Python equivalent to the DVUploader written in Java. Complements other libraries written in Python and facilitates the upload of files to a Dataverse instance via Direct Upload.
Features
- Parallel direct upload to a Dataverse backend storage
- Files are streamed directly instead of being buffered in memory
- Supports multipart uploads and chunks data accordingly
DVUploader.mov
To get started with DVUploader, you can install it via PyPI
python3 -m pip install dvuploader
or by source
git clone https://github.com/gdcc/python-dvuploader.git
cd python-dvuploader
python3 -m pip install .
In order to perform a direct upload, you need to have a Dataverse instance running and a cloud storage provider. The following example shows how to upload files to a Dataverse instance. Simply provide the files of interest and utilize the upload
method of a DVUploader
instance.
import dvuploader as dv
# Add file individually
files = [
dv.File(filepath="./small.txt"),
dv.File(directory_label="some/dir", filepath="./medium.txt"),
dv.File(directory_label="some/dir", filepath="./big.txt"),
*dv.add_directory("./data"), # Add an entire directory
]
DV_URL = "https://demo.dataverse.org/"
API_TOKEN = "XXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX"
PID = "doi:10.70122/XXX/XXXXX"
dvuploader = dv.DVUploader(files=files)
dvuploader.upload(
api_token=API_TOKEN,
dataverse_url=DV_URL,
persistent_id=PID,
n_parallel_uploads=2, # Whatever your instance can handle
)
DVUploader ships with a CLI ready to use outside scripts. In order to upload files to a Dataverse instance, simply provide the files of interest, persistent identifier and credentials.
dvuploader my_file.txt my_other_file.txt \
--pid doi:10.70122/XXX/XXXXX \
--api-token XXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX \
--dataverse-url https://demo.dataverse.org/ \
Alternatively, you can also supply a config
file that contains all necessary information for the uploader. The config
file is a JSON/YAML file that contains the following keys:
persistent_id
: Persistent identifier of the dataset to upload to.dataverse_url
: URL of the Dataverse instance.api_token
: API token of the Dataverse instance.files
: List of files to upload. Each file is a dictionary with the following keys:filepath
: Path to the file to upload.directory_label
: Optional directory label to upload the file to.description
: Optional description of the file.mimetype
: Mimetype of the file.categories
: Optional list of categories to assign to the file.restrict
: Boolean to indicate that this is a restricted file. Defaults to False.
In the following example, we upload three files to a Dataverse instance. The first file is uploaded to the root directory of the dataset, while the other two files are uploaded to the directory some/dir
.
# config.yml
persistent_id: doi:10.70122/XXX/XXXXX
dataverse_url: https://demo.dataverse.org/
api_token: XXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX
files:
- filepath: ./small.txt
- filepath: ./medium.txt
directory_label: some/dir
- filepath: ./big.txt
directory_label: some/dir
The config
file can then be used as follows:
dvuploader --config-path config.yml
To install the development dependencies, run the following command:
pip install poetry
poetry install --with test
In order to test the DVUploader, you need to have a Dataverse instance running. You can start a local Dataverse instance by following these steps:
1. Start the Dataverse instance
docker compose \
-f ./docker/docker-compose-base.yml \
--env-file local-test.env \
up -d
2. Set up the environment variables
export BASE_URL=http://localhost:8080
export $(grep "API_TOKEN" "dv/bootstrap.exposed.env")
export DVUPLOADER_TESTING=true
3. Run the test(s) with pytest
python -m pytest -v
This repository uses ruff
to lint the code and codespell
to check for spelling mistakes. You can run the linters with the following command:
python -m ruff check
python -m codespell --check-filenames