Export Huggingface datasets in persistable formats using CLI.
- Clone the repo
- Install using pip
pip install .
- Now it is available as a CLI command.
Example:
datasets download imdb
This will download the imdb dataset and persists it in csv
format.
The default output location is ~/saved_datasets/
.
A dataset can be saved in csv
, json
and parquet
files.
All the splits/files of a dataset are downloaded and stored separately.
The director ~/saved_datasets
is populated as follows:
$ tree ~/saved_datasets/
.
└── imdb
├── test.csv
├── train.csv
└── unsupervised.csv
2 directories, 3 files
Similarly, the dataset can be downloaded in json
and parquet
files by using the --format
option:
JSON: $ datasets download imdb --format json
Parquet: $ datasets download imdb --format parquet