At this point, there should be no further changes to your dataloader script after the PR was accepted.
Please do the following before getting started:
-
Make an account on 🤗's Hub and login. Choose a good password, as you'll need to authenticate your credentials.
-
Join the Indobenchmark initiative here.
- click the "Request to join this org" button in the upper right corner.
-
Make a github account; you can follow instructions to install git here.
Note - your permissions will be set to READ. Please contact an admin in your dataset's github issue to be granted WRITE access; this should be given after your PR is accepted.
You can find the official instructions here. We will provide what you need for the nusantara-datasets hackathon environment.
With your active nusantara
environment, use the following command:
huggingface-cli login
Login with your 🤗 Hub account username and password.
Make a repository via the 🤗 Hub here with the following details.
- Set Owner: nusantara-datasets
- Set Dataset name: the name of the dataset
- Set License: the license that applies to this dataset
- Select Private
- Click
Create dataset
Please name your dataloading script with the same name as the dataset. For example, if your dataset loader script is called absa_prosa.py
, then your dataset name should be absa_prosa
.
If there is no appropriate license available in the provided options (for example for datasets with specific data user agreements) you should select "other".
Using terminal access, find a location to place your github repository. In this location, use the following command:
git clone https://huggingface.co/datasets/indobenchmark/<your_dataset_name>
Run the following commands to add and push your work.
git add <your_file_name.py> # add the dataset
git commit -m "Adds <your_dataset_name>"
git push origin
Run the following command in a folder that does not include your data-loading script:
Test both the original dataset schema/config and the nusantara schema/config.
Public Dataset
from datasets import load_dataset
dataset_orig = load_dataset("indobenchmark/<your_dataset_name>", name="source", use_auth_token=True)
dataset_indobenchmark= load_dataset("indobenchmark/<your_dataset_name>", name="indobenchmark", use_auth_token=True)
Private Dataset
from datasets import load_dataset
dataset_orig = load_dataset(
"indobenchmark/<your_dataset_name>",
name="source",
data_dir="/local/path/to/data/files",
use_auth_token=True)
dataset_indobenchmark = load_dataset(
"indobenchmark/<your_dataset_name>",
name="indobenchmark",
data_dir="/local/path/to/data/files",
use_auth_token=True)
And with that, you have successfully contributed a data-loading script!