Please make a github account prior to implementing a dataset; you can follow instructions to install git here.
You will also need at least Python 3.6+. If you are installing python, we recommend downloading anaconda to curate a python environment with necessary packages. We strongly recommend Python 3.8+ for stability.
Optional Setup your GitHub account with SSH (instructions here.)
- Choose a dataset from the list of Nusantara datasets.
- Assign yourself an issue by commenting
#self-assign
under the issue. Please assign yourself to issues with no other collaborators assigned. You should see your GitHub username associated to the issue within 1-2 minutes of making a comment.
-
Search to see if the dataset exists in the 🤗 Hub. If it exists, please use the current implementation as the
source
and focus on implementing the task-specificnusantara
schema. -
If not, find the dataset online, usually uploaded in Github or Google drive.
Fork the nusa-crowd repository to your local github account. To do this, click the link to the repository and click "fork" in the upper-right corner. You should get an option to fork to your account, provided you are signed into Github.
After you fork, clone the repository locally. You can do so as follows:
git clone [email protected]:<your_github_username>/nusa-crowd.git
cd nusa-crowd # enter the directory
Next, you want to set your upstream
location to enable you to push/pull (add or receive updates). You can do so as follows:
git remote add upstream [email protected]:IndoNLP/nusa-crowd.git
You can optionally check that this was set properly by running the following command:
git remote -v
The output of this command should look as follows:
origin [email protected]:<your_github_username>/nusa-crowd.git (fetch)
origin [email protected]:<your_github_username>/nusa-crowd.git (push)
upstream [email protected]:IndoNLP/nusa-crowd.git (fetch)
upstream [email protected]:IndoNLP/nusa-crowd.git (push)
If you do NOT have an origin
for whatever reason, then run:
git remote add origin [email protected]:<your_github_username>/nusa-crowd.git
The goal of upstream
is to keep your repository up-to-date to any changes that are made officially to the datasets library. You can do this as follows by running the following commands:
git fetch upstream
git pull
Provided you have no merge conflicts, this will ensure the library stays up-to-date as you make changes. However, before you make changes, you should make a custom branch to implement your changes.
You can make a new branch as such:
git checkout -b <dataset_name>
Please do not make changes on the master branch!
Always make sure you're on the right branch with the following command:
git branch
The correct branch will have a asterisk * in front of it.
You can make an environment in any way you choose to. We highlight two possible options:
The following instructions will create an Anaconda env-nusantara-datasets
environment.
- Install anaconda for your appropriate operating system.
- Run the following command while in the
nusantara_datasets
folder (you can pick your python version):
conda env create -f conda.yml # Creates a conda env
conda activate env-nusantara-datasets # Activate your conda environment
You can deactivate your environment at any time by either exiting your terminal or using conda deactivate
.
Python 3.3+ has venv automatically installed; official information is found here.
python3 -m venv <your_env_name_here>
source <your_env_name_here>/bin/activate # activate environment
pip install -r requirements.txt # Install this while in the datasets folder
Make sure your pip
package points to your environment's source.
Make a new directory within the nusa-crowd/nusacrowd/nusa_datasets
directory:
mkdir nusacrowd/nusa_datasets/<dataset_name>
Please use lowercase letters and underscores when choosing a <dataset_name>
.
To implement your dataset, there are three key methods that are important:
_info
: Specifies the schema of the expected dataloader_split_generators
: Downloads and extracts data for each split (e.g. train/val/test) or associate local data with each split._generate_examples
: Create examples from data that conform to each schema defined in_info
.
To start, copy templates/template.py to your nusa-crowd/nusacrowd/nusa_datasets/<dataset_name>
directory with the name <dataset_name>.py
. Within this file, fill out all the TODOs.
cp templates/template.py nusacrowd/nusa_datasets/<dataset_name>/<dataset_name>.py
For the _info_
function, you will need to define features
for your
DatasetInfo
object. For the bigbio
config, choose the right schema from our list of examples. You can find a description of these in the Task Schemas Document. You can find the actual schemas in the schemas directory.
You will use this schema in the _generate_examples
return value.
Populate the information in the dataset according to this schema; some fields may be empty.
To enable quality control, please add the following line in your file before the class definition:
from utils.constants import Tasks
_SUPPORTED_TASKS = [Tasks.NAMED_ENTITY_RECOGNITION, Tasks.DEPENDENCY_PARSING]
To help you implement a dataset, you can see the implementation of other dataset scripts.
You can run your data loader script during development by appending the following statement to your code (templates/template.py already includes this):
if __name__ == "__main__":
datasets.load_dataset(__file__)
If you want to use an interactive debugger during development, you will have to use
breakpoint()
instead of setting breakpoints directly in your IDE. Most IDEs will
recognize the breakpoint()
statement and pause there during debugging. If your prefered
IDE doesn't support this, you can always run the script in your terminal and debug with
pdb
.
Make sure your dataset is implemented correctly by checking in python the following commands:
from datasets import load_dataset
data = load_dataset("nusacrowd/nusa_datasets/<dataset_name>/<dataset_name>.py", name="<dataset_name>_nusantara_<schema>")
Run these commands from the top level of the nusa-crowd
repo (i.e. the same directory that contains the requirements.txt
file).
Once this is done, please also check if your dataloader satisfies our unit tests as follows by using this command in the terminal:
python -m tests.test_nusantara nusacrowd/nusa_datasets/<dataset_name>/<dataset_name>.py [--data_dir /path/to/local/data]
Your particular dataset may require use of some of the other command line args in the test script.
To view full usage instructions you can use the --help
command,
python -m tests.test_nusantara --help
From the main directory, run the Makefile via the following command:
make check_file=nusacrowd/nusa_datasets/<dataset_name>/<dataset_name>.py
This runs the black formatter, isort, and lints to ensure that the code is readable and looks nice. Flake8 linting errors may require manual changes.
First, commit your changes to the branch to "add" the work:
git add nusacrowd/nusa_datasets/<dataset_name>/<dataset_name>.py
git commit -m "A message describing your commits"
Then, run the following commands to incorporate any new changes in the master branch of datasets as follows:
git fetch upstream
git rebase upstream/master
Or you can install the pre-commit hooks to automatically pre-check before commit by:
pre-commit install
Run these commands in your custom branch.
Push these changes to your fork with the following command:
git push -u origin <dataset_name>
Make a Pull Request to implement your changes on the main repository here. To do so, click "New Pull Request". Then, choose your branch from your fork to push into "base:master".
When opening a PR, please link the issue corresponding to your dataset using closing keywords in the PR's description, e.g. resolves #17
.