From d484f28e0554b1e9f298d7f24fc72d78728310d7 Mon Sep 17 00:00:00 2001 From: Christian Wibisono Date: Tue, 28 Jun 2022 18:32:20 +0700 Subject: [PATCH] add and apply pre-commit hook --- .github/ISSUE_TEMPLATE/add-dataset.md | 1 - .pre-commit-config.yaml | 17 ++++ DATALOADER.md | 8 +- POINTS.id.md | 2 +- POINTS.md | 3 +- README.md | 8 +- UPLOADING.id.md | 10 +-- UPLOADING.md | 26 +++--- nusantara/nusa_datasets/bapos/bapos.py | 48 ++++------- .../nusa_datasets/bible_en_id/bible_en_id.py | 58 +++++-------- .../nusa_datasets/bible_su_id/bible_su_id.py | 58 +++++-------- nusantara/nusa_datasets/emot/emot.py | 3 +- .../nusa_datasets/id_abusive/id_abusive.py | 1 - .../id_clickbait/id_clickbait.py | 2 +- .../id_hatespeech/id_hatespeech.py | 1 - .../indo_religious_mt_en_id.py | 53 ++++++------ nusantara/nusa_datasets/smsa/smsa.py | 63 ++++++--------- .../stif_indonesia/stif_indonesia.py | 81 +++++++++---------- nusantara/utils/common_parser.py | 15 +--- nusantara/utils/constants.py | 24 +++--- nusantara/utils/schemas/__init__.py | 9 +-- nusantara/utils/schemas/seq_label.py | 10 +-- nusantara/utils/schemas/text_to_text.py | 4 +- templates/template.py | 2 - test_example.sh | 2 +- tests/test_nusantara.py | 57 +++---------- 26 files changed, 223 insertions(+), 343 deletions(-) create mode 100644 .pre-commit-config.yaml diff --git a/.github/ISSUE_TEMPLATE/add-dataset.md b/.github/ISSUE_TEMPLATE/add-dataset.md index 5f68e462..fd81c75d 100644 --- a/.github/ISSUE_TEMPLATE/add-dataset.md +++ b/.github/ISSUE_TEMPLATE/add-dataset.md @@ -20,4 +20,3 @@ assignees: '' - **Is Synthetic:** *Yes/No. Put yes if the dataset is generated synthetically somehow, for example by translating from other languages, or by generating from language models, or CFG, etc2.* - **License:** *Type of license; please provide public for new datasets* - **Motivation:** *what are some good reasons to have this dataset* - diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml new file mode 100644 index 00000000..b3b12c88 --- /dev/null +++ b/.pre-commit-config.yaml @@ -0,0 +1,17 @@ +repos: +- repo: https://github.com/pre-commit/pre-commit-hooks + rev: v2.3.0 + hooks: + - id: check-yaml + - id: end-of-file-fixer + - id: trailing-whitespace +- repo: https://github.com/hadialqattan/pycln + rev: v1.3.5 + hooks: + - id: pycln + args: [--all] +- repo: https://github.com/psf/black + rev: 22.3.0 + hooks: + - id: black + args: [--line-length=250, --target-version=py38] diff --git a/DATALOADER.md b/DATALOADER.md index 927c3276..fe142fea 100644 --- a/DATALOADER.md +++ b/DATALOADER.md @@ -9,7 +9,7 @@ You will also need at least Python 3.6+. If you are installing python, we recomm **Optional** Setup your GitHub account with SSH ([instructions here](https://docs.github.com/en/authentication/connecting-to-github-with-ssh).) ### 1. **Assigning a dataloader** -- Choose a dataset from the [list of Nusantara datasets](https://github.com/orgs/IndoNLP/projects/2). +- Choose a dataset from the [list of Nusantara datasets](https://github.com/orgs/IndoNLP/projects/2).

@@ -104,7 +104,7 @@ Make a new directory within the `nusa-crowd/nusantara/nusa_datasets` directory: mkdir nusantara/nusa_datasets/ -Please use lowercase letters and underscores when choosing a ``. +Please use lowercase letters and underscores when choosing a ``. To implement your dataset, there are three key methods that are important: * `_info`: Specifies the schema of the expected dataloader @@ -141,9 +141,9 @@ if __name__ == "__main__": ``` If you want to use an interactive debugger during development, you will have to use -`breakpoint()` instead of setting breakpoints directly in your IDE. Most IDEs will +`breakpoint()` instead of setting breakpoints directly in your IDE. Most IDEs will recognize the `breakpoint()` statement and pause there during debugging. If your prefered -IDE doesn't support this, you can always run the script in your terminal and debug with +IDE doesn't support this, you can always run the script in your terminal and debug with `pdb`. diff --git a/POINTS.id.md b/POINTS.id.md index 5c2b58ff..2e526f2f 100644 --- a/POINTS.id.md +++ b/POINTS.id.md @@ -4,7 +4,7 @@ Untuk dianggap sebagai co-author, diperlukan 10 poin kontribusi. ## Data Loader -Menerapkan data loader apa pun diberikan +3 poin. +Menerapkan data loader apa pun diberikan +3 poin. Info lebih lanjut dapat ditemukan [di sini](DATALOADER.md). ## Proposal Dataset diff --git a/POINTS.md b/POINTS.md index 55561cbe..6899a5ee 100644 --- a/POINTS.md +++ b/POINTS.md @@ -41,8 +41,7 @@ We can have 4 different levels: Small, Medium, Large, XL ## Examples -Let's assume a new sentiment analysis for one of Papuan language, consisting of 500 sentences. +Let's assume a new sentiment analysis for one of Papuan language, consisting of 500 sentences. For data size, it is considered small (+1 pts). While sentiment analysis is common, but the language itself is extremely rare and underrepresented, therefore we got +6 pts for this. Lastly, assuming the data is in high-quality, we'll obtain a total of (1 + 6) * 1.5 pts = 10.5pts, which is enough for authorship. Another example, let's assume a new Natural Language Inference (NLI) dataset for Javanese. NLI by itself is not new for Indonesian languages, and Javanese resource is available. However, Javanese NLI is the first one even, hence it is still considered rare (+6 pts). Assuming the dataset is Small size, with Good quality, we end up with a total of 7 pts. By additionally, implementing the data loader for this dataset, we'll have a total of 10 pts, which is enough for authorship. - diff --git a/README.md b/README.md index 04417ba1..57cd2d66 100644 --- a/README.md +++ b/README.md @@ -15,7 +15,7 @@ Indonesian NLP is underrepresented in research community, and one of the reasons ## How to contribute? -You can contribute by proposing **unregistered NLP dataset** on [our record](https://indonlp.github.io/nusa-catalogue/). You can also propose datasets from your past work that have not been released to the public. [Just fill out this form](https://forms.gle/31dMGZik25DPFYFd6), and we will check and approve your entry. +You can contribute by proposing **unregistered NLP dataset** on [our record](https://indonlp.github.io/nusa-catalogue/). You can also propose datasets from your past work that have not been released to the public. [Just fill out this form](https://forms.gle/31dMGZik25DPFYFd6), and we will check and approve your entry. We will give **contribution points** based on several factors, including: **dataset quality**, **language scarcity**, or **task scarcity**. @@ -36,7 +36,7 @@ The license for a dataset is not always obvious. Here are some strategies to try * check publications that announce the release of the dataset * check the website of the organization providing the dataset -If no official license is listed anywhere, but you find a webpage that describes general data usage policies for the dataset, you can fall back to providing that URL in the `_LICENSE` variable. If you can't find any license information, please note in your PR and put `_LICENSE="Unknown"` in your dataset script. +If no official license is listed anywhere, but you find a webpage that describes general data usage policies for the dataset, you can fall back to providing that URL in the `_LICENSE` variable. If you can't find any license information, please note in your PR and put `_LICENSE="Unknown"` in your dataset script. #### What if my dataset is not yet publicly available? @@ -49,11 +49,11 @@ Yes, you can ask for helps in NusaCrowd's community channel! Please join our [Wh ## Thank you! -We greatly appreciate your help! +We greatly appreciate your help! The artifacts of this hackathon will be described in a forthcoming academic paper targeting a machine learning or NLP audience. Please refer to [this section](#contribution-guidelines) for your contribution rewards for helping Nusantara NLP. We recognize that some datasets require more effort than others, so please reach out if you have questions. Our goal is to be inclusive with credit! -