Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create dataset loader for SEACrowd Instruct Multi-task Collection #723

Open
SamuelCahyawijaya opened this issue Jul 30, 2024 · 0 comments
Open

Comments

@SamuelCahyawijaya
Copy link
Collaborator

Dataloader name: seacrowd_instruct/seacrowd_instruct.py
DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?seacrowd_instruct

Dataset seacrowd_instruct
Description The SEACrowd Instruct Multi-task Collection is a multi-task, instruction-formatted compilation of 29 preexisting datasets from SEACrowd. The collection comprises 332,040 question-answer pairs, to integrate the permissively licensed datasets in SEACrowd into the Data Provenance Explorer tool as well as multi-task instruction fine-tuning.
Subsets Bhinneka Korpus - Translated, CebuaNER - Translated, EMoTES-3K - Translated, Fake News Filipino - Translated, Filipino Hate Speech Tiktok - Text - Translated, Filipino Slang Spelling Normalization - Translated, id-vaccines-tweets - Translated, identifikasi-bahasa - Translated, IndoNER-Tourism - Translated, Indonesia-Chinese-MTRobustEval - Translated, LimeSoda - Translated, MABL - Translated, MKQA - Translated, Multilabel Multiclass Sentiment and Emotion Dataset from Indonesian Mobile Application Review - Translated, Myanmar (Burmese) Name Romanization with Alignment on Grapheme-Level - Translated, NTREX-128 - Translated, SPAMID-PAIR - Translated, Tagalog Profanity Dataset - Translated, Thai Depression Dataset - Translated, Thai Toxicity Tweet Corpus - Translated, Typhoon Yolanda Tweets - Translated, UIT-ViCTSD - Translated, UIT-ViOCD - Translated, Vietnamese Social Media Emotion Corpus (UIT-VSMEC) - Translated, ViHealthQA - Translated, Wisesight Thai Sentiment Corpus - Translated, Wongnai Reviews - Translated, XNLI - Translated, XStoryCloze - Translated
Languages eng, cmn, ind, mya, vie, tha, fil, tgl, khg, hmv, hmf, hnj, lao, zlm, por, tam, yue, khm, jav, abs, ceb, day, xdy, aoz
Tasks Instruction Tuning
License Unknown (unknown)
Homepage https://github.com/Data-Provenance-Initiative/Data-Provenance-Collection/blob/main/data_summaries/Seacrowd.json
HF URL https://huggingface.co/datasets/minnieliang5/seacrowd
Paper URL -
@SamuelCahyawijaya SamuelCahyawijaya converted this from a draft issue Jul 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

No branches or pull requests

1 participant