Grammatical Error Correction (GEC) for Ukrainian: Shared Task Repository

Welcome to the collaborative repository presenting the WebSpellChecker team's approach towards Grammatical Error Correction (GEC) for the Ukrainian language. You can learn more about our methodology in the article "RedPenNet for Grammatical Error Correction: Outputs to Tokens, Attentions to Spans". This repository will periodically be updated with new public data.

Synthetic data

We are excited to share a substantial dataset containing over 3.7 million synthetic Ukrainian sentences, which you can download here. The dataset is built upon error-free texts sourced from data corpora presented on the lang-uk website.

Generation methodology

In this endeavor, we generated sentences utilizing the mbart-large-50 as the pretrained model. This model was fine-tuned using a backtranslation method on training data extracted from the UA-GEC dataset. This process enabled us to transduce error-free input text sequences into sequences containing synthetic errors. Notably, approximately 800K sentences in this dataset contain at least one generated error. Our model training process took place over 8 epochs on a Google Colab Premium GPU instance. We used a batch size of 4 and a learning rate of 1e-5, limiting input and output lengths to a maximum of 128 tokens each.

License

The synthetic data available through this repository is generated based on texts from the lang-uk corpora and retains the same license — Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. We encourage users to familiarize themselves with the license terms before utilizing the data. By accessing and using the synthetic data, you agree to abide by the stipulated license terms, promoting ethical use and collaboration within the research community.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Grammatical Error Correction (GEC) for Ukrainian: Shared Task Repository

Synthetic data

Generation methodology

License

About

Releases

Packages

Contributors 2

WebSpellChecker/unlp-2023-shared-task

Folders and files

Latest commit

History

Repository files navigation

Grammatical Error Correction (GEC) for Ukrainian: Shared Task Repository

Synthetic data

Generation methodology

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages