Welcome to the collaborative repository presenting the WebSpellChecker team's approach towards Grammatical Error Correction (GEC) for the Ukrainian language. You can learn more about our methodology in the article "RedPenNet for Grammatical Error Correction: Outputs to Tokens, Attentions to Spans". This repository will periodically be updated with new public data.
We are excited to share a substantial dataset containing over 3.7 million synthetic Ukrainian sentences, which you can download here. The dataset is built upon error-free texts sourced from data corpora presented on the lang-uk website.
In this endeavor, we generated sentences utilizing the mbart-large-50
as the pretrained model. This model was fine-tuned using a backtranslation method on training data extracted from the UA-GEC dataset. This process enabled us to transduce error-free input text sequences into sequences containing synthetic errors. Notably, approximately 800K sentences in this dataset contain at least one generated error. Our model training process took place over 8 epochs on a Google Colab Premium GPU instance. We used a batch size of 4 and a learning rate of 1e-5, limiting input and output lengths to a maximum of 128 tokens each.
The synthetic data available through this repository is generated based on texts from the lang-uk
corpora and retains the same license — Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. We encourage users to familiarize themselves with the license terms before utilizing the data. By accessing and using the synthetic data, you agree to abide by the stipulated license terms, promoting ethical use and collaboration within the research community.