Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use our localization data for training #882

Open
marco-c opened this issue Oct 16, 2024 · 2 comments
Open

Use our localization data for training #882

marco-c opened this issue Oct 16, 2024 · 2 comments
Labels
data sources Data importer support

Comments

@marco-c
Copy link
Collaborator

marco-c commented Oct 16, 2024

https://github.com/mozilla-l10n/mt-training-data

Maybe we could add it to OPUS.

@marco-c marco-c added the data sources Data importer support label Oct 16, 2024
@marco-c
Copy link
Collaborator Author

marco-c commented Oct 16, 2024

It looks like it is already in OPUS: https://github.com/Helsinki-NLP/OPUS-ingest/tree/master/corpus/Mozilla-I10n. Though it seems to be a very old version, from 2021.

@ZJaume
Copy link
Collaborator

ZJaume commented Oct 21, 2024

Localization data from software like I think it can really help with translation of short sentences, specially when #888 is fixed 😅

EDIT: although some language pairs may need a little bit of cleaning in these corpora, Ubuntu and OpenOffice corpora can be useful helping firefox translations models with the webpage menus.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data sources Data importer support
Projects
None yet
Development

No branches or pull requests

2 participants