-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Review and annotate likely wrong language codes #4
Comments
Okay. I'm moving a bit of the logic to also generate CSVs that expose the conversion of the languages. With asciidoctor, is possible to import these tables to PDF and the web documentation. This already will be used later on #2. Except ebooks, these would need create images from the tables :| Anyway, I still need some strategy to get CLDR information to print on those tables without require the full java pipeline. There is not really a lot of cli doing that, so we may need to do some scripting. The use case for this would be allow get the translations of the languages and potentially check if the countries are what was marked there. I mean, people using |
…-is-valid-syntax, --speaking-population, --writing-population
Ok. Here's something. It will take more time since I'm preparing extra reusable tools. Actually I'm not only doing only this for TICO-19 (but it helps as is real test cases, even with more people watching issues happened), but I myself when dealing directly with data on HXL tables (aka directly on Google Sheets and Excel) this type of mistake could happen even more as is raw direct access do data (not someone's service). While do exist tools who can manage sources like CLDR data as library (more common Java, NodeJS, Python and PHP, but this is very advanced to deal with all those XMLS; the langcodes is already abstracting part of the work) my idea is expose as CLI. So it doesn't matter that much which software and also can be used on data pipelines. The CLDR provides, in addition to translations related to localization, some ways to calculate statistics about speakers per language per country. This is obviously not as exact, but it can help mark codes which are well formatted but may be good candidates for human review. For example, 0 people who are able to speak or write in a language code combination language even without trying to brute force if the terms themselves are not wrong can save from basic typography human errors.
On the aid to convert language codesAnyway, the additional use for this, while giving hints for mislabeled language codes, actually helps me map the additional language codes we would use on HXLTM tables. For example what they call This is one of the reasons I will try to make a tool to convert from (assuming an already perfectly used code) as BCP47 and then create the HXL language attributes that could be inferred. If something like glottolog or other more exact language codes be added, even optimistic scenarios would still need human review, but at least this stress is reduced for less languages. For codes already not perfect, this means give some help with human error. On the automation to "detect" the right language (not just mislabeled codes without awareness of content)Actually, some libraries exist (even in Python) that brute force natural language detection so it could extra human errors by providing both the language code and samples of text. But as boring as it may seem I'm saying this, for something related to deciding which codes to use to create a dictionary for others, it at best could be used for quick tests for human errors. Also, I'm actually concerned that past and future models may be trained and labeled with wrong language codes (and, obviously, too new natural languages would be unknown to any detection solution). So trying to put on the same tool to help those who do lexicography such detections could worsen the situation. To be fair, scenarios were this is always relevant would be to test software bugs (think human do the right command, but software uses wrong codes or swap languages) or someone who is publishing data for others who may already be assumed to be credible, but the person who approves cannot even read the script (but could be the last resort of error checking, assuming collaborators could be exhausted under emergency responses). |
…yrl, hi-Deva, gu-Gujr, el-Grek, ka-Geor, pa-Guru, zh-Hans, zh-Hant, he-Hebr, ko-Jamo, jv-Java, ja-Kana, km-Khmr, kn-Knda, lo-Laoo, la-Latn, my-Mymr, su-Sund, ta-Taml, te-Telu, th-Thai, bo-Tibt, ii-Yiii
…know conflicts on eus basq1248/basq1250)
…r documentation; mention to JavaScript port
At least a small part of language codes (which are very important, since they are necessary to explain what entire community submissions or professional translators are doing) are not only malformatted, but wrong. Since this is not a mere conversion that can be automated, we need to explain before republish.
Some of these malformatted codes are
es-LA
(Spanish as in Laos) andar-AR
("Arabic something" as in Argentina)How we could do it
This is a complex subject. The amount of translations is so high that ourselves could introduce new bugs (the same way TICO-19 eventually published then), so we could even draft (or call others to help with command line tools) just to find types of errors that are common mistakes. But on short term, at least we need to document it as different issue.
So, on this point, despite be serious issue, we can't blame the submitters of data to TICO-19 initiative. We should assume that under urgency and data exchange they are even more likely to occur.
The text was updated successfully, but these errors were encountered: