-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TICO-19 ideal data normalization steps to generate HXLTM data #1
Comments
fititnt
added a commit
that referenced
this issue
Nov 11, 2021
fititnt
added a commit
that referenced
this issue
Nov 11, 2021
fititnt
added a commit
that referenced
this issue
Nov 11, 2021
fititnt
added a commit
that referenced
this issue
Nov 11, 2021
…tch scripts/patch/data-terminology-facebook.diff (fititnt/hxltm-action#5, #1)
fititnt
added a commit
that referenced
this issue
Nov 12, 2021
…al way of label targetLang (Facebook terminology only) (fititnt/hxltm-action#5, #1)
fititnt
added a commit
that referenced
this issue
Nov 17, 2021
fititnt
added a commit
that referenced
this issue
Nov 18, 2021
fititnt
added a commit
that referenced
this issue
Nov 18, 2021
fititnt
added a commit
that referenced
this issue
Nov 18, 2021
fititnt
added a commit
that referenced
this issue
Nov 18, 2021
fititnt
added a commit
that referenced
this issue
Nov 18, 2021
fititnt
added a commit
that referenced
this issue
Nov 18, 2021
fititnt
added a commit
that referenced
this issue
Nov 20, 2021
fititnt
added a commit
that referenced
this issue
Nov 20, 2021
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
NOTE: While is possible to ingest data directly from existing files, it will be simpler (and maybe relevant) for who would process some of original files directly after some normalization (without actually change contents, but how it is labeled and packed). Since this project is under urgency, we will just document it
At base minimum, this means move files to directories in such way that each directory contains same logical group. But I suspect we will need also to normalize the language codes used. From hxltm-action-example | /data/verum/TICO-19/terminologies-facebook-lint.patch some of the CSVs need manual escape (since translations used
,
, but generated output did not escaped it, as per RFC 4180, so it break toolings)Not surprisingly the data published on the was done as it was created. Also different providers (for example, Google terminologies and Facebook terminologies) used different language codes to express same things.
While Google terminologies only used country codes for specific cases, Facebook done it explicitly on all terms, to a point of when the case of 'no need to specify country', they used '_XX'. It could be just omitted.
The text was updated successfully, but these errors were encountered: