Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TICO-19 ideal data normalization steps to generate HXLTM data #1

Open
fititnt opened this issue Nov 11, 2021 · 0 comments
Open

TICO-19 ideal data normalization steps to generate HXLTM data #1

fititnt opened this issue Nov 11, 2021 · 0 comments

Comments

@fititnt
Copy link
Member

fititnt commented Nov 11, 2021


NOTE: While is possible to ingest data directly from existing files, it will be simpler (and maybe relevant) for who would process some of original files directly after some normalization (without actually change contents, but how it is labeled and packed). Since this project is under urgency, we will just document it

At base minimum, this means move files to directories in such way that each directory contains same logical group. But I suspect we will need also to normalize the language codes used. From hxltm-action-example | /data/verum/TICO-19/terminologies-facebook-lint.patch some of the CSVs need manual escape (since translations used ,, but generated output did not escaped it, as per RFC 4180, so it break toolings)

Not surprisingly the data published on the was done as it was created. Also different providers (for example, Google terminologies and Facebook terminologies) used different language codes to express same things.

While Google terminologies only used country codes for specific cases, Facebook done it explicitly on all terms, to a point of when the case of 'no need to specify country', they used '_XX'. It could be just omitted.

fititnt added a commit that referenced this issue Nov 11, 2021
fititnt added a commit that referenced this issue Nov 12, 2021
fititnt added a commit that referenced this issue Nov 17, 2021
fititnt added a commit that referenced this issue Nov 18, 2021
fititnt added a commit that referenced this issue Nov 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant