Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update language_list.py #269

Merged
merged 15 commits into from
Nov 21, 2023

Conversation

NIXBLACK11
Copy link
Contributor

This PR simplifies language_list.py in LASER by refining the LASER2 and LASER3 language list.

This resolves the problem outlined in #267 .

@facebook-github-bot facebook-github-bot added the CLA Signed Do not delete this pull request or issue due to inactivity. label Nov 17, 2023
@NIXBLACK11
Copy link
Contributor Author

NIXBLACK11 commented Nov 17, 2023

Can I make these association?

chavacano -> italian -> ita_Latn context
cornish -> welsh -> cym_Latn context

@fissoreg
Copy link

Can I make these association?

chavacano -> italian -> ita_Latn

As an italian, I need to say that this doesn't feel right at all 🤣
Chavacano seems to be a spanish-based language spoken in the Philippines, so I wouldn't know how to treat it 🤷

@NIXBLACK11
Copy link
Contributor Author

NIXBLACK11 commented Nov 17, 2023

Can I make these association?
chavacano -> italian -> ita_Latn

As an italian, I need to say that this doesn't feel right at all 🤣 Chavacano seems to be a spanish-based language spoken in the Philippines, so I wouldn't know how to treat it 🤷

I made it based on this, but I think they are quite different 🤣.

@heffernankevin
Copy link
Contributor

heffernankevin commented Nov 17, 2023

Can I make these association?
chavacano -> italian -> ita_Latn

As an italian, I need to say that this doesn't feel right at all 🤣 Chavacano seems to be a spanish-based language spoken in the Philippines, so I wouldn't know how to treat it 🤷

I made it based on this, but I think they are quite different 🤣.

Can I make these association?

chavacano -> italian -> ita_Latn context cornish -> welsh -> cym_Latn context

You can make the associations:

Italian -> ita_Latn
Welsh -> cym_Latn

However, the other languages (from LASER2) are not covered by those language scripts and so we will have to leave them "as is" as for example, "Cornish" is not the same as "Welsh". You can find the master language list we are using for those with scripts here: https://github.com/facebookresearch/flores/tree/main/flores200. However you can see they do not cover all the LASER2 langs.

Copy link
Contributor

@avidale avidale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this PR!
Added just a few nits.

laser_encoders/language_list.py Outdated Show resolved Hide resolved
laser_encoders/language_list.py Outdated Show resolved Hide resolved
laser_encoders/language_list.py Outdated Show resolved Hide resolved
Copy link
Contributor

@avidale avidale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@avidale avidale merged commit 9cde37a into facebookresearch:MLH-dev Nov 21, 2023
2 checks passed
avidale added a commit that referenced this pull request Nov 21, 2023
* feat: converted SPMapply function to use python script

* modified laserTokenizer class to have a seperate function for tokenizing a file

* modified tokenize_file function

* removed instances of Path

* created new function for opening files

* test for LaserTokenizer.tokenize

* tests for normalisation, descape and lower_case

* deleted test dir because of relative import error

* modified test tokenizer function to use the downloaded model before exiting the context manager

* test for tokenize_file

* added test for is_printable

* test for over_write when equal to True and False

* added some type hints for tests

* added type hint for log function

* added header comment

* feat: make LASER pip installable (#239)

* feat: make LASER pip installable

* Added GitHub Actions workflow for tests and linting

* upgraded python version due to node depreciation error

* removed updated python version

* removed poetry

* bug fixes

* removed dependencies install

* updated pyproject and made lint_and_test to install dev and mono dependencies

* removed isort and black

* removed mono dependencies

* removed version from pyproject

* removed duplicate of classifiers

* removed description

* removed dynamic

* added src-layout to discover only laser_encoder

* added build backend

* updated project name

* changed license to BSD

* removed src-layout to test

* added linting to actions

* updated linting to only check the laser_encoders folder

* fixed linting issues

* fixed black linting issues

* added white-space

* Refactor embedder (#241)

* feat: make LASER pip installable

* Added GitHub Actions workflow for tests and linting

* upgraded python version due to node depreciation error

* removed updated python version

* removed poetry

* bug fixes

* removed dependencies install

* updated pyproject and made lint_and_test to install dev and mono dependencies

* removed isort and black

* removed mono dependencies

* removed version from pyproject

* removed duplicate of classifiers

* removed description

* removed dynamic

* added src-layout to discover only laser_encoder

* added build backend

* updated project name

* changed license to BSD

* removed src-layout to test

* added linting to actions

* updated linting to only check the laser_encoders folder

* fixed linting issues

* fixed black linting issues

* added white-space

* refactored emmbeder to work in the laser tokenizer package

* downgraded numpy version to suit the installled python version

* added test for sentence encoder

* added whitespace to test workflow

* restructured test for sentence encoder

* restructured test for sentence encoder

* fixed black issues

* restructured test for sentence encoder

* changed python version because of workflow error

* updated dependencies requirements version

* removed unneccessary print statement

* updated python version

* restructured test_sentence_encoder

* restructured test_sentence encoder

* black linting fixes

* restructure calling of tempile module

* updated workflow to remove pip cache

* removed commented code

* refactored code and added type hints

* fixed black issues

* fixed no module found error by adding Laser environment

* feat: Add Python function to download LASER models (#244)

* feat: make LASER pip installable

* Added GitHub Actions workflow for tests and linting

* upgraded python version due to node depreciation error

* removed updated python version

* removed poetry

* bug fixes

* removed dependencies install

* updated pyproject and made lint_and_test to install dev and mono dependencies

* removed isort and black

* removed mono dependencies

* removed version from pyproject

* removed duplicate of classifiers

* removed description

* removed dynamic

* added src-layout to discover only laser_encoder

* added build backend

* updated project name

* changed license to BSD

* removed src-layout to test

* added linting to actions

* updated linting to only check the laser_encoders folder

* fixed linting issues

* fixed black linting issues

* added white-space

* refactored emmbeder to work in the laser tokenizer package

* downgraded numpy version to suit the installled python version

* added test for sentence encoder

* added whitespace to test workflow

* restructured test for sentence encoder

* restructured test for sentence encoder

* fixed black issues

* restructured test for sentence encoder

* changed python version because of workflow error

* updated dependencies requirements version

* removed unneccessary print statement

* updated python version

* restructured test_sentence_encoder

* restructured test_sentence encoder

* black linting fixes

* restructure calling of tempile module

* updated workflow to remove pip cache

* removed commented code

* refactored code and added type hints

* fixed black issues

* fixed no module found error by adding Laser environment

* feat:created download function for downloading laser models in python

* added language list and made some changes to the download models

* fixed linting issues

* added type hints

* fixed linting issues

* added progress bar for downloading of models

* fixed black issues

* updated code to download laser model based on where the language is found

* fixed black and linting issues

* fixed black issues

* fixed bug in sentence encoder

* black issues and relative import issues

* removed addition of laser path

* fixed isort issues

* refactored the python entrypoint functions

* fixed black issues

* updated laguage list with some laser2 and laser3 languages

* refactor: added option for laser

* added laser2 language list

* added laser3 language list

* fixed black issues

* updated language list

* refactoed download function to display total filesize in MB and also made some changes to raise an error when laser is not passed

* fixed black issues

* refactored download models to move model_dir to the class

* fixed black issues

* refactored laser tokenizer test to use the laser downloader class methods

* documentation for the laser_encoder

* added tokenizer part

* added some docs for tokenize file and download models

* updated readme to include supported flore200 langs

* corrected readme path and license

* added requirements for laser_encoder

* added __main__.py file for running download command easily

* black and isort fixes, updated docs to effect changes due to creation of __main__.py file

* added contributors section

* Revert "added requirements for laser_encoder"

This reverts commit 431780e.

reverting back

* reverting creation of main.py

* fixed isort and black issues

* removed irrelevant comment

* moved pyproject to laser direcory and adjust contributors name

* workflow issues due to removal of pyproject

* pointed workflow to laser_encoders dir

* fixed EOF error

* fixed EOF error

* debuging

* debuging

* debuging

* debuging

* debuging

* debuging

* debuging

* debuging

* debuging

* debuging

* debuging

* debuging

* bug fixes and new implementation of convert_tokens_to_id function

* bug fix

* bug fix

* bug fix

* bug fix

* bug fix

* bug fix

* bug fix

* bug fix

* bug fix

* reverting back because of workflow error

* reverting back because of workflow error

* some extra adjustment

* changed ibo to igbo

* updated doc to effect the ibo to igbo change

* refactore: modified the sentence encoder to tokenize a text before encodingit

* debugging failed test

* added a call method to seperately handle the tokenization before encodding

* added value error for when there is no spm_model

* documentation for the new __call__ method for tokenization with encoder

* docs: Update docs to include reference to laserembeddings (#254)

* Handle Interrupted Model Weight Downloads (#253)

* fix: Fix interrupted downloads issue

* style: Format code using black

* Update download method to use tempfile

* style: Remove unnecessary space

* Fix OSError by using shutil.move for cross-filesystem moves

Using os.rename caused an OSError when trying to move files across different filesystems (e.g., from /tmp to another directory).
By using shutil.move, we gracefully handle such situations,
ensuring files are moved correctly regardless of the source and destination filesystems.

* Refactor `initialize_encoder` to `LaserEncoderPipeline` (#256)

* Remove 'tokenize' argument from initialize_encoder function

* Add LaserEncoderPipeline for streamlined tokenization and encoding

* docs: Update README to show use of LaserEncoderPipeline

* style: Reformat code using black

* refactor: move encoder and tokenizer initialization into repective files

* style: run black

* test: Add test for LaserEncoderPipeline

* test to validate languages

* test to validate languages

* Delete flores directory

* Update validate_models.py

* Update validate_models.py

* Update validate_models.py

* Update validate_models.py

* Update .gitignore

* added pytest to validate_models.py

* Update validate_models.py

* Update validate_models.py

* Update validate_models.py using mock downloader

* Update validate_models.py

* Update validate_models.py

* Update validate_models.py

* Update validate_models.py

* Extend Tokenizer to Support Single Strings and Lists of Strings (#258)

* Handle case for both str and list in tokenizer

* test: Add test for tokenizer call method

* Rename 'sentences' argument to 'text_or_batch' for clarity

* Handle string input in call method

* Update validate_models.py

* Update download_models.py according to 1.

* Update download_models.py

* Update download_models.py

* Update download_models.py

* Enhance LaserTokenizer with Perl Parity, Optional Punctuation Normalization, and Embedding Normalization (#262)

* Introduce pearl compability flag

* Add argument `normalize_punct` to `LaserTokenizer`

* Add normalize_embeddings option to encode_sentences

* Update README on normalize_embeddings option

* style: Run black and isort

* test: Add tests for normalize_embeddings flag in sentence encoder

* style: Run black

* Update validate_models.py

* Update models.py

* Update laser_tokenizer.py

* Update download_models.py

* Update validate_models.py

* Update validate_models.py

* Added slow and fast tests to validate_models.py

* Update validate_models.py

* Update validate_models.py

* Create test_validate_models.py

* Rename test_validate_models.py to test_models_initialization.py

* Update test_models_initialization.py

* Update test_models_initialization.py

* Update download_models.py

* Update test_models_initialization.py

* Update test_models_initialization.py

* Update download_models.py

* Update validate_models.py

* Update validate_models.py

* Update validate_models.py

* Update validate_models.py

* Update validate_models.py

* Update validate_models.py

* Update validate_models.py

* Update validate_models.py

* Update README.md

* Update README.md

* Decrease versions of numpy and torch required by laser-encoders (#264)

* Update requirements to follow fairseq

* Update README

* Update dependencies in toml file

* Remove requirements.txt

* Update laser_encoders README

* resolve parity with MOSES-4.0 release

* update test

* Update the main README file with a mention of `laser_encoders` (#266)

* update the main readme file

* wording changes

* update the example in the readme

* fix readme text

* Update language_list.py (#269)

* Update language_list.py

* Update language_list.py

* Update language_list.py

* Updated laser encoder pipeline

* Update models.py

* Update models.py

* Added warning for using laser2 with a language

* add tests to test_laser_tokenizer.py

* Update test_laser_tokenizer.py

* Update models.py

* Update test_laser_tokenizer.py

* Update test_laser_tokenizer.py

* Update language_list.py

* Update language_list.py

* Update language_list.py

---------

Co-authored-by: CaptainVee <[email protected]>
Co-authored-by: Victor Joseph <[email protected]>
Co-authored-by: Kevin Heffernan <[email protected]>
Co-authored-by: Okewunmi Paul <[email protected]>
Co-authored-by: NIXBLACK11 <[email protected]>
Co-authored-by: Siddharth Singh Rana <[email protected]>
Co-authored-by: Kevin Heffernan <[email protected]>
Paulooh007 added a commit to Paulooh007/LASER that referenced this pull request Nov 21, 2023
…bookresearch#249)

* feat: converted SPMapply function to use python script

* modified laserTokenizer class to have a seperate function for tokenizing a file

* modified tokenize_file function

* removed instances of Path

* created new function for opening files

* test for LaserTokenizer.tokenize

* tests for normalisation, descape and lower_case

* deleted test dir because of relative import error

* modified test tokenizer function to use the downloaded model before exiting the context manager

* test for tokenize_file

* added test for is_printable

* test for over_write when equal to True and False

* added some type hints for tests

* added type hint for log function

* added header comment

* feat: make LASER pip installable (facebookresearch#239)

* feat: make LASER pip installable

* Added GitHub Actions workflow for tests and linting

* upgraded python version due to node depreciation error

* removed updated python version

* removed poetry

* bug fixes

* removed dependencies install

* updated pyproject and made lint_and_test to install dev and mono dependencies

* removed isort and black

* removed mono dependencies

* removed version from pyproject

* removed duplicate of classifiers

* removed description

* removed dynamic

* added src-layout to discover only laser_encoder

* added build backend

* updated project name

* changed license to BSD

* removed src-layout to test

* added linting to actions

* updated linting to only check the laser_encoders folder

* fixed linting issues

* fixed black linting issues

* added white-space

* Refactor embedder (facebookresearch#241)

* feat: make LASER pip installable

* Added GitHub Actions workflow for tests and linting

* upgraded python version due to node depreciation error

* removed updated python version

* removed poetry

* bug fixes

* removed dependencies install

* updated pyproject and made lint_and_test to install dev and mono dependencies

* removed isort and black

* removed mono dependencies

* removed version from pyproject

* removed duplicate of classifiers

* removed description

* removed dynamic

* added src-layout to discover only laser_encoder

* added build backend

* updated project name

* changed license to BSD

* removed src-layout to test

* added linting to actions

* updated linting to only check the laser_encoders folder

* fixed linting issues

* fixed black linting issues

* added white-space

* refactored emmbeder to work in the laser tokenizer package

* downgraded numpy version to suit the installled python version

* added test for sentence encoder

* added whitespace to test workflow

* restructured test for sentence encoder

* restructured test for sentence encoder

* fixed black issues

* restructured test for sentence encoder

* changed python version because of workflow error

* updated dependencies requirements version

* removed unneccessary print statement

* updated python version

* restructured test_sentence_encoder

* restructured test_sentence encoder

* black linting fixes

* restructure calling of tempile module

* updated workflow to remove pip cache

* removed commented code

* refactored code and added type hints

* fixed black issues

* fixed no module found error by adding Laser environment

* feat: Add Python function to download LASER models (facebookresearch#244)

* feat: make LASER pip installable

* Added GitHub Actions workflow for tests and linting

* upgraded python version due to node depreciation error

* removed updated python version

* removed poetry

* bug fixes

* removed dependencies install

* updated pyproject and made lint_and_test to install dev and mono dependencies

* removed isort and black

* removed mono dependencies

* removed version from pyproject

* removed duplicate of classifiers

* removed description

* removed dynamic

* added src-layout to discover only laser_encoder

* added build backend

* updated project name

* changed license to BSD

* removed src-layout to test

* added linting to actions

* updated linting to only check the laser_encoders folder

* fixed linting issues

* fixed black linting issues

* added white-space

* refactored emmbeder to work in the laser tokenizer package

* downgraded numpy version to suit the installled python version

* added test for sentence encoder

* added whitespace to test workflow

* restructured test for sentence encoder

* restructured test for sentence encoder

* fixed black issues

* restructured test for sentence encoder

* changed python version because of workflow error

* updated dependencies requirements version

* removed unneccessary print statement

* updated python version

* restructured test_sentence_encoder

* restructured test_sentence encoder

* black linting fixes

* restructure calling of tempile module

* updated workflow to remove pip cache

* removed commented code

* refactored code and added type hints

* fixed black issues

* fixed no module found error by adding Laser environment

* feat:created download function for downloading laser models in python

* added language list and made some changes to the download models

* fixed linting issues

* added type hints

* fixed linting issues

* added progress bar for downloading of models

* fixed black issues

* updated code to download laser model based on where the language is found

* fixed black and linting issues

* fixed black issues

* fixed bug in sentence encoder

* black issues and relative import issues

* removed addition of laser path

* fixed isort issues

* refactored the python entrypoint functions

* fixed black issues

* updated laguage list with some laser2 and laser3 languages

* refactor: added option for laser

* added laser2 language list

* added laser3 language list

* fixed black issues

* updated language list

* refactoed download function to display total filesize in MB and also made some changes to raise an error when laser is not passed

* fixed black issues

* refactored download models to move model_dir to the class

* fixed black issues

* refactored laser tokenizer test to use the laser downloader class methods

* documentation for the laser_encoder

* added tokenizer part

* added some docs for tokenize file and download models

* updated readme to include supported flore200 langs

* corrected readme path and license

* added requirements for laser_encoder

* added __main__.py file for running download command easily

* black and isort fixes, updated docs to effect changes due to creation of __main__.py file

* added contributors section

* Revert "added requirements for laser_encoder"

This reverts commit 431780e.

reverting back

* reverting creation of main.py

* fixed isort and black issues

* removed irrelevant comment

* moved pyproject to laser direcory and adjust contributors name

* workflow issues due to removal of pyproject

* pointed workflow to laser_encoders dir

* fixed EOF error

* fixed EOF error

* debuging

* debuging

* debuging

* debuging

* debuging

* debuging

* debuging

* debuging

* debuging

* debuging

* debuging

* debuging

* bug fixes and new implementation of convert_tokens_to_id function

* bug fix

* bug fix

* bug fix

* bug fix

* bug fix

* bug fix

* bug fix

* bug fix

* bug fix

* reverting back because of workflow error

* reverting back because of workflow error

* some extra adjustment

* changed ibo to igbo

* updated doc to effect the ibo to igbo change

* refactore: modified the sentence encoder to tokenize a text before encodingit

* debugging failed test

* added a call method to seperately handle the tokenization before encodding

* added value error for when there is no spm_model

* documentation for the new __call__ method for tokenization with encoder

* docs: Update docs to include reference to laserembeddings (facebookresearch#254)

* Handle Interrupted Model Weight Downloads (facebookresearch#253)

* fix: Fix interrupted downloads issue

* style: Format code using black

* Update download method to use tempfile

* style: Remove unnecessary space

* Fix OSError by using shutil.move for cross-filesystem moves

Using os.rename caused an OSError when trying to move files across different filesystems (e.g., from /tmp to another directory).
By using shutil.move, we gracefully handle such situations,
ensuring files are moved correctly regardless of the source and destination filesystems.

* Refactor `initialize_encoder` to `LaserEncoderPipeline` (facebookresearch#256)

* Remove 'tokenize' argument from initialize_encoder function

* Add LaserEncoderPipeline for streamlined tokenization and encoding

* docs: Update README to show use of LaserEncoderPipeline

* style: Reformat code using black

* refactor: move encoder and tokenizer initialization into repective files

* style: run black

* test: Add test for LaserEncoderPipeline

* test to validate languages

* test to validate languages

* Delete flores directory

* Update validate_models.py

* Update validate_models.py

* Update validate_models.py

* Update validate_models.py

* Update .gitignore

* added pytest to validate_models.py

* Update validate_models.py

* Update validate_models.py

* Update validate_models.py using mock downloader

* Update validate_models.py

* Update validate_models.py

* Update validate_models.py

* Update validate_models.py

* Extend Tokenizer to Support Single Strings and Lists of Strings (facebookresearch#258)

* Handle case for both str and list in tokenizer

* test: Add test for tokenizer call method

* Rename 'sentences' argument to 'text_or_batch' for clarity

* Handle string input in call method

* Update validate_models.py

* Update download_models.py according to 1.

* Update download_models.py

* Update download_models.py

* Update download_models.py

* Enhance LaserTokenizer with Perl Parity, Optional Punctuation Normalization, and Embedding Normalization (facebookresearch#262)

* Introduce pearl compability flag

* Add argument `normalize_punct` to `LaserTokenizer`

* Add normalize_embeddings option to encode_sentences

* Update README on normalize_embeddings option

* style: Run black and isort

* test: Add tests for normalize_embeddings flag in sentence encoder

* style: Run black

* Update validate_models.py

* Update models.py

* Update laser_tokenizer.py

* Update download_models.py

* Update validate_models.py

* Update validate_models.py

* Added slow and fast tests to validate_models.py

* Update validate_models.py

* Update validate_models.py

* Create test_validate_models.py

* Rename test_validate_models.py to test_models_initialization.py

* Update test_models_initialization.py

* Update test_models_initialization.py

* Update download_models.py

* Update test_models_initialization.py

* Update test_models_initialization.py

* Update download_models.py

* Update validate_models.py

* Update validate_models.py

* Update validate_models.py

* Update validate_models.py

* Update validate_models.py

* Update validate_models.py

* Update validate_models.py

* Update validate_models.py

* Update README.md

* Update README.md

* Decrease versions of numpy and torch required by laser-encoders (facebookresearch#264)

* Update requirements to follow fairseq

* Update README

* Update dependencies in toml file

* Remove requirements.txt

* Update laser_encoders README

* resolve parity with MOSES-4.0 release

* update test

* Update the main README file with a mention of `laser_encoders` (facebookresearch#266)

* update the main readme file

* wording changes

* update the example in the readme

* fix readme text

* Update language_list.py (facebookresearch#269)

* Update language_list.py

* Update language_list.py

* Update language_list.py

* Updated laser encoder pipeline

* Update models.py

* Update models.py

* Added warning for using laser2 with a language

* add tests to test_laser_tokenizer.py

* Update test_laser_tokenizer.py

* Update models.py

* Update test_laser_tokenizer.py

* Update test_laser_tokenizer.py

* Update language_list.py

* Update language_list.py

* Update language_list.py

---------

Co-authored-by: CaptainVee <[email protected]>
Co-authored-by: Victor Joseph <[email protected]>
Co-authored-by: Kevin Heffernan <[email protected]>
Co-authored-by: Okewunmi Paul <[email protected]>
Co-authored-by: NIXBLACK11 <[email protected]>
Co-authored-by: Siddharth Singh Rana <[email protected]>
Co-authored-by: Kevin Heffernan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed Do not delete this pull request or issue due to inactivity.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants