Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MLH fellowship contribution: adding the laser_encoders module #249

Merged
merged 134 commits into from
Nov 21, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
134 commits
Select commit Hold shift + click to select a range
407a1d0
feat: converted SPMapply function to use python script
CaptainVee Jul 5, 2023
d68131d
Merge branch 'facebookresearch:main' into tokenize
CaptainVee Jul 6, 2023
521ac85
modified laserTokenizer class to have a seperate function for tokeniz…
CaptainVee Jul 9, 2023
23528fb
Merge branch 'facebookresearch:main' into tokenize
CaptainVee Jul 9, 2023
a168556
modified tokenize_file function
CaptainVee Jul 12, 2023
b730274
removed instances of Path
CaptainVee Jul 12, 2023
03b521b
created new function for opening files
CaptainVee Jul 17, 2023
2a9b30f
test for LaserTokenizer.tokenize
CaptainVee Jul 17, 2023
14c4336
tests for normalisation, descape and lower_case
CaptainVee Jul 18, 2023
199671d
deleted test dir because of relative import error
CaptainVee Jul 18, 2023
ab614b8
modified test tokenizer function to use the downloaded model before e…
CaptainVee Jul 18, 2023
fb1d213
test for tokenize_file
CaptainVee Jul 18, 2023
1871cc0
added test for is_printable
CaptainVee Jul 18, 2023
9c5503a
test for over_write when equal to True and False
CaptainVee Jul 19, 2023
29b3f32
added some type hints for tests
CaptainVee Jul 19, 2023
3031db6
added type hint for log function
CaptainVee Jul 19, 2023
296656b
added header comment
CaptainVee Jul 20, 2023
4424b35
Merge pull request #238 from CaptainVee/tokenize
avidale Jul 24, 2023
e45f7b6
feat: make LASER pip installable (#239)
CaptainVee Jul 26, 2023
7bb9822
Refactor embedder (#241)
CaptainVee Aug 2, 2023
fc0fd16
feat: Add Python function to download LASER models (#244)
CaptainVee Aug 18, 2023
f6e557d
documentation for the laser_encoder
CaptainVee Aug 21, 2023
2c90bbc
added tokenizer part
CaptainVee Aug 22, 2023
7059758
added some docs for tokenize file and download models
CaptainVee Aug 22, 2023
4e3c42b
updated readme to include supported flore200 langs
CaptainVee Aug 24, 2023
54a7d92
corrected readme path and license
CaptainVee Aug 24, 2023
431780e
added requirements for laser_encoder
CaptainVee Aug 24, 2023
4234e7b
added __main__.py file for running download command easily
CaptainVee Aug 25, 2023
8e46691
black and isort fixes, updated docs to effect changes due to creation…
CaptainVee Aug 25, 2023
8d5a192
added contributors section
CaptainVee Aug 25, 2023
eb4fdcb
Revert "added requirements for laser_encoder"
CaptainVee Aug 28, 2023
76843f7
reverting creation of main.py
CaptainVee Aug 28, 2023
676f3e1
fixed isort and black issues
CaptainVee Aug 28, 2023
013fcbd
removed irrelevant comment
CaptainVee Aug 28, 2023
83b2e01
moved pyproject to laser direcory and adjust contributors name
CaptainVee Aug 30, 2023
2a073f6
workflow issues due to removal of pyproject
CaptainVee Aug 30, 2023
c30c6aa
pointed workflow to laser_encoders dir
CaptainVee Aug 30, 2023
fdb5ffd
fixed EOF error
CaptainVee Aug 30, 2023
cccb24f
fixed EOF error
CaptainVee Aug 30, 2023
b1d1138
debuging
CaptainVee Aug 30, 2023
8276b5b
debuging
CaptainVee Aug 30, 2023
ba2e8c6
debuging
CaptainVee Aug 30, 2023
976cbed
debuging
CaptainVee Aug 30, 2023
8e3e19b
debuging
CaptainVee Aug 30, 2023
af8d095
debuging
CaptainVee Aug 30, 2023
726fb28
debuging
CaptainVee Aug 30, 2023
d953140
debuging
CaptainVee Aug 30, 2023
f253487
debuging
CaptainVee Aug 30, 2023
793756e
debuging
CaptainVee Aug 30, 2023
ee6def4
debuging
CaptainVee Aug 30, 2023
2f73b9e
debuging
CaptainVee Aug 30, 2023
bb768e6
bug fixes and new implementation of convert_tokens_to_id function
CaptainVee Sep 5, 2023
b79b15b
bug fix
CaptainVee Sep 5, 2023
6684564
bug fix
CaptainVee Sep 5, 2023
6966a5e
bug fix
CaptainVee Sep 5, 2023
d9b8882
bug fix
CaptainVee Sep 5, 2023
7d68522
bug fix
CaptainVee Sep 5, 2023
24cd881
bug fix
CaptainVee Sep 5, 2023
d5a4829
bug fix
CaptainVee Sep 5, 2023
acbbc36
bug fix
CaptainVee Sep 5, 2023
c4129dc
bug fix
CaptainVee Sep 5, 2023
c889d82
reverting back because of workflow error
CaptainVee Sep 5, 2023
5a1c476
reverting back because of workflow error
CaptainVee Sep 5, 2023
5d649c5
some extra adjustment
CaptainVee Sep 5, 2023
c69e749
changed ibo to igbo
CaptainVee Sep 6, 2023
b97fd24
updated doc to effect the ibo to igbo change
CaptainVee Sep 6, 2023
94bc7aa
Merge pull request #246 from CaptainVee/documentation
heffernankevin Sep 6, 2023
d8e6983
refactore: modified the sentence encoder to tokenize a text before en…
CaptainVee Sep 8, 2023
af224c6
debugging failed test
CaptainVee Sep 8, 2023
2ac3362
added a call method to seperately handle the tokenization before enco…
CaptainVee Sep 18, 2023
c2f66cd
added value error for when there is no spm_model
CaptainVee Sep 21, 2023
0858676
documentation for the new __call__ method for tokenization with encoder
CaptainVee Sep 21, 2023
51b4293
Merge pull request #248 from CaptainVee/refactor-sentence-encoder
heffernankevin Sep 22, 2023
0976ee8
docs: Update docs to include reference to laserembeddings (#254)
Paulooh007 Oct 11, 2023
e3257c1
Handle Interrupted Model Weight Downloads (#253)
Paulooh007 Oct 13, 2023
e6f4805
Refactor `initialize_encoder` to `LaserEncoderPipeline` (#256)
Paulooh007 Oct 31, 2023
8fc4b9a
test to validate languages
NIXBLACK11 Oct 31, 2023
9a3228b
test to validate languages
NIXBLACK11 Oct 31, 2023
ad9a588
Delete flores directory
NIXBLACK11 Oct 31, 2023
7f32d7a
Update validate_models.py
NIXBLACK11 Oct 31, 2023
ff3254b
Update validate_models.py
NIXBLACK11 Oct 31, 2023
cb2d91a
Update validate_models.py
NIXBLACK11 Oct 31, 2023
f4e84d2
Update validate_models.py
NIXBLACK11 Oct 31, 2023
109eac2
Update .gitignore
NIXBLACK11 Oct 31, 2023
2236fe0
added pytest to validate_models.py
NIXBLACK11 Nov 1, 2023
472657b
Update validate_models.py
NIXBLACK11 Nov 1, 2023
c744030
Update validate_models.py
NIXBLACK11 Nov 1, 2023
c71aec7
Update validate_models.py using mock downloader
NIXBLACK11 Nov 4, 2023
c816d79
Update validate_models.py
NIXBLACK11 Nov 6, 2023
31aa252
Update validate_models.py
NIXBLACK11 Nov 6, 2023
c34279d
Update validate_models.py
NIXBLACK11 Nov 6, 2023
8b25a3d
Update validate_models.py
NIXBLACK11 Nov 6, 2023
c5b6f60
Extend Tokenizer to Support Single Strings and Lists of Strings (#258)
Paulooh007 Nov 7, 2023
302d068
Update validate_models.py
NIXBLACK11 Nov 7, 2023
73f873f
Update download_models.py according to 1.
NIXBLACK11 Nov 7, 2023
5e04a2a
Update download_models.py
NIXBLACK11 Nov 7, 2023
e3552a7
Update download_models.py
NIXBLACK11 Nov 7, 2023
1d74246
Update download_models.py
NIXBLACK11 Nov 7, 2023
3c5f5ed
Enhance LaserTokenizer with Perl Parity, Optional Punctuation Normali…
Paulooh007 Nov 8, 2023
1bddd81
Update validate_models.py
NIXBLACK11 Nov 8, 2023
e4f3fd0
Update models.py
NIXBLACK11 Nov 8, 2023
03284a2
Update laser_tokenizer.py
NIXBLACK11 Nov 8, 2023
43f4d1a
Update download_models.py
NIXBLACK11 Nov 8, 2023
6ef54c2
Update validate_models.py
NIXBLACK11 Nov 8, 2023
89c9dde
Update validate_models.py
NIXBLACK11 Nov 8, 2023
d883ee0
Added slow and fast tests to validate_models.py
NIXBLACK11 Nov 9, 2023
e1e22a3
Update validate_models.py
NIXBLACK11 Nov 9, 2023
a8f4135
Update validate_models.py
NIXBLACK11 Nov 9, 2023
4cd83e8
Create test_validate_models.py
NIXBLACK11 Nov 9, 2023
e0be04f
Rename test_validate_models.py to test_models_initialization.py
NIXBLACK11 Nov 9, 2023
9ec012f
Update test_models_initialization.py
NIXBLACK11 Nov 9, 2023
fbbc6fc
Update test_models_initialization.py
NIXBLACK11 Nov 9, 2023
99ebbfd
Update download_models.py
NIXBLACK11 Nov 9, 2023
6356c4d
Update test_models_initialization.py
NIXBLACK11 Nov 9, 2023
eac3674
Update test_models_initialization.py
NIXBLACK11 Nov 9, 2023
d3935f9
Update download_models.py
NIXBLACK11 Nov 9, 2023
18c1657
Update validate_models.py
NIXBLACK11 Nov 14, 2023
c26e775
Update validate_models.py
NIXBLACK11 Nov 14, 2023
023eab2
Update validate_models.py
NIXBLACK11 Nov 14, 2023
3944556
Update validate_models.py
NIXBLACK11 Nov 14, 2023
0a4d983
Update validate_models.py
NIXBLACK11 Nov 14, 2023
e5823d6
Update validate_models.py
NIXBLACK11 Nov 14, 2023
92345be
Update validate_models.py
NIXBLACK11 Nov 14, 2023
87a08e9
Update validate_models.py
NIXBLACK11 Nov 14, 2023
b0131d9
Merge pull request #257 from NIXBLACK11/Language_model_validation
heffernankevin Nov 14, 2023
89ec5f3
Update README.md
NIXBLACK11 Nov 15, 2023
30856cc
Update README.md
NIXBLACK11 Nov 15, 2023
6360627
Merge pull request #265 from NIXBLACK11/Laser_readme_update
heffernankevin Nov 15, 2023
cd6118e
Decrease versions of numpy and torch required by laser-encoders (#264)
Paulooh007 Nov 15, 2023
ea7691c
resolve parity with MOSES-4.0 release
Nov 17, 2023
77bf7fb
update test
Nov 17, 2023
90db293
Update the main README file with a mention of `laser_encoders` (#266)
avidale Nov 17, 2023
b4aed58
Merge pull request #268 from facebookresearch/fix-parity
heffernankevin Nov 20, 2023
9cde37a
Update language_list.py (#269)
NIXBLACK11 Nov 21, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 32 additions & 0 deletions .github/workflows/lint_and_tests.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
name: lint_and_tests

on: [push, pull_request]

jobs:
build:
strategy:
max-parallel: 1
matrix:
platform: [ubuntu-latest]
python-version: [3.8]

runs-on: ${{ matrix.platform }}

steps:
- uses: actions/checkout@v2

- name: Install dependencies
run: |
python --version
python -m pip install --upgrade 'pip>=23.2.1'
python -m pip show pip
python -m pip install -e '.[dev]'

- name: isort
run: cd laser_encoders && isort --check --diff .

- name: black
run: cd laser_encoders && black --check --diff .

- name: pytest
run: pytest laser_encoders
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -10,3 +10,6 @@ tasks/xnli/XNLI-1.0*
tasks/xnli/multinli_1.0*
.??*swp
.idea
__pycache__
nllb
dist
26 changes: 24 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
LASER is a library to calculate and use multilingual sentence embeddings.

**NEWS**
* 2023/11/16 Released [**laser_encoders**](laser_encoders), a pip-installable package supporting LASER-2 and LASER-3 models
* 2023/06/26 [**xSIM++**](https://arxiv.org/abs/2306.12907) evaluation pipeline and data [**released**](tasks/xsimplusplus/README.md)
* 2022/07/06 Updated LASER models with support for over 200 languages are [**now available**](nllb/README.md)
* 2022/07/06 Multilingual similarity search (**xsim**) evaluation pipeline [**released**](tasks/xsim/README.md)
Expand All @@ -26,7 +27,27 @@ a language family which is covered by other languages.
A detailed description of how the multilingual sentence embeddings are trained can
be found [here](https://arxiv.org/abs/2205.12654), together with an experimental evaluation.

## Dependencies
## The core sentence embedding package: `laser_encoders`
We provide a package `laser_encoders` with minimal dependencies.
It supports LASER-2 (a single encoder for the languages listed [below](#supported-languages))
and LASER-3 (147 language-specific encoders described [here](nllb/README.md)).

The package can be installed simply with `pip install laser_encoders` and used as below:

```python
from laser_encoders import LaserEncoderPipeline
encoder = LaserEncoderPipeline(lang="eng_Latn")
embeddings = encoder.encode_sentences(["Hi!", "This is a sentence encoder."])
print(embeddings.shape) # (2, 1024)
```

The laser_encoders [readme file](laser_encoders) provides more examples of its installation and usage.

## The full LASER kit
Apart from the `laser_encoders`, we provide support for LASER-1 (the original multilingual encoder)
and for various LASER applications listed below.

### Dependencies
* Python >= 3.7
* [PyTorch 1.0](http://pytorch.org/)
* [NumPy](http://www.numpy.org/), tested with 1.15.4
Expand All @@ -42,7 +63,8 @@ be found [here](https://arxiv.org/abs/2205.12654), together with an experimental
* [pandas](https://pypi.org/project/pandas), data analysis toolkit (`pip install pandas`)
* [Sentencepiece](https://github.com/google/sentencepiece), subword tokenization (installed automatically)

## Installation
### Installation
* install the `laser_encoders` package by e.g. `pip install -e .` for installing it in the editable mode
* set the environment variable 'LASER' to the root of the installation, e.g.
`export LASER="${HOME}/projects/laser"`
* download encoders from Amazon s3 by e.g. `bash ./nllb/download_models.sh`
Expand Down
4 changes: 4 additions & 0 deletions install_external_tools.sh
Original file line number Diff line number Diff line change
Expand Up @@ -181,6 +181,10 @@ InstallMecab () {
#
###################################################################

echo "Installing the laser_encoders package in editable mode"

pip install -e .

echo "Installing external tools"

InstallMosesTools
Expand Down
149 changes: 149 additions & 0 deletions laser_encoders/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,149 @@
# LASER encoders
avidale marked this conversation as resolved.
Show resolved Hide resolved

LASER Language-Agnostic SEntence Representations Toolkit

laser_encoders is the official Python package for the Facebook [LASER](https://github.com/facebookresearch/LASER) library. It provides a simple and convenient way to use LASER embeddings in Python. It allows you to calculate multilingual sentence embeddings using the LASER toolkit. These embeddings can be utilized for various natural language processing tasks, including document classification, bitext filtering, and mining.

## Dependencies

- Python `>= 3.8`
- [PyTorch `>= 1.10.0`](http://pytorch.org/)
- sacremoses `>=0.1.0`
- sentencepiece `>=0.1.99`
- numpy `>=1.21.3`
- fairseq `>=0.12.2`

You can find a full list of requirements [here](https://github.com/facebookresearch/LASER/blob/main/pyproject.toml)

## Installation

You can install `laser_encoders` package from PyPI:

```sh
pip install laser_encoders
```

Alternatively, you can install it from a local clone of this repository, in editable mode:
```sh
pip install . -e
```

## Usage

Here's a simple example on how to obtain embeddings for sentences using the `LaserEncoderPipeline`:

>**Note:** By default, the models will be downloaded to the `~/.cache/laser_encoders` directory. To specify a different download location, you can provide the argument `model_dir=path/to/model/directory`

```py
from laser_encoders import LaserEncoderPipeline

# Initialize the LASER encoder pipeline
encoder = LaserEncoderPipeline(lang="igbo")

# Encode sentences into embeddings
embeddings = encoder.encode_sentences(["nnọọ, kedu ka ị mere"])
# If you want the output embeddings to be L2-normalized, set normalize_embeddings to True
normalized_embeddings = encoder.encode_sentences(["nnọọ, kedu ka ị mere"], normalize_embeddings=True)

```

If you prefer more control over the tokenization and encoding process, you can initialize the tokenizer and encoder separately:
```py
from laser_encoders import initialize_encoder, initialize_tokenizer

# Initialize the LASER tokenizer
tokenizer = initialize_tokenizer(lang="igbo")
tokenized_sentence = tokenizer.tokenize("nnọọ, kedu ka ị mere")

# Initialize the LASER sentence encoder
encoder = initialize_encoder(lang="igbo")

# Encode tokenized sentences into embeddings
embeddings = encoder.encode_sentences([tokenized_sentence])
```
>By default, the `spm` flag is set to `True` when initializing the encoder, ensuring the accompanying spm model is downloaded.

**Supported Languages:** You can specify any language from the [FLORES200](https://github.com/facebookresearch/flores/blob/main/flores200/README.md#languages-in-flores-200) dataset. This includes both languages identified by their full codes (like "ibo_Latn") and simpler alternatives (like "igbo").

## Downloading the pre-trained models

If you prefer to download the models individually, you can use the following command:

```sh
python -m laser_encoders.download_models --lang=your_prefered_language # e.g., --lang="igbo""
```

By default, the downloaded models will be stored in the `~/.cache/laser_encoders` directory. To specify a different download location, utilize the following command:

```sh
python -m laser_encoders.download_models --model-dir=path/to/model/directory
```

> For a comprehensive list of available arguments, you can use the `--help` command with the download_models script.

Once you have successfully downloaded the models, you can utilize the `SentenceEncoder` to tokenize and encode your text in your desired language. Here's an example of how you can achieve this:

```py
from laser_encoders.models import SentenceEncoder
from pathlib import Path

encoder = SentenceEncoder(model_path=path/to/downloaded/model, spm_model=Path(path/to/spm_model), spm_vocab=path/to/cvocab)
embeddings = encoder("This is a test sentence.")
```
If you want to perform tokenization seperately, you can do this below:
```py
from laser_encoders.laser_tokenizer import LaserTokenizer

tokenizer = LaserTokenizer(spm_model=Path(path/to/spm_model))

tokenized_sentence = tokenizer.tokenize("This is a test sentence.")

encoder = SentenceEncoder(model_path=path/to/downloaded/model, spm_vocab=path/to/cvocab)
embeddings = encoder.encode_sentences([tokenized_sentence])
```

For tokenizing a file instead of a string, you can use the following:

```py
tokenized_sentence = tokenizer.tokenize_file(inp_fname=Path(path/to/input_file.txt), out_fname=Path(path/to/output_file.txt))
```

### Now you can use these embeddings for downstream tasks

For more advanced usage and options, please refer to the official LASER repository documentation.

## LASER Versions and Associated Packages

For users familiar with the earlier version of LASER, you might have encountered the [`laserembeddings`](https://pypi.org/project/laserembeddings/) package. This package primarily dealt with LASER-1 model embeddings.

For the latest LASER-2,3 models, use the newly introduced `laser_encoders` package, which offers better performance and support for a wider range of languages.


## Contributing

We welcome contributions from the developer community to enhance and improve laser_encoders. If you'd like to contribute, you can:

1. Submit bug reports or feature requests through GitHub issues.
1. Fork the repository, make changes, and submit pull requests for review.

Please follow our [Contribution Guidelines](https://github.com/facebookresearch/LASER/blob/main/CONTRIBUTING.md) to ensure a smooth process.

### Code of Conduct

We expect all contributors to adhere to our [Code of Conduct](https://github.com/facebookresearch/LASER/blob/main/CODE_OF_CONDUCT.md).

### Contributors

The following people have contributed to this project:

- [Victor Joseph](https://github.com/CaptainVee)
- [Paul Okewunmi](https://github.com/Paulooh007)
- [Siddharth Singh Rana](https://github.com/NIXBLACK11)
- [David Dale](https://github.com/avidale/)
- [Holger Schwenk](https://github.com/hoschwenk)
- [Kevin Heffernan](https://github.com/heffernankevin)

### License

This package is released under the [LASER](https://github.com/facebookresearch/LASER/blob/main/LICENSE) BSD License.

avidale marked this conversation as resolved.
Show resolved Hide resolved
16 changes: 16 additions & 0 deletions laser_encoders/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
#!/bin/bash
# Copyright (c) Facebook, Inc. and its affiliates.
# All rights reserved.
#
# This source code is licensed under the BSD-style license found in the
# LICENSE file in the root directory of this source tree.
#
# LASER Language-Agnostic SEntence Representations
# is a toolkit to calculate multilingual sentence embeddings
# and to use them for document classification, bitext filtering
# and mining
#
# -------------------------------------------------------

from laser_encoders.laser_tokenizer import initialize_tokenizer
from laser_encoders.models import LaserEncoderPipeline, initialize_encoder
Loading
Loading