Skip to content

Commit

Permalink
Merge pull request #246 from CaptainVee/documentation
Browse files Browse the repository at this point in the history
docs: Readme documentation for the laser_encoder package
  • Loading branch information
heffernankevin authored Sep 6, 2023
2 parents fc0fd16 + b97fd24 commit 94bc7aa
Show file tree
Hide file tree
Showing 5 changed files with 174 additions and 10 deletions.
117 changes: 117 additions & 0 deletions laser_encoders/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
# LASER encoders

LASER Language-Agnostic SEntence Representations Toolkit

laser_encoders is the official Python package for the Facebook [LASER](https://github.com/facebookresearch/LASER) library. It provides a simple and convenient way to use LASER embeddings in Python. It allows you to calculate multilingual sentence embeddings using the LASER toolkit. These embeddings can be utilized for various natural language processing tasks, including document classification, bitext filtering, and mining.

## Dependencies

- Python `>= 3.8`
- [PyTorch `>= 2.0`](http://pytorch.org/)
- sacremoses `>=0.0.53`
- sentencepiece `>=0.1.99`
- numpy `>=1.25.0`
- fairseq `>=0.12.2`

You can find a full list of requirements [here](requirements.txt)

## Installation

You can install laser_encoders using pip:

```sh
pip install laser_encoders
```

## Usage

Here's a simple example of how you can download and initialise the tokenizer and encoder with just one step.

**Note:** By default, the models will be downloaded to the `~/.cache/laser_encoders` directory. To specify a different download location, you can provide the argument `model_dir=path/to/model/directory` to the initialize_tokenizer and initialize_encoder functions

```py
from laser_encoders import initialize_encoder, initialize_tokenizer

# Initialize the LASER tokenizer
tokenizer = initialize_tokenizer(lang="igbo")
tokenized_sentence = tokenizer.tokenize("nnọọ, kedu ka ị mere")

# Initialize the LASER sentence encoder
encoder = initialize_encoder(lang="igbo")

# Encode sentences into embeddings
embeddings = encoder.encode_sentences([tokenized_sentence])
```

**Supported Languages:** You can specify any language from the [FLORES200](https://github.com/facebookresearch/flores/blob/main/flores200/README.md#languages-in-flores-200) dataset. This includes both languages identified by their full codes (like "ibo_Latn") and simpler alternatives (like "igbo").

## Downloading the pre-trained models

If you prefer to download the models individually, you can use the following command:

```sh
python -m laser_encoders.download_models --lang=your_prefered_language # e.g., --lang="igbo""
```

By default, the downloaded models will be stored in the `~/.cache/laser_encoders` directory. To specify a different download location, utilize the following command:

```sh
python -m laser_encoders.download_models --model-dir=path/to/model/directory
```

> For a comprehensive list of available arguments, you can use the `--help` command with the download_models script.
Once you have successfully downloaded the models, you can utilize the `LaserTokenizer` to tokenize text in your desired language. Here's an example of how you can achieve this:

```py
from laser_encoders.laser_tokenizer import LaserTokenizer
from laser_encoders.models import SentenceEncoder
from pathlib import Path

tokenizer = LaserTokenizer(spm_model=Path(path/to/spm_model))

tokenized_sentence = tokenizer.tokenize("This is a test sentence.")

encoder = SentenceEncoder(model_path=path/to/downloaded/model, spm_vocab=path/to/cvocab)
embeddings = encoder.encode_sentences([tokenized_sentence])
```

For tokenizing a file instead of a string, you can use the following:

```py
tokenized_sentence = tokenizer.tokenize_file(inp_fname=Path(path/to/input_file.txt), out_fname=Path(path/to/output_file.txt))
```

### Now you can use these embeddings for downstream tasks

For more advanced usage and options, please refer to the official LASER repository documentation.

## Contributing

We welcome contributions from the developer community to enhance and improve laser_encoders. If you'd like to contribute, you can:

1. Submit bug reports or feature requests through GitHub issues.
1. Fork the repository, make changes, and submit pull requests for review.

Please follow our [Contribution Guidelines](https://github.com/facebookresearch/LASER/blob/main/CONTRIBUTING.md) to ensure a smooth process.

### Code of Conduct

We expect all contributors to adhere to our [Code of Conduct](https://github.com/facebookresearch/LASER/blob/main/CODE_OF_CONDUCT.md).

### Contributors

The following people have contributed to this project:

- [Victor Joseph](https://github.com/CaptainVee)
- [David Dale](https://github.com/avidale/)
- [Holger Schwenk](https://github.com/hoschwenk)
- [Kevin Heffernan](https://github.com/heffernankevin)

### License

This package is released under the [LASER](https://github.com/facebookresearch/LASER/blob/main/LICENSE) BSD License.

### Contact

For any questions, feedback, or support, you can contact Facebook AI Research.
8 changes: 6 additions & 2 deletions laser_encoders/download_models.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,6 @@
# -------------------------------------------------------
#
# This python script installs NLLB LASER2 and LASER3 sentence encoders from Amazon s3
# default to download to current directory

import argparse
import logging
Expand Down Expand Up @@ -122,6 +121,7 @@ def initialize_encoder(
downloader = LaserModelDownloader(model_dir)
if laser is not None:
if laser == "laser3":
lang = downloader.get_language_code(LASER3_LANGUAGE, lang)
downloader.download_laser3(lang=lang, spm=spm)
file_path = f"laser3-{lang}.v1"
elif laser == "laser2":
Expand All @@ -132,6 +132,7 @@ def initialize_encoder(
f"Unsupported laser model: {laser}. Choose either laser2 or laser3."
)
else:
lang = downloader.get_language_code(LASER3_LANGUAGE, lang)
if lang in LASER3_LANGUAGE:
downloader.download_laser3(lang=lang, spm=spm)
file_path = f"laser3-{lang}.v1"
Expand All @@ -158,7 +159,10 @@ def initialize_tokenizer(lang: str = None, model_dir: str = None, laser: str = N
if laser is not None:
if laser == "laser3":
lang = downloader.get_language_code(LASER3_LANGUAGE, lang)
filename = f"laser3-{lang}.v1.spm"
if lang in SPM_LANGUAGE:
filename = f"laser3-{lang}.v1.spm"
else:
filename = "laser2.spm"
elif laser == "laser2":
filename = "laser2.spm"
else:
Expand Down
40 changes: 38 additions & 2 deletions laser_encoders/laser_tokenizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,13 +16,16 @@

import gzip
import logging
import re
import sys
import typing as tp
from pathlib import Path
from typing import IO, List

import sentencepiece as spm
from sacremoses import MosesDetokenizer, MosesPunctNormalizer

SPACE_NORMALIZER = re.compile(r"\s+")

logging.basicConfig(
stream=sys.stdout,
level=logging.INFO,
Expand Down Expand Up @@ -53,7 +56,7 @@ def __init__(
self.moses_detokenizer = MosesDetokenizer()
self.spm_encoder = spm.SentencePieceProcessor(model_file=str(self.spm_model))

def open(self, file: Path, mode: str, encoding="utf-8") -> tp.IO:
def open(self, file: Path, mode: str, encoding="utf-8") -> IO:
return (
gzip.open(file, mode, encoding=encoding)
if file.name.endswith(".gz")
Expand Down Expand Up @@ -95,3 +98,36 @@ def tokenize_file(self, inp_fname: Path, out_fname: Path) -> None:
for line in file_in:
tokens = self.tokenize(line.strip())
file_out.write(tokens + "\n")

def __call__(self, text_or_batch, batch=False):
if not batch:
return self.tokenize(text_or_batch)
else:
return self.tokenize_batch(text_or_batch)

def tokenize_batch(self, batch: List[str]) -> List[List[str]]:
return [self.tokenize(text) for text in batch]

def convert_ids_to_tokens(self, ids: List[int]) -> List[str]:
return [self.spm_encoder.DecodeIds(ids) for ids in ids]

def convert_tokens_to_ids(self, tokens: List[str]) -> List[int]:
ids = []

for token in tokens:
# Apply the same tokenization logic as in _tokenize method
tokens = SPACE_NORMALIZER.sub(" ", token).strip().split()

# Initialize an empty tensor for this token's IDs
token_ids = []

for i, token in enumerate(tokens):
token_id = self.spm_encoder.PieceToId(token)
if token_id == 0: # Handle out-of-vocabulary tokens
token_id = self.spm_encoder.PieceToId("<unk>")
token_ids.append(token_id)

# Append token IDs to the final IDs tensor
ids.extend(token_ids)

return ids
8 changes: 8 additions & 0 deletions laser_encoders/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
fairseq==0.12.2
numpy==1.25.0
pytest==7.4.0
Requests==2.31.0
sacremoses==0.0.53
sentencepiece==0.1.99
torch==2.0.1
tqdm==4.65.0
11 changes: 5 additions & 6 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ name = "laser_encoders"
version = "0.0.1"
authors = [{name = "Facebook AI Research"}]
description = "LASER Language-Agnostic SEntence Representations is a toolkit to calculate multilingual sentence embeddings and to use them for document classification, bitext filtering and mining"
readme = "README.md"
readme = "laser_encoders/README.md"
requires-python = ">=3.8"

dependencies = [
Expand All @@ -19,12 +19,11 @@ dependencies = [
]

classifiers=[
"License :: BSD License",
"License :: OSI Approved :: BSD License",
"Topic :: Scientific/Engineering",
"Development Status :: 4 - Beta",
]


[project.urls]
"Homepage" = "https://github.com/facebookresearch/LASER"
"Bug Tracker" = "https://github.com/facebookresearch/LASER/issues"
Expand Down Expand Up @@ -56,14 +55,14 @@ python_version = "3.8"
show_error_codes = true
check_untyped_defs = true

ignore_missing_imports = true

files = [
"laser_encoders/"
]

ignore_missing_imports = true

[tool.pytest.ini_options]
testpaths = ["laser_encoders"]
python_files = [
"test_*.py",
]
]

0 comments on commit 94bc7aa

Please sign in to comment.