Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
embed.sh		embed.sh

README.md

LASER: calculation of sentence embeddings

Tool to calculate sentence embeddings for an arbitrary text file:

bash ./embed.sh INPUT-FILE OUTPUT-FILE [LANGUAGE]

The input will first be tokenized, and then sentence embeddings will be generated. If a language is specified, then embed.sh will look for a language-specific LASER3 encoder using the format: {model_dir}/laser3-{language}.{version}.pt. Otherwise it will default to LASER2 which covers the same 93 languages as the original LASER encoder.

NOTE: please set the model location (model_dir in embed.sh) before running. We recommend to download the models from the NLLB release (see here). Optionally you can also select the model version number for downloaded LASER3 models. This currently defaults to: 1 (initial release).

Output format

The embeddings are stored in float32 matrices in raw binary format. They can be read in Python by:

import numpy as np
dim = 1024
X = np.fromfile("my_embeddings.bin", dtype=np.float32, count=-1)                                                                          
X.resize(X.shape[0] // dim, dim)

X is a N x 1024 matrix where N is the number of lines in the text file.

Examples

In order to encode an input text in any of the 93 languages supported by LASER2 (e.g. Afrikaans, English, French):

./embed.sh input_file output_file

To use a language-specific encoder (if available), such as for example: Wolof, Hausa, or Irish:

./embed.sh input_file output_file wol_Latn

./embed.sh input_file output_file hau_Latn

./embed.sh input_file output_file gle_Latn

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

embed

embed

README.md

LASER: calculation of sentence embeddings

Output format

Examples

Files

embed

Directory actions

More options

Directory actions

More options

Latest commit

History

embed

Folders and files

parent directory

README.md

LASER: calculation of sentence embeddings

Output format

Examples