Tool to calculate sentence embeddings for an arbitrary text file:
bash ./embed.sh INPUT-FILE OUTPUT-FILE [LANGUAGE]
The input will first be tokenized, and then sentence embeddings will be generated. If a language
is specified,
then embed.sh
will look for a language-specific LASER3 encoder using the format: {model_dir}/laser3-{language}.{version}.pt
.
Otherwise it will default to LASER2 which covers the same 93 languages as the original LASER encoder.
NOTE: please set the model location (model_dir
in embed.sh
) before running. We recommend to download the models from the NLLB
release (see here). Optionally you can also select the model version number for downloaded LASER3 models. This currently defaults to: 1
(initial release).
The embeddings are stored in float32 matrices in raw binary format. They can be read in Python by:
import numpy as np
dim = 1024
X = np.fromfile("my_embeddings.bin", dtype=np.float32, count=-1)
X.resize(X.shape[0] // dim, dim)
X is a N x 1024 matrix where N is the number of lines in the text file.
In order to encode an input text in any of the 93 languages supported by LASER2 (e.g. Afrikaans, English, French):
./embed.sh input_file output_file
To use a language-specific encoder (if available), such as for example: Wolof, Hausa, or Irish:
./embed.sh input_file output_file wol_Latn
./embed.sh input_file output_file hau_Latn
./embed.sh input_file output_file gle_Latn