Skip to content

Latest commit

 

History

History
27 lines (19 loc) · 2.07 KB

README.md

File metadata and controls

27 lines (19 loc) · 2.07 KB

LASER: P-xSIM (dual approach multilingual similarity error rate)

This README shows how to calculate the P-xSIM error rate (Seamless Communication et al., 2023) for a given language pair.

P-xSIM returns the error rate for recreating gold alignments using a blended combination of two different approaches. It works by performing a k-nearest-neighbor search and margin calculation (i.e. margin-based parallel alignment) using the first approach, followed by the scoring of each candidate neighbor using an auxiliary model (the second approach). Finally, the scores of both the margin-based alignment and the auxiliary model are combined together using a blended score defined as:

$$ \text{blended-score}(x, y) = \alpha \cdot \text{margin} + (1 - \alpha) \cdot \text{auxiliary-score} $$

where the parameter $\alpha$ controls the combination of both the margin-based and auxiliary scores. By default, the auxiliary-score will be calculated as the cosine between the source and candidate neighbors using the auxiliary embeddings. However, there is also an option to perform inference using a comparator model (Seamless Communication et al., 2023). In this instance, the auxiliary-score will be the AutoPCP outputs.

P-xSIM offers three margin-based scoring options (discussed in detail here):

  • distance
  • ratio
  • absolute

Example usage

Simply run the example script bash ./eval.sh to download a sample dataset (flores200), sample encoders (laser2 and LaBSE), and then perform P-xSIM. In this toy example, we use laser2 to provide the k-nearest-neighbors, followed by applying LaBSE as an auxiliary model on each candidate neighbor, before then applying the blended scoring function defined above. Dependending on your data sources, you may want to alter the approach used for either margin-based parallel alignment, or the scoring of each candidate neighbor (i.e. the auxiliary model).

In addition to LaBSE in the example above, you can also calculate P-xSIM using any model hosted on HuggingFace sentence-transformers.