A Principled Framework for Evaluating on Typologically Diverse Languages

Abstract: Beyond individual languages, multilingual natural language processing (NLP) research increasingly aims to develop models that perform well across languages generally. However, evaluating these systems on all the world’s languages is practically infeasible. To attain generalizability, representative language sampling is essential. Previous work argues that generalizable multilingual evaluation sets should contain languages with diverse typological properties. However, ‘typologically diverse’ language samples have been found to vary considerably in this regard, and popular sampling methods are flawed and inconsistent. We present a language sampling framework for selecting highly typologically diverse languages given a sampling frame, informed by language typology. We compare sampling methods with a range of metrics and find that our systematic methods consistently retrieve more typologically diverse language selections than previous methods in NLP. Moreover, we provide evidence that this affects generalizability in multilingual model evaluation, emphasizing the importance of diverse language sampling in NLP evaluation.

This repository contains the implementations, results and visualizations for the paper A Principled Framework for Evaluating on Typologically Diverse Languages. If you use any contents from this repository for your work, we kindly ask you to cite our paper:

@misc{ploeger2024principledframework,
      title={A Principled Framework for Evaluating on Typologically Diverse Languages},
      author={Esther Ploeger and Wessel Poelman and Andreas Holck Høeg-Petersen and Anders Schlichtkrull and Miryam de Lhoneux and Johannes Bjerva},
      year={2024},
      eprint={2407.05022},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2407.05022},
}

For the specific version related to the initial preprint, please check the preprint-v1 branch.

Installation and data

The package is tested with Python 3.10.

git clone https://github.com/esther2000/typdiv-sampling
cd typdiv-sampling
pip install .

Alternatively, with visualization support:

pip install ".[vis]"

Or for development:

pip install ".[dev]"
pre-commit install
pre-commit run --all-files

The data can be downloaded and prepared using the following script:

./prepare-data.sh

Example usage

Sampling

from typdiv_sampling import Sampler
from pathlib import Path

# A list of glottocodes to sample from.
frame = ['stan1293', 'russ1263', 'finn1318', 'nucl1301', 'stan1290', 'kore1280']
k = 3  # The number of languages to sample.
seed = 1 # A random seed for the non-deterministic methods.

# Initialize with default setup.
sampler = Sampler()
sampler.sample_maxsum(frame, k)
> ['kore1280', 'russ1263', 'stan1290']

Sampling methods include: sample_maxsum() (MaxSum), sample_maxmin() (MaxMin) and several baselines: sample_random(), sample_convenience(), sample_random_family(), sample_random_genus().

Most options are also available from the cli of the package:

sample --help

An example of sampling usage in practice is found in: evaluation/experiment.py.

Typological diversity evaluation

from typdiv_sampling.evaluation import Evaluator

# With default settings.
evaluator = Evaluator()

sample = ['kore1280', 'russ1263', 'stan1290']
evaluator.evaluate_sample(sample)
> Result(
    run=None, # Optional result to keep track of averages across runs, unused here.
    ent_score_with=0.5374,
    ent_score_without=0.4954,
    fvi_score=0.7686,
    mpd_score=0.7836,
    fvo_score=0.6302,
    sample={'russ1263', 'kore1280', 'stan1290'},
)

An example of evaluation usage in practice is found in use_cases/next-best.ipynb.

Reproducibility

The results and visualizations from the paper can be reproduced with the following scripts or notebooks:

Language families and typological similarities (Figure 1, page 5): analysis/typ-data-analysis.ipynb
Sampling algorithm visualization (Figure 2, page 10): analysis/algo-vis.ipynb
Intrinsic evaluation graph (Figure 5, page 14): evaluation/intrinsic-eval.sh
Intrinsic evaluation table (Table 1, page 15): evaluation/tables/table.ipynb
Tokenization boxplots (Figure 6, page 16): use_cases/tokenization/visualizations.ipynb
Dataset expansion results (Table 2, page 18): use_cases/dataset_expansion/next-best.ipynb
UD expansion case study (Table 3, page 19): use_cases/dataset_expansion/next-best.ipynb
Geographical distance plot (Figure 7, page 20): use_cases/geo_dist/visualize-dist.ipynb

Data Licenses

Grambank is licensed under a Creative Commons 4.0 BY International License.
WALS is licensed under a Creative Commons 4.0 BY International License.
Glottolog is licensed under a Creative Commons 4.0 BY International License.

License

The code and data in this repo are licensed under a Creative Commons 4.0 BY International License.

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
analysis		analysis
data		data
evaluation		evaluation
grambank @ 7ae000c		grambank @ 7ae000c
img		img
src/typdiv_sampling		src/typdiv_sampling
tests		tests
use_cases		use_cases
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
prepare-data.sh		prepare-data.sh
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A Principled Framework for Evaluating on Typologically Diverse Languages

Installation and data

Example usage

Sampling

Typological diversity evaluation

Reproducibility

Data Licenses

License

About

Contributors 2

Languages

License

esther2000/typdiv-sampling

Folders and files

Latest commit

History

Repository files navigation

A Principled Framework for Evaluating on Typologically Diverse Languages

Installation and data

Example usage

Sampling

Typological diversity evaluation

Reproducibility

Data Licenses

License

About

Resources

License

Stars

Watchers

Forks

Contributors 2

Languages