Skip to content

Latest commit

 

History

History
149 lines (119 loc) · 5.79 KB

README.md

File metadata and controls

149 lines (119 loc) · 5.79 KB

Elinor: Evaluation Library in INfOrmation Retrieval

Crates.io Documentation Build Status

News: The CLI tools are now available in the elinor-cli directory!

Elinor is a Rust library for evaluating information retrieval (IR) systems. It provides a comprehensive set of metrics and statistical tests for evaluating and comparing IR systems.

Key features

  • IR-specific design: Elinor is tailored specifically for evaluating IR systems, with an intuitive interface designed for IR engineers. It offers a streamlined workflow that simplifies common IR evaluation tasks.
  • Comprehensive evaluation metrics: Elinor supports a wide range of key evaluation metrics, such as Precision, MAP, MRR, and nDCG. The supported metrics are available in Metric. The evaluation results are validated against trec_eval to ensure accuracy and reliability.
  • In-depth statistical testing: Elinor includes several statistical tests, such as Student's t-test, Bootstrap test, and Randomized Tukey HSD test. Not only p-values but also other important statistics, such as effect sizes and confidence intervals, are provided for thorough reporting. See the statistical_tests module for more details.
  • Command-line tools: elinor-cli provides command-line tools for evaluating and comparing IR systems. The tools support various metrics and statistical tests, facilitating comprehensive evaluations and in-depth analyses.

API documentation

See https://docs.rs/elinor/.

Or, you can build and open the documentation locally by running the following command:

RUSTDOCFLAGS="--html-in-header katex.html" cargo doc --no-deps --features serde --open

Command-line tools

elinor-cli provides command-line tools for evaluating and comparing IR systems.

For example, you can obtain various statistics from several statistical tests, as shown below:

Two-system comparison

# Means
+--------+----------+----------+
| Metric | System_1 | System_2 |
+--------+----------+----------+
| ndcg@5 | 0.3450   | 0.2700   |
+--------+----------+----------+

# Two-sided paired Student's t-test for (System_1 - System_2)
+--------+--------+--------+--------+--------+---------+---------+
| Metric | Mean   | Var    | ES     | t-stat | p-value | 95% MOE |
+--------+--------+--------+--------+--------+---------+---------+
| ndcg@5 | 0.0750 | 0.0251 | 0.4731 | 2.1158 | 0.0478  | 0.0742  |
+--------+--------+--------+--------+--------+---------+---------+

# Two-sided paired Bootstrap test (n_resamples = 10000)
+--------+---------+
| Metric | p-value |
+--------+---------+
| ndcg@5 | 0.0511  |
+--------+---------+

# Fisher's randomized test (n_iters = 10000)
+--------+---------+
| Metric | p-value |
+--------+---------+
| ndcg@5 | 0.0498  |
+--------+---------+

Multi-system comparison

# ndcg@5
## System means
+----------+--------+---------+
| System   | Mean   | 95% MOE |
+----------+--------+---------+
| System_1 | 0.3450 | 0.0670  |
| System_2 | 0.2700 | 0.0670  |
| System_3 | 0.2450 | 0.0670  |
+----------+--------+---------+
## Two-way ANOVA without replication
+-----------------+------------+----+----------+--------+---------+
| Factor          | Variation  | DF | Variance | F-stat | p-value |
+-----------------+------------+----+----------+--------+---------+
| Between-systems | 0.1083     | 2  | 0.0542   | 2.4749 | 0.0976  |
| Between-topics  | 1.0293     | 19 | 0.0542   | 2.4754 | 0.0086  |
| Residual        | 0.8317     | 38 | 0.0219   |        |         |
+-----------------+------------+----+----------+--------+---------+
## Effect sizes for Tukey HSD test
+----------+----------+----------+----------+
| ES       | System_1 | System_2 | System_3 |
+----------+----------+----------+----------+
| System_1 | 0.0000   | 0.5070   | 0.6760   |
| System_2 | -0.5070  | 0.0000   | 0.1690   |
| System_3 | -0.6760  | -0.1690  | 0.0000   |
+----------+----------+----------+----------+
## p-values for randomized Tukey HSD test (n_iters = 10000)
+----------+----------+----------+----------+
| p-value  | System_1 | System_2 | System_3 |
+----------+----------+----------+----------+
| System_1 | 1.0000   | 0.2561   | 0.1040   |
| System_2 | 0.2561   | 1.0000   | 0.8926   |
| System_3 | 0.1040   | 0.8926   | 1.0000   |
+----------+----------+----------+----------+

Correctness verification

In addition to simple unit tests, Elinor's evaluation results are validated to ensure accuracy and reliability:

  • The metrics are validated against trec_eval using its test data.
  • The statistical tests are validated against the results in Sakai's book using its sample data.

Acknowledgments

This library is inspired by Sakai's books on IR evaluation and statistical testing:

I recommend reading these books before using this library.

Licensing

Licensed under either of

at your option.