Skip to content

Latest commit

 

History

History
67 lines (53 loc) · 3.33 KB

wmt_experiments.md

File metadata and controls

67 lines (53 loc) · 3.33 KB

Working with WMT Ratings Data

We provide code to facilitate working with the WMT Metrics Shared Task and reproduce experiments from our ACL submission. All the scripts are in the wmt/ folder.

Downloading and aggregating the WMT ratings

We found it sometimes difficult to work with ratings from the WMT ratings shared task because the data is spread over several archives. The following command downloads all the necessary archives and aggregates the ratings in one large JSONL file.

python -m bleurt.wmt.db_builder \
  -target_language="en" \
  -rating_years="2015 2016" \
  -target_file=wmt.jsonl

You may use any combination of years from 2015 to 2019.

EMNLP Paper

We release a subset of models used in Table 1 of our EMNLP paper below:

ACL Paper (English Only)

The checkpoints page lists models that are similar to those trained for our 2020 ACL paper.

The script wmt/benchmark.py can be used to re-trained them from scratch. It downloads ratings from the WMT website, postprocesses them, trains a BLEURT checkpoint and computes the correlation with human ratings.

You may for instance reproduce the results of Table 2 of the paper as follows:

BERT_DIR=bleurt/test_checkpoint
BERT_CKPT=variables/variables
python -m bleurt.wmt.benchmark \
 -train_years="2015 2016" \
 -test_years="2017" \
 -dev_ratio=0.1 \
 -model_dir=bleurt_model \
 -results_json=results.json \
 -init_checkpoint=${BERT_DIR}/${BERT_CKPT} \
 -bert_config_file=${BERT_DIR}/bert_config.json \
 -vocab_file=${BERT_DIR}/vocab.txt \
 -do_lower_case=True \
 -num_train_steps=20000

For years 2018 and 2019, the flag average_duplicates_on_test must be set to False for a direct comparison with results from the paper. This flag enables averaging different ratings for each distinct reference-candidate pair, which the organizers of the WMT shared task started doing in 2018.

The exact correlations will probably be different from those reported in the paper because of differences in setup and initialization (expect differences between 0.001 and 0.1).