We provide code to facilitate working with the WMT Metrics
Shared Task and reproduce experiments from our ACL submission. All the scripts are
in the wmt/
folder.
We found it sometimes difficult to work with ratings from the WMT ratings shared task because the data is spread over several archives. The following command downloads all the necessary archives and aggregates the ratings in one large JSONL file.
python -m bleurt.wmt.db_builder \
-target_language="en" \
-rating_years="2015 2016" \
-target_file=wmt.jsonl
You may use any combination of years from 2015 to 2019.
We release a subset of models used in Table 1 of our EMNLP paper below:
- Teacher (RemBERT-32)
- RemBERT-3 distilled on WMT and Wikipedia
- RemBERT-6 distilled on WMT and Wikipedia
- RemBERT-12 distilled on WMT and Wikipedia
- RemBERT-12 distilled on WMT and Wikipedia, Germanic (cluster 1)
- RemBERT-12 distilled on WMT and Wikipedia, Romance (cluster 2)
- RemBERT-12 distilled on WMT and Wikipedia, Indo-Iranian-Tamil (cluster 3)
- RemBERT-12 distilled on WMT and Wikipedia, Slavic-Finno-Ugric-Kazakh-Turkish.zip (cluster 4)
- RemBERT-12 distilled on WMT and Wikipedia, Sino-Tibetan-Japanases.zip (cluster 5)
The checkpoints page lists models that are similar to those trained for our 2020 ACL paper.
The script wmt/benchmark.py
can be used to re-trained them from scratch. It downloads ratings from the WMT website, postprocesses them,
trains a BLEURT checkpoint and computes the correlation with human ratings.
You may for instance reproduce the results of Table 2 of the paper as follows:
BERT_DIR=bleurt/test_checkpoint
BERT_CKPT=variables/variables
python -m bleurt.wmt.benchmark \
-train_years="2015 2016" \
-test_years="2017" \
-dev_ratio=0.1 \
-model_dir=bleurt_model \
-results_json=results.json \
-init_checkpoint=${BERT_DIR}/${BERT_CKPT} \
-bert_config_file=${BERT_DIR}/bert_config.json \
-vocab_file=${BERT_DIR}/vocab.txt \
-do_lower_case=True \
-num_train_steps=20000
For years 2018 and 2019, the flag average_duplicates_on_test
must be set
to False
for a direct comparison with results from the paper. This flag
enables averaging different ratings for each distinct reference-candidate pair,
which the organizers of the WMT shared task started doing in 2018.
The exact correlations will probably be different from those reported in the paper because of differences in setup and initialization (expect differences between 0.001 and 0.1).