WeTS: A Benchmark for Translation Suggestion

Translation Suggestion (TS), which provides alternatives for specific words or phrases given the entire documents translated by machine translation (MT) has been proven to play a significant role in post editing (PE). WeTS is a benchmark data set for TS, which is annotated by expert translators. WeTS contains corpus(train/dev/test) for four different translation directions, i.e., English2German, German2English, Chinese2English and English2Chinese.

For corpus in each direction, the data is organized as:
direction.split.src: the source-side sentences
direction.split.mask: the masked translation sentences, the placeholder is "<MASK>"
direction.split.tgt: the predicted suggestions, the test set for English2Chinese has three references for each example

direction: En2De, De2En, Zh2En, En2Zh
split: train, dev, test

Models

We release the pre-trained NMT models which are used to generate the MT sentences. Additionally, the released NMT models can be used to generate synthetic corpus for TS, which can improve the final performance dramatically.Detailed description about the way of generating synthetic corpus can be found in our paper.

The released models can be downloaded at:

Download the models

and the password is "2iyk"

For inference with the released model, we can:

sh inference_*direction*.sh

direction can be: en2de, de2en, en2zh, zh2en

Get Started

data preprocessing

sh process.sh

pre-training

Codes for the first-phase pre-training are not included in this repo, as we directly utilized the codes of XLM (https://github.com/facebookresearch/XLM) with little modiafication. And we did not achieve much gains with the first-phase pretraining.

The second-phase pre-training:

sh pretraining.sh
``

#### fine-tuning
```Bash
sh finetuning.sh

Codes in this repo is mainly forked from fairseq (https://github.com/pytorch/fairseq.git)

Citation

Please cite the following paper if you found the resources in this repository useful.

@article{yang2021wets,
  title={WeTS: A Benchmark for Translation Suggestion},
  author={Yang, Zhen and Zhang, Yingxue and Li, Ernan and Meng, Fandong and Zhou, Jie},
  journal={arXiv preprint arXiv:2110.05151},
  year={2021}
}

LICENCE

See LICENCE

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
codes_src		codes_src
corpus		corpus
nmt_dicts_en_de		nmt_dicts_en_de
nmt_dicts_en_zh		nmt_dicts_en_zh
LICENSE		LICENSE
README.md		README.md
finetuning.sh		finetuning.sh
inference_de2en.sh		inference_de2en.sh
inference_en2de.sh		inference_en2de.sh
inference_en2zh.sh		inference_en2zh.sh
inference_zh2en.sh		inference_zh2en.sh
pretraining.sh		pretraining.sh
process.sh		process.sh

Translation Direction	Train	Valid	Test
English2German	14,957	1000	1000
German2English	11,777	1000	1000
English2Chinese	15,769	1000	1000
Chinese2English	21,213	1000	1000

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WeTS: A Benchmark for Translation Suggestion

Contents

Data

Models

Get Started

data preprocessing

pre-training

Citation

LICENCE

About

Releases

Packages

Languages

License

ZhenYangIACAS/WeTS

Folders and files

Latest commit

History

Repository files navigation

WeTS: A Benchmark for Translation Suggestion

Contents

Data

Models

Get Started

data preprocessing

pre-training

Citation

LICENCE

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages