-
Notifications
You must be signed in to change notification settings - Fork 2
/
ape-task.html
149 lines (121 loc) · 8.64 KB
/
ape-task.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
<HTML>
<HEAD>
<title>Automatic Post-Editing Task - ACL 2016 First Conference on Machine Translation</title>
<style> h3 { margin-top: 2em; } </style>
</HEAD>
<body>
<center>
<script src="title.js"></script>
<p><h2>Shared Task: Automatic Post-Editing</h2></p>
<script src="menu.js"></script>
</center>
<H3>OVERVIEW</H3>
<p> The second round of the APE shared task follows the first pilot round organised in 2015. The aim is to examine <b> automatic methods for correcting errors produced by an unknown machine translation (MT) system.</b> This has to be done by exploiting knowledge acquired from human post-edits, which are provided as training material.</p>
<H3>Goals</H3>
<p>
The aim of this task is to improve MT output in black-box scenarios, in which the MT system is used "as is" and cannot be modified. From the application point of view APE components would make it possible to:
<UL>
<LI>Cope with systematic errors of an MT system whose decoding process is not accessible</LI>
<LI>Provide professional translators with improved MT output quality to reduce (human) post-editing effort</LI>
<LI>Adapt the output of a general-purpose system to the lexicon/style requested in a specific application domain</LI>
</UL>
</p>
<H3>Task Description</H3>
<p>
This year the task focuses on the Information Technology (IT) domain, in which English source sentences have been translated into German by an unknown MT system and then manually post-edited by professional translators.</p>
<p>At training stage, the collected human post-edits have to be used to learn correction rules for the APE systems. At test stage they will be used for system evaluation with automatic metrics (TER and BLEU).
</p>
<H3>Data</H3>
<p>
Training, development and test data (the same used for the Sentence-level Quality Estimation task) consist in English-German triplets (source, target and post-edit) belonging to the IT domain and <b>already tokenized.</b></p>
<p>Training and development respectively contain 12,000 and 1,000 triplets, while the test set 2,000 instances. All data is provided by the EU project QT21 (<a href="http://www.qt21.eu/" target="_blank">http://www.qt21.eu/</a>).</p>
<p><b>NOTE:</b> Any use of additional data for training your system is allowed (e.g. parallel corpora, post-edited corpora).</p>
<H3>Evaluation</H3>
<p>Systems' performance will be evaluated with respect to their capability to reduce the distance that separates an automatic translation from its human-revised version.</p>
<p>Such distance will be measured in terms of TER, which will be computed between automatic and human post-edits in <b>case-sensitive mode.</b></p>
<p>Also BLEU will be taken into consideration as a secondary evaluation metric. To gain further insights on final output quality, a subset of the outputs of the submitted systems will also be manually evaluated.</p>
<p>The submitted runs will be ranked based on the average HTER calculated on the test set by using the <a href="http://www.cs.umd.edu/~snover/tercom/" target="_blank">tercom</a> software.</p>
<p>The HTER calculated between the raw MT output and human post-editions in the test set will be used as baseline (<i>i.e.</i> the baseline is a system that leaves all the test instances unmodified).</p>
<H3>Download Links</H3>
<p><a href="http://hdl.handle.net/11372/LRT-1632" target="_blank">Training and development data</a></p>
<p><a href="http://hdl.handle.net/11372/LRT-1632" target="_blank">Test data </a> <b>(gold standard references are released in <a href="http://hdl.handle.net/11372/LRT-1632" target="_blank">test_pe.zip</a>)</b></p>
<p><a href="https://www.dropbox.com/s/5jw5maariwey080/Evaluation_Script.tar.gz?dl=0" target="_blank">Evaluation script</a></p>
Results (Systems are ranked according to TER score) <font color="red"><b> !!! NEW !!!</b></font>
<table border="1" cellspacing="0" cellpadding="5">
<tr><td><b>Systems<b></td><td><b>TER<b></td><td><b>BLEU<b></td></tr>
<tr><td>AMU_ensemble8-mt+src_PRIMARY</td><td>21.52</td><td>67.65</td></tr>
<tr><td>AMU_ensemble4-mt_CONTRASTIVE</td><td>23.06</td><td>66.09</td></tr>
<tr><td>FBK_factored_contrastive</td><td>23.92</td><td>64.75</td></tr>
<tr><td>FBK_factored-qe_primary</td><td>23.94</td><td>64.75</td></tr>
<tr><td>USAAR_OSM_PRIMARY_BOTH</td><td>24.14</td><td>64.10</td></tr>
<tr><td>USAAR_CPBOSM_CONTRASTIVE_BOTH</td><td>24.14</td><td>64.00</td></tr>
<tr><td>CUNI_edit_gen_1_PRIMARY</td><td>24.31</td><td>63.32</td></tr>
<tr bgcolor="#DCDCDC"><td>Baseline_2 (Statistical phrase-based APE)</td><td>24.64</td><td>63.47</td></tr>
<tr bgcolor="#DCDCDC"><td>Official Baseline (MT)</td><td>24.76</td><td>62.11</td></tr>
<tr><td>DCU_R34_CONTRASTIVE</td><td>26.79</td><td>58.60</td></tr>
<tr><td>JUSAAR_SC_PRIMARY_BOTH</td><td>26.92</td><td>59.44</td></tr>
<tr><td>JUSAAR_SC_D_CONTRASTIVE_BOTH</td><td>26.97</td><td>59.18</td></tr>
<tr><td>DCU_R24_PRIMARY</td><td>28.97</td><td>55.19</td></tr>
</table>
<H3>DIFFERENCES FROM THE FIRST PILOT ROUND</H3>
<p>
Compared to the the pilot round, the main differences are:
<UL>
<LI>the domain specificity (from news to IT);</LI>
<LI>the target language (from Spanish to German);</LI>
<LI>the post-editors (from crowdsourced workers to professional translators);</LI>
<LI>the evaluation metrics (from case-sensitive/insensitive TER to case-sensitive TER and BLEU);</LI>
<LI>the performance analysis (from automatic metrics to automatic metrics plus manual evaluation).</LI>
</UL>
</p>
<H3>Submission Format</H3>
<p>
The output of your system should produce automatic post-editions of the target sentences in the test in the following way:
<pre>
<b><METHOD NAME> <SEGMENT NUMBER> <APE SEGMENT></b>
</pre>
</p>
Where:
<ul>
<li><code><b>METHOD NAME</b></code> is the name of your automatic post-editing method.</li>
<li><code><b>SEGMENT NUMBER</b></code> is the line number of the plain text target file you are post-editing.</li>
<li><code><b>APE SEGMENT</b></code> is the automatic post-edition for the particular segment.</li>
</ul>
Each field should be delimited by a single tab character.
</p>
<H3>Submission Requirements</H3>
<p>Each participating team can submit at most 3 systems, but they have to explicitly indicate which of them represents their <i>primary</i> submission. In the case that none of the runs is marked as primary, the latest submission received will be used as the primary submission.</p>
<p>Submissions should be sent via email to <font color="red"><a href="mailto:[email protected]">[email protected]</a></font>. Please use the following pattern to name your files:</p>
<p><code><b>INSTITUTION-NAME_METHOD-NAME_SUBTYPE</b></code>, where:</p>
<p><code><b>INSTITUTION-NAME</b></code> is an acronym/short name for your institution, e.g. "UniXY"</p>
<p><code><b>METHOD-NAME</b></code> is an identifier for your method, e.g. "pt_1_pruned"</p>
<p><code><b>SUBTYPE</b></code> indicates whether the submission is primary or contrastive with the two alternative values: <code>PRIMARY</code>, <code>CONTRASTIVE</code>.</p>
<p>You are also invited to submit a short paper (4 to 6 pages) to WMT describing your APE method(s). You are not required to submit a paper if you do not want to. In that case, we ask you to give an appropriate reference describing your method(s) that we can cite in the WMT overview paper.</p>
<h3>Important dates</h3>
<table>
<tr><td>Release of training data </td><td>February 19, 2016</td></tr>
<tr><td>Test set distributed </td><td>April 18, 2016</td></tr>
<tr><td>Submission deadline </td><td><strike>April 24, 2016</strike> <strike>April 26, 2016</strike> May 2, 2016 </td></tr>
<tr><td>Paper submission deadline</td><td><strike>May 8, 2016</strike> May 15, 2016 </td></tr>
<tr><td>Manual evaluation</td><td>May 2016</td></tr>
<tr><td>Notification of acceptance</td><td>June 5, 2016</td></tr>
<tr><td>Camera-ready deadline</td><td>June 22, 2016</td></tr>
</table>
<h3>Organisers</h3>
Rajen Chatterjee (Fondazione Bruno Kessler)
<br>
Matteo Negri (Fondazione Bruno Kessler)
<br>
Raphael Rubino (Saarland University)
<br>
Marco Turchi (Fondazione Bruno Kessler)
<br>
Marcos Zampieri (Saarland University)
<h3>Contact</h3>
<p>For any information or question on the task, please send an email to:<a href="mailto:[email protected]">[email protected]</a>.<br>
To be always updated about this year's edition of the APE task, you can also join the <a href="http://groups.google.com/a/fbk.eu/group/wmt-ape/" target="_blank">wmt-ape group</a>.</p>
<p align="right">
Supported by the European Commission under the QT21
<a href="http://www.qt21.eu/"><img align=right src="figures/qt21.png" border=0 width=100 height=40></a>
<br>project (grant number 645452) <p>
</HTML>