-
Notifications
You must be signed in to change notification settings - Fork 5
/
quality-estimation-task.html
415 lines (319 loc) · 25.5 KB
/
quality-estimation-task.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
<HTML>
<HEAD>
<title>Quality Estimation Task - ACL 2017 Second Conference on Machine Translation</title>
<style> h3 { margin-top: 2em; } </style>
</HEAD>
<body>
<center>
<script src="title.js"></script>
<p><h2>Shared Task: Quality Estimation</h2></p>
<script src="menu.js"></script>
</center>
<p>This shared task will build on its previous five editions to further
examine automatic methods for estimating the quality of machine
translation output at run-time, without relying on reference
translations. We include <b>word-level</b>, <b>phrase-level</b> and <b>sentence-level</b>
estimation. All tasks will make use of a large dataset produced from
post-editions by professional translators. The data will be
domain-specific (IT and Pharmaceutical domains) and substantially larger than in previous
years. In addition to advancing the state of the art at all prediction
levels, our <b>goals</b> include:
</p><ul>
<li>To test the effectiveness of larger (domain-specific and professionally
annotated) datasets. We will do so by increasing the size of one of
last year's training sets. </li>
<li>To study the effect of language direction and domain. We will do so by
providing two datasets created in similar ways, but for different
domains and language directions.</li>
<li>To investigate the utility of detailed information logged during
post-editing. We will do so by providing post-editing time,
keystrokes, and actual edits.</li>
<li>Measure progress over years at all prediction levels. We will do so by using last year's test set for comparative experiments.</li>
</ul>
<!--
<ul>
<li>To study the utility of detailed information logged during post-editing (time, keystrokes, actual edits) for different levels of prediction.</li>
<li>To investigate quality estimation at a new level of granularity: phrases. </li>
<li>To advance work on sentence and word-level quality estimation by providing domain-specific, larger and professionally annotated datasets.</li>
Our tasks have the following <b>goals</b>:
<li>To analyse the effectiveness of different types of quality labels provided by humans for longer texts in document-level prediction. </li>
<li>To explore differences between sentence-level and document-level
prediction.</li>
<li>To analyse the effect of training data sizes and quality for sentence and word-level prediction, particularly the use of annotations obtained from crowdsourced post-editing. </li>
<li>To explore word-level quality prediction at different levels of granularity. </li>
<li> To investigate the effectiveness of different quality labels. </li>
<li>To push current work on sentence-level quality estimation towards robust models that can work across MT systems;</li>
<li> To study the effects of training and test datasets with mixed domains, language pairs and MT systems. </li>
<li>To test work on sentence-level quality estimation for the task of selecting the best translation amongst multiple systems;</li>
<li>To evaluate the applicability of quality estimation for post-editing tasks;</li>
<li>To provide a first common ground for development and comparison of quality estimation systems at word-level.</li>
</ul>
-->
This year's shared task provides new training and test datasets for all
tasks, and allows participants to explore any additional data and
resources deemed relevant. A in-house MT system was used to produce
translations for all tasks. MT system-dependent information can be made
available under request. The data is publicly available but since it has
been provided by our industry partners it is subject to specific terms
and conditions. However, these have no practical implications on the use
of this data for research purposes.
<p><br></p><hr>
<br>
<font color="purple"><b>Gold-standard labels for all subtasks </b> <a href="http://www.quest.dcs.shef.ac.uk/wmt17_files_qe/wmt17_de-en_gold.tar.gz">German-English<a/> and <a href="http://www.quest.dcs.shef.ac.uk/wmt17_files_qe/wmt17_en-de_gold.tar.gz">English-German<a/></font></b>.
<!-- BEGIN SENTENCE-LEVEL-->
<h3><font color="blue">Task 1: Sentence-level QE</font></h3>
<p><b><font color="purple">Results <a href="http://www.quest.dcs.shef.ac.uk/wmt17_files_qe/wmt17_task1_results.pdf">here<a/></font></b>.
<p> Participating systems are required to score (and rank) sentences
according to post-editing effort. Multiple labels will be made
available, including the percentage of edits need to be fixed (HTER),
post-editing time, and keystrokes. The main prediction label will be
HTER, but we welcome participants wanting to submit models trained to
predict other labels.
Predictions according to each alternative label will be evaluated
independently. For the ranking variant, the predictions can be generated
by models built using any of these labels (or their combination), as
well using external information. The <b>data</b> consists of:
</p><ul>
<!--<li><b>English-German</b>: segments on the IT domain translated by an in-house phrase-based SMT system and post-edited by professional translators (23,000 for training, 1,000 for dev).
Download the <a href="http://www.quest.dcs.shef.ac.uk/wmt17_files_qe/task1_en-de_training-dev.tar.gz">training and development</a> data and baseline features.</li>
<li><b>German-English</b>: segments on the Pharmaceutical domain translated by an in-house phrase-based SMT system and post-edited by professional translators (25,000 for training, 1,000 for dev). Download the <a href="http://www.quest.dcs.shef.ac.uk/wmt17_files_qe/task1_de-en_training-dev.tar.gz">training and development</a> data and baseline features.</li>
-->
<li><b>English-German</b>: segments on the IT domain translated by an in-house phrase-based SMT system and post-edited by professional translators (23,000 for training, 1,000 for dev). Download the <a href="https://lindat.mff.cuni.cz/repository/xmlui/handle/11372/LRT-1974">training and development</a> data and baseline features.</li>
<li><b>German-English</b>: segments on the Pharmaceutical domain translated by an in-house phrase-based SMT system and post-edited by professional translators (25,000 for training, 1,000 for dev). Download the <a href="https://lindat.mff.cuni.cz/repository/xmlui/handle/11372/LRT-1974">training and development</a> data and baseline features. <font color="red"><b>WARNING</b>: there was an issue with the baseline features extracted for this language pair, so if you are using them, please download the file again from the same link (the 'Corrected Version' file).</font></li>
</ul>
<p>The data for download contains source sentences, their machine translations, their post-editions (translations), HTER as post-editing effort scores. Other scores, such as post-editing time, will be made available shortly.
In both cases, The <a href="https://github.com/ghpaetzold/PET">PET</a> tool was used to collect these various types of information during post-editing. HTER labels were computed using <a href="http://www.umiacs.umd.edu/%7Esnover/terp/">TER</a> (default settings: tokenised, case insensitive, exact matching only, with scores capped to 1).
<p></p>
<p>As <i><font color="green">test data</font></i>, for each language pair we will provide <b>2,000</b> new sentence translations, produced by the same SMT system used for the training data for each language pair. <font color="red"><b>NEW</b></font>: Download the <a href="http://hdl.handle.net/11372/LRT-2135">test</a> data and baseline features. For English-German, note that we are also releasing the <b>2016 test data</b>. Please submit your results for both test sets so we can attempt to measure progress over years.
</p><p></p><p>
The usual <a href="http://www.quest.dcs.shef.ac.uk/quest_files/features_blackbox_baseline_17">17 features</a> used in WMT12-16 is considered for the <b>baseline system</b>.
This system uses SVM regression with an RBF kernel, as well as grid
search algorithm for the optimisation of relevant parameters. <a href="https://github.com/ghpaetzold/questplusplus">QuEst++</a> is used to build prediction models.
<!--
and this <a href="http://www.quest.dcs.shef.ac.uk/wmt13_files/evaluateWMTQP2013-Task1_1.pl">script</a> is used to evaluation the models. For significance tests, we use the bootstrap resampling method with <a href="http://www.quest.dcs.shef.ac.uk/wmt15_files/bootstrap-hypothesis-difference-significance.pl">this code</a>.
<br>
-->
</p><p>As in previous years, two variants of the results can be submitted:
</p><ul>
<li><b>Scoring</b>: An absolute quality score for each sentence
translation according to the type of prediction, to be interpreted as an
error metric: lower scores mean better translations.</li>
<li><b>Ranking</b>: A ranking of sentence translations for all source
sentences from best to worst. For this variant, it does not matter how
the ranking is produced (from HTER predictions, likert predictions,
post-editing time, etc.). The reference ranking will be defined based on
the true HTER scores.</li>
</ul>
<p><i><font color="green">Evaluation</font></i> is performed against the true label and/or ranking using as metrics:
</p><ul>
<li><b>Scoring</b>: Pearson's correlation (primary), Mean Average Error (MAE) and Root Mean Squared Error (RMSE).</li>
<li><b>Ranking</b>: Spearman's rank correlation (primary) and DeltaAvg. </li>
</ul>
<!--<font color="red">Add link to evaluation script</font>-->
<p><br></p><hr>
<!-- BEGIN WORD-LEVEL-->
<h3><font color="blue">Task 2: Word-level QE</font></h3>
<p><b><font color="purple">Results <a href="http://www.quest.dcs.shef.ac.uk/wmt17_files_qe/wmt17_task2_results.pdf">here<a/></font></b>.
<p>Participating systems are required to detect errors for each token
in MT output. We frame the problem as the binary task of distinguishing
between 'OK' and 'BAD' tokens. </p>
<p>The <b>data</b> for this task is the same as provided in Task 1. As
in previous years, all segments are automatically annotated for errors
with binary word-level labels by using the alignments provided by the <a href="http://www.cs.umd.edu/%7Esnover/tercom/" target="_blank">TER</a>
tool (settings: tokenised, case insensitive, exact matching only,
disabling shifts by using the `-d 0` option) between machine
translations and their post-edited versions. Shifts (word order errors)
were not annotated as such (but rather as deletions + insertions) to
avoid introducing noise in the annotation.</p>
As <i><font color="green">training</font></i> and <i><font color="green">development</font></i> data, we provide the tokenised translation outputs with tokens annotated with 'OK' or 'BAD' labels. Download:
<ul>
<!--<li><b>English-German</b> <a href="http://www.quest.dcs.shef.ac.uk/wmt17_files_qe/task2_en-de_training-dev.tar.gz">training and development</a> data and baseline features.</li>
<li><b>German-English</b> <a href="http://www.quest.dcs.shef.ac.uk/wmt17_files_qe/task2_de-en_training-dev.tar.gz">training and development</a> data and baseline features.</li> -->
<li><b>English-German</b> <a href="https://lindat.mff.cuni.cz/repository/xmlui/handle/11372/LRT-1974">training and development</a> data and baseline features.</li>
<li><b>German-English</b> <a href="https://lindat.mff.cuni.cz/repository/xmlui/handle/11372/LRT-1974">training and development</a> data and baseline features.</li>
</ul>
<p>As <i><font color="green">test data</font></i>, for each language pair we will provide <b>2,000</b> new sentence translations, produced and annotated in the same way. <font color="red"><b>NEW</b></font>: Download the <a href="http://hdl.handle.net/11372/LRT-2135">test</a> data and baseline features. For English-German, note that we are also releasing the <b>2016 test data</b>. Please submit your results for both test sets so we can attempt to measure progress over years.
<p>The baseline system is be similar to the baseline used at WMT-15 and WMT-16: the set of <a href="http://www.quest.dcs.shef.ac.uk/wmt17_files_qe/task2_en-de_baseline.feature.list">baseline features</a> includes the same features as the ones used last year with the addition of feature combinations (target word + left/right context, target word + source word, etc.). The features are extracted with the <a href="https://github.com/qe-team/marmot">Marmot</a> QE tool. The system is trained with <a href="http://www.chokkan.org/software/crfsuite/">CRFSuite</a> toolkit with passive-aggressive algorithm.</p>
</p><p>Submissions are <i><font color="green">evaluated</font></i> in
terms of classification performance via the multiplication of F1-scores
for the 'OK' and 'BAD' classes against the original labels, as in WMT16.
We will also report the F1-BAD score.
We use this <a href="https://gist.github.com/varvara-l/028e4439fb992d089935" target="_blank">evaluation script</a> for the metrics, and
<a href="https://gist.github.com/varvara-l/d66450db8da44b8584c02f4b6c79745c">this script</a> to compute significance levels using approximate randomisation.
<p><b><font color="red">NEW</font></b>: Submissions to the word-level task will also be <font color="green">evaluated</font> in terms of their performance at sentence level. The motivation for that is that we found that sometimes predictions at word level can work well as sentence-level predictors: the percentage of words labelled as 'BAD' in a sentence should essentially be similar to a sentence-level HTER score.
All submissions for Task 2 will automatically be evaluated analogously to the sentence-level scoring task: using Pearson correlation (primary metric), MAE and RMSE scores. Participants aiming to optimise their models against sentence-level metrics can submit one additional system per language pair if they wish so, using the submission format of Task 2. The binary word-level predictions will be used to compute the sentence-level score: number of words with 'BAD' label over the length of sentence.</p>
<br>
<hr>
<!-- BEGIN PHRASE-LEVEL-->
</p><h3><font color="blue">Task 3: Phrase-level QE</font></h3>
<p><b><font color="purple">Results <a href="http://www.quest.dcs.shef.ac.uk/wmt17_files_qe/wmt17_task3_results.pdf">here<a/></font></b>.
<p>For this task, given a 'phrase' (segmentation as given by the SMT
decoder), participants are required to label it as 'OK' or 'BAD'. Errors
made by MT engines are interdependent and one incorrectly chosen word
can cause more errors, especially in its local context. Phrases as
produced by SMT decoders can be seen as a representation of this local
context and in this task we ask participants to consider them as atomic
units, using phrase-specific information to improve upon the results of
the word-level task.
<p>
The <b>data</b> for this task is the same as provided in Tasks 1 and 2.
The labelling of this data was adapted from word-level labelling by
assigning the 'BAD' tag to any phrase that contains at least one 'BAD'
word. We note, however, that <i> the order of the words in the source sentence is different here than the original word order</i>, as some pre-ordering was applied to the source sentences before decoding. Given that our phrases correspond to the decoder segmentation (based on this reordered version of the source), it is not possible to revert the pre-ordering while keeping the segmentation produced by the decoder. We also provide the original source sentences before the pre-ordering for those interested.
<p>As <i><font color="green">training</font></i> and <i><font color="green">development</font></i>
data, we provide the tokenised translation outputs with phrase
segmentation for both source and machine-translated sentences. We also
provide target-source phrase alignments and phrase-level labels.
Download:
<ul>
<li><b>English-German</b> <a href="https://lindat.mff.cuni.cz/repository/xmlui/handle/11372/LRT-1974">training and development</a> data and baseline features.</li>
<li><b>German-English</b> <a href="https://lindat.mff.cuni.cz/repository/xmlui/handle/11372/LRT-1974">training and development</a> data and baseline features.</li>
</ul>
<p>The baseline phrase-level system is analogous to last year's system: it uses a set of <a href="http://www.quest.dcs.shef.ac.uk/wmt17_files_qe/task3_en-de_baseline.feature.list">baseline features</a> (based on black-box sentence-level features) extracted with the <a href="https://github.com/qe-team/marmot">Marmot</a> tool and is trained with the <a href="http://www.chokkan.org/software/crfsuite/">CRFSuite</a> tool.</p>
<p>As <i><font color="green">test data</font></i>, for each language pair we will provide <b>2,000</b> new sentence translations, produced and annotated in the same way. <font color="red"><b>NEW</b></font>: Download the <a href="http://hdl.handle.net/11372/LRT-2135">test</a> data and baseline features. For English-German, note that we are also releasing the <b>2016 test data</b>. Please submit your results for both test sets so we can attempt to measure progress over years. </p>
<p>Submissions will be <i><font color="green">evaluated</font></i> in terms of the multiplication of <b>phrase-level</b> F1-OK and F1-BAD. </p>
<!---- TASK 3B -->
<p><h3><font color="blue">Task 3b: Phrase-level QE with human annotation</font></h3>
<p><b><font color="red">This task was cancelled this year due to issues in the labelling of the data</font></b>.<p>
This task uses a subset of the data in Task 3 (German-English only) where each phrase has been annotated (as a phrase) by humans with three labels: 'OK', 'BAD' (as before) and 'BAD_word_order', which is a specific type of error where the phrase is in an incorrect position in the sentence.
<p>The <i><font color="green">training</font></i> and <i><font color="green">development</font></i> data follow the same structure as for Task 3, but it is smaller (124 and 3,769 sentences, respectively). Download:
<ul>
<li><b>German-English</b> <a href="https://lindat.mff.cuni.cz/repository/xmlui/handle/11372/LRT-1974">training and development</a> data and baseline features.</li>
</ul>
<p>The baseline phrase-level system and evaluation procedures are the same as for Task 3.
<p>As <i><font color="green">test data</font></i>, we will provide <b>306</b> new sentence translations, produced and annotated in the same way. <font color="red"><b>NEW</b></font>: Download the <a href="http://hdl.handle.net/11372/LRT-2135">test</a> data and baseline features. </p>
<br>
<hr>
<!-- EXTRA STUFF -->
<h3>Additional resources</h3>
<p>These are the resources we have used to extract the baseline features
in Task 1, which can also be useful for Tasks 2 and 3. If you require
other resources/info from the MT system, let us know:
</p><p>
<b>English-German</b>
</p><ul>
<li>English <a href="http://www.quest.dcs.shef.ac.uk/quest_files_16/lm.tok.en.tar.gz">language model</a></li>
<li>English <a href="http://www.quest.dcs.shef.ac.uk/quest_files_16/ngram-count.tok.en.out.clean.tar.gz">n-gram counts</a></li>
<li>German <a href="http://www.quest.dcs.shef.ac.uk/quest_files_16/lm.tok.de.tar.gz">language model</a></li>
<li>English-German (and v.v.) <a href="http://www.quest.dcs.shef.ac.uk/quest_files_16/EN-DE.lex.tar.gz">lexical translation tables</a></li>
</ul>
<p><b>German-English</b>
</p><ul>
<li>German <a href="http://www.quest.dcs.shef.ac.uk/wmt17_files_qe/lm.tok.de.tar.gz">language model</a></li>
<li>German <a href="http://www.quest.dcs.shef.ac.uk/wmt17_files_qe/ngram-count.tok.de.out.clean.tar.gz">n-gram counts</a></li>
<li>English <a href="http://www.quest.dcs.shef.ac.uk/wmt17_files_qe/lm.tok.en.tar.gz">language model</a></li>
<li>German-English (and v.v.) <a href="http://www.quest.dcs.shef.ac.uk/wmt17_files_qe/DE-EN.lex.tar.gz">lexical translation tables</a></li>
</ul>
<p><br></p><hr>
<!-- SUBMISSION INFO -->
<h3>Submission Format</h3>
<h4><font color="red">Tasks 1</font></h4>
<p> The output of your system for a <b>a given subtask</b> should produce scores for the translations at the <i>segment-level</i> formatted in the following way: </p>
<pre><METHOD NAME> <SEGMENT NUMBER> <SEGMENT SCORE> <SEGMENT RANK><br><br></pre>
Where:
<ul>
<li><code>METHOD NAME</code> is the name of your
quality estimation method.</li>
<li><code>SEGMENT NUMBER</code> is the line number
of the plain text translation file you are scoring/ranking.</li>
<li><code>SEGMENT SCORE</code> is the predicted (HTER/METEOR) score for the
particular segment - assign all 0's to it if you are only submitting
ranking results. </li>
<li><code>SEGMENT RANK</code> is the ranking of
the particular segment - assign all 0's to it if you are only submitting
absolute scores. </li>
</ul>
Each field should be delimited by a single tab character.
<h4><font color="red">Task 2</font></h4>
<p> The output of your system should produce scores for the translations at the <i>word-level</i>
formatted in the following way: </p>
<pre><METHOD NAME> <SEGMENT NUMBER> <WORD INDEX> <WORD> <BINARY SCORE> <br><br></pre>
Where:
<ul>
<li><code>METHOD NAME</code> is the name of your quality estimation method.</li>
<li><code>SEGMENT NUMBER</code> is the line number of the plain text translation file you are scoring (starting at 0).</li>
<li><code>WORD INDEX</code> is the index of the word in the tokenised sentence, as given in the training/test sets (starting at 0).</li>
<li><code>WORD</code> actual word.</li>
<li><code>BINARY SCORE</code> is either 'OK' for no issue or 'BAD' for any issue.</li>
</ul>
Each field should be delimited by a single tab character.
<h4><font color="red">Task 3 and Task 3b</font></h4>
<p> The output of your system should produce scores for the translations at the <i>phrase-level</i>
formatted in the following way: </p>
<pre><METHOD NAME> <SEGMENT NUMBER> <PHRASE INDEX> <PHRASE> <BINARY SCORE> <br><br></pre>
Where:
<ul>
<li><code>METHOD NAME</code> is the name of your quality estimation method.</li>
<li><code>SEGMENT NUMBER</code> is the line number of the plain text translation file you are scoring (starting at 0).</li>
<li><code>PHRASE INDEX</code> is the index of the word in the segmented sentence, as given in the training/test sets (starting at 0).</li>
<li><code>PHRASE</code> actual phrase. Multiword phrases should be written in whole with words delimited by spaces</li>
<li><code>BINARY SCORE</code> is either 'OK' for no issue or 'BAD' for any issue.</li>
</ul>
Each field should be delimited by a single tab character.
<p>Example of the phrase-level format:</p>
<table>
<tr>
<td width="20%"><tt>PHRASE_BASELINE</tt></td> <td width="10%">4</td> <td width="10%">0</td> <td>Geben Sie im Eigenschafteninspektor (</td> <td width="10%">BAD<td>
</tr>
<tr>
<td><tt>PHRASE_BASELINE</tt></td> <td>4</td> <td>1</td> <td>" Fenster " > " Eigenschaften "</td> <td>OK</td>
</tr>
<tr>
<td><tt>PHRASE_BASELINE</tt></td> <td>4</td> <td>2</td> <td>) , und wählen Sie</td> <td>BAD</td>
</tr>
<tr>
<td><tt>PHRASE_BASELINE</tt></td> <td>4</td> <td>3</td> <td>Statischer Text</td> <td>OK</td>
</tr>
<tr>
<td><tt>PHRASE_BASELINE</tt></td> <td>4</td> <td>4</td> <td>oder</td> <td>OK</td>
</tr>
<tr>
<td><tt>PHRASE_BASELINE</tt></td> <td>4</td> <td>5</td> <td>Dynamischer Text</td> <td>OK</td>
</tr>
<tr>
<td><tt>PHRASE_BASELINE</tt></td> <td>4</td> <td>6</td> <td>.</td> <td>OK</td>
</tr>
</table>
<p>The example shows the labelling for the sentence (double vertical lines show phrase borders): </p>
<p style="text-indent:25px">Geben Sie im Eigenschafteninspektor ( || ' Fenster ' > ' Eigenschaften ' || ) , und wählen Sie || Statischer Text || oder || Dynamischer Text || .</p>
</p>performed by the <tt>PHRASE_BASELINE</tt> system.</p>
<h3>Submission Requirements</h3>
Each participating team can submit at most 2 systems for each of the language pairs of each subtask (systems producing alternative scores, e.g. post-editing time can be submitted as additional runs) . These should be sent
via email to Lucia Specia <a href="mailto:[email protected]" target="_blank">[email protected]</a>. Please use the following pattern to name your files:
<p>
<code>INSTITUTION-NAME</code>_<code>TASK-NAME</code>_<code>METHOD-NAME</code>, where:
</p><p> <code>INSTITUTION-NAME</code> is an acronym/short name for your institution, e.g. SHEF
</p><p><code>TASK-NAME</code> is one of the following: 1, 2, 3.
</p><p><code>METHOD-NAME</code> is an identifier for your method in case you have multiple methods for the same task, e.g. 2_J48, 2_SVM
</p><p> For instance, a submission from team SHEF for task 2 using method "SVM" could be named SHEF_2_SVM.
</p><p>You are invited to submit a short paper (4 to 6 pages) to WMT
describing your QE method(s). You are not required to
submit a paper if you do not want to. In that case, we ask you
to give an appropriate reference describing your method(s) that we can cite
in the WMT overview paper.</p>
<h3>Important dates</h3>
<table>
<tbody><tr><td>Release of training data </td><td>February 10, 2017</td></tr>
<tr><td>Release of test data </td><td>April 10 2017</td></tr>
<tr><td>QE metrics results submission deadline </td><td>May 14 2017</td></tr>
<tr><td>Paper submission deadline</td><td>June 9 2017</td></tr>
<tr><td>Notification of acceptance</td><td>June 30 2017</td></tr>
<tr><td>Camera-ready deadline</td><td>July 14 2017</td></tr>
</tbody></table>
<h3>Organisers</h3>
<br>
Varvara Logacheva (University of Sheffield)
<br>
Lucia Specia (University of Sheffield)
<br>
<h3>Contact</h3>
<p> For questions or comments, email Lucia
Specia <a href="mailto:[email protected]" target="_blank">[email protected]</a>.
</p>
<p align="right">
Supported by the European Commission under the projects
<br>
<a href="http://www.qt21.eu/"><img src="figures/qt21.png" height="40" width="100" border="0" align="right"></a>
<a href="http://cracker-project.eu/"><img src="figures/cracker-logo-no-tag-large.png" height="40" width="100" border="0" align="right"></a>
</p>
</p></li></body></html>