The Romanian UD treebank (called RoRefTrees) (Barbu Mititelu et al., 2016) is the reference treebank in UD format for standard Romanian.
It is based on RACAI-RoTb (Irimia and Barbu Mititelu, 2015) and on UAIC-RoTb (Perez, 2014). The distribution of text genres in RoRefTrees is not balanced: literature - 1818 sentences, law - 1606 sentences, medical - 1210 sentences, FrameNet translations - 1092 sentences, academic writing - 950 sentences, news - 933 sentences, science - 362 sentences, wikipedia - 251 sentences, miscellanea - 1301 sentences. These genres can be told apart by the sentences id.
We split the treebank as follows: the test set (ro-ud-test.conllu) is 12.5% of the whole treebank, the development set (ro-ud-dev.conllu) also 12.5%, while the rest of the treebank (75%) is the training set (ro-ud-train.conllu). The data was split in a random fashion.
Tree count: 9523 Word count: 218511 Token count: 218511 Dep. relations: 50 of which 11 language specific POS tags: 17 Category=value feature pairs: 57
This work was supported by a grant of the Romanian National Authority for Scientific Research and Innovation, CNCS - UEFISCDI, project number PN-II-RU-TE-2014-4-1362.
V. Barbu Mititelu, R. Ion, R. Simionescu, E. Irimia, C-A Perez. 2016. The Romanian Treebank Annotated According to Universal Dependencies. Proceedings of HrTAL2016, Dubrovnik, Croatia, 29 September - 1 October 2016. R. Ion, E. Irimia, D. Ștefănescu, D. Tufiș. 2012. ROMBAC: The Romanian Balanced Annotated Corpus. Proceedings of LREC'12, Istanbul, Turkey. E. Irimia, V. Barbu Mititelu. 2015. Building a Romanian Dependency Treebank. Corpus Linguistics 2015, Lancaster, UK, 21-24 July 2015. C-A Perez. 2014. Resurse lingvistice pentru prelucrarea limbajului natural, PhD thesis, A.I. Cuza University of Iasi.
- UD 1.2 --> 1.3: the number of trees was considerably increased.
- UD 1.3 --> 1.4:
- increase the treebank size to 9523 sentences;
- identical sentences (disregarding punctuation and numbers) at word form level have been removed;
- added a scientific (Physics, Mathematics and Computer Science) sub-corpus to make up for the loss;
- removed all words with underscores;
- removed most of the errors reported by the content validation tool;
- extensive POS-tagging and lemmatization corrections;
- ensuring more consistent data at the lexical, morphological and syntactic levels.
- UD 1.4 --> 2.0:
- manual improvements of the annotation, concerning POS-tagging, syntactic labeling, one sentence split.
- automatic conversion to UDv2 guidelines using Udapi (http://udapi.github.io/) ud.Convert1to2
- automatic reconstruction of original texts (ud.ro.SetSpaceAfter), pseudo-documents marked with newdoc markup
- re-split: train=185,113 (84.7%) tokens, dev=17,074 (7.8%) tokens, dev=16,324 (7.5%) tokens. Each pseudo-document equally distributed into train/dev/test.
- test set omitted from the UDv2.0 official release because of CoNLL 2017 shared task.
- UD 2.0 --> 2.1: no modifications to the previous version.
- UD 2.1 --> 2.2: repository renamed from UD_Romanian to UD_Romanian-RRT
- UD 2.6 --> 2.7: manual improvement of the annotations, concerning POS-tagging, syntactic labeling
- UD 2.7 --> 2.8: automatic (but manually checked) improved POS-tagging, mainly for numerals and auxiliaries; some automatic dependency relation corrections (
obl
for nouns headed by verbs,nsubj:pass
for subjects of passive constructions). Correction scripts are in GitHub at ro-ud-autocorrect.
=== Machine-readable metadata ================================================= Data available since: UD v1.2 License: CC BY-SA 4.0 Includes text: yes Genre: wiki legal news fiction medical nonfiction academic Lemmas: automatic UPOS: converted with corrections XPOS: automatic Features: converted with corrections Relations: manual native Contributors: Barbu Mititelu, Verginica; Irimia, Elena; Perez, Cenel-Augusto; Ion, Radu; Simionescu, Radu; Popel, Martin Contributing: elsewhere Contact: [email protected] ===============================================================================