-
Notifications
You must be signed in to change notification settings - Fork 21
UD pipelines
Table of Contents generated with DocToc
This page describes multilingual text analysis according to Universal Dependencies guidelines. Models for more than 60 languages are available. Following two pipelines are intended for UD analysis.
Pipeline | Description | Input | Output* |
---|---|---|---|
deepud | Full pipeline that includes tokenization and sentence segmentation. | plain text | CoNLL-U |
deepud-pretok | Pipeline that starts from pretokenized text. | CoNLL-U | CoNLL-U |
* Pipelines can be configured for CoNLL-03 output (see conllDumperNer processing unit configuration in lima_linguisticprocessing/conf/lima-lp-$LANG-CODE.xml
).
Use the lima_models.py
script to download and install models to user's home directory (we follow the XDG specification to install and search for LIMA models):
$ lima_models.py -l english
To get information about the available models, use the -i
switch:
$ lima_models.py -i
Alternatively, you can manually download the language packages you need from the Releases section of lima-models repository. You can use as many language packages simultaneously as you need. Install each language package with apt
. E.g.:
$ sudo apt install ./lima-deep-models-english_0.1.5_all.deb
Refer to the LIMA user manual for detailed instructions. But in short, use the ud "language", the deepud pipeline and the 3 letters ISO639-3 language code to choose required language while using analyzeText:
$ analyzeText -l ud -p deepud --meta udlang:eng your-text-file.txt
To analyze tokenized text (.conllu input file) use deepud-pretok pipeline:
$ analyzeText -l ud -p deepud-pretok --meta udlang:eng your-text-file.conllu
Short command-line syntax is also available and it works for all languages except English and French.
$ analyzeText -l spa -p deepud your-text-file.conllu
Short command-line syntax for English and French requires UD to be mentioned as a part of language code. This form equally works for all languages.
$ analyzeText -l ud-eng -p deepud your-text-file.conllu
deepud | deepud-pretok | |
---|---|---|
Input: | ||
cpptftokenizer | + | |
conllureader | + | |
RNN-based PoS tagger and dependency parser: | ||
tfmorphosyntax | + | + |
RNN-based lemmatizer: | ||
tflemmatizer | + | + |
Output: | ||
conllDumper | + | + |
For the up-to-date definitions of these pipelines please check the corresponding configuration files: lima_linguisticprocessing/conf/lima-lp-$LANG-CODE.xml
.
Current performance of LIMA in all supported languages is reported in a dedicated page. We report some of them below, for English and French only.
Regarding the speed it's important to note that LIMA and UDify do multithread computation and consume normally all available CPU cores. UDPipe use only one thread. This difference is ignored here.
Tool | Mode | Tokens | Sentences | Words | UPOS | UFeats | Lemmas | UAS | LAS | Speed |
---|---|---|---|---|---|---|---|---|---|---|
lima | raw | 98.85 | 85.14 | 98.85 | 94.89 | 90.81 | 94.17 | 85.15 | 82.06 | 245 |
lima | gold-tok | 100 | 100 | 100 | 95.95 | 91.86 | 95.09 | 87.91 | 84.65 | 254 |
udpipe | raw | 98.9 | 86.92 | 98.9 | 93.34 | 94.3 | 95.45 | 81.83 | 78.64 | 1793 |
udpipe | gold-tok | 100 | 100 | 100 | 94.43 | 95.37 | 96.41 | 84.4 | 81.08 | 2281 |
udify | gold-tok | 100 | 100 | 100 | 96.29 | 96.19 | 97.39 | 91.12 | 88.53 | 92 |
Tool | Mode | Tokens | Sentences | Words | UPOS | UFeats | Lemmas | UAS | LAS | Speed |
---|---|---|---|---|---|---|---|---|---|---|
lima | raw | 99.69 | 84.22 | 97.94 | 96.06 | 89.28 | 94.91 | 85.09 | 82.4 | 291 |
lima | gold-tok | 100 | 100 | 100 | 98.25 | 91.33 | 96.91 | 89.1 | 86.48 | 300 |
udpipe | raw | 99.79 | 87.5 | 99.09 | 96.1 | 94.93 | 96.93 | 84.85 | 82.09 | 3349 |
udpipe | gold-tok | 100 | 100 | 100 | 97.08 | 95.84 | 97.82 | 86.83 | 84.13 | 3349 |
udify | gold-tok | 100 | 100 | 100 | 97.93 | 89.41 | 97.24 | 92.07 | 89.22 | 86 |
- Speed: current speed is around 300 w/s (80-900 w/s depending of the particular language model - see evaluation page). This could be not acceptable for everyday use. We are still working on speed improvement.
- RAM consumption: depending on the particular language model and word embeddings file size (see details below), it can take up to 32Gb of RAM.
- Typo and Abbr features and XPOS conllu column aren't generated by LIMA.
fastText models published by Facebook include word embeddings with subword information. Original binary files are ~7Gb per language and this seems to be too much for practical usage. In lima-models repository we distribute compressed (quantized) versions of these files (~600Mb per language). The quantization process slightly affects analysis quality. The following table reports the average effect on metrics calculated with original and two compressed embeddings.
Embedding size | 1.2Gb | 0.6Gb
---------------+---------+---------
UPOS | -0.01 | -0.03
UAS | -0.02 | -0.05
LAS | -0.03 | -0.1
You can manually replace word embeddings with original binary (i.e. with subword information) files from fastText.cc or with less compressed ones we've published separately. In order to do this you have to put downloaded files to /usr/share/apps/lima/resources/TensorFlowMorphoSyntax/ud/ with the name fasttext-xxx.bin where xxx is a corresponding ISO639-3 language code (eng, fra...).
Table of Contents generated with DocToc