Website - http://nlp-tools.uom.lk/thamizhi-pos/
ThamizhiPOSt is a deep learning based POS tagger which is developed using Stanza framework, and trained using 11K POS tagged sentences along with fasttext model of Facebook. ThamizhiPOSt uses the Universal Dependency POS tagset for the annotation.
ThamizhiPOSt shows an accuracy of 95.20 (as of today 02.09.2020) for the TTB (https://github.com/UniversalDependencies/UD_Tamil-TTB/blob/master/ta_ttb-ud-test.conllu). This is the current state of the art for the Tamil POS taggers which are implemented/reported as of today.
We trained this POS tagger using the AMRITA POS tagged data. Before we do this, we did a harmonisation of BIS, AMRITA and UPOS tagsets, which are the primary POS tagsets available as of today. The harmonisation Universal Dependency POS (UPOS) , BIS , and AMRITA can be be found in this sheet.
However, we found that the Amrita POS tagged data are more clean, therefore, we used it to train the POS tagger. We used Stanza, a neural based framework developed by Stanford University - a sccuessor of their CoreNLP framework, to train the POS tagger.
The trained models can be found here in a compressed format. This file is in tgz format, you can extract it using tar.
You need to have Python 3.0. In addition, install the following tools and libraries (These commands are for Debian based distribution, you can find the similar ones for other Linux distributions & Windows over the web):
pip3 install [stanza](https://stanfordnlp.github.io/stanza/installation_usage.html)
[Download this compressed file](http://nlp-tools.uom.lk/thamizhi-pos/thamizhi-pos.zip) , and uncompressed it. You should be able to see a scipts: thamizhi-post.py, and a folder models
Run the following command:
python3 thamizhi-post.py "input-file"
where "input-file" is the text file you want to POS tag. (there should not be any empty lines in the file) . This will generate a file called pos-tagged.txt.
Note: To use this version of tagger, it is compulsory to include a symbol (can be a period/exclamation mark / question mark) at the end of each line/sentence. Otherwise, the very last token will be considered as a punctuation.
An output will look like the following for the data "தமிழ் எங்கள் உயிருக்கு நேர் ."
1 தமிழ் PROPN
2 எங்கள் PRON
3 உயிருக்கு NOUN
4 நேர் NOUN
5 . PUNCT
The following datasets are tagged using ThamizhiPOSt, available for research :
- Official data (consists of Annual reports, Audit reports, Letters - anonymised) - 8,932 tokens/1,100 sentences
- Sri Lankan Tamil news data - 124,203 tokens / 10,000 sentences
Please cite this if you use Thamizhi-POS tool / models / tagged data:
@misc{sarveswaran2020thamizhiudp, title={ThamizhiUDp: A Dependency Parser for Tamil}, author={Sarveswaran, Kengatharaiyer and Dias, Gihan}, year={2020}, eprint={2012.13436}, archivePrefix={arXiv}, primaryClass={cs.CL} }
This research was supported by the Accelerating Higher Education Expansion and Development (AHEAD) Operation of the Ministry of Higher Education, Sri Lanka funded by the World Bank.