Skip to content

[PlanTL/medicine/document annotation/NLP preprocessing/part-of-speech] Part-of-Speech Tagger for medical domain corpus in Spanish based on FreeLing.

Notifications You must be signed in to change notification settings

PlanTL-GOB-ES/SPACCC_POS-TAGGER

Repository files navigation

SPACCC_POS-TAGGER: Spanish Clinical Case Corpus Part-of-Speech Tagger

Digital Object Identifier (DOI)

https://doi.org/10.5281/zenodo.2621286

Introduction

This repository contains the Part-of-Speech Tagger for medical domain corpus in Spanish based on FreeLing3.1. It also contains the Python wrapper for this software, aiming at easier use.

Demo

Here you can find a demonstration of the Part-of-Speech Tagger: http://temu.bsc.es/pos/

Prerequisites

To use the SPACCC_POS-TAGGER, the following resources are required:

Directory structure

  • compila_freeling.sh: compiles the adapted FreeLing3.1 docker image
  • config.cfg: FreeLing configuration file
  • Dockerfile: Dockerfile for image compilation
  • llamada_freeling.sh: Script to execute the analysis of a text with the adapted FreeLing
  • README.md: This file
  • singlewords.dat: File with the normalized resources (words, acronyms, and abbreviations)
  • splitter.dat: Sentence segmentantion rules
  • tokenizer.dat: Tokenization rules
  • usermap.dat: Rules for POS assignment (regular expressions)
  • Med_Tagger: Folder containing the Python wrapper for this tool

Usage

To compile the adapted FreeLing3.1 docker image, the following command (from this directory) has to be executed:

$> bash compila_freeling.sh

The result will be the docker image med-tagger:1.0.0

Examples

To execute the program, given a text, one can use the following command:

$> echo 'Este es un texto de prueba.' | bash llamada_freeling.sh

This generates an output with four columns that includes, for each input word, the input wordform, the lemma, the PoS tag and the score that the tagger assigns to the tag:

Esto este PD0NS000 1
es ser VSIP3S0 1
un uno DI0MS0 0.987295
texto texto NCMS000 1
de de SPS00 0.999984
prueba prueba NCFS000 0.972603
. . Fp 1

Performance

Gold standard vs Tagger ACC
Splitting 98,85%
Tokenization 99,47%
Part-of-Speech 99,87%

Python wrapper

Check the Med_Tagger folder inside this directory

Contact

Felipe Soares ([email protected])

License

FreeLing is licensed under GPL (https://www.gnu.org/licenses/gpl-3.0.en.html).

About

[PlanTL/medicine/document annotation/NLP preprocessing/part-of-speech] Part-of-Speech Tagger for medical domain corpus in Spanish based on FreeLing.

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •