A silver standard parallel corpus containing 1917 English, Spanish and Portuguese clinical case reports (639 for each language) annotated with medical diagnostic codes from the ICD10-CM (International Classification of Diseases, 10th Revision, Clinical Modification) terminology. To create a new corpus follow the instructions below.
Citation:
P. Ruas, A. Lamurias, and F. Couto, “Towards a multilingual corpus for named entity linking evaluation in the clinical domain,” in The First International Workshop on Semantic Indexing and Information Retrieval for Health from heterogeneous content types and languages (SIIRH), 2020.
It is necessary an environment with Python>=3.5 installed.
Then, run the script:
./setup.sh
To download the English, Portuguese and Spanish versions of ICD10-CM:
./get_data.sh
Returns:
- icd10cm_tabular_2020.xml file with the English version
- ICD10CMPCS_2017_PT_Longa e Curta_20180821corrigidav5.2.xlsx file with the Portuguese version
- CIE10_2020_DIAGNOST_REFERENCIA_2019_10_04_1.xlsx file with the Spanish version
2. Retrieve abstracts from SciELO repository
To retrieve the clinical case reports (corresponding to SciELO search filters: * AND subject_area:("Health Sciences") AND type:("case-report") AND la:("es" OR "pt" OR "en")) whose abstract text is simultaneously in English, Portuguese and Spanish:
python retrieve_abstracts.py
Returns:
- scielo_abstracts dir containing the retrieved abstracts (file names with termination en, pt, es refer respectively to English, Portuguese and Spanish texts)
- article_scielo_ids file
- abstract_retrieval_stats file
To recognize the medical diagnostic entities in the abstracts and to link them to the appropriate code in the ICD10-CM terminology using MER (<language> is en, pt or es):
python mer_annotate.py <language>
Returns:
- mer_annotations dir containing the subdirectories en, pt and es, each subdirectory contains the annotations files for the abstracts in the referred language
- mer_annotation_stats_[LANGUAGE] file
Example of the annotation file S0034-70942002000300010-scl_pt.ann:
[Term number] [Begin] [End] [Entity mention] [ICD10-CM code]
T1 626 645 insuficiência renal N17-N19
T2 29 47 doença de Moyamoya I67.5
T3 342 360 doença de Moyamoya I67.5
T4 541 559 doença de Moyamoya I67.5