MER is a Named-Entity Recognition tool that identifies terms from any lexicon within input text, providing their exact locations (annotations). It can also link recognized entities to their respective classes when provided with an ontology (OWL file).
A demo is available at: MER Demo
- LEXICONS: Package here is available.
- COMMENTS: More comments were added to the scripts to improve readability.
- ONTOLOGIES: New examples added, namely the ontologies: OSCI, CL, ENVO, and ECTO.
- DOCKER: Image available: fjmc/mer-image.
- MULTILINGUAL: English, Spanish, and Portuguese lexicons using DeCS.
- PYTHON: Interface: lasigeBioTM/merpy.
- SIMILARITY:
get_similarities.sh
finds the most similar term also recognized. See here.
-
MER: a Shell Script and Annotation Server for Minimal Named Entity Recognition and Linking
F. Couto and A. Lamurias
Journal of Cheminformatics, 10:58, 2018
DOI: 10.1186/s13321-018-0312-9 -
MER: a Minimal Named-Entity Recognition Tagger and Annotation Server
F. Couto, L. Campos, and A. Lamurias
BioCreative V.5 Challenge Evaluation, 2017
ResearchGate
MER was developed and tested using GNU awk (gawk) and grep. Please note that using another awk interpreter may not guarantee the program's functionality.
To install GNU awk on Ubuntu, use the following command:
sudo apt-get install gawk
Let's walk trough an example of adding a sample lexicon to MER.
First, create the lexicon file in the data
folder:
α-maltose
nicotinic acid
nicotinic acid D-ribonucleotide
nicotinic acid-adenine dinucleotide phosphate
Assuming that the file is called lexicon.txt
, you process it as follows:
(cd data; ../produce_data_files.sh lexicon.txt)
After processing, examples of labels will be shown as output to verify the operation. This step generates all the necessary files to utilize MER with the provided lexicon.
The script receives as input a text and a lexicon:
./get_entities.sh [text] [lexicon]
Let's try to find mentions in a snippet of text:
./get_entities.sh 'α-maltose and nicotinic acid was found, but not nicotinic acid D-ribonucleotide' lexicon
The output will be a TSV looking like this:
0 9 α-maltose
14 28 nicotinic acid
48 62 nicotinic acid
48 79 nicotinic acid D-ribonucleotide
The first column corresponds to the start-index, the second to the end-index and the third to the annotated term:
1 2 3 4 5 6 7
01234567890123456789012345678901234567890123456789012345678901234567890123456789
α-maltose and nicotinic acid was found, but not nicotinic acid D-ribonucleotide
To check if the result is what was expected try:
./test.sh
if something is wrong, please check if you are using UTF-8 encoding and that you have GNU awk and grep installed.
If you create a links file named lexicon_links.tsv
in the data
folder associating each label (in lower case) with an URI:
α-maltose http://purl.obolibrary.org/obo/CHEBI_18167
nicotinic acid http://purl.obolibrary.org/obo/CHEBI_15940
nicotinic acid d-ribonucleotide http://purl.obolibrary.org/obo/CHEBI_15763
nicotinic acid-adenine dinucleotide phosphate http://purl.obolibrary.org/obo/CHEBI_76072
Then the mentions in a snippet of text will be associated to the respective identifier:
./get_entities.sh 'α-maltose and nicotinic acid was found, but not nicotinic acid D-ribonucleotide' lexicon
The output will be a TSV looking like this:
0 9 α-maltose http://purl.obolibrary.org/obo/CHEBI_18167
14 28 nicotinic acid http://purl.obolibrary.org/obo/CHEBI_15940
48 62 nicotinic acid http://purl.obolibrary.org/obo/CHEBI_15940
48 79 nicotinic acid D-ribonucleotide http://purl.obolibrary.org/obo/CHEBI_15763
Download an abstract from PubMed, for example 31319702:
text=$(curl -s "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=31319702&retmode=text&rettype=xml" | xmllint --xpath '//AbstractText/text()' /dev/stdin)
You can add more abstracts by adding more PubMed ids separated by comma.
Recognize the entities in the abstract:
./get_entities.sh "$text" lexicon
The output should be something like this:
1578 1592 nicotinic acid http://purl.obolibrary.org/obo/CHEBI_15940
1731 1745 nicotinic acid http://purl.obolibrary.org/obo/CHEBI_15940
Download the ontology:
(cd data; curl -L -O http://purl.obolibrary.org/obo/go.owl)
Process it:
(cd data; ../produce_data_files.sh go.owl)
Now, download an abstract from PubMed, for example 31351426:
text=$(curl -s "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=31351426&retmode=text&rettype=xml" | xmllint --xpath '//AbstractText/text()' /dev/stdin)
Recognize the entities in the abstract:
./get_entities.sh "$text" go
The output should be something like this:
185 198 fertilization http://purl.obolibrary.org/obo/GO_0009566
284 306 PPAR signaling pathway http://purl.obolibrary.org/obo/GO_0035357
289 298 signaling http://purl.obolibrary.org/obo/GO_0023052
289 306 signaling pathway http://purl.obolibrary.org/obo/GO_0007165
1157 1179 PPAR signaling pathway http://purl.obolibrary.org/obo/GO_0035357
1162 1171 signaling http://purl.obolibrary.org/obo/GO_0023052
1162 1179 signaling pathway http://purl.obolibrary.org/obo/GO_0007165
1280 1302 PPAR signaling pathway http://purl.obolibrary.org/obo/GO_0035357
1285 1294 signaling http://purl.obolibrary.org/obo/GO_0023052
1285 1302 signaling pathway http://purl.obolibrary.org/obo/GO_0007165
1303 1318 gene expression http://purl.obolibrary.org/obo/GO_0010467
1547 1569 PPAR signaling pathway http://purl.obolibrary.org/obo/GO_0035357
1552 1561 signaling http://purl.obolibrary.org/obo/GO_0023052
1552 1569 signaling pathway http://purl.obolibrary.org/obo/GO_0007165
1641 1659 glucose metabolism http://purl.obolibrary.org/obo/GO_0006006
1649 1659 metabolism http://purl.obolibrary.org/obo/GO_0008152
1661 1682 inflammatory response http://purl.obolibrary.org/obo/GO_0006954
1862 1884 PPAR signaling pathway http://purl.obolibrary.org/obo/GO_0035357
1867 1876 signaling http://purl.obolibrary.org/obo/GO_0023052
1867 1884 signaling pathway http://purl.obolibrary.org/obo/GO_0007165
1989 2001 pathogenesis http://purl.obolibrary.org/obo/GO_0001897
Download the ontology:
(cd data; curl -L -O http://purl.obolibrary.org/obo/chebi/chebi_lite.owl)
Process it:
(cd data; ../produce_data_files.sh chebi_lite.owl)
Download an abstract from PubMed, for example 31319702:
text=$(curl -s "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=31319702&retmode=text&rettype=xml" | xmllint --xpath '//AbstractText/text()' /dev/stdin)
Recognize the entities in the abstract:
./get_entities.sh "$text" chebi_lite
The output should be something like this:
0 8 Electron http://purl.obolibrary.org/obo/CHEBI_10545
160 165 ester http://purl.obolibrary.org/obo/CHEBI_35701
213 218 ester http://purl.obolibrary.org/obo/CHEBI_35701
285 290 ester http://purl.obolibrary.org/obo/CHEBI_35701
342 347 ester http://purl.obolibrary.org/obo/CHEBI_35701
397 402 ester http://purl.obolibrary.org/obo/CHEBI_35701
475 480 ester http://purl.obolibrary.org/obo/CHEBI_35701
1051 1055 atom http://purl.obolibrary.org/obo/CHEBI_33250
1065 1080 isopropyl ester http://purl.obolibrary.org/obo/CHEBI_35725
1075 1080 ester http://purl.obolibrary.org/obo/CHEBI_35701
1128 1132 acid http://purl.obolibrary.org/obo/CHEBI_37527
1145 1152 propene http://purl.obolibrary.org/obo/CHEBI_16052
1206 1211 ester http://purl.obolibrary.org/obo/CHEBI_35701
1261 1265 acid http://purl.obolibrary.org/obo/CHEBI_37527
1289 1296 radical http://purl.obolibrary.org/obo/CHEBI_26519
1348 1354 methyl http://purl.obolibrary.org/obo/CHEBI_29309
1544 1550 methyl http://purl.obolibrary.org/obo/CHEBI_29309
1621 1627 proton http://purl.obolibrary.org/obo/CHEBI_24636
1707 1719 benzoic acid http://purl.obolibrary.org/obo/CHEBI_30746
1715 1719 acid http://purl.obolibrary.org/obo/CHEBI_37527
1789 1803 nicotinic acid http://purl.obolibrary.org/obo/CHEBI_15940
1799 1803 acid http://purl.obolibrary.org/obo/CHEBI_37527
1844 1858 carbonyl group http://purl.obolibrary.org/obo/CHEBI_23019
1853 1858 group http://purl.obolibrary.org/obo/CHEBI_24433
1929 1941 benzoic acid http://purl.obolibrary.org/obo/CHEBI_30746
1937 1941 acid http://purl.obolibrary.org/obo/CHEBI_37527
1984 1998 nicotinic acid http://purl.obolibrary.org/obo/CHEBI_15940
1994 1998 acid http://purl.obolibrary.org/obo/CHEBI_37527
2094 2097 ion http://purl.obolibrary.org/obo/CHEBI_24870
2190 2193 ion http://purl.obolibrary.org/obo/CHEBI_24870
Download the ontology:
(cd data; curl -L -O http://purl.obolibrary.org/obo/hp.owl)
Process it:
(cd data; ../produce_data_files.sh hp.owl)
Download an abstract from PubMed, for example 29490421:
text=$(curl -s "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=29490421&retmode=text&rettype=xml" | xmllint --xpath '//AbstractText/text()' /dev/stdin)
Recognize the entities in the abstract:
./get_entities.sh "$text" hp
The output should be something like this:
50 53 dry http://purl.obolibrary.org/obo/PATO_0001801
348 354 asthma http://purl.obolibrary.org/obo/HP_0002099
359 363 COPD http://purl.obolibrary.org/obo/HP_0006510
496 500 COPD http://purl.obolibrary.org/obo/HP_0006510
504 510 asthma http://purl.obolibrary.org/obo/HP_0002099
Download the ontology:
(cd data; curl -L -O http://purl.obolibrary.org/obo/doid.owl)
Process it:
(cd data; ../produce_data_files.sh doid.owl)
Download an abstract from PubMed, for example 29490421:
text=$(curl -s "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=29490421&retmode=text&rettype=xml" | xmllint --xpath '//AbstractText/text()' /dev/stdin)
Recognize the entities in the abstract:
./get_entities.sh "$text" doid
The output should be something like this:
348 354 asthma http://purl.obolibrary.org/obo/DOID_2841
359 363 COPD http://purl.obolibrary.org/obo/DOID_3083
496 500 COPD http://purl.obolibrary.org/obo/DOID_3083
504 510 asthma http://purl.obolibrary.org/obo/DOID_2841
Download the ontology:
(cd data; curl -L -O https://raw.githubusercontent.com/stemcellontologyresource/OSCI/master/src/ontology/osci.owl)
Process it:
(cd data; ../produce_data_files.sh osci.owl)
Download an abstract from PubMed, for example 30053745:
text=$(curl -s "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=30053745&retmode=text&rettype=xml" | xmllint --xpath '//AbstractText/text()' /dev/stdin)
Recognize the entities in the abstract:
./get_entities.sh "$text" osci
The output should be something like this:
325 329 cell http://purl.obolibrary.org/obo/CL_0000000
447 452 human http://purl.obolibrary.org/obo/NCBITaxon_9606
475 480 human http://purl.obolibrary.org/obo/NCBITaxon_9606
531 536 human http://purl.obolibrary.org/obo/NCBITaxon_9606
545 556 pluripotent http://purl.obolibrary.org/obo/PATO_0001403
601 610 Stem cell http://purl.obolibrary.org/obo/CL_0000034
606 610 cell http://purl.obolibrary.org/obo/CL_0000000
691 702 pluripotent http://purl.obolibrary.org/obo/PATO_0001403
743 748 human http://purl.obolibrary.org/obo/NCBITaxon_9606
749 755 neuron http://purl.obolibrary.org/obo/CL_0000540
798 804 neuron http://purl.obolibrary.org/obo/CL_0000540
913 929 Neural stem cell http://purl.obolibrary.org/obo/CL_0000047
920 929 stem cell http://purl.obolibrary.org/obo/CL_0000034
925 929 cell http://purl.obolibrary.org/obo/CL_0000000
975 984 stem cell http://purl.obolibrary.org/obo/CL_0000034
980 984 cell http://purl.obolibrary.org/obo/CL_0000000
Download the ontology:
(cd data; curl -L -O http://purl.obolibrary.org/obo/cl.owl)
Process it:
(cd data; ../produce_data_files.sh cl.owl)
Download an abstract from PubMed, for example 30053745:
text=$(curl -s "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=30053745&retmode=text&rettype=xml" | xmllint --xpath '//AbstractText/text()' /dev/stdin)
Recognize the entities in the abstract:
./get_entities.sh "$text" cl
The output should be something like this:
255 268 dentate gyrus http://purl.obolibrary.org/obo/UBERON_0001885
263 268 gyrus http://purl.obolibrary.org/obo/UBERON_0000200
276 287 hippocampus http://purl.obolibrary.org/obo/UBERON_0001954
315 329 ependymal cell http://purl.obolibrary.org/obo/CL_0000065
325 329 cell http://purl.obolibrary.org/obo/CL_0000000
447 452 human http://purl.obolibrary.org/obo/NCBITaxon_9606
475 480 human http://purl.obolibrary.org/obo/NCBITaxon_9606
531 536 human http://purl.obolibrary.org/obo/NCBITaxon_9606
601 610 Stem cell http://purl.obolibrary.org/obo/CL_0000034
606 610 cell http://purl.obolibrary.org/obo/CL_0000000
743 748 human http://purl.obolibrary.org/obo/NCBITaxon_9606
749 755 neuron http://purl.obolibrary.org/obo/CL_0000540
798 804 neuron http://purl.obolibrary.org/obo/CL_0000540
913 929 Neural stem cell http://purl.obolibrary.org/obo/CL_0000047
920 929 stem cell http://purl.obolibrary.org/obo/CL_0000034
925 929 cell http://purl.obolibrary.org/obo/CL_0000000
975 984 stem cell http://purl.obolibrary.org/obo/CL_0000034
980 984 cell http://purl.obolibrary.org/obo/CL_0000000
1041 1046 great http://purl.obolibrary.org/obo/PATO_0000586
Download the ontology:
(cd data; curl -L -O http://purl.obolibrary.org/obo/ecto.owl)
Process it:
(cd data; ../produce_data_files.sh ecto.owl)
Download an abstract from PubMed, for example 34303912:
text=$(curl -s "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=34303912&retmode=text&rettype=xml" | xmllint --xpath '//AbstractText/text()' /dev/stdin)
Recognize the entities in the abstract:
./get_entities.sh "$text" ecto
The output should be something like this:
0 5 Water http://purl.obolibrary.org/obo/CHEBI_15377
30 35 human http://purl.obolibrary.org/obo/NCBITaxon_9606
139 150 environment http://purl.obolibrary.org/obo/ENVO_01000254
177 182 water http://purl.obolibrary.org/obo/CHEBI_15377
200 206 planet http://purl.obolibrary.org/obo/ENVO_01000800
237 242 water http://purl.obolibrary.org/obo/CHEBI_15377
237 252 water pollution http://purl.obolibrary.org/obo/ENVO_02500039
243 252 pollution http://purl.obolibrary.org/obo/ENVO_02500036
339 350 environment http://purl.obolibrary.org/obo/ENVO_01000254
397 404 process http://purl.obolibrary.org/obo/BFO_0000015
449 462 concentration http://purl.obolibrary.org/obo/PATO_0000033
542 553 agriculture http://purl.obolibrary.org/obo/ENVO_01001246
585 591 energy http://purl.obolibrary.org/obo/ENVO_2000015
661 665 role http://purl.obolibrary.org/obo/BFO_0000023
764 771 quality http://purl.obolibrary.org/obo/BFO_0000019
811 821 technology http://purl.obolibrary.org/obo/NCIT_C17187
1101 1110 behaviour http://purl.obolibrary.org/obo/GO_0007610
1167 1177 technology http://purl.obolibrary.org/obo/NCIT_C17187
1192 1198 energy http://purl.obolibrary.org/obo/ENVO_2000015
1412 1422 technology http://purl.obolibrary.org/obo/NCIT_C17187
Download the ontology:
(cd data; curl -L -O http://purl.obolibrary.org/obo/envo.owl)
Process it:
(cd data; ../produce_data_files.sh envo.owl)
Download an abstract from PubMed, for example 34303912:
text=$(curl -s "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=34303912&retmode=text&rettype=xml" | xmllint --xpath '//AbstractText/text()' /dev/stdin)
Recognize the entities in the abstract:
./get_entities.sh "$text" envo
The output should be something like this:
0 5 Water http://purl.obolibrary.org/obo/CHEBI_15377
30 35 human http://purl.obolibrary.org/obo/NCBITaxon_9606
139 150 environment http://purl.obolibrary.org/obo/ENVO_01000254
177 182 water http://purl.obolibrary.org/obo/CHEBI_15377
200 206 planet http://purl.obolibrary.org/obo/ENVO_01000800
237 242 water http://purl.obolibrary.org/obo/CHEBI_15377
237 252 water pollution http://purl.obolibrary.org/obo/ENVO_02500039
243 252 pollution http://purl.obolibrary.org/obo/ENVO_02500036
339 350 environment http://purl.obolibrary.org/obo/ENVO_01000254
397 404 process http://purl.obolibrary.org/obo/BFO_0000015
449 462 concentration http://purl.obolibrary.org/obo/PATO_0000033
542 553 agriculture http://purl.obolibrary.org/obo/ENVO_01001246
585 591 energy http://purl.obolibrary.org/obo/ENVO_2000015
661 665 role http://purl.obolibrary.org/obo/BFO_0000023
764 771 quality http://purl.obolibrary.org/obo/BFO_0000019
1192 1198 energy http://purl.obolibrary.org/obo/ENVO_2000015
Request the XML files of DeCS in Portuguese, Spanish and English from https://decs.bvsalud.org/
Process it:
(cd data; ../produce_data_files.sh bireme_decs_eng2020.xml)
(cd data; ../produce_data_files.sh bireme_decs_spa2020.xml)
(cd data; ../produce_data_files.sh bireme_decs_por2020.xml)
Download a multilingual corpus, e.g. from https://sites.google.com/view/felipe-soares/datasets Text from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4458994
text_eng='This situation also contributes to respiratory aspiration and aspiration pneumonia, which is evidenced by a persistent cough with sputum or by other signs such as fever, tachypnea, or focal consolidation confirmed by radiographic imaging'
text_spa='Esa situación también contribuye para la aspiración respiratoria y para la neumonía espirativa, que puede ser evidenciada por tos persistente con expectoración o por otras señales como fiebre, taquipnea, consolidación focal, siendo confirmada por la imagen radiográfica'
text_por='Essa situação também contribui para a aspiração respiratória e para a pneumonia aspirativa, que pode ser evidenciada por tosse persistente com expectoração ou por outros sinais como: febre, taquipneia, consolidação focal, sendo confirmada pela imagem radiográfica'
Recognize the entities in the abstract:
./get_entities.sh "$text_eng" bireme_decs_eng2020
./get_entities.sh "$text_spa" bireme_decs_spa2020
./get_entities.sh "$text_por" bireme_decs_por2020
The output should be something like this:
35 57 respiratory aspiration https://decs.bvsalud.org/ths/?filter=ths_regid&q=D053120
73 82 pneumonia https://decs.bvsalud.org/ths/?filter=ths_regid&q=D011014
119 124 cough https://decs.bvsalud.org/ths/?filter=ths_regid&q=D003371
130 136 sputum https://decs.bvsalud.org/ths/?filter=ths_regid&q=D013183
163 168 fever https://decs.bvsalud.org/ths/?filter=ths_regid&q=D005334
170 179 tachypnea https://decs.bvsalud.org/ths/?filter=ths_regid&q=D059246
41 64 aspiración respiratoria https://decs.bvsalud.org/ths/?filter=ths_regid&q=D053120
75 83 neumonía https://decs.bvsalud.org/ths/?filter=ths_regid&q=D011014
126 129 tos https://decs.bvsalud.org/ths/?filter=ths_regid&q=D003371
185 191 fiebre https://decs.bvsalud.org/ths/?filter=ths_regid&q=D005334
193 202 taquipnea https://decs.bvsalud.org/ths/?filter=ths_regid&q=D059246
38 60 aspiração respiratória https://decs.bvsalud.org/ths/?filter=ths_regid&q=D053120
70 79 pneumonia https://decs.bvsalud.org/ths/?filter=ths_regid&q=D011014
70 90 pneumonia aspirativa https://decs.bvsalud.org/ths/?filter=ths_regid&q=D011015
121 126 tosse https://decs.bvsalud.org/ths/?filter=ths_regid&q=D003371
183 188 febre https://decs.bvsalud.org/ths/?filter=ths_regid&q=D005334
190 200 taquipneia https://decs.bvsalud.org/ths/?filter=ths_regid&q=D059246
Download the ontology:
(cd data; curl -L -O http://www.w3.org/2006/03/wn/wn20/rdf/wordnet-hyponym.rdf)
Process it:
(cd data; ../produce_data_files.sh wordnet-hyponym.rdf)
Download an abstract from PubMed, for example 29490421:
text=$(curl -s "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=29490421&retmode=text&rettype=xml" | xmllint --xpath '//AbstractText/text()' /dev/stdin)
Recognize the entities in the abstract:
./get_entities.sh "$text" wordnet-hyponym
The output should be something like this:
4 11 article
50 53 dry
54 60 powder
91 95 data
105 115 literature
192 196 well
288 291 may
292 301 influence
306 314 efficacy
319 325 safety
329 336 inhaler
337 344 therapy
348 354 asthma
381 384 use
396 404 addition
423 432 potential
448 456 efficacy
460 467 inhaler
468 475 therapy
485 491 doctor
504 510 asthma
511 518 patient
519 530 perspective
556 562 choice
563 572 algorithm
587 594 patient
cd data
curl -L -O https://labs.rd.ciencias.ulisboa.pt/mer/data/lexicons202407.tgz
tar -xzf lexicons202407.tgz
cd ..
cd data
curl -L -O https://labs.rd.ciencias.ulisboa.pt/mer/data/becalm2017.tgz
tar -xzf data2017.tgz
tar -tzf data2017.tgz | xargs -l ../produce_data_files.sh
cd ..
./get_entities.sh 'heart' tissue_and_organ
./get_entities.sh 'histoglobin' protein
./get_entities.sh 'ame-miR-2b' mirna
First install DiShIn: https://github.com/lasigeBioTM/DiShIn Or a minimalist version:
curl -L -O https://labs.rd.ciencias.ulisboa.pt/dishin/dishin.py
curl -L -O https://labs.rd.ciencias.ulisboa.pt/dishin/ssm.py
curl -L -O https://labs.rd.ciencias.ulisboa.pt/dishin/annotations.py
Before executing the get_similarity script you need to select the following parameters:
- Measure: Resnik, Lin or JC
- Type: MICA or DiShIn
- Path: DiShIn installation folder
- Database: DiShIn db file with the ontology, e.g. chebi.db, go.db, hp.db, doid.db, radlex.db, or wordnet.db
For example, download the database for ChEBI:
curl -L -O https://labs.rd.ciencias.ulisboa.pt/dishin/chebi202407.db.gz
gunzip -N chebi202407.db.gz
Then, just execute the get_similarity script using the output of the get_entities script
./get_entities.sh "α-maltose and nicotinic acid was found, but not nicotinic acid D-ribonucleotide" lexicon | ./get_similarity.sh Lin DiShIn . chebi.db
The output now includes for each match the most similar term and its similarity:
0 9 α-maltose http://purl.obolibrary.org/obo/CHEBI_18167 CHEBI_15763 0.0264373654324
14 28 nicotinic acid http://purl.obolibrary.org/obo/CHEBI_15940 CHEBI_15763 0.0796995701424
48 62 nicotinic acid http://purl.obolibrary.org/obo/CHEBI_15940 CHEBI_15763 0.0796995701424
48 79 nicotinic acid D-ribonucleotide http://purl.obolibrary.org/obo/CHEBI_15763 CHEBI_15940 0.0796995701424
A multilingual example:
curl -L -O https://labs.rd.ciencias.ulisboa.pt/dishin/mesh202407.db.gz
gunzip -N mesh202407.db.gz
curl -L -O https://labs.rd.ciencias.ulisboa.pt/mer/data/lexicons202407.tgz
(cd data; tar -xzf ../lexicons202407.tgz --wildcards bireme_decs_por2024*)
./get_entities.sh "febre, tontura, pneumonia e tosse" bireme_decs_por2024 | ./get_similarity.sh Lin DiShIn . mesh.db
The output:
0 5 febre https://decs.bvsalud.org/ths/?filter=ths_regid&q=D005334 D004244 0.29193507456
7 14 tontura https://decs.bvsalud.org/ths/?filter=ths_regid&q=D004244 D005334 0.29193507456
16 25 pneumonia https://decs.bvsalud.org/ths/?filter=ths_regid&q=D011014 D003371 0.431131076105
28 33 tosse https://decs.bvsalud.org/ths/?filter=ths_regid&q=D003371 D011014 0.431131076105
As expected, fever (fever) is closer to dizziness (tontura), and pneumonia is closer to cough (tosse).