Personalized PageRank using Semantic Similarity Measures
This is the code used to run our experiments for the paper "PPR-SSM: Personalized PageRank and Semantic Similarity Measures for Entity Linking".
The code has three steps:
- Generating candidates file
- Running PPR algorithm
- Analyze results
The code for each gold standard is organized on its separate directoy (hpo_src, chebi_src, and go_src). The main script of each gold standard are ones starting with "parse". The others have helper functions to generate and process data.
You can build a docker image using the Dockerfile provided on this repository or download it from dockerhub: docker pull andrelamurias/pprssm
We used the following corpora:
- HPO GSC+ (https://github.com/lasigeBioTM/IHP/raw/master/GSC%2B.rar)
- ChEBI patents corpus (provided with this repo)
- CRAFT (https://github.com/UCDenver-ccp/CRAFT/releases/tag/3.0 - put brat files inside CRAFT/GO_BP and CRAFT/GO_CC)
And the following ontologies:
- HPO
- ChEBI
- Gene Ontology
For each ontology, it is necessary a OBO file and a .db file processed by DiShIn. These can be obtained with the get_data.sh script.
First run dishin_app.py with flask:
export FLASK_APP=dishin_app.py
export DISHIN_DB=chebi.db
flask run &
Args:
- min distance
- min similarity
- corpus dir (or ontology name for Gene Ontology entities in CRAFT corpus: "GO_BP" for GO Biological Process entities, "GO_CC" for GO Cellular Component entities)
Example:
python chebi_src/parse_chebi_corpus.py 1 0.5 ChebiPatents/
Run the PPRforNED script:
javac ppr_for_ned_chebi.java
java ppr_for_ned_chebi resnik_dishin
For GO entities in CRAFT corpus change to the desired subontology in the ppr_for_ned_go.java script
Process the results to get more results than what is given by PPRforNED:
python src/process_results.py chebi
Example output:
one candidate 431
correct 909
wrong 105
total 1014
accuracy: 0.8964497041420119
accuracy (multiple candidates): 0.8198970840480274