Hybrid X-Linker: Automated Data Generation and Extreme Multi-label Ranking for Biomedical Entity Linking

This repository includes the necessary code to reproduce all the experiments described here.

Citation:

@misc{ruas2024,
      title={Hybrid X-Linker: Automated Data Generation and Extreme Multi-label Ranking for Biomedical Entity Linking}, 
      author={Pedro Ruas and Fernando Gallego and Francisco J. Veredas and Francisco M. Couto},
      year={2024},
      eprint={2407.06292},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2407.06292}, 
}

Environment setup

The experiments were done in an environment based on Python 3.9.1.

Minimum storage: 11 GB free disk space.

To use the official Docker image for Python, pull the image to your local environment:

docker pull python:3.9-slim-bullseye

Run a container in the project directory:

nvidia-docker run -v $(pwd):/x_linker --name x_linker --ipc="host" --gpus '"device=0"' -it <image ID> bash

Replace the Docker image ID by the respective value and the number of GPU devices.

Inside the container, install the requirements:

./setup.sh

Run in the root directory of the project:

export PYTHONPATH="${PYTHONPATH}:"

Data preparation

Data is stored in https://zenodo.org/records/12704543.

Get knowledge base files, datasets/corpora (and PubTator annotations if building training data from scratch):

./src/bash/get_eval_data.sh

If you want to generate all the training data from scratch and train PECOS-EL models follow the instructions in the file 'build_train.md'. To just perform evaluation or use X-Linker for inference follow the next steps in this file.

Evaluation

X-Linker

To evaluate X-Linker in datasets:

python src/python/xlinker/evaluate.py -dataset <> -ent_type <> -kb <> -model_dir <> -top_k <> -clustering <> --abbrv --pipeline --ppr --fuzzy_top_k

Arguments:

-dataset: evaluation dataset, 'bc5cdr', 'biored', 'nlm_chem' or 'ncbi:disease'; type=str, required=True
-ent_type: entity type, 'Disease' or 'Chemical' type=str, required=True
-kb: target knowledge organization system, 'medic' or 'ctd_chemicals; type=str, required=True
-model_dir: directory 'data/models/trained' followed by the specific model directory: 'chemical_all_1ep', 'chemical_only_kb_1ep', 'disease_100_1ep', 'disease_200_1ep', 'disease_300_1ep', 'disease_400_1ep' 'disease_only_kb_1ep'; type=str, required=True
-top_k: top-k candidates to output; type=int, default=5
-clustering: type=str, default="pifa"
--abbrv: use abbreviation detector or not; default=False, action=BooleanOptionalAction
--pipeline: use the X-Linker pipeline or not default=False, action=BooleanOptionalAction
--threshold: threshold to filter candidates; type=float, default=0.1
--ppr: use the Personalized PageRank or not default=False, action=BooleanOptionalAction
--fuzzy_top_k: number of candidates to be retrieved by the string matcher; type=int, default=1

Example for BC5CDR-Disease dataset:

python src/python/xlinker/evaluate.py -dataset bc5cdr -ent_type Disease -kb medic -model_dir data/models/trained/disease_200_1ep -top_k 2 --abbrv --pipeline --threshold 0.15 --ppr

Example for BioRED-Chemical dataset:

python src/python/xlinker/evaluate.py -dataset biored -ent_type Chemical -kb ctd_chemicals -model_dir data/models/trained/chemical_1ep_all -top_k 2 --abbrv --pipeline --threshold 0.1 --ppr

SapBERT:

python src/python/sapbert.py -ent_type <ent_type> -dataset <disease> -kb <kb> -top_k <top_k> --abbrv

Arguments:

ent_type: entity type, 'Disease' or 'Chemical' type=str, required=True
dataset: evaluation dataset, 'bc5cdr', 'biored', 'nlm_chem' or 'ncbi:disease'; type=str, required=True
kb: target knowledge organization system, 'medic' or 'ctd_chemicals; type=str, required=True
top_k: top-k candidates to output; type=int, default=5 -- abbrv: use abbreviation detector or not; default=False, action=BooleanOptionalAction

Example:

python3.9 src/python/sapbert.py -dataset bc5cdr -ent_type Disease -kb medic -top_k 1

Inference

To use any of the PECOS-EL models for inference in a list of input entities run the script 'src/python/xlinker/inference_pecos.py'. Change the variable 'input_entities' manually.

Example using the input ["Hypertension", "Diabetes", "Cancer"]:

python src/python/xlinker/inference_pecos.py

Output:

model loaded!
KB info loaded!
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 2219.99it/s]
Entity: hypertension
Output: [('Hypertension', 'D006973', 1.0), ('Essential Hypertension', 'D000075222', 0.20821528136730194), ('Hypertension, Pregnancy-Induced', 'D046110', 0.18146659433841705), ('Hypertension, Malignant', 'D006974', 0.17562662065029144), ('Hypertension, Renovascular', 'D006978', 0.11986447870731354)]
--------
Entity: diabetes
Output: [('Diabetes Mellitus', 'D003920', 0.899384617805481), ('Diabetes Mellitus, Insulin-Dependent, 19', 'C565715', 0.587672233581543), ('Diabetes Mellitus, Insulin-Dependent, 7', 'C563957', 0.5324499607086182), ('Diabetes Mellitus, Experimental', 'D003921', 0.5279378890991211), ('Diabetes Mellitus, Transient Neonatal, 1', 'C563322', 0.47543802857398987)]
--------
Entity: cancer
Output: [('Pharyngeal Neoplasms', 'D010610', 0.7916738986968994), ('Tongue Neoplasms', 'D014062', 0.7916738986968994), ('Breast Neoplasms', 'D001943', 0.7492217421531677), ('Laryngeal Neoplasms', 'D007822', 0.7154863476753235), ('Gallbladder Neoplasms', 'D005706', 0.5262947082519531)]

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
src		src
README.md		README.md
build_train.md		build_train.md
requirements.txt		requirements.txt
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hybrid X-Linker: Automated Data Generation and Extreme Multi-label Ranking for Biomedical Entity Linking

Environment setup

Data preparation

Evaluation

X-Linker

SapBERT:

Inference

About

Releases

Packages

Languages

lasigeBioTM/X-Linker

Folders and files

Latest commit

History

Repository files navigation

Hybrid X-Linker: Automated Data Generation and Extreme Multi-label Ranking for Biomedical Entity Linking

Environment setup

Data preparation

Evaluation

X-Linker

SapBERT:

Inference

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages