This study aimed to enhance the accuracy and efficiency of identifying negative relations, specifically focusing on inhibitor relations within biomedical texts. It developed a novel text mining system using state-of-the-art K-RET, which allows the integration of biomedical ontologies. Two datasets were utilized, DrugProt and ChemProt, featuring inhibitor relations and their opposite, activator relations. The K-RET system was adapted to identify negative relations more effectively, achieving high accuracy, precision, recall, and F1 scores for the test set.
This repository contains code and instructions to obtain the K-RET-NEG, which is designed for identifying and classifying sentences related to inhibitors and activators in biomedical text. This README provides an overview of the repository structure and how to use the provided scripts.
- Getting the Data
- Model Preparation (K-RET)
- Final Architecture
- Data Processing
- Train, dev, test set
- Training the Model - K-RET-NEG
- Evaluating Model Predictions
Inside data
is possible to find the steps to obtain the data and the correct architecture necessary: data Folder
To run the code provided, you need to follow these steps:
-
Obtain the K-RET Algorithm:
- Download the K-RET algorithm from its GitHub repository: K-RET GitHub.
- Clone or download the repository to your local machine.
-
Replace the 'auxiliar' Folder:
- Inside the K-RET folder, locate the
auxiliar/
folder (K-RET/auxiliar
); - Replace the
auxiliar
folder from the original K-RET folder with the one provided here.
- Inside the K-RET folder, locate the
-
Place the 'src' Folder:
- in
K-RET/src/
is provided the scripts developed for this project; - Place the entire
src
folder inside the K-RET folder.
- in
-
Copy 'run_classification1.py':
- Locate the
run_classification_inhibitor.py
file in the provided code (K-RET/run_classification_inhibitor.py
). This is a modification of the originalrun_classification.py
where you can give weights of each label as arguments. - Copy the
run_classification_inhibitor.py
file and place it inside the K-RET folder.
- Locate the
Once you've completed these steps, you should have the necessary files and folder structure set up within the K-RET directory.
The initial configuration should be like this:
- data/
- ChemProt/
- chemprot_development/
- chemprot_sample/
- chemprot_test_gs/
- chemprot_train/
- DrugProt/drugprot-gs-training-development/
- development/
- training/
- src/
- ChemProt/
- K-RET/
- auxiliar/ replace original with provided here.
- brain/
- datasets/
- models/
- outputs/
- uer/
- src/ add this folder to K-RET
- run_classification_inhibitor.py add this to K-RET
- other files present in K-RET
To get the necessary data to obtain the K-RET-NEG, follow these steps:
-
Navigate to the
data
directory:cd data
-
Run the following command to generate various JSON files:
python3 src/get_data.py
- This command will produce the following JSON files:
- Separate JSON files for each dataset.
final_duplicates.json
, which merges all datasets while preserving possible duplicates.final_no_duplicates.json
, which removes duplicates from the merged dataset.- Two additional files,
all_complete
andall_ready
, containing all inhibitor/activator sentences:all_ready
contains only label and text_a.all_complete
contains the ID of the PubMed article, text, arg1 (entity 1), arg2 (entity 2), and the label of the sentence.
- This command will produce the following JSON files:
-
Navigate to the
K-RET
directory:cd K-RET
-
Run the following command to generate 4 different train, dev and test sets:
chmod u+x create_train_test_dev.sh; ./create_train_test_dev.sh
train_test_dev.py
: present indata/src/
, can be used to split thefinal_no_duplicates.json
dataset into train, test, and development sets.create_train_test_dev.sh
: This script generates four different train, test, and dev datasets fromfinal_no_duplicates.json
, by calling 4 timestrain_test_dev.py
.
To train and evaluate the K-RET-NEG, use the scripts in the K-RET
directory:
-
To replicate results and check different weights automatically:
- Run
./src/chemdrug_weight.sh
, and the results will be saved in theoutput/chemdrug_weight/
directory.
- Run
-
To train four different models:
chmod u+x ./src/chemdrug_crossval.sh; ./src/chemdrug_crossval.sh
-The
chemdrug_crossval.sh
script allows you to adjust label weights. By default, it uses the following weights obtained from calculations: 0.799 (for the activator label) and 0.201 (for the inhibitor label). -The results of the training process will be saved in theoutput/chemdrug_crossval/
directory as both a log file and a corresponding binary model file. The naming convention for these files is as follows:-Example Model File: `output/chemdrug_crossval/chemdrug_crossval_1_0.799_0.201.bin` -Example Log File: `output/chemdrug_crossval/chemdrug_crossval_1_0.799_0.201.log`
-In the file names, the number (e.g., 1) represents the specific train, dev, and test set, while the weights (0.799 and 0.201) indicate the assigned label weights.
After training the four different models, you can check the predictions made by each model on the test set by running the following script present in K-RET
directory:
chmod u+x ./src/chemdrug_cross_process.sh; ./src/chemdrug_cross_process.sh
-This script will use the four different models to evaluate their predictions on the test set they were trained on. The results are saved in processed_results/
Feel free to explore the repository further and adapt the provided scripts as needed for your specific use case.