Skip to content

bert-nmt/BERT-DTI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BERT-DTI

This repo provide the experiment codes for the KD-DTI benchmark, which aims to extract Drug-Target Interaction knowledge from biomedical literatures. Our code is based on BERT-NMT.

Public version dataset is aviailable at here

Get stared:

Prepare environment

Run ./utils/prepare_environment.sh to install required package and install bert-nmt to default path /tmp/bert-nmt/

Preprocess the raw data:

Run ./data_scripts/build_seq2seq_data.sh: a script that preprocess the raw files, it takes two params:

  • input_dir: path to dir contain json raw data
  • output_dir: path to save processed seq2seq data Tips: see example params in the scripts

In this step, we need to process raw input into train.x, train.y, valid.x, valid.y, test.x, test.y

For the *.x files, each line is a document.

For the *.y files, each line is made up of drug_1 relation_1 target_1 drug_2 relation_2 target_2, etc

Notice!! Before processing the data, you should first register a DrugBank account, download the xml data set, and replace the entity id with the entity name in the drugbank.

Tokenize and Binarize data:

Run ./data_scripts/move_and_bin_data.sh: a script that tokenize and binarize the preprocessed files, it takes two params:

  • input_dir: path to seq2seq raw data
  • script_dir: code dir for BERT-DTI Tips: see example params in the scripts

In this step, we first use build_bpe_data.sh to get the BPE data.

And get bin data for different settings:

  • For conventional model, use bin.sh
  • For bert model, use bin-bert.sh
  • If you woud like to use PubMEBBERT, please use bin-pubmedbert.sh.

Training and Inference

All train and inference scripts can be found at ./train_and_test_scripts/

For training, run ./train_and_test_scripts/train_seq2seq{pretrained_model_name}.sh, it takes four params:

  • dr: dropout rate
  • las: label smoothing rate
  • lr: learning rate
  • data_path: path to the processed /data-bin, eg: ./data/seq2seq/data-bin-BERT

For inference, run ./train_and_test_scripts/predict_seq2seq{pretrained_model_name}.sh, it takes three params:

  • model: path to checkpoint pt file
  • data_path: path to dir of bin data
  • output_file: path to result file

Evaluation

Run ./evaluation_scripts/hard_match_evaluation.py to get results An example of usage is provided in ./evaluation_scripts/run_hard_eval.sh

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published