Skip to content

Deep autoregressive generative models capture the intrinsics embedded in t-cell receptor repertoires

License

Notifications You must be signed in to change notification settings

jiangdada1221/TCRpeg

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

62 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TCRpeg

TCRpeg is a deep probabilistic neural network framework used for inferring probability distribution for given CDR3 repertoires. Beyond that, TCRpeg can provide numerical embeddings for TCR sequences, generate new TCR sequences with highly similar statistical properties with the training repertoires. TCRpeg can be easily extended to act as a classifier for predictive purposes (TCRpeg-c).


Installation

TCRpeg is a python software implemented based on the deeplearning library - Pytorch. It is available on PyPI and can be downloaded and installed via pip:
pip install tcrpeg
(recommended) TCRpeg can be also installed by cloning the Github repository and using the pip :
git clone https://github.com/jiangdada1221/TCRpeg.git
cd TCRpeg
pip install .
The required software dependencies are listed below:

Numpy
matplotlib
tqdm
pandas
scikit-learn
scipy
torch >= 1.1.0

Data

All the data used in the paper is publicly available, so we suggest readers refer to the original papers for more details. We also upload the processed data which can be downloaded via this link

Usage instructions

Define and train TCRpeg model:

from tcrpeg.TCRpeg import TCRpeg
model = TCRpeg(embedding_path='tcrpeg/data/embedding_32.txt',load_data=True, path_train=tcrs) 
#'embedding_32.txt' records the numerical embeddings for each AA; We provide it under the 'tcrpeg/data/' folder.
#'tcrs' is the TCR repertoire ([tcr1,tcr2,....])
model.create_model() #initialize the TCRpeg model
model.train_tcrpeg(epochs=20, batch_size= 32, lr=1e-3) 
#defining and training of TCRpeg_vj can be found in tutorial.ipynb

Load the default models

model = TCRpeg(embedding_path='tcrpeg/data/embedding_32.txt',load_data=False)
model.create_model(load=True,path='tcrpeg/models/tcrpeg.pth')
#TCRpeg_vj model
model_vj = TCRpeg(embedding_path='tcrpeg/data/embedding_32.txt',load_data=False,vj=True)
model_vj.create_model(vj=True,load=True,path='tcrpeg/models/tcrpeg_vj.pth')

Use the pretrained TCRpeg model for downstream applications:

log_probs = model.sampling_tcrpeg_batch(tcrs)   #probability inference
new_tcrs = model.generate_tcrpeg(num_to_gen=1000, batch_size= 100)    #generation
embs = model.get_embedding(tcrs)    #embeddings for tcrs

Updates

The downstream applications can be also applied to CDR3+V+J data

new_clonetypes = model.generate_tcrpeg_vj(num_to_gen=1000, batch_size= 100) #generation
log_probs_clonetypes = model.sampling_tcrpeg_batch(clone_types) # get the probs of CDR3_V_J
#size of clone_types: 3xlength ([[cdr1,cdr2,cdr3...],[v1,v2,v3..],[j1,j2,j3...]])

We provide a tutorial jupyter notebook named tutorial.ipynb. It contains most of the functional usages of TCRpeg which mainly consist of three parts: probability inference, numerical encodings & downstream classification, and generation.

Command line usages

We have provided the scripts for the experiments in the paper via the folder tcrpeg/scripts.

python train.py --path_train ../data/TCRs_train.csv --epoch 20 --learning_rate 0.0001 --store_path ../results/model.pth 

To train a TCRpeg (with vj) model, the data file needs to have the columns named 'seq', 'v', 'j'. Insert 'python train.py --h' for more details.

python evaluate.py --test_path ../data/pdf_test.csv --model_path ../results/model.pth

To compute the Pearson correlation coefficient of the probability inference task on test set.

python generate.py --model_path ../results/model.pth --n 10000 --store_path ../results/gen_seq.txt

Use the pretrained TCRpeg to generate new sequences. Type 'python generate.py --h' for more details

python classify.py --path_train ../data/train.csv --path_test ../data/test.csv --epoch 20 --learning_rate 0.0001

Use TCRpeg-c for classification task. The files should have two columns: 'seq' and 'label'. Type 'python classify.py --h' for more details.
Note that the parameters unspecified will use the default ones (e.g. batch size)

The python files and their usages are shown below:

Module name Usage
TCRpeg.py Contain most functions of TCRpeg
evaluate.py Evaluate the performance of probability inference
word2vec.py word2vec model for obtaining embeddings of AAs
model.py Deep learning models of TCRpeg,TCRpeg-c,TCRpeg_vj
classification.py Apply TCRpeg-c for classification tasks
utils.py N/A (contains util functions)
process_data.py Construct the universal TCR pool

Contact

Name: Yuepeng Jiang
Email: [email protected]/[email protected]/[email protected]
Note: For instant query, feel free to send me an email since I check email often. Otherwise, you may open an issue section in this repository.

License

Free use of TCRpeg is granted under the terms of the GNU General Public License version 3 (GPLv3).

Citation

@article{jiang2023deep,
  title={Deep autoregressive generative models capture the intrinsics embedded in T-cell receptor repertoires},
  author={Jiang, Yuepeng and Li, Shuai Cheng},
  journal={Briefings in Bioinformatics},
  volume={24},
  number={2},
  pages={bbad038},
  year={2023},
  publisher={Oxford University Press}
}

About

Deep autoregressive generative models capture the intrinsics embedded in t-cell receptor repertoires

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published