Repository for the paper Light Attention Predicts Protein Location from the Language of Life. The method beats the previous SOTA by 8 percentage points on the standard subcellular localization dataset and our new benchmark.
If you have questions or any trouble running some of the code, don't hesitate to open an issue or ask me via [email protected]. I am happy to hear from you!
Either train your own model or use the weights I provide in this repository and do only inference. We also provide a webserver https://embed.protein.properties/ where you can just upload sequences and get the localization predictions (and more).
The architecture works on embeddings that are generated from single sequences of amino acids without additional
evolutionary information as in profile embeddings.
Just download
the embeddings and place them in data_files
.
Alternatively you can generate the embedding files from .fasta
files using the
bio-embeddings library. For using the embeddings here, just replace the path
in the corresponding config file, such as configs/light_attention.yaml
to point to your embeddings and remapping file
and set the parameter key_format: hash
.
If you are using conda, you can install all dependencies like this. Otherwise, look at the setup below.
conda env create -f environment.yml
conda activate bio
Make sure that you have
the embeddings
placed in the files specified in configs/light_attention.yaml
as described in step 1. Then start training:
python train.py --config configs/light_attention.yaml
tensorboard --logdir=runs --port=6006
You can now go to localhost:6006
in your browser and watch the model train.
Either use your own tranined weights that were saved in the runs
directory, or download
the trained_model_weights
and place the folder in the repository root. Running the following command will use the weights to generate the
predictions for the protein embeddings specified in configs/inference.yaml
(setHARD currently).
python inference.py --config configs/inference.yaml
The predictions are then saved in the checkpoint in trained_model_weights
as predictions.txt
in the same order as
your input.
Python 3 dependencies:
- pytorch
- biopython
- h5py
- matplotlib
- seaborn
- pandas
- pyaml
- torchvision
- sklearn
You can use the conda environment file to install all of the above dependencies. Create the conda environment bio
by
running:
conda env create -f environment.yml
You can use the respective configuration file to reproduce the results of the methods in the paper. The 10 randomly generated seeds with which we trained 10 models of each method to get standard errors are:
[921, 969, 309, 559, 303, 451, 279, 624, 657, 702]
The DeepLoc data set has 10 different subcellular localizations that
need to be classified. Meanwhile, setHard
is a new Dataset with less redundancy and harder targets. The dataset
details can be found in our paper.
Accuracy on the DeepLoc test set:
Method | Accuracy |
---|---|
Ours | 86.01% |
DeepLoc | 77.97% |
iLoc-Euk | 68.20% |
YLoc | 61.22% |
LocTree2 | 61.20% |
SherLoc2 | 58.15% |
(Ours evaluated accross 10 different randomly chosen seeds) (Numbers taken from the DeepLoc paper)