Protein Subcellular Localization Prediction 🧬

Video | Paper

Repository for the paper Light Attention Predicts Protein Location from the Language of Life. The method beats the previous SOTA by 8 percentage points on the standard subcellular localization dataset and our new benchmark.

If you have questions or any trouble running some of the code, don't hesitate to open an issue or ask me via [email protected]. I am happy to hear from you!

Usage

Either train your own model or use the weights I provide in this repository and do only inference. We also provide a webserver https://embed.protein.properties/ where you can just upload sequences and get the localization predictions (and more).

1. Get Protein Embeddings

The architecture works on embeddings that are generated from single sequences of amino acids without additional evolutionary information as in profile embeddings. Just download the embeddings and place them in data_files.

Alternatively you can generate the embedding files from .fasta files using the bio-embeddings library. For using the embeddings here, just replace the path in the corresponding config file, such as configs/light_attention.yaml to point to your embeddings and remapping file and set the parameter key_format: hash.

2. Setup environment

If you are using conda, you can install all dependencies like this. Otherwise, look at the setup below.

conda env create -f environment.yml
conda activate bio

3.1 Training

Make sure that you have the embeddings placed in the files specified in configs/light_attention.yaml as described in step 1. Then start training:

python train.py --config configs/light_attention.yaml
tensorboard --logdir=runs --port=6006

You can now go to localhost:6006 in your browser and watch the model train.

3.2 Inference

Either use your own tranined weights that were saved in the runs directory, or download the trained_model_weights and place the folder in the repository root. Running the following command will use the weights to generate the predictions for the protein embeddings specified in configs/inference.yaml (setHARD currently).

python inference.py --config configs/inference.yaml

The predictions are then saved in the checkpoint in trained_model_weights as predictions.txt in the same order as your input.

Architecture

Setup

Python 3 dependencies:

pytorch
biopython
h5py
matplotlib
seaborn
pandas
pyaml
torchvision
sklearn

You can use the conda environment file to install all of the above dependencies. Create the conda environment bio by running:

conda env create -f environment.yml

Reproduce exact results

You can use the respective configuration file to reproduce the results of the methods in the paper. The 10 randomly generated seeds with which we trained 10 models of each method to get standard errors are:

[921, 969, 309, 559, 303, 451, 279, 624, 657, 702]

Performance

The DeepLoc data set has 10 different subcellular localizations that need to be classified. Meanwhile, setHard is a new Dataset with less redundancy and harder targets. The dataset details can be found in our paper.

Accuracy on the DeepLoc test set:

Method	Accuracy
Ours	86.01%
DeepLoc	77.97%
iLoc-Euk	68.20%
YLoc	61.22%
LocTree2	61.20%
SherLoc2	58.15%

(Ours evaluated accross 10 different randomly chosen seeds) (Numbers taken from the DeepLoc paper)

Name		Name	Last commit message	Last commit date
Latest commit History 456 Commits
configs		configs
data_files		data_files
datasets		datasets
legacy_scripts		legacy_scripts
models		models
utils		utils
.accuracy.png		.accuracy.png
.architecture.png		.architecture.png
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
inference.py		inference.py
solver.py		solver.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Protein Subcellular Localization Prediction 🧬

Video | Paper

Usage

1. Get Protein Embeddings

2. Setup environment

3.1 Training

3.2 Inference

Architecture

Setup

Reproduce exact results

Performance

About

Releases

Packages

Languages

License

HannesStark/protein-localization

Folders and files

Latest commit

History

Repository files navigation

Protein Subcellular Localization Prediction 🧬

Video | Paper

Usage

1. Get Protein Embeddings

2. Setup environment

3.1 Training

3.2 Inference

Architecture

Setup

Reproduce exact results

Performance

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages