Fine-tuning sequence-to-expression models on personal genome and transcriptome data

The code base for our work on improving the performance of sequence-to-expression models for making individual-specific gene expression predictions by fine-tuning them on personal genome and transcriptome data. Please cite the following paper if you use our code:

@article{
    finetuning_seq2exp_models_rastogi_reddy_2024,
    author = {Rastogi, Ruchir and Reddy, Aniketh Janardhan and Chung, Ryan and Ioannidis, Nilah M.},
    title = {Fine-tuning sequence-to-expression models on personal genome and transcriptome data},
    elocation-id = {2024.09.23.614632},
    year = {2024},
    doi = {10.1101/2024.09.23.614632},
    publisher = {Cold Spring Harbor Laboratory},
    URL = {https://www.biorxiv.org/content/early/2024/09/25/2024.09.23.614632},
    eprint = {https://www.biorxiv.org/content/early/2024/09/25/2024.09.23.614632.full.pdf},
    journal = {bioRxiv}
}

We fine-tune Enformer [1] in our work. We use the PyTorch port of Enformer available at https://github.com/lucidrains/enformer-pytorch as the base model. Our code is structured as follows:

finetuning/: Scripts for fine-tuning Enformer on personal genome and transcriptome data using the various training strategies described in our paper.
fusion/: Code to run variant-based baseline methods.
analysis/: Scripts used for analysing predictions and generating figures.
process_geuvadis_data/: Scripts for processing personal gene expression data from the GEUVADIS consortium. The processed data used in our experiments can be found at process_geuvadis_data/log_tpm/corrected_log_tpm.annot.csv.gz.
process_sequence_data/: Scripts to obtain personal genome sequences for individuals with matching gene expression data.
process_enformer_data/: Code to build Enformer training data from Basenji2 training data by expanding input sequences. This data is used for joint training along with the personal genome and transcriptome data.
process_Malinois_MPRA_data/: Code to download and format the MPRA data collected by Siraj et al. [2] from ENCODE. This data is also used for joint training.
vcf_utils/: Miscellaneous utils used for processing VCF files.

References:

Avsec, Žiga, et al. "Effective gene expression prediction from sequence by integrating long-range interactions." Nature methods 18.10 (2021): 1196-1203.
Siraj, Layla, et al. "Functional dissection of complex and molecular trait variants at single nucleotide resolution." bioRxiv (2023).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fine-tuning sequence-to-expression models on personal genome and transcriptome data

References:

About

Releases

Packages

Contributors 3

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 377 Commits
analysis		analysis
finetuning		finetuning
fusion		fusion
process_Malinois_MPRA_data		process_Malinois_MPRA_data
process_enformer_data		process_enformer_data
process_geuvadis_data		process_geuvadis_data
process_sequence_data		process_sequence_data
vcf_utils		vcf_utils
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md

ni-lab/finetuning-enformer

Folders and files

Latest commit

History

Repository files navigation

Fine-tuning sequence-to-expression models on personal genome and transcriptome data

References:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages