GROOT: Effective Design of Biological Sequences with Limited Experimental Data

Introduction

This is the official implementation of the paper "GROOT: Effective Design of Biological Sequences with Limited Experimental Data".

Dependencies

The environment.yaml file contains the necessary dependencies to run GROOT. It requires Python 3.10 and CUDA version 11.8 to run the main pipeline.

Installation

Follow these steps to install GROOT:

conda env create -f environment.yml
conda activate groot
pip install pyg_lib torch_scatter torch_sparse torch_cluster torch_spline_conv -f https://data.pyg.org/whl/torch-2.1.0+cu118.html
pip install -e .

Please setup wandb to train VAE model with logging utilities. Otherwise, you can disable it by prepending WANDB_DISABLED=true before running python script.

Setup benchmarks

The checkpoints of oracles provided by GGS are already included in the repository, under the directory ckpts.

The benchmark datasets are provided in data directory. Otherwise, you can generate your own sub-dataset by spliting the ground-truth with script split_data.py.

Download pretrained weights

We provide the pretrained VAE model for AAV and GFP dataset here.

Usage

Training

To train VAE model for each benchmark dataset, run script train_vae.py as follows:

python scripts/train_vae.py [CONFIG_FILE] --csv_file [CSV_FILE] --devices [DEVICES] --output_dir [OUTPUT_DIR] --dataset [DATASET]

Parameter	Type	Description	Options	Required	Default
`config_file`	str	Path to config module		✔️
`output_dir`	str	Path to output directory		✔️
`csv_file`	str	Path to training data		✔️
`dataset`	str	Training dataset	AAV, GFP	✔️
`expected_kl`	float	Expected KL-Divergence value			40
`batch_size`	int	Batch size			64
`devices`	str	Training devices separated by comma			-1
`ckpt_path`	str	Path to checkpoint to resume training			None
`wandb_id`	str	WandB experimental id to resume			None
`prefix`	str	Prefix to add to checkpoint file			""

You can use the configuration templates in vae directory. Checkpoints will be saved in [OUTPUT_DIR]/vae_ckpts/ folder.

Optimization

To perform optimization, run script optimize.py as follows:

python scripts/optimize.py [CONFIG_FILE] --devices [DEVICES] --dataset [DATASET] --model_ckpt_path [VAE_CKPT_PATH] --optim_config_path [OPTIM_CONFIG_PATH] --level [LEVEL] --output_dir [OUTPUT_DIR]

Parameter	Type	Description	Options	Required	Default
`config_file`	str	Path to config module		✔️
`model_ckpt_path`	str	Path to VAE checkpoint		✔️
`dataset`	str	Training dataset	AAV, GFP	✔️
`level`	str	Benchmark difficulty	easy, medium, hard, harder1, harder2, harder3	✔️
`optim_config_path`	str	Path to optimization configuration file		✔️
`output_dir`	str	Path to output directory		✔️
`batch_size`	int	Batch size			128
`devices`	str	Training devices separated by comma			-1
`changes`	list[str]	List of modifications made to replace argument in `optim_config_file`			[]

For more details about these arguments, refer to the optimize.sh file.

Citation

If our paper or codebase aids your research, please consider citing us:

@misc{tran2024grooteffectivedesignbiological,
      title={GROOT: Effective Design of Biological Sequences with Limited Experimental Data}, 
      author={Thanh V. T. Tran and Nhat Khang Ngo and Viet Anh Nguyen and Truong Son Hy},
      year={2024},
      eprint={2411.11265},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2411.11265}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
ckpts		ckpts
data		data
groot		groot
scripts		scripts
static		static
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GROOT: Effective Design of Biological Sequences with Limited Experimental Data

Table of Contents:

Introduction

Dependencies

Installation

Setup benchmarks

Download pretrained weights

Usage

Training

Optimization

Citation

About

Releases

Packages

Languages

Fsoft-AIC/GROOT

Folders and files

Latest commit

History

Repository files navigation

GROOT: Effective Design of Biological Sequences with Limited Experimental Data

Table of Contents:

Introduction

Dependencies

Installation

Setup benchmarks

Download pretrained weights

Usage

Training

Optimization

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages