This is the official implementation of the paper "GROOT: Effective Design of Biological Sequences with Limited Experimental Data".
The environment.yaml
file contains the necessary dependencies to run GROOT. It requires Python 3.10 and CUDA version 11.8 to run the main pipeline.
Follow these steps to install GROOT:
conda env create -f environment.yml
conda activate groot
pip install pyg_lib torch_scatter torch_sparse torch_cluster torch_spline_conv -f https://data.pyg.org/whl/torch-2.1.0+cu118.html
pip install -e .
Please setup wandb to train VAE model with logging utilities. Otherwise, you can disable it by prepending WANDB_DISABLED=true
before running python script.
The checkpoints of oracles provided by GGS are already included in the repository, under the directory ckpts
.
The benchmark datasets are provided in data
directory. Otherwise, you can generate your own sub-dataset by spliting the ground-truth with script split_data.py
.
We provide the pretrained VAE model for AAV and GFP dataset here.
To train VAE model for each benchmark dataset, run script train_vae.py
as follows:
python scripts/train_vae.py [CONFIG_FILE] --csv_file [CSV_FILE] --devices [DEVICES] --output_dir [OUTPUT_DIR] --dataset [DATASET]
Parameter | Type | Description | Options | Required | Default |
---|---|---|---|---|---|
config_file |
str | Path to config module | ✔️ | ||
output_dir |
str | Path to output directory | ✔️ | ||
csv_file |
str | Path to training data | ✔️ | ||
dataset |
str | Training dataset | AAV, GFP | ✔️ | |
expected_kl |
float | Expected KL-Divergence value | 40 | ||
batch_size |
int | Batch size | 64 | ||
devices |
str | Training devices separated by comma | -1 | ||
ckpt_path |
str | Path to checkpoint to resume training | None | ||
wandb_id |
str | WandB experimental id to resume | None | ||
prefix |
str | Prefix to add to checkpoint file | "" |
You can use the configuration templates in vae
directory. Checkpoints will be saved in [OUTPUT_DIR]/vae_ckpts/
folder.
To perform optimization, run script optimize.py
as follows:
python scripts/optimize.py [CONFIG_FILE] --devices [DEVICES] --dataset [DATASET] --model_ckpt_path [VAE_CKPT_PATH] --optim_config_path [OPTIM_CONFIG_PATH] --level [LEVEL] --output_dir [OUTPUT_DIR]
Parameter | Type | Description | Options | Required | Default |
---|---|---|---|---|---|
config_file |
str | Path to config module | ✔️ | ||
model_ckpt_path |
str | Path to VAE checkpoint | ✔️ | ||
dataset |
str | Training dataset | AAV, GFP | ✔️ | |
level |
str | Benchmark difficulty | easy, medium, hard, harder1, harder2, harder3 | ✔️ | |
optim_config_path |
str | Path to optimization configuration file | ✔️ | ||
output_dir |
str | Path to output directory | ✔️ | ||
batch_size |
int | Batch size | 128 | ||
devices |
str | Training devices separated by comma | -1 | ||
changes |
list[str] | List of modifications made to replace argument in optim_config_file |
[] |
For more details about these arguments, refer to the optimize.sh
file.
If our paper or codebase aids your research, please consider citing us:
@misc{tran2024grooteffectivedesignbiological,
title={GROOT: Effective Design of Biological Sequences with Limited Experimental Data},
author={Thanh V. T. Tran and Nhat Khang Ngo and Viet Anh Nguyen and Truong Son Hy},
year={2024},
eprint={2411.11265},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2411.11265},
}