infection-timing

The goal of this project is to use viral sequence diversity to estimate HIV infection timing for infants using a variety of linear regression methods including Bayesian hierarchical regression and least-absolute-deviation regression.

Install

Everything R 4.1.1 based. R packages that are required can be installed via miniconda:

conda env create -f environment.yml
conda activate infection-timing

If you would like to run analyses involving previously-trained adult-specific least-absolute-deviation models and code (Puller et. al, PLoS Computational Biology 2017), we have built a separate Python-2-based conda environment which can be activated as follows:

conda env create -f py2_environment.yml
conda activate infection-timing_py2

Analysis outline:

Table of Contents:

Data preparation
Model training
- Summary of model types
Calculating Bayes Factors
Model evaluation
- Summary of model evaluation options
Model validation
- Summary of model validation options
Model prediction
Plot results
Supplementary analyses
- Model coverage estimates
- Model posterior predictive checks

Data preparation

All of the models presented here leverage HIV viral diversity measures (which we refer to as ADP) to estimate time since infection. The training and testing sequencing data sets used here were originally processed using this data processing script.

Preparation for making model predictions using your own data with our pre-trained models

We have provided our processed datasets here, however, if you would like to make predictions using your own data, make sure your data is in csv format and contains the following required columns:

Sample: string containing patient identifier, gene region identifier, and sequencing run number (i.e. 1_F1_R1 would correspond to individual 1, sequencing fragment/gene region 1, and sequencing run 1)
ptnum: patient identifier
pass2_APD: viral diversity measure
Fragment: gene region identifier (i.e. 1 corresponds to gene region 1, 2 corresponds to gene region 2, etc.)

After formatting your data to include these columns, if you would like to make predictions using our pre-trained models, proceed to the Model prediction section.

Preparation for training models using your own data

If you would like to train your own models using your own data, you will need several additional columns in your data:

month_visit: the month (post-birth) at which the diversity was sampled
vload: the viral load measured at the specified time point
incat_hiv: whether the individual was known to be infected in-utero or after birth

After formatting your data to include these additional columns, if you would like to train your own models, proceed to the Model training section.

Model training

Model types

Here is a summary of the main model options. You can find a description of additional model type options here and here

Model name (same as manuscript)	Infant-trained hierarchical model	Infant-trained linear models	Adult-trained linear models
Model type	Bayesian hierarchical regression model	Least absolute deviation linear regression model	Least absolute deviation linear regression model
Data input	Viral diversity measures (sampled from any of the three gene regions)	Viral diversity measures (sampled from any of the three gene regions), but a unique model will be trained for each gene region in which data is sampled	Viral diversity measures (sampled from any of the three gene regions), but a unique model will be trained for each gene region in which data is sampled
Data output	Distributions of estimated times since infection	Point estimates of time since infection	Point estimates of time since infection
Includes individual- and gene-region-specific slope-modifying terms?	Yes	No	No
Model count	1	1 per gene region	1 per gene region
Script used for model training	fit_infant_model	fit_adult_style_model	NA (this model is pre-trained using adult data within Puller et. al, PLoS Comp Bio 2017)
Location where trained models will be stored	here	here	NA

Workflow

Download the training cohort data set using the link provided in our manuscript or use your own data (but make sure the data is formatted the same as ours)
Edit the config and file paths files to be project and/or computer specific. See the README for more details.
Train model using the model fitting script for your desired model (see table above; for the infant hierarchical model, use this and for the infant linear models, use this. Both of these scripts can be run locally or on a cluster. If using the fit_infant_model.sh script, it takes a single "time correction type" argument -- options are uniform, beta, none, or beta_laplace (see here for more details; the beta argument is the one we use throughout the manuscript). The other model training scripts do not have any arguments.

Model fits will be stored in the location described in the table above (depending on the model type).

Calculating Bayes Factors

If you would like to explore whether APD slopes vary in certain contexts (e.g. by individual, by gene region, etc.), you can use this script to calculate Bayes Factors comparing two Bayesian models. This script takes 2 arguments:

the time correction type -- options are uniform, beta, or none (e.g. beta is the one we use throughout the manuscript)
variation type -- options are described below:

Variation type argument	Description
`no_fragment`	Compare a model in which APD slopes vary by gene region to a model in which they don't
`no_subject`	Compare a model in which APD slopes vary by individual to a model in which they don't
`by_infection_time`	Compare a model in which APD slopes vary by mode of infection (i.e. relative timing of infection; either before birth or after birth) to a model in which they don't
`with_vload`	Compare a model in which APD slopes vary with viral load to a model in which they don't
`with_max_vload`	Compare a model in which APD slopes vary with maximum viral load to a model in which they don't
`with_cd4_percent`	Compare a model in which APD slopes vary with percentage of CD4 cells to a model in which they don't
`with_cd4_rate`	Compare a model in which APD slopes vary with rate of CD4 cell count decline to a model in which they don't

This analysis will save a file containing the results in a directory called bayes_factor_results located in the indicated OUTPUT_PATH as specified in the config files

Model evaluation

If you would like to evaluate the performance of a model using various subsets of the training data set (i.e. leave one out cross validation), you can use the hierarchical model evaluation script or the linear model evaluation script.

Both of these scripts can be run locally or on a cluster. If using the hierarchical model evaluation script, it takes a single "time correction type" argument -- options are uniform, beta, none, or beta_laplace (see here for more details; the beta argument is the one we use throughout the manuscript). The other model evaluation scripts do not have any arguments.

All output files will be located in a directory called model_loocv within the OUTPUT_PATH as specified in the config file

Model prediction

If you would like to use the model to make predictions on a new data set and/or validate the model on a testing data set, you can follow these steps:

Download the processed testing data set, and make sure it contains the same column names as the training data set
Edit config file to be project and/or computer specific (be sure to add the path to the file from the last step)
Run the infant-trained hierarchical model prediction script, infant-trained linear model prediction script, or the adult-trained linear model prediction script. Each of these scripts require different arguments:

The infant-trained hierarchical model prediction script requires two arguments:

the time correction type--options are none, beta, or uniform
the path to the dataset you want to make predictions for

The infant-trained linear model prediction script requires one argument:

the path to the dataset you want to make predictions for

The adult-trained linear model prediction script requires two arguments:

the path to the dataset you want to make predictions for
whether the true time since infection is known--options are TRUE or FALSE

All output files will be located in a directory called model_predictions within the OUTPUT_PATH as specified in the config file

Plot results

Plot figures from the manuscript.

Also, have a look at the plotting README for more details.

Supplementary analyses

Model coverage estimates

If you would like to compute the coverage interval for the predictions from the infant-trained Bayesian hierarchical regression model, you can run the coverage script

Model posterior predictive checks

If you would like to conduct a series of posterior predictive checks for the infant-trained Bayesian hierarchical regression model, you can run the posterior predictive checks script

About the analysis

With this analysis, we want to quantify rates of viral diversification during infant HIV infection and assess the accuracy of using these diversification rates to estimate time since infection.

See the manuscript for specific model and methods details:

Russell, M. L., Fish, C. S., Drescher, S., Cassidy, N. A. J., Chanana, P., Benki-Nugent, S., Slyker, J., Mbori-Ngacha, D., Bosire, R., Richardson, B., Wamalwa, D., Maleche-Obimbo, E., Overbaugh, J., John-Stewart, G., Matsen IV, F. A., Lehman, D. A. (In preparation). Using viral sequence diversity to estimate time of HIV infection in infants.

The following packages were especially helpful in our analyses:

rstan (Stan Development Team, 2020)
rstanarm (Goodrich, et. al, 2023)
data.table (Dowle and Srinivasan, 2021)
tidyverse (Wickham et. al, 2019)
doParallel (Corporation and Steve Weston, 2020)
cowplot (Wilke, 2020)

Name		Name	Last commit message	Last commit date
Latest commit History 107 Commits
adult_model_scripts		adult_model_scripts
analysis_scripts		analysis_scripts
config		config
figs		figs
plotting_scripts		plotting_scripts
scripts		scripts
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml
fit_adult_style_model.sh		fit_adult_style_model.sh
fit_infant_model.sh		fit_infant_model.sh
predict_with_adult_model.sh		predict_with_adult_model.sh
predict_with_adult_style_model.sh		predict_with_adult_style_model.sh
predict_with_adult_style_model_loocv.sh		predict_with_adult_style_model_loocv.sh
predict_with_infant_model.sh		predict_with_infant_model.sh
predict_with_infant_model_loocv.sh		predict_with_infant_model_loocv.sh
predict_with_infant_model_map_loocv.sh		predict_with_infant_model_map_loocv.sh
preprocess_sequence_data.R		preprocess_sequence_data.R
py2_environment.yml		py2_environment.yml
time.pdf		time.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

infection-timing

Install

Analysis outline:

Data preparation

Preparation for making model predictions using your own data with our pre-trained models

Preparation for training models using your own data

Model training

Model types

Workflow

Calculating Bayes Factors

Model evaluation

Model prediction

Plot results

Supplementary analyses

Model coverage estimates

Model posterior predictive checks

About the analysis

About

Releases

Packages

Contributors 2

Languages

matsengrp/infection-timing

Folders and files

Latest commit

History

Repository files navigation

infection-timing

Install

Analysis outline:

Data preparation

Preparation for making model predictions using your own data with our pre-trained models

Preparation for training models using your own data

Model training

Model types

Workflow

Calculating Bayes Factors

Model evaluation

Model prediction

Plot results

Supplementary analyses

Model coverage estimates

Model posterior predictive checks

About the analysis

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages