The goal of this project is to use viral sequence diversity to estimate HIV infection timing for infants using a variety of linear regression methods including Bayesian hierarchical regression and least-absolute-deviation regression.
Everything R 4.1.1 based. R packages that are required can be installed via miniconda:
conda env create -f environment.yml
conda activate infection-timing
If you would like to run analyses involving previously-trained adult-specific least-absolute-deviation models and code (Puller et. al, PLoS Computational Biology 2017), we have built a separate Python-2-based conda environment which can be activated as follows:
conda env create -f py2_environment.yml
conda activate infection-timing_py2
Table of Contents:
- Data preparation
- Model training
- Calculating Bayes Factors
- Model evaluation
- Model validation
- Model prediction
- Plot results
- Supplementary analyses
All of the models presented here leverage HIV viral diversity measures (which we refer to as ADP) to estimate time since infection. The training and testing sequencing data sets used here were originally processed using this data processing script.
We have provided our processed datasets here, however, if you would like to make predictions using your own data, make sure your data is in csv format and contains the following required columns:
Sample
: string containing patient identifier, gene region identifier, and sequencing run number (i.e.1_F1_R1
would correspond to individual 1, sequencing fragment/gene region 1, and sequencing run 1)ptnum
: patient identifierpass2_APD
: viral diversity measureFragment
: gene region identifier (i.e.1
corresponds to gene region 1,2
corresponds to gene region 2, etc.)
After formatting your data to include these columns, if you would like to make predictions using our pre-trained models, proceed to the Model prediction section.
If you would like to train your own models using your own data, you will need several additional columns in your data:
month_visit
: the month (post-birth) at which the diversity was sampledvload
: the viral load measured at the specified time pointincat_hiv
: whether the individual was known to be infected in-utero or after birth
After formatting your data to include these additional columns, if you would like to train your own models, proceed to the Model training section.
Here is a summary of the main model options. You can find a description of additional model type options here and here
Model name (same as manuscript) | Infant-trained hierarchical model | Infant-trained linear models | Adult-trained linear models |
---|---|---|---|
Model type | Bayesian hierarchical regression model | Least absolute deviation linear regression model | Least absolute deviation linear regression model |
Data input | Viral diversity measures (sampled from any of the three gene regions) | Viral diversity measures (sampled from any of the three gene regions), but a unique model will be trained for each gene region in which data is sampled | Viral diversity measures (sampled from any of the three gene regions), but a unique model will be trained for each gene region in which data is sampled |
Data output | Distributions of estimated times since infection | Point estimates of time since infection | Point estimates of time since infection |
Includes individual- and gene-region-specific slope-modifying terms? | Yes | No | No |
Model count | 1 | 1 per gene region | 1 per gene region |
Script used for model training | fit_infant_model | fit_adult_style_model | NA (this model is pre-trained using adult data within Puller et. al, PLoS Comp Bio 2017) |
Location where trained models will be stored | here | here | NA |
- Download the training cohort data set using the link provided in our manuscript or use your own data (but make sure the data is formatted the same as ours)
- Edit the config and file paths files to be project and/or computer specific. See the README for more details.
- Train model using the model fitting script for your desired model (see table above; for the infant hierarchical model, use this and for the infant linear models, use this. Both of these scripts can be run locally or on a cluster. If using the fit_infant_model.sh script, it takes a single "time correction type" argument -- options are
uniform
,beta
,none
, orbeta_laplace
(see here for more details; thebeta
argument is the one we use throughout the manuscript). The other model training scripts do not have any arguments.
Model fits will be stored in the location described in the table above (depending on the model type).
If you would like to explore whether APD slopes vary in certain contexts (e.g. by individual, by gene region, etc.), you can use this script to calculate Bayes Factors comparing two Bayesian models. This script takes 2 arguments:
- the time correction type -- options are
uniform
,beta
, ornone
(e.g.beta
is the one we use throughout the manuscript) - variation type -- options are described below:
Variation type argument | Description |
---|---|
no_fragment |
Compare a model in which APD slopes vary by gene region to a model in which they don't |
no_subject |
Compare a model in which APD slopes vary by individual to a model in which they don't |
by_infection_time |
Compare a model in which APD slopes vary by mode of infection (i.e. relative timing of infection; either before birth or after birth) to a model in which they don't |
with_vload |
Compare a model in which APD slopes vary with viral load to a model in which they don't |
with_max_vload |
Compare a model in which APD slopes vary with maximum viral load to a model in which they don't |
with_cd4_percent |
Compare a model in which APD slopes vary with percentage of CD4 cells to a model in which they don't |
with_cd4_rate |
Compare a model in which APD slopes vary with rate of CD4 cell count decline to a model in which they don't |
This analysis will save a file containing the results in a directory called bayes_factor_results
located in the indicated OUTPUT_PATH
as specified in the config files
If you would like to evaluate the performance of a model using various subsets of the training data set (i.e. leave one out cross validation), you can use the hierarchical model evaluation script or the linear model evaluation script.
Both of these scripts can be run locally or on a cluster. If using the hierarchical model evaluation script, it takes a single "time correction type" argument -- options are uniform
, beta
, none
, or beta_laplace
(see here for more details; the beta
argument is the one we use throughout the manuscript). The other model evaluation scripts do not have any arguments.
All output files will be located in a directory called model_loocv
within the OUTPUT_PATH
as specified in the config file
If you would like to use the model to make predictions on a new data set and/or validate the model on a testing data set, you can follow these steps:
- Download the processed testing data set, and make sure it contains the same column names as the training data set
- Edit config file to be project and/or computer specific (be sure to add the path to the file from the last step)
- Run the infant-trained hierarchical model prediction script, infant-trained linear model prediction script, or the adult-trained linear model prediction script. Each of these scripts require different arguments:
The infant-trained hierarchical model prediction script requires two arguments:
- the time correction type--options are
none
,beta
, oruniform
- the path to the dataset you want to make predictions for
The infant-trained linear model prediction script requires one argument:
- the path to the dataset you want to make predictions for
The adult-trained linear model prediction script requires two arguments:
- the path to the dataset you want to make predictions for
- whether the true time since infection is known--options are
TRUE
orFALSE
All output files will be located in a directory called model_predictions
within the OUTPUT_PATH
as specified in the config file
Plot figures from the manuscript.
Also, have a look at the plotting README for more details.
If you would like to compute the coverage interval for the predictions from the infant-trained Bayesian hierarchical regression model, you can run the coverage script
If you would like to conduct a series of posterior predictive checks for the infant-trained Bayesian hierarchical regression model, you can run the posterior predictive checks script
With this analysis, we want to quantify rates of viral diversification during infant HIV infection and assess the accuracy of using these diversification rates to estimate time since infection.
See the manuscript for specific model and methods details:
Russell, M. L., Fish, C. S., Drescher, S., Cassidy, N. A. J., Chanana, P., Benki-Nugent, S., Slyker, J., Mbori-Ngacha, D., Bosire, R., Richardson, B., Wamalwa, D., Maleche-Obimbo, E., Overbaugh, J., John-Stewart, G., Matsen IV, F. A., Lehman, D. A. (In preparation). Using viral sequence diversity to estimate time of HIV infection in infants.
The following packages were especially helpful in our analyses:
- rstan (Stan Development Team, 2020)
- rstanarm (Goodrich, et. al, 2023)
- data.table (Dowle and Srinivasan, 2021)
- tidyverse (Wickham et. al, 2019)
- doParallel (Corporation and Steve Weston, 2020)
- cowplot (Wilke, 2020)