Prerequisites
Ensure that you have the following prerequisites installed on your system:
Python (version 3.8 or higher)
R (version 4 or higher)
NCBI AMRfinderPlus version 4.0.3
NCBI BLAST+ (Included in AMRfinderPlus installation)
We suggest that ReGAIN and all prerequisites are installed within a Conda environment
Download miniforge
Create Conda environment and install NCBI AMRfinderPlus
conda create -n regain python=3.10
conda activate regain
Install AMRfinderPlus
conda install -y -c conda-forge -c bioconda ncbi-amrfinderplus
Check installation
amrfinder -h
Download ARMfinderPlus Database
amrfinder -u
Download ReGAIN to preferred directory
git clone https://github.com/ERBringHorvath/regain_CLI
Install Python dependencies
pip install -r requirements.txt
or pip3 install -r requirements.txt
Add ReGAIN to your PATH
Add this line to the end of .bash_profile
(Linux/Unix) or .zshrc
(macOS):
export PATH="$PATH:/path/to/regain_CLI/bin"
Replace /path/to/regain_CLI/bin
with the actual path to the directory containing the executable.
Whatever the initial directory, this path should end with /regain_CLI/bin
Save the file and restart your terminal or run source ~/.bash_profile
or source ~/.zshrc
Verify installation:
regain --version
use -h
, --help
, to bring up the help menu
regain --help
NOTE: ReGAIN utilizes shell scripts to execute some modules. You may need to modify your permissions
to execute these scripts. If you run regain --version
and see permission denied: regain
, Navigate to
regain/bin
, then run both chmod +x regain
and chmod +x *.sh
and rerun regain --version
and you should see something similar to: regain v.1.5.0
Module 1.0 regain AMR
-d
, --directory
, path to directory containing genome FASTA files to analyze
-O
, --organism
, optional; specify what organism (if any) you want to analyze (optional flag)
-T
, --threads
, number of cores to dedicate for parallel processing
-o
, --output-dir
, output directory to store AMRfinder results
Currently supported organisms and how they should be called:
Acinetobacter_baumannii
Burkholderia_cepacia
Burkholderia_pseudomallei
Campylobacter
Clostridioides_difficile
Enterobacter_cloacae
Enterococcus_faecalis
Enterococcus_faecium
Escherichia
Klebsiella_pneumoniae
Neisseria
Pseudomonas_aeruginosa
Salmonella
Staphylococcus_aureus
Staphylococcus_pseudintermedius
Streptococcus_agalactiae
Streptococcus_pneumoniae
Streptococcus_pyogenes
Vibrio_cholerae
Module 1 example usage:
Organism specific:
regain AMR -d path/to/FASTA/files -O Pseudomonas_aruginosa -T 8 -o path/to/output/directory
Organism non-specific:
regain AMR -d path/to/FASTA/files -T 8 -o path/to/output/directory
Output files:
One results file per submitted genome
NOTE: variable names cannot contain special characters–this transformation is automated during dataset creation
Module 1.1 regain matrix
-d
, --directory
, path to AMRfinder results in CSV format
--gene-type
, searches for resistance
, virulence
, or all
genes
--min
, minimum gene occurrence cutoff
--max
, maximum gene occurrence cutoff (should be less than number of genomes, see NOTE below)
--report-all
, optional; reports all genes identified, regardless of --min
/--max
threshold
--keep-gene-names
, optional; maintains special characters in variable names. Should not be used if proceeding to Module 2
Module 1.1 example usage
NOTE: Discrete Bayesian network anlyses requires all variables to exist in at least two states. For ReGAIN, these two states are 'present' and 'absent'. Ubiquitously occurring genes will break the analysis.
Best practice is for N genomes, --max
should MINIMALLY be defined as N - 1. Keep in mind that removing very low and very high abundance genes can reduce noise in the network.
regain matrix -d path/to/AMRfinder/results --gene-type resistance --min 5 --max 475
NOTE: all results are saved in the 'ReGAIN_Dataset' folder, which will be generated within the directory
defined by -d
/--directory
Output files:
filtered_matrix.csv
: presence/absence matrix of genes
metadata.csv
: file containing genes identified in AMRfinderPlus analysis
combined_AMR_results_unfiltered.csv
: concatenated file of all AMRfinder/Plus results; this file contains contig and nucleotide location of all identified genes
If --report-all
is used:
unfiltered_matrix.csv
: presence/absence matrix of all genes identified, regardless of --min
/--max
thresholds
Module 2 regain bnL
or regain bnS
-i
, --input
, input file in CSV format
-M
, --metadata
, file containing gene names and descriptions
-o
, --output_boot
, output bootstrap file
-T
, --threads
, number of cores to dedicate for parallel processing
-n
, --number_of_boostraps
, how many bootstraps to run (suggested 300-500)
-r
, --number-of-resamples
, how many data resamples you want to use (suggested 100)
Module 2 example usage:
NOTE: We suggest using between 300 and 500 bootstraps and 100 resamples
bnS
, Bayesian network structure learning analysis for less than 100 genes
bnL
, Bayesian network structure learning analysis for 100 genes or greater
For less than 100 genes:
regain bnS -i matrix_filtered.csv -M metadata.csv -o bootstrapped_network -T 8 -n 500 -r 100
For 100 or more genes:
regain bnL -i matrix_filtered.csv -M metadata.csv -o bootstrapped_network -T 8 -n 500 -r 100
Output files:
Results.csv
, results file of all conditional probability and relative risk values
post_hoc_analysis.csv
, results file of all bidirectional probability and fold change scores
Bayesian_Network.html
, interactive Bayesian network
ReGAIN Curate is designed to allow users to generate a dataset for Bayesian network structure learning using
a custom set of gene queries, independent of ReGAIN Module 1
regain curate
-d
, --directory
, path to genome FASTA files
-q
, --query
, path to query files containing amino acid sequences in FASTA format
-T
, --threads
, number of cores to dedicate for parallel processing
--min
, minimum gene occurrence threshold
--max
, maximum gene occurrence threshold (should be less than number of genomes, see NOTE below)
--nucleotide-query
, optional; use this to query nucleotide FASTA files
--report-all
, optional; use this to return all BLAST hits, regardless of internal identity thresholds
--perc
, optional; set a custom minimum percent identity threshold. Default = 90%
--cov
, optional; set a custom minimum query coverage threshold. Default = 75%
--min-seq-length
, optional; designate minimum allowed query sequence lenght. Use with caution
--keep-gene-names
, optional; maintains special characters in variable names. Should not be used if proceeding to Module 2
ReGAIN Curate example Usage:
regain curate -d /path/to/genome/files -q /path/to/query/files -T 8 --min 5 --max 475
ReGAIN Curate output files:
filtered_results.csv
, all BLAST results meeting identity thresholds
curate_matrix.csv
, filtered data matrix
curate_metadata.csv
, metadata file for use in ReGAIN statistical modules
If --report-all
is used:
all_results.csv
, all BLAST results, regardless of identity thresholds
NOTE: Discrete Bayesian network anlyses requires all variables to exist in at least two states. For ReGAIN, these two states are 'present' and 'absent'. Ubiquitously occurring genes will break the analysis.
Best practice is for N genomes, --max
should MINIMALLY be defined as N - 1. Keep in mind that removing very low and very high abundance genes can reduce noise in the network.
ReGAIN Extract is an optional module for use with ReGAIN Curate. This module extracts aligned sequences
identified from regain curate
. Offered as an additional quality control step for gene identification.
Nucleotide sequences are extracted to a multi-FASTA file
regain extract
-c
, --csv-path
, path to ReGAIN Curate results file, such as filtered_results.csv
-f
, --fasta-directory
, path to genome FASTA files used in ReGAIN curate
-T
, --threads
, number of cores to dedicate for parallel processing
-o
, --output-fasta
, multi-FASTA file output (.fa
, .fas
, .fasta
, .fna
, .faa
)
--min-evalue
, optional; for use when --report-all
flag is used. Sets minimum evalue threshold for sequence extraction
--min-perc
, optional; same guidelines as --min-evalue
--min-cov
, optional; same guidlines as --min-evalue
and --min-perc
--translate
, optional; translates extracted nucleotide sequences (see NOTE below)
ReGAIN Extract example usage:
regain extract -c /path/to/results/csv -f /path/to/genome/FASTA/files -T 8 -o sequences.fa
NOTE: the --translate
flag should be used with care. In the event an alignment returns an incomplete CDS,
ReGAIN Extract will trim the sequence to the closest value divisible by 3 for codon prediction, which can result
in frameshifts. --translate
is only suggested for use if returned alignments represent full coding sequences, or
manual validation of gene calls is performed
ReGAIN Combine is an optional module for use in combination with the ReGAIN Curate and ReGAIN AMR modules.
In the event users want to supplement the regain AMR
results with a custom set of genes queried through
regain curate
, regain combine
will merge both datasets into a single dataset for use in ReGAIN statistical modules
regain combine
--matrix1
, path to ReGAIN AMR data matrix, filtered_matrix.csv
--matrix2
, path to ReGAIN Curate data matrix, curate_matrix.csv
--metadata1
, path to ReGAIN AMR metadata file, metadata.csv
--metadata2
, path to ReGAIN Curate metadata file, curate_metadata.csv
--delete-duplicates
, optional; automatically delete duplicate values from dataset
ReGAIN Combine example usage:
regain combine --matrix1 /path/to/AMR/matrix/csv --matrix2 /path/to/curate/matrix/csv
--metadata1 /path/to/AMR/metadata/csv --metadata2 /path/to/curate/metadata/csv
ReGAIN Combine output files:
combined_matrix.csv
, combined presence/absence matrix
combined_metadata.csv
, combined metadata file
NOTE: in order for regain combine
to function properly, do not modify values in column 1 (file
) of the data matrix files
Stand Alone Network Visualization
regain network
-i
, --input
, input RDS file generated from bnS
/bnL
analysis
-d
, --data
, input filtered data matrix file
-M
, --metadata
, input metadata file
-s
, --statistics_results
, input 'Results.csv' file from bnS
/bnL
analysis
Example usage:
regain network -i network.rds -d matrix_filtered.csv -M metadata.csv -s Results.csv
This analysis is an integrated part of the standard bnS
/bnL
pipeline, but serves as a redundant measure in the event network visualization needs to be re-performed
Output:
Bayesian_Network.html
, interactive Bayesian network
Multidimensional Analyses
Optional Module 3 regain MVA
Currently supported measures of distance:
manhattan
, euclidean
, canberra
, clark
, bray
, kulczynski
, jaccard
, gower
,
horn
, mountford
, raup
, binomial
, chao
, cao
, mahalanobis
, altGower
, morisita
,
chisq
, chord
, hellinger
-i
, --input
, input file in CSV format
-m
, --method
, measure of distance method
-c
, --centers
, how many centers you want for your multidimensional analysis (1-10)
-C
, --confidence
, confidence interval for ellipses
Module 3 example usage:
regain MVA -i matrix.csv -m jaccard -c 3 -C 0.75
NOTE: the MVA analysis will generate 2 files: a PNG and a PDF of the plot
Bayesian network analysis requires both data matrix and metadata files. MVA analysis requires only a data matrix file
Metadata file MUST have two column headers. Ideally, 'Gene' and 'GeneClass'. Second column may contain empty rows
Data matrix MUST have headers for all columns
Resistance Gene Association and Inference Network (ReGAIN): A Bioinformatics Pipeline for Assessing Probabilistic
Co-Occurrence Between Resistance Genes in Bacterial Pathogens.
Bring Horvath, E; Stein, M; Mulvey, MA; Hernandez, EJ; Winter, JM.
bioRxiv 2024.02.26.582197; doi: https://doi.org/10.1101/2024.02.26.582197