Skip to content

Latest commit

 

History

History
338 lines (219 loc) · 14.6 KB

README.md

File metadata and controls

338 lines (219 loc) · 14.6 KB

ReGAIN Installation and User guide

image

Prerequisites

Ensure that you have the following prerequisites installed on your system:

Python (version 3.8 or higher)

R (version 4 or higher)

NCBI AMRfinderPlus version 4.0.3
NCBI BLAST+ (Included in AMRfinderPlus installation)

Install R


We suggest that ReGAIN and all prerequisites are installed within a Conda environment

Download miniforge

Create Conda environment and install NCBI AMRfinderPlus

conda create -n regain python=3.10

conda activate regain

Install AMRfinderPlus

conda install -y -c conda-forge -c bioconda ncbi-amrfinderplus

Check installation

amrfinder -h

Download ARMfinderPlus Database

amrfinder -u

Download ReGAIN to preferred directory

git clone https://github.com/ERBringHorvath/regain_CLI

Install Python dependencies

pip install -r requirements.txt or pip3 install -r requirements.txt

Add ReGAIN to your PATH

Add this line to the end of .bash_profile (Linux/Unix) or .zshrc (macOS):

export PATH="$PATH:/path/to/regain_CLI/bin"

Replace /path/to/regain_CLI/bin with the actual path to the directory containing the executable.
Whatever the initial directory, this path should end with /regain_CLI/bin

Save the file and restart your terminal or run source ~/.bash_profile or source ~/.zshrc

Verify installation:

regain --version

use -h, --help, to bring up the help menu

regain --help

NOTE: ReGAIN utilizes shell scripts to execute some modules. You may need to modify your permissions
to execute these scripts. If you run regain --version and see permission denied: regain, Navigate to
regain/bin, then run both chmod +x regain and chmod +x *.sh and rerun regain --version
and you should see something similar to: regain v.1.5.0


Programs and Example Usage

Resistance and Virulence Gene Identification

Module 1.0 regain AMR

-d, --directory, path to directory containing genome FASTA files to analyze
-O, --organism, optional; specify what organism (if any) you want to analyze (optional flag)
-T, --threads, number of cores to dedicate for parallel processing
-o, --output-dir, output directory to store AMRfinder results

Currently supported organisms and how they should be called:

Acinetobacter_baumannii
Burkholderia_cepacia
Burkholderia_pseudomallei
Campylobacter
Clostridioides_difficile
Enterobacter_cloacae
Enterococcus_faecalis
Enterococcus_faecium
Escherichia
Klebsiella_pneumoniae
Neisseria
Pseudomonas_aeruginosa
Salmonella
Staphylococcus_aureus
Staphylococcus_pseudintermedius
Streptococcus_agalactiae
Streptococcus_pneumoniae
Streptococcus_pyogenes
Vibrio_cholerae

Module 1 example usage:

Organism specific:
regain AMR -d path/to/FASTA/files -O Pseudomonas_aruginosa -T 8 -o path/to/output/directory

Organism non-specific:
regain AMR -d path/to/FASTA/files -T 8 -o path/to/output/directory

Output files:

One results file per submitted genome


Dataset Creation

NOTE: variable names cannot contain special characters–this transformation is automated during dataset creation

Module 1.1 regain matrix

-d, --directory, path to AMRfinder results in CSV format
--gene-type, searches for resistance, virulence, or all genes
--min, minimum gene occurrence cutoff
--max, maximum gene occurrence cutoff (should be less than number of genomes, see NOTE below)
--report-all, optional; reports all genes identified, regardless of --min/--max threshold
--keep-gene-names, optional; maintains special characters in variable names. Should not be used if proceeding to Module 2

Module 1.1 example usage

NOTE: Discrete Bayesian network anlyses requires all variables to exist in at least two states. For ReGAIN, these two states are 'present' and 'absent'. Ubiquitously occurring genes will break the analysis. Best practice is for N genomes, --max should MINIMALLY be defined as N - 1. Keep in mind that removing very low and very high abundance genes can reduce noise in the network.

regain matrix -d path/to/AMRfinder/results --gene-type resistance --min 5 --max 475

NOTE: all results are saved in the 'ReGAIN_Dataset' folder, which will be generated within the directory
defined by
-d/--directory

Output files:

filtered_matrix.csv: presence/absence matrix of genes
metadata.csv: file containing genes identified in AMRfinderPlus analysis
combined_AMR_results_unfiltered.csv: concatenated file of all AMRfinder/Plus results; this file contains contig and nucleotide location of all identified genes
If --report-all is used:
unfiltered_matrix.csv: presence/absence matrix of all genes identified, regardless of --min/--max thresholds


Bayesian Network Structure Learning

Module 2 regain bnL or regain bnS

-i, --input, input file in CSV format
-M, --metadata, file containing gene names and descriptions
-o, --output_boot, output bootstrap file
-T, --threads, number of cores to dedicate for parallel processing
-n, --number_of_boostraps, how many bootstraps to run (suggested 300-500)
-r, --number-of-resamples, how many data resamples you want to use (suggested 100)

Module 2 example usage:

NOTE: We suggest using between 300 and 500 bootstraps and 100 resamples

bnS, Bayesian network structure learning analysis for less than 100 genes
bnL, Bayesian network structure learning analysis for 100 genes or greater

For less than 100 genes:

regain bnS -i matrix_filtered.csv -M metadata.csv -o bootstrapped_network -T 8 -n 500 -r 100

For 100 or more genes:

regain bnL -i matrix_filtered.csv -M metadata.csv -o bootstrapped_network -T 8 -n 500 -r 100

Output files:

Results.csv, results file of all conditional probability and relative risk values
post_hoc_analysis.csv, results file of all bidirectional probability and fold change scores
Bayesian_Network.html, interactive Bayesian network


ReGAIN Curate

ReGAIN Curate is designed to allow users to generate a dataset for Bayesian network structure learning using
a custom set of gene queries, independent of ReGAIN Module 1

regain curate

-d, --directory, path to genome FASTA files
-q, --query, path to query files containing amino acid sequences in FASTA format
-T, --threads, number of cores to dedicate for parallel processing
--min, minimum gene occurrence threshold
--max, maximum gene occurrence threshold (should be less than number of genomes, see NOTE below)
--nucleotide-query, optional; use this to query nucleotide FASTA files
--report-all, optional; use this to return all BLAST hits, regardless of internal identity thresholds
--perc, optional; set a custom minimum percent identity threshold. Default = 90%
--cov, optional; set a custom minimum query coverage threshold. Default = 75%
--min-seq-length, optional; designate minimum allowed query sequence lenght. Use with caution
--keep-gene-names, optional; maintains special characters in variable names. Should not be used if proceeding to Module 2

ReGAIN Curate example Usage:
regain curate -d /path/to/genome/files -q /path/to/query/files -T 8 --min 5 --max 475

ReGAIN Curate output files:

filtered_results.csv, all BLAST results meeting identity thresholds
curate_matrix.csv, filtered data matrix
curate_metadata.csv, metadata file for use in ReGAIN statistical modules
If --report-all is used:
all_results.csv, all BLAST results, regardless of identity thresholds

NOTE: Discrete Bayesian network anlyses requires all variables to exist in at least two states. For ReGAIN, these two states are 'present' and 'absent'. Ubiquitously occurring genes will break the analysis. Best practice is for N genomes, --max should MINIMALLY be defined as N - 1. Keep in mind that removing very low and very high abundance genes can reduce noise in the network.

ReGAIN Extract

ReGAIN Extract is an optional module for use with ReGAIN Curate. This module extracts aligned sequences
identified from regain curate. Offered as an additional quality control step for gene identification.
Nucleotide sequences are extracted to a multi-FASTA file

regain extract

-c, --csv-path, path to ReGAIN Curate results file, such as filtered_results.csv
-f, --fasta-directory, path to genome FASTA files used in ReGAIN curate
-T, --threads, number of cores to dedicate for parallel processing
-o, --output-fasta, multi-FASTA file output (.fa, .fas, .fasta, .fna, .faa)
--min-evalue, optional; for use when --report-all flag is used. Sets minimum evalue threshold for sequence extraction
--min-perc, optional; same guidelines as --min-evalue
--min-cov, optional; same guidlines as --min-evalue and --min-perc
--translate, optional; translates extracted nucleotide sequences (see NOTE below)

ReGAIN Extract example usage:

regain extract -c /path/to/results/csv -f /path/to/genome/FASTA/files -T 8 -o sequences.fa

NOTE: the --translate flag should be used with care. In the event an alignment returns an incomplete CDS,
ReGAIN Extract will trim the sequence to the closest value divisible by 3 for codon prediction, which can result
in frameshifts. --translate is only suggested for use if returned alignments represent full coding sequences, or
manual validation of gene calls is performed

ReGAIN Combine

ReGAIN Combine is an optional module for use in combination with the ReGAIN Curate and ReGAIN AMR modules.
In the event users want to supplement the regain AMR results with a custom set of genes queried through
regain curate, regain combine will merge both datasets into a single dataset for use in ReGAIN statistical modules

regain combine

--matrix1, path to ReGAIN AMR data matrix, filtered_matrix.csv
--matrix2, path to ReGAIN Curate data matrix, curate_matrix.csv
--metadata1, path to ReGAIN AMR metadata file, metadata.csv
--metadata2, path to ReGAIN Curate metadata file, curate_metadata.csv
--delete-duplicates, optional; automatically delete duplicate values from dataset

ReGAIN Combine example usage:

regain combine --matrix1 /path/to/AMR/matrix/csv --matrix2 /path/to/curate/matrix/csv
--metadata1 /path/to/AMR/metadata/csv --metadata2 /path/to/curate/metadata/csv

ReGAIN Combine output files:

combined_matrix.csv, combined presence/absence matrix
combined_metadata.csv, combined metadata file

NOTE: in order for regain combine to function properly, do not modify values in column 1 (file) of the data matrix files


ReGAIN Accessary Modules

Stand Alone Network Visualization

regain network

-i, --input, input RDS file generated from bnS/bnL analysis
-d, --data, input filtered data matrix file
-M, --metadata, input metadata file
-s, --statistics_results, input 'Results.csv' file from bnS/bnL analysis

Example usage:

regain network -i network.rds -d matrix_filtered.csv -M metadata.csv -s Results.csv

This analysis is an integrated part of the standard bnS/bnL pipeline, but serves as a redundant measure in the event network visualization needs to be re-performed

Output:

Bayesian_Network.html, interactive Bayesian network


Multidimensional Analyses

Optional Module 3 regain MVA

Currently supported measures of distance:

manhattan, euclidean, canberra, clark, bray, kulczynski, jaccard, gower,
horn, mountford, raup, binomial, chao, cao, mahalanobis, altGower, morisita,
chisq, chord, hellinger

-i, --input, input file in CSV format
-m, --method, measure of distance method
-c, --centers, how many centers you want for your multidimensional analysis (1-10)
-C, --confidence, confidence interval for ellipses

Module 3 example usage:

regain MVA -i matrix.csv -m jaccard -c 3 -C 0.75

NOTE: the MVA analysis will generate 2 files: a PNG and a PDF of the plot


Formatting External Data

Bayesian network analysis requires both data matrix and metadata files. MVA analysis requires only a data matrix file
Metadata file MUST have two column headers. Ideally, 'Gene' and 'GeneClass'. Second column may contain empty rows
Data matrix MUST have headers for all columns

image

Citing ReGAIN

Resistance Gene Association and Inference Network (ReGAIN): A Bioinformatics Pipeline for Assessing Probabilistic Co-Occurrence Between Resistance Genes in Bacterial Pathogens.
Bring Horvath, E; Stein, M; Mulvey, MA; Hernandez, EJ; Winter, JM.
bioRxiv 2024.02.26.582197; doi: https://doi.org/10.1101/2024.02.26.582197