-
Notifications
You must be signed in to change notification settings - Fork 4
Aligning Genes to Genomes
Now we get to the core functionality of gig-map
, aligning a collection of
genes against a collection of genomes. In a previous step the user should have
generated a set of deduplicated genes
and a collection of microbial genomes.
In the next step, the user will align the genes against the genomes and create
a set of output files which can be used to render
interactive gig-map
displays.
To start, create or identify a folder which will be used to run this analysis. Next, download these two template files to help you set up the alignment process:
The align.params.json
file allows you to specify which genes and genomes will be used,
and in what location the output files will be placed. The align.sh
file is a script
which will launch the appropriate utility within gig-map
using the parameters specified
in align.params.json
To list the complete set of options available for the alignment utility, run the following command:
bash align.sh --help
By default, these files are set up to align the genes present in a file named
downloaded_genes/clustered.genes.fasta.gz
against the genomes found in the folder
downloaded_genomes/
, saving the output to a folder named alignments/
within the
working directory. Please modify any of the values in the align.params.json
file as appropriate for your use-case.
Once you are satisfied that the alignments.params.json
file is pointing to the
right set of inputs and outputs, start the download process by running:
bash alignments.sh
There are many parameters which can be modified when performing the alignment process.
For each of the parameters listed below, a custom value can be specified by adding
the appropriate line to the align.params.json
file. This file must conform to the
JSON file format, and if you would like to
check and make sure that you have not accidentally introduced any formatting errors,
you can check the file against a free online JSON linting tool.
As an example, to filter out any genes shorter than 100aa, you can set the min_gene_length
parameter as follows:
{
"genomes": "downloaded_genomes/*",
"genes_fasta": "downloaded_genes/clustered.genes.fasta.gz",
"output_folder": "alignments",
"min_gene_length": 100
}
Required Arguments:
--genomes Genome sequences in FASTA format (see note below)
--genome_tables Tables of NCBI genomes to analyze (see note below)
--genes_fasta Amino acid sequences to search for (multi-FASTA format)
--genes_dmnd Amino acid sequences to search for (DIAMOND database format *.dmnd)
(either --genes_fasta or --genes_dmnd is required)
--output_folder Folder to write output files to
Optional Arguments:
--marker_genes Optionally provide marker genes in FASTA format which will be used
in addition to ANI to estimate a phylogeny for the provided genomes.
More than one FASTA file can be specified with comma delimiters.
Genes should be provided in amino acid sequence.
--pick_marker_genes The workflow will select a small number of marker genes from the
input genes which:
- are found across the largest proportion of genomes,
- have the highest average nucleotide identity alignments to genomes, and
- have the longest sequence length
Use this flag to specify the maximum number of marker genes which
should be identified with these criteria (default: 10).
Note: these marker genes will be analyzed in addition to any sequences
provided explicitly with --marker_genes.
--min_marker_coverage Minimum percent coverage required to use the aligned marker sequence
from a particular genome (default: 50)
--aligner Alignment algorithm to use (options: diamond, blast; default: diamond)
--min_gene_length If specified, filter out any gene shorter than this number of amino acids
--min_identity Percent identity threshold used for alignment (default: 90)
--min_coverage Percent coverage threshold used for alignment (default: 90)
--ftp_threads Number of FTP downloads to execute concurrently (default: 25)
--query_gencode Genetic code to use for conceptual translation (default: 11)
--max_evalue Maximum E-value for any alignment (default: 0.001)
--max_overlap Remove alignments which align to a region which overlaps
a higher-scoring alignment by at least this percentage
(default: 50, max: 100, min: 0)
--annotate_geneshot Optionally format annotations from the geneshot pipeline in a format
which can be easily loaded into the gig-map visualization app.
This flag does not require the use of --abundances_geneshot (below).
The expected file is the output of geneshot named *.results.hdf5.
--abundances_geneshot Optionally annotate genes using their observed relative abundances
from a set of metagenomic datasets, using the output of the geneshot
analysis pipeline. This flag can only be used in combination with
--annotate_geneshot.
The expected file is the output of geneshot named *.details.hdf5.
--max_n_genes_train_pca
The maximum number of genes used to train the PCA model used
for ordering genes based on the similarity of the genomes
which they align to (default: 100000)
NOTE: If this dataset has fewer genes than this, the ordering of
genes by similarity of genome alignments will be performed more
precisely by linkage clustering.
--max_pcs_tsne The maximum number of dimensions from the PCA output to use
for ordering genes by 1-dimensional t-SNE (default: 50)
--sketch_size Sketch size (see mash documentation) (default: 10000)
--genome_distances If the pairwise genome distances have already been computed,
you can directly import them into the analysis instead of repeating
that time-consuming process. This flag should be used to indicate
the distances.csv.gz file produced for this exact set of genomes.
--ani_thresholds Comma-delimited list of percent identity thresholds at which
genomes will be clustered for more rapid visualization.
Default: 99,95,90,80,70,60,50
Specifing Genomes for Alignment:
Genomes may be provided for alignment in two different ways, which can be used
individually or in combination.
In the most straightforward way, genomes are provided in gzip-compressed FASTA format
with a wildcard expression (e.g. 'genomes/*.fasta.gz'). When a single wildcard
character ('*') is used, all files are considered within the specified folder.
When a double wildcard character ('**') is used, all subdirectories are also
traversed to find matches for the wildcard expression. One or more set of paths
can be provided to --genomes using a comma to delimit multiple paths, for example:
--genomes local_genomes/*.fasta.gz,other_genomes/**.fna.gz
In addition, you may import genomes directly from the NCBI Prokaryotic Genome Browser
found at (https://www.ncbi.nlm.nih.gov/genome/browse#!/prokaryotes/). After selecting
your genomes of interest, click on the "Download" button to save a CSV listing all
of the genomes for alignment. That CSV file may also be specified with the --genome_tables
flag. More than one table of genomes may be specified using the comma delimiter as above.
NOTE: All genomes must have a unique filename
Specifying Genes for Alignment:
Genes may be provided for alignment either in FASTA format or as a pre-formatted DIAMOND
database. There is no computational benefit derived from providing a pre-formatted database,
and so that option is provided merely for convenience. Note that DIAMOND database formats
depend on the version of the software being used, and so there may be compatibility issues
when using databases compiled with versions of DIAMOND which are significantly different
from the version used for alignment in this workflow (v2.0.6 at time of writing).
To provide genes in FASTA format (gzip-optional), use the --genes_fasta flag.
To provide genes in DIAMOND database format (v2.0.6 recommended), use the --genes_dmnd flag.
Multiple files may be specified with either flag, which will be concatenated for alignment.