Skip to content

Aligning Genes to Genomes

Sam Minot edited this page Jan 18, 2022 · 7 revisions

Aligning Genes

Now we get to the core functionality of gig-map, aligning a collection of genes against a collection of genomes. In a previous step the user should have generated a set of deduplicated genes and a collection of microbial genomes. In the next step, the user will align the genes against the genomes and create a set of output files which can be used to render interactive gig-map displays.

Setting Up

To start, create or identify a folder which will be used to run this analysis. Next, download these two template files to help you set up the alignment process:

The align.params.json file allows you to specify which genes and genomes will be used, and in what location the output files will be placed. The align.sh file is a script which will launch the appropriate utility within gig-map using the parameters specified in align.params.json

To list the complete set of options available for the alignment utility, run the following command:

bash align.sh --help

By default, these files are set up to align the genes present in a file named downloaded_genes/clustered.genes.fasta.gz against the genomes found in the folder downloaded_genomes/, saving the output to a folder named alignments/ within the working directory. Please modify any of the values in the align.params.json file as appropriate for your use-case.

Once you are satisfied that the alignments.params.json file is pointing to the right set of inputs and outputs, start the download process by running:

bash alignments.sh

Adding Custom Parameters

There are many parameters which can be modified when performing the alignment process. For each of the parameters listed below, a custom value can be specified by adding the appropriate line to the align.params.json file. This file must conform to the JSON file format, and if you would like to check and make sure that you have not accidentally introduced any formatting errors, you can check the file against a free online JSON linting tool.

As an example, to filter out any genes shorter than 100aa, you can set the min_gene_length parameter as follows:

{
    "genomes": "downloaded_genomes/*",
    "genes_fasta": "downloaded_genes/clustered.genes.fasta.gz",
    "output_folder": "alignments",
    "min_gene_length": 100
}

Argument List

    Required Arguments:
      --genomes             Genome sequences in FASTA format (see note below)
      --genome_tables       Tables of NCBI genomes to analyze (see note below)
      --genes_fasta         Amino acid sequences to search for (multi-FASTA format)
      --genes_dmnd          Amino acid sequences to search for (DIAMOND database format *.dmnd)
                            (either --genes_fasta or --genes_dmnd is required)
      --output_folder       Folder to write output files to

    Optional Arguments:
      --marker_genes        Optionally provide marker genes in FASTA format which will be used
                            in addition to ANI to estimate a phylogeny for the provided genomes.
                            More than one FASTA file can be specified with comma delimiters.
                            Genes should be provided in amino acid sequence.
      --pick_marker_genes   The workflow will select a small number of marker genes from the
                            input genes which:
                                - are found across the largest proportion of genomes,
                                - have the highest average nucleotide identity alignments to genomes, and
                                - have the longest sequence length
                            Use this flag to specify the maximum number of marker genes which
                            should be identified with these criteria (default: 10).
                            Note: these marker genes will be analyzed in addition to any sequences
                            provided explicitly with --marker_genes.
      --min_marker_coverage Minimum percent coverage required to use the aligned marker sequence
                            from a particular genome (default: 50)
      --aligner             Alignment algorithm to use (options: diamond, blast; default: diamond)
      --min_gene_length     If specified, filter out any gene shorter than this number of amino acids
      --min_identity        Percent identity threshold used for alignment (default: 90)
      --min_coverage        Percent coverage threshold used for alignment (default: 90)
      --ftp_threads         Number of FTP downloads to execute concurrently (default: 25)
      --query_gencode       Genetic code to use for conceptual translation (default: 11)
      --max_evalue          Maximum E-value for any alignment (default: 0.001)
      --max_overlap         Remove alignments which align to a region which overlaps
                            a higher-scoring alignment by at least this percentage
                            (default: 50, max: 100, min: 0)
      --annotate_geneshot   Optionally format annotations from the geneshot pipeline in a format
                            which can be easily loaded into the gig-map visualization app.
                            This flag does not require the use of --abundances_geneshot (below).
                            The expected file is the output of geneshot named *.results.hdf5.
      --abundances_geneshot Optionally annotate genes using their observed relative abundances
                            from a set of metagenomic datasets, using the output of the geneshot
                            analysis pipeline. This flag can only be used in combination with
                            --annotate_geneshot.
                            The expected file is the output of geneshot named *.details.hdf5.
      --max_n_genes_train_pca
                            The maximum number of genes used to train the PCA model used
                            for ordering genes based on the similarity of the genomes
                            which they align to (default: 100000)
                            NOTE: If this dataset has fewer genes than this, the ordering of
                            genes by similarity of genome alignments will be performed more 
                            precisely by linkage clustering.
      --max_pcs_tsne        The maximum number of dimensions from the PCA output to use
                            for ordering genes by 1-dimensional t-SNE (default: 50)
      --sketch_size         Sketch size (see mash documentation) (default: 10000)
      --genome_distances    If the pairwise genome distances have already been computed,
                            you can directly import them into the analysis instead of repeating
                            that time-consuming process. This flag should be used to indicate
                            the distances.csv.gz file produced for this exact set of genomes.
      --ani_thresholds      Comma-delimited list of percent identity thresholds at which
                            genomes will be clustered for more rapid visualization.
                            Default: 99,95,90,80,70,60,50

    
    Specifing Genomes for Alignment:

    Genomes may be provided for alignment in two different ways, which can be used
    individually or in combination.
    
    In the most straightforward way, genomes are provided in gzip-compressed FASTA format
    with a wildcard expression (e.g. 'genomes/*.fasta.gz'). When a single wildcard
    character ('*') is used, all files are considered within the specified folder.
    When a double wildcard character ('**') is used, all subdirectories are also
    traversed to find matches for the wildcard expression. One or more set of paths
    can be provided to --genomes using a comma to delimit multiple paths, for example:
        --genomes local_genomes/*.fasta.gz,other_genomes/**.fna.gz

    In addition, you may import genomes directly from the NCBI Prokaryotic Genome Browser
    found at (https://www.ncbi.nlm.nih.gov/genome/browse#!/prokaryotes/). After selecting
    your genomes of interest, click on the "Download" button to save a CSV listing all
    of the genomes for alignment. That CSV file may also be specified with the --genome_tables
    flag. More than one table of genomes may be specified using the comma delimiter as above.

    NOTE: All genomes must have a unique filename


    Specifying Genes for Alignment:

    Genes may be provided for alignment either in FASTA format or as a pre-formatted DIAMOND
    database. There is no computational benefit derived from providing a pre-formatted database,
    and so that option is provided merely for convenience. Note that DIAMOND database formats
    depend on the version of the software being used, and so there may be compatibility issues
    when using databases compiled with versions of DIAMOND which are significantly different
    from the version used for alignment in this workflow (v2.0.6 at time of writing).

    To provide genes in FASTA format (gzip-optional), use the --genes_fasta flag.
    To provide genes in DIAMOND database format (v2.0.6 recommended), use the --genes_dmnd flag.

    Multiple files may be specified with either flag, which will be concatenated for alignment.