Skip to content

Aligning Genes to Genomes

Sam Minot edited this page May 25, 2022 · 7 revisions

Now we get to the core functionality of gig-map, aligning a collection of genes against a collection of genomes. In a previous step the user should have generated a set of deduplicated genes and a collection of microbial genomes. In the next step, the user will align the genes against the genomes and create a set of output files which can be used to render gig-map displays.

Running Alignments

To align a collection of genes and genomes, the input genomes must be present within one or more gzip-compressed FASTA files in a single folder. If you want to combine files which are located in different folders, simply create symlinks for those files into a single folder.

Inputs and Outputs

Inputs

  • genes: Single file containing all genes to be aligned, in amino acid FASTA format (gzip-compressed) (e.g. centroids.faa.gz from the deduplicate outputs)
  • genomes: Folder containing all genomes to align against (gzip-compressed FASTA format)
  • collect_results: After aligning genomes, perform all additional analyses needed for visualization (true/false)
  • min_coverage: Minimum proportion of a gene which must align in order to retain the alignment [default: 90]
  • min_identity: Minimum percent identity of the amino acid alignment required to retain the alignment [default: 90]
  • max_evalue: Maximum E-value threshold used to filter all alignments [default: 0.001]
  • aligner: Algorithm used for alignment (default: diamond, options: diamond, blast)
  • max_overlap: Any alignment which overlaps a higher-scoring alignment by more than this will be filtered out [default: 50]
  • query_gencode: Genetic code used for conceptual translation of genome sequences [default: 11]

Outputs

The output from this step will include:

  • genomes.aln.csv.gz: A table with all of the alignments which were found
  • distances.csv.gz: A table of genome-genome similarity (ANI)
  • genomes.gene_order.txt.gz: A table with the ordering of genes which resulted from this collection of alignments
  • gigmap.*.html: A quick-and-dirty visualization of the gene-to-genome alignment
  • gigmap.rdb: A complete archive of the aligned information which can be used in the interactive gig-map display tool
  • genome.manifest.csv: A template genome annotation table which can be used to build out more complex visualizations
  • gene.manifest.csv: A similar template annotation table for the genes used in the analysis

align_genes

Useful References

Other useful references may be: