Skip to content

Tutorial

Sam Minot edited this page Apr 28, 2022 · 1 revision

Before starting this tutorial, make sure to follow the steps for getting started with all of the software needed to run the gig-map analysis tools on your system.

1. Find your genomes

In order for gig-map to align genes against a collection of genomes, each genome must be formatted in its own FASTA file (with nucleotide sequences). A genome may be made up of multiple records within the FASTA, but each genome must be in its own file.

If you do not already have a collection of genes to align, gig-map provides a utility for downloading the genes contained within a set of genomes in NCBI. To download these public sequences, first go to the NCBI genome portal to search for your genomes of interest, and then download the CSV manifest describing that set of genomes.

The NCBI Genome CSV file can be used as an input for the genome download utility, which will output a folder full of FASTA files.

download_genomes

2. Find your genes

In order for gig-map to align a collection of genes against a collection of genomes, the genes must be formatted in one or more FASTA files containing amino acid (not nucleotide) sequences. Each of those FASTA file should also be gzip-compressed.

If you do not already have a collection of genes to align, gig-map provides a utility for downloading the genes contained within a set of genomes in NCBI. To download these public sequences, first go to the NCBI genome portal to search for your genomes of interest, and then download the CSV manifest describing that set of genomes.

The NCBI Genome CSV file can be used as an input for the gene download utility, which will output a folder full of FASTA files.

download_genes

3. Deduplicate your genes

Before aligning a collection of genes, it is a good idea to filter out any duplicated sequences. When gig-map aligns a set of genes against each genome, it only keeps the best hit for any genomic region. If there are two identical sequences in your query set of genes, only one of them will be seen in the results, which may be unexpected.

To make it easy to know that your query set of genes does not contain any duplicates, you can use the gene deduplication utility. This utility is also helpful for combining genes from multiple genomes, keeping only the unique coding content which is present.

deduplicate_genes

4. Aligning genes to genomes

Now that you have (1) a folder full of genome FASTAs and (2) a deduplicated collection of genes in a single FASTA, you are ready to align the genes to the genomes.

Use the gene alignment utility to run the alignment, with the option of keeping the default level of stringency or provide your own custom values for the minimum identity and coverage threshold as appropriate.

align_genes

5. Generating custom displays

While the alignment tool does produce an interative HTML figure showing which genes are found in which genomes, you may want to add in your own set of text or color labels.

Using the display generation utility, you can custom annotations to the genes and genomes in an interactive figure. To generate this display, you need a table of annotations for both genes and genomes. The genome annotation table (CSV) must have a column with the header genome_id which corresponds to the name of the file for each genome. The gene annotation table (CSV) must have a similar column with the gene_id. A template for each of these annotation tables can be found in the output from the gene alignment utility with the names genome.manifest.csv and gene.manifest.csv. By adding additional columns to these tables, you can build out own annotations on the resulting figure.

Complete set of options for rendering interactive displays

render