-
Notifications
You must be signed in to change notification settings - Fork 4
Tutorial
Before starting this tutorial, make sure to follow the
steps for getting started with all of the software
needed to run the gig-map
analysis tools on your system.
In order for gig-map
to align genes against a collection of genomes,
each genome must be formatted in its own FASTA file (with nucleotide sequences).
A genome may be made up of multiple records within the FASTA, but each genome
must be in its own file.
If you do not already have a collection of genes to align, gig-map
provides a utility for downloading the genes contained within a set
of genomes in NCBI. To download these public sequences, first go to
the NCBI genome portal
to search for your genomes of interest, and then download the CSV manifest
describing that set of genomes.
The NCBI Genome CSV file can be used as an input for the genome download utility, which will output a folder full of FASTA files.
In order for gig-map
to align a collection of genes against a
collection of genomes, the genes must be formatted in one or more
FASTA files containing amino acid (not nucleotide) sequences.
Each of those FASTA file should also be gzip-compressed.
If you do not already have a collection of genes to align, gig-map
provides a utility for downloading the genes contained within a set
of genomes in NCBI. To download these public sequences, first go to
the NCBI genome portal
to search for your genomes of interest, and then download the CSV manifest
describing that set of genomes.
The NCBI Genome CSV file can be used as an input for the gene download utility, which will output a folder full of FASTA files.
Before aligning a collection of genes, it is a good idea to filter out
any duplicated sequences. When gig-map
aligns a set of genes against
each genome, it only keeps the best hit for any genomic region. If there
are two identical sequences in your query set of genes, only one of them
will be seen in the results, which may be unexpected.
To make it easy to know that your query set of genes does not contain any duplicates, you can use the gene deduplication utility. This utility is also helpful for combining genes from multiple genomes, keeping only the unique coding content which is present.
Now that you have (1) a folder full of genome FASTAs and (2) a deduplicated collection of genes in a single FASTA, you are ready to align the genes to the genomes.
Use the gene alignment utility to run the alignment, with the option of keeping the default level of stringency or provide your own custom values for the minimum identity and coverage threshold as appropriate.
While the alignment tool does produce an interative HTML figure showing which genes are found in which genomes, you may want to add in your own set of text or color labels.
Using the display generation utility, you can
custom annotations to the genes and genomes in an interactive figure.
To generate this display, you need a table of annotations for both genes
and genomes. The genome annotation table (CSV) must have a column with
the header genome_id
which corresponds to the name of the file for each
genome. The gene annotation table (CSV) must have a similar column with
the gene_id
. A template for each of these annotation tables can be
found in the output from the gene alignment utility with the names
genome.manifest.csv
and gene.manifest.csv
. By adding additional
columns to these tables, you can build out own annotations on the resulting
figure.