-
Notifications
You must be signed in to change notification settings - Fork 4
Searching Genomes
Sometimes you just want to identify which genomes may contain a set of
genes of interest, without going through the computationally expensive
effort of performing exhaustive alignment against all of those genomes.
gig-map
provides this functionality by creating compressed searchable
'sketches' of genome collections. Those 'sketches' can be rapidly queried
for presence of a particular gene (or better yet, a set of genes).
This set of tools is most useful for people with large genome collections
which they may want to search in the future, identifying which genomes
in the collection contain some genes of interest.
The concept of a 'sketch' refers to a robust body of work using compressed sequence representations (minmers) to rapidly compare genomes. The foundational instance of minmer-based sketching is a program called Mash, created by an insightful group of researchers including Adam Phillippy, Brian Ondov, Todd Treangen, Sergey Koren, Pall Melsted, and others.
Ondov, B.D., Treangen, T.J., Melsted, P. et al. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol 17, 132 (2016). https://doi.org/10.1186/s13059-016-0997-x
Based on the utility and popularity of minmer-based searching approaches, a number of other implementations were created and continue to be extremely useful in a number of applications including long-read sequencing, genome assembly, microbial genomics, and metagenomics. Some of those implementations include:
- sourmash - C. Titus Brown
- finsh - Roderick Bovee and Nick Greenfield
- mash screen
The implementation used in gig-map
is the original Mash utility.
However, amino acid sequences are used instead of nucleotide sequences.
This gives the user the ability to search for coding sequences within
large genome collections.
The speed and utility of sketches comes from their fixed size, both in the number of k-mers included in the sketch and the value of k which is used. A more complete explanation of those terms can be found in the mash documentation.
For the purpose of this tool, what the user should know is that sketches can only be compared if they have the same k-mer length as well as the same number of k-mers. Changing either of those values makes the sketches incompatible.
Generally speaking, selecting larger values of k will make the sketch more sequence specific, and smaller values will be more general. Because each match must be exact, any variants that occur within two homologous sequences will cause that k-mer to not be found.
Selecting a larger sketch size (a larger number of k-mers in each sketch) will provide a greater degree of sensitivity with the trade-off of a longer time required to search those sketches. For a sketch with 1,000 k-mers, the greatest degree of sensitivity possible would be 1/1000, or 0.1%. Similarly, by increasing the sketch size to 10,000 it would be possible to detect 1/10,000 matches for a sensitivity of 0.01%.
It may be worth exploring a few different sets of parameters to find the combination which is most appropriate for your use-case.
To index a genome collection, run the sketch_genomes
tool:
Inputs:
-
genomes
: A folder containing the genomes to sketch -
recursive
: Whether or not to include files in subdirectories -
sketch_folder
: Output folder for the combined sketches of all genomes -
sketch_size
: Number of k-mers to include in each sketch -
kmer_size
: Length of k-mers used for sketching -
gencode
: Genetic code to use for conceptual translation of genome sequences
Outputs:
-
combined_genomes.msh
: The combined sketches of all genomes in the collection -
combined_genomes.txt
: Text file containing the full path of all genomes which were indexed
To search an indexed genome collection, run the search_sketches
tools:
Inputs:
-
query
: Protein sequence(s) to search against those genomes (in FASTA format); Multiple files may be specified with wildcard characters -
genome_sketches
: Single file containing all of the genome sketches to search against (combined_genomes.msh
) -
search_results
: Output folder used to write results -
kmer_size
: Length of k-mers used for sketching. Must match the value used to sketch genomes. -
sketch_size
: Number of k-mers to include in the query for each sketch
Outputs:
-
*.dists.csv
: For each of thequery
input sequences, an output file will be created in CSV format with the 'mash_distance' and number of matching k-mers.
The primary output of the search_sketches
tool is a CSV file with the number of matching k-mers as well as
the 'mash_distance' for the comparison. This metric takes into account the differences in sequence size between
the query and the reference in order to estimate the sequence similarity of the original source genomes.
In practical evaluations, the 'mash_distance' is correlated closely with average nucleotide identity, although
many of those assumptions may not be valid when searching a small number of protein coding sequences against
incompletely-annotated bacterial genomes.
For more details on this distance metric, read through the official documentation.
The most prudent use of this tool is to quickly identify specific genomes which may then be searched more exhaustively with the tool for aligning genes to genomes.