Searching Genomes

Sometimes you just want to identify which genomes may contain a set of genes of interest, without going through the computationally expensive effort of performing exhaustive alignment against all of those genomes. gig-map provides this functionality by creating compressed searchable 'sketches' of genome collections. Those 'sketches' can be rapidly queried for presence of a particular gene (or better yet, a set of genes). This set of tools is most useful for people with large genome collections which they may want to search in the future, identifying which genomes in the collection contain some genes of interest.

Using minmers to rapidly search genomes

The concept of a 'sketch' refers to a robust body of work using compressed sequence representations (minmers) to rapidly compare genomes. The foundational instance of minmer-based sketching is a program called Mash, created by an insightful group of researchers including Adam Phillippy, Brian Ondov, Todd Treangen, Sergey Koren, Pall Melsted, and others.

Ondov, B.D., Treangen, T.J., Melsted, P. et al. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol 17, 132 (2016). https://doi.org/10.1186/s13059-016-0997-x

Based on the utility and popularity of minmer-based searching approaches, a number of other implementations were created and continue to be extremely useful in a number of applications including long-read sequencing, genome assembly, microbial genomics, and metagenomics. Some of those implementations include:

sourmash - C. Titus Brown
finsh - Roderick Bovee and Nick Greenfield
mash screen

The implementation used in gig-map is the original Mash utility. However, amino acid sequences are used instead of nucleotide sequences. This gives the user the ability to search for coding sequences within large genome collections.

Picking sketch parameters

The speed and utility of sketches comes from their fixed size, both in the number of k-mers included in the sketch and the value of k which is used. A more complete explanation of those terms can be found in the mash documentation.

For the purpose of this tool, what the user should know is that sketches can only be compared if they have the same k-mer length as well as the same number of k-mers. Changing either of those values makes the sketches incompatible.

Generally speaking, selecting larger values of k will make the sketch more sequence specific, and smaller values will be more general. Because each match must be exact, any variants that occur within two homologous sequences will cause that k-mer to not be found.

Selecting a larger sketch size (a larger number of k-mers in each sketch) will provide a greater degree of sensitivity with the trade-off of a longer time required to search those sketches. For a sketch with 1,000 k-mers, the greatest degree of sensitivity possible would be 1/1000, or 0.1%. Similarly, by increasing the sketch size to 10,000 it would be possible to detect 1/10,000 matches for a sensitivity of 0.01%.

It may be worth exploring a few different sets of parameters to find the combination which is most appropriate for your use-case.

Sketching genomes

To index a genome collection, run the sketch_genomes tool:

Inputs:

genomes: A folder containing the genomes to sketch
recursive: Whether or not to include files in subdirectories
sketch_folder: Output folder for the combined sketches of all genomes
sketch_size: Number of k-mers to include in each sketch
kmer_size: Length of k-mers used for sketching
gencode: Genetic code to use for conceptual translation of genome sequences

Outputs:

combined_genomes.msh: The combined sketches of all genomes in the collection
combined_genomes.txt: Text file containing the full path of all genomes which were indexed

Searching sketched genomes

To search an indexed genome collection, run the search_sketches tools:

Inputs:

query: Protein sequence(s) to search against those genomes (in FASTA format); Multiple files may be specified with wildcard characters
genome_sketches: Single file containing all of the genome sketches to search against (combined_genomes.msh)
search_results: Output folder used to write results
kmer_size: Length of k-mers used for sketching. Must match the value used to sketch genomes.
sketch_size: Number of k-mers to include in the query for each sketch

Outputs:

*.dists.csv: For each of the query input sequences, an output file will be created in CSV format with the 'mash_distance' and number of matching k-mers.

Mash Distance

The primary output of the search_sketches tool is a CSV file with the number of matching k-mers as well as the 'mash_distance' for the comparison. This metric takes into account the differences in sequence size between the query and the reference in order to estimate the sequence similarity of the original source genomes. In practical evaluations, the 'mash_distance' is correlated closely with average nucleotide identity, although many of those assumptions may not be valid when searching a small number of protein coding sequences against incompletely-annotated bacterial genomes. For more details on this distance metric, read through the official documentation.

The most prudent use of this tool is to quickly identify specific genomes which may then be searched more exhaustively with the tool for aligning genes to genomes.

Provide feedback

Saved searches