diff --git a/README.md b/README.md index 93034ae..16c6eb2 100644 --- a/README.md +++ b/README.md @@ -3,7 +3,7 @@ MashMap [![BioConda Install](https://img.shields.io/conda/dn/bioconda/mashmap.svg?style=flag&label=BioConda%20install)](https://anaconda.org/bioconda/mashmap) [![GitHub Downloads](https://img.shields.io/github/downloads/marbl/MashMap/total.svg?style=social&logo=github&label=Download)](https://github.com/marbl/MashMap/releases) -MashMap implements a fast and approximate algorithm for computing local alignment boundaries between long DNA sequences. It can be useful for mapping genome assembly or long reads (PacBio/ONT) to reference genome(s). Given a minimum alignment length and an identity threshold for the desired local alignments, Mashmap computes alignment boundaries and identity estimates using *k*-mers. It does not compute the alignments explicitly, but rather estimates an unbiased *k*-mer based [Jaccard similarity](https://en.wikipedia.org/wiki/Jaccard_index) using a combination of minmers (a novel winnowing scheme) and [MinHash](https://en.wikipedia.org/wiki/MinHash). This is then converted to an estimate of sequence identity using the [Mash](http://mash.readthedocs.org) distance. An appropriate *k*-mer sampling rate is automatically determined using the given minimum local alignment length and identity thresholds. **The automatic sampling rate has increased relative to MashMap2, resulting in more accurate mappings and ANI prediction at the cost of more RAM**. The efficiency of the algorithm improves as both of these thresholds are increased. +MashMap implements a fast and approximate algorithm for computing local alignment boundaries between long DNA sequences. It can be useful for mapping genome assembly or long reads (PacBio/ONT) to reference genome(s). Given a minimum alignment length and an identity threshold for the desired local alignments, Mashmap computes alignment boundaries and identity estimates using *k*-mers. It does not compute the alignments explicitly, but rather estimates an unbiased *k*-mer based [Jaccard similarity](https://en.wikipedia.org/wiki/Jaccard_index) using a combination of minmers (a novel winnowing scheme) and [MinHash](https://en.wikipedia.org/wiki/MinHash). This is then converted to an estimate of sequence identity using the [Mash](http://mash.readthedocs.org) distance. An appropriate *k*-mer sampling rate is automatically determined using the given minimum local alignment length and identity thresholds. **The automatic sampling rate has increased relative to MashMap2, resulting in more accurate mappings and ANI prediction at the cost of more RAM**. As an example, Mashmap can map a human genome assembly to the human reference genome in about one minute total execution time and < 4 GB memory using just 8 CPU threads, achieving more than an order of magnitude improvement in both runtime and memory over alternative methods. We describe the algorithms associated with Mashmap, and report on speed, scalability, and accuracy of the software in the publications listed [below](#publications). Unlike traditional mappers, MashMap does not compute exact sequence alignments. In future, we plan to add an optional alignment support to generate base-to-base alignments.