mm2-fast is an accelerated implementation of minimap2 on modern CPUs. mm2-fast accelerates all the three major modules of minimap2: (a) seeding, (b) chaining, and (c) pairwise alignment, achieving up to 1.8x speedup using AVX512 over minimap2. mm2-fast is a drop-in replacement of minimap2, providing the same functionality with the exact same output. In the current version, all the modules are optimized using AVX-512 and AVX2 vectorization. Detailed benchmark results are available in our publication in Nature Computational Science (https://www.nature.com/articles/s43588-022-00201-8).
Operating System: Linux
mm2-fast was tested using g++ (GCC) 9.2.0 and icpc version 19.1.3.304
Architecture: x86_64 CPUs with AVX512, AVX2
Memory requirement: ~30GB for human genome
Clone the mm2-fast github repo. The source code can be compiled by using make command. It only takes a few seconds.
git clone --recursive https://github.com/bwa-mem2/mm2-fast.git mm2-fast
cd mm2-fast
make
The usage of mm2-fast is same as minimap2. Here is an example of mapping ONT reads with test data.
./minimap2 -ax map-ont test/MT-human.fa test/MT-orang.fa > mm2-fast_output
As this verion of mm2-fast is an accelerated version of minimap2-v2.24, the output of mm2-fast can be verified against minimap2-v2.24. Note that the optimized chaining in mm2-fast is strictly required to be run with a chaining parameter max-chain-skip=infinity. Note that having parameter max-chain-skip=infinity leads to higher chaining precision. Therefore, for correctness verification, minimap2 should run with a larger value of max-chain-skip parameter. Follow the below steps to verify the accuracy of mm2-fast.
git clone --recursive https://github.com/bwa-mem2/mm2-fast.git mm2-fast
cd mm2-fast && make
./minimap2 -ax map-ont test/MT-human.fa test/MT-orang.fa --max-chain-skip=1000000 > mm2-fast_output
git clone https://github.com/lh3/minimap2.git -b v2.24
cd minimap2 && make
./minimap2 -ax map-ont test/MT-human.fa test/MT-orang.fa --max-chain-skip=1000000 > minimap2_output
The output generated by minimap2 and mm2-fast should match.
diff minimap2_output mm2-fast_output > diff_result
The file diff_result
should be empty, meaning a difference of 0 lines.
The default compilation using make applies two optimizations: vectorized chaining and sequence alignment. The learned-indexes based seeding is disabled by default as it requires availability of Rust. This is because the learned hash-table uses an external training library that runs on Rust. Rust is trivial to install, see https://rustup.rs/ and add its path to .bashrc file. Rust installation only takes a few seconds. Following are the steps to enable learned hash table optimization in mm2-fast:
# Start by building learned hash table index for optimized seeding module
./build_rmi.sh test/MT-human.fa map-ont ##Takes two arguments: 1. path-to-reference-seq-file 2. preset.
##For human genome, this step should take around 2-3 minutes to finish.
# Next, compile and run the mapping phase
make clean && make lhash=1
./minimap2 -ax map-ont test/MT-human.fa test/MT-orang.fa > mm2-fast-lhash_output
To compile mm2-fast with all optimizations turned off and switch back to default minimap2, use the following command during compilation. This could be useful for debugging.
make clean && make no_opt=1
We have observed up to 1.8x speedup across datasets (please refer to the paper for more details). For example, for the randomly sampled 100K reads from "HG002_GM24385_1_2_3_Guppy_3.6.0_prom.fastq.gz", minimap2 takes 92 seconds, while mm2-fast takes 54 seconds to map against the human genome on a 28 cores Intel® Xeon® Platinum 8280 CPUs. Our sampled datasets with 100K reads are available here.
Accelerating minimap2 for long-read sequencing applications on modern CPUs. Saurabh Kalikar, Chirag Jain, Vasimuddin Md, Sanchit Misra. Nat Comput Sci 2, 78–83 (2022). https://doi.org/10.1038/s43588-022-00201-8
The original README content of minimap2 follows.
git clone https://github.com/lh3/minimap2
cd minimap2 && make
# long sequences against a reference genome
./minimap2 -a test/MT-human.fa test/MT-orang.fa > test.sam
# create an index first and then map
./minimap2 -x map-ont -d MT-human-ont.mmi test/MT-human.fa
./minimap2 -a MT-human-ont.mmi test/MT-orang.fa > test.sam
# use presets (no test data)
./minimap2 -ax map-pb ref.fa pacbio.fq.gz > aln.sam # PacBio CLR genomic reads
./minimap2 -ax map-ont ref.fa ont.fq.gz > aln.sam # Oxford Nanopore genomic reads
./minimap2 -ax map-hifi ref.fa pacbio-ccs.fq.gz > aln.sam # PacBio HiFi/CCS genomic reads (v2.19 or later)
./minimap2 -ax asm20 ref.fa pacbio-ccs.fq.gz > aln.sam # PacBio HiFi/CCS genomic reads (v2.18 or earlier)
./minimap2 -ax sr ref.fa read1.fa read2.fa > aln.sam # short genomic paired-end reads
./minimap2 -ax splice ref.fa rna-reads.fa > aln.sam # spliced long reads (strand unknown)
./minimap2 -ax splice -uf -k14 ref.fa reads.fa > aln.sam # noisy Nanopore Direct RNA-seq
./minimap2 -ax splice:hq -uf ref.fa query.fa > aln.sam # Final PacBio Iso-seq or traditional cDNA
./minimap2 -ax splice --junc-bed anno.bed12 ref.fa query.fa > aln.sam # prioritize on annotated junctions
./minimap2 -cx asm5 asm1.fa asm2.fa > aln.paf # intra-species asm-to-asm alignment
./minimap2 -x ava-pb reads.fa reads.fa > overlaps.paf # PacBio read overlap
./minimap2 -x ava-ont reads.fa reads.fa > overlaps.paf # Nanopore read overlap
# man page for detailed command line options
man ./minimap2.1
- Getting Started
- Users' Guide
- Developers' Guide
- Limitations
Minimap2 is a versatile sequence alignment program that aligns DNA or mRNA sequences against a large reference database. Typical use cases include: (1) mapping PacBio or Oxford Nanopore genomic reads to the human genome; (2) finding overlaps between long reads with error rate up to ~15%; (3) splice-aware alignment of PacBio Iso-Seq or Nanopore cDNA or Direct RNA reads against a reference genome; (4) aligning Illumina single- or paired-end reads; (5) assembly-to-assembly alignment; (6) full-genome alignment between two closely related species with divergence below ~15%.
For ~10kb noisy reads sequences, minimap2 is tens of times faster than mainstream long-read mappers such as BLASR, BWA-MEM, NGMLR and GMAP. It is more accurate on simulated long reads and produces biologically meaningful alignment ready for downstream analyses. For >100bp Illumina short reads, minimap2 is three times as fast as BWA-MEM and Bowtie2, and as accurate on simulated data. Detailed evaluations are available from the minimap2 paper or the preprint.
Minimap2 is optimized for x86-64 CPUs. You can acquire precompiled binaries from the release page with:
curl -L https://github.com/lh3/minimap2/releases/download/v2.24/minimap2-2.24_x64-linux.tar.bz2 | tar -jxvf -
./minimap2-2.24_x64-linux/minimap2
If you want to compile from the source, you need to have a C compiler, GNU make
and zlib development files installed. Then type make
in the source code
directory to compile. If you see compilation errors, try make sse2only=1
to disable SSE4 code, which will make minimap2 slightly slower.
Minimap2 also works with ARM CPUs supporting the NEON instruction sets. To
compile for 32 bit ARM architectures (such as ARMv7), use make arm_neon=1
. To
compile for for 64 bit ARM architectures (such as ARMv8), use make arm_neon=1 aarch64=1
.
Minimap2 can use SIMD Everywhere (SIMDe) library for porting
implementation to the different SIMD instruction sets. To compile using SIMDe,
use make -f Makefile.simde
. To compile for ARM CPUs, use Makefile.simde
with the ARM related command lines given above.
Without any options, minimap2 takes a reference database and a query sequence file as input and produce approximate mapping, without base-level alignment (i.e. coordinates are only approximate and no CIGAR in output), in the PAF format:
minimap2 ref.fa query.fq > approx-mapping.paf
You can ask minimap2 to generate CIGAR at the cg
tag of PAF with:
minimap2 -c ref.fa query.fq > alignment.paf
or to output alignments in the SAM format:
minimap2 -a ref.fa query.fq > alignment.sam
Minimap2 seamlessly works with gzip'd FASTA and FASTQ formats as input. You don't need to convert between FASTA and FASTQ or decompress gzip'd files first.
For the human reference genome, minimap2 takes a few minutes to generate a minimizer index for the reference before mapping. To reduce indexing time, you can optionally save the index with option -d and replace the reference sequence file with the index file on the minimap2 command line:
minimap2 -d ref.mmi ref.fa # indexing
minimap2 -a ref.mmi reads.fq > alignment.sam # alignment
Importantly, it should be noted that once you build the index, indexing parameters such as -k, -w, -H and -I can't be changed during mapping. If you are running minimap2 for different data types, you will probably need to keep multiple indexes generated with different parameters. This makes minimap2 different from BWA which always uses the same index regardless of query data types.
Minimap2 uses the same base algorithm for all applications. However, due to the
different data types it supports (e.g. short vs long reads; DNA vs mRNA reads),
minimap2 needs to be tuned for optimal performance and accuracy. It is usually
recommended to choose a preset with option -x, which sets multiple
parameters at the same time. The default setting is the same as map-ont
.
minimap2 -ax map-pb ref.fa pacbio-reads.fq > aln.sam # for PacBio CLR reads
minimap2 -ax map-ont ref.fa ont-reads.fq > aln.sam # for Oxford Nanopore reads
The difference between map-pb
and map-ont
is that map-pb
uses
homopolymer-compressed (HPC) minimizers as seeds, while map-ont
uses ordinary
minimizers as seeds. Emperical evaluation suggests HPC minimizers improve
performance and sensitivity when aligning PacBio CLR reads, but hurt when aligning
Nanopore reads.
minimap2 -ax splice:hq -uf ref.fa iso-seq.fq > aln.sam # PacBio Iso-seq/traditional cDNA
minimap2 -ax splice ref.fa nanopore-cdna.fa > aln.sam # Nanopore 2D cDNA-seq
minimap2 -ax splice -uf -k14 ref.fa direct-rna.fq > aln.sam # Nanopore Direct RNA-seq
minimap2 -ax splice --splice-flank=no SIRV.fa SIRV-seq.fa # mapping against SIRV control
There are different long-read RNA-seq technologies, including tranditional
full-length cDNA, EST, PacBio Iso-seq, Nanopore 2D cDNA-seq and Direct RNA-seq.
They produce data of varying quality and properties. By default, -x splice
assumes the read orientation relative to the transcript strand is unknown. It
tries two rounds of alignment to infer the orientation and write the strand to
the ts
SAM/PAF tag if possible. For Iso-seq, Direct RNA-seq and tranditional
full-length cDNAs, it would be desired to apply -u f
to force minimap2 to
consider the forward transcript strand only. This speeds up alignment with
slight improvement to accuracy. For noisy Nanopore Direct RNA-seq reads, it is
recommended to use a smaller k-mer size for increased sensitivity to the first
or the last exons.
Minimap2 rates an alignment by the score of the max-scoring sub-segment, excluding introns, and marks the best alignment as primary in SAM. When a spliced gene also has unspliced pseudogenes, minimap2 does not intentionally prefer spliced alignment, though in practice it more often marks the spliced alignment as the primary. By default, minimap2 outputs up to five secondary alignments (i.e. likely pseudogenes in the context of RNA-seq mapping). This can be tuned with option -N.
For long RNA-seq reads, minimap2 may produce chimeric alignments potentially caused by gene fusions/structural variations or by an intron longer than the max intron length -G (200k by default). For now, it is not recommended to apply an excessively large -G as this slows down minimap2 and sometimes leads to false alignments.
It is worth noting that by default -x splice
prefers GT[A/G]..[C/T]AG
over GT[C/T]..[A/G]AG, and then over other splicing signals. Considering
one additional base improves the junction accuracy for noisy reads, but
reduces the accuracy when aligning against the widely used SIRV control data.
This is because SIRV does not honor the evolutionarily conservative splicing
signal. If you are studying SIRV, you may apply --splice-flank=no
to let
minimap2 only model GT..AG, ignoring the additional base.
Since v2.17, minimap2 can optionally take annotated genes as input and prioritize on annotated splice junctions. To use this feature, you can
paftools.js gff2bed anno.gff > anno.bed
minimap2 -ax splice --junc-bed anno.bed ref.fa query.fa > aln.sam
Here, anno.gff
is the gene annotation in the GTF or GFF3 format (gff2bed
automatically tests the format). The output of gff2bed
is in the 12-column
BED format, or the BED12 format. With the --junc-bed
option, minimap2 adds a
bonus score (tuned by --junc-bonus
) if an aligned junction matches a junction
in the annotation. Option --junc-bed
also takes 5-column BED, including the
strand field. In this case, each line indicates an oriented junction.
minimap2 -x ava-pb reads.fq reads.fq > ovlp.paf # PacBio CLR read overlap
minimap2 -x ava-ont reads.fq reads.fq > ovlp.paf # Oxford Nanopore read overlap
Similarly, ava-pb
uses HPC minimizers while ava-ont
uses ordinary
minimizers. It is usually not recommended to perform base-level alignment in
the overlapping mode because it is slow and may produce false positive
overlaps. However, if performance is not a concern, you may try to add -a
or
-c
anyway.
minimap2 -ax sr ref.fa reads-se.fq > aln.sam # single-end alignment
minimap2 -ax sr ref.fa read1.fq read2.fq > aln.sam # paired-end alignment
minimap2 -ax sr ref.fa reads-interleaved.fq > aln.sam # paired-end alignment
When two read files are specified, minimap2 reads from each file in turn and
merge them into an interleaved stream internally. Two reads are considered to
be paired if they are adjacent in the input stream and have the same name (with
the /[0-9]
suffix trimmed if present). Single- and paired-end reads can be
mixed.
Minimap2 does not work well with short spliced reads. There are many capable RNA-seq mappers for short reads.
minimap2 -ax asm5 ref.fa asm.fa > aln.sam # assembly to assembly/ref alignment
For cross-species full-genome alignment, the scoring system needs to be tuned according to the sequence divergence.
Due to a design flaw, BAM does not work with CIGAR strings with >65535 operations (SAM and CRAM work). However, for ultra-long nanopore reads minimap2 may align ~1% of read bases with long CIGARs beyond the capability of BAM. If you convert such SAM/CRAM to BAM, Picard and recent samtools will throw an error and abort. Older samtools and other tools may create corrupted BAM.
To avoid this issue, you can add option -L
at the minimap2 command line.
This option moves a long CIGAR to the CG
tag and leaves a fully clipped CIGAR
at the SAM CIGAR column. Current tools that don't read CIGAR (e.g. merging and
sorting) still work with such BAM records; tools that read CIGAR will
effectively ignore these records. It has been decided that future tools
will seamlessly recognize long-cigar records generated by option -L
.
TL;DR: if you work with ultra-long reads and use tools that only process
BAM files, please add option -L
.
The cs
SAM/PAF tag encodes bases at mismatches and INDELs. It matches regular
expression /(:[0-9]+|\*[a-z][a-z]|[=\+\-][A-Za-z]+)+/
. Like CIGAR, cs
consists of series of operations. Each leading character specifies the
operation; the following sequence is the one involved in the operation.
The cs
tag is enabled by command line option --cs
. The following alignment,
for example:
CGATCGATAAATAGAGTAG---GAATAGCA
|||||| |||||||||| |||| |||
CGATCG---AATAGAGTAGGTCGAATtGCA
is represented as :6-ata:10+gtc:4*at:3
, where :[0-9]+
represents an
identical block, -ata
represents a deletion, +gtc
an insertion and *at
indicates reference base a
is substituted with a query base t
. It is
similar to the MD
SAM tag but is standalone and easier to parse.
If --cs=long
is used, the cs
string also contains identical sequences in
the alignment. The above example will become
=CGATCG-ata=AATAGAGTAG+gtc=GAAT*at=GCA
. The long form of cs
encodes both
reference and query sequences in one string. The cs
tag also encodes intron
positions and splicing signals (see the minimap2 manpage for
details).
Minimap2 also comes with a (java)script paftools.js that processes alignments in the PAF format. It calls variants from assembly-to-reference alignment, lifts over BED files based on alignment, converts between formats and provides utilities for various evaluations. For details, please see misc/README.md.
In the following, minimap2 command line options have a dash ahead and are highlighted in bold. The description may help to tune minimap2 parameters.
-
Read -I [=4G] reference bases, extract (-k,-w)-minimizers and index them in a hash table.
-
Read -K [=200M] query bases. For each query sequence, do step 3 through 7:
-
For each (-k,-w)-minimizer on the query, check against the reference index. If a reference minimizer is not among the top -f [=2e-4] most frequent, collect its the occurrences in the reference, which are called seeds.
-
Sort seeds by position in the reference. Chain them with dynamic programming. Each chain represents a potential mapping. For read overlapping, report all chains and then go to step 8. For reference mapping, do step 5 through 7:
-
Let P be the set of primary mappings, which is an empty set initially. For each chain from the best to the worst according to their chaining scores: if on the query, the chain overlaps with a chain in P by --mask-level [=0.5] or higher fraction of the shorter chain, mark the chain as secondary to the chain in P; otherwise, add the chain to P.
-
Retain all primary mappings. Also retain up to -N [=5] top secondary mappings if their chaining scores are higher than -p [=0.8] of their corresponding primary mappings.
-
If alignment is requested, filter out an internal seed if it potentially leads to both a long insertion and a long deletion. Extend from the left-most seed. Perform global alignments between internal seeds. Split the chain if the accumulative score along the global alignment drops by -z [=400], disregarding long gaps. Extend from the right-most seed. Output chains and their alignments.
-
If there are more query sequences in the input, go to step 2 until no more queries are left.
-
If there are more reference sequences, reopen the query file from the start and go to step 1; otherwise stop.
Manpage minimap2.1 provides detailed description of minimap2 command line options and optional tags. The FAQ page answers several frequently asked questions. If you encounter bugs or have further questions or requests, you can raise an issue at the issue page. There is not a specific mailing list for the time being.
If you use minimap2 in your work, please cite:
Li, H. (2018). Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 34:3094-3100. doi:10.1093/bioinformatics/bty191
and/or:
Li, H. (2021). New strategies to improve minimap2 alignment accuracy. Bioinformatics, 37:4572-4574. doi:10.1093/bioinformatics/btab705
Minimap2 is not only a command line tool, but also a programming library. It provides C APIs to build/load index and to align sequences against the index. File example.c demonstrates typical uses of C APIs. Header file minimap.h gives more detailed API documentation. Minimap2 aims to keep APIs in this header stable. File mmpriv.h contains additional private APIs which may be subjected to changes frequently.
This repository also provides Python bindings to a subset of C APIs. File
python/README.rst gives the full documentation;
python/minimap2.py shows an example. This Python
extension, mappy, is also available from PyPI via pip install mappy
or from BioConda via conda install -c bioconda mappy
.
-
Minimap2 may produce suboptimal alignments through long low-complexity regions where seed positions may be suboptimal. This should not be a big concern because even the optimal alignment may be wrong in such regions.
-
Minimap2 requires SSE2 instructions on x86 CPUs or NEON on ARM CPUs. It is possible to add non-SIMD support, but it would make minimap2 slower by several times.
-
Minimap2 does not work with a single query or database sequence ~2 billion bases or longer (2,147,483,647 to be exact). The total length of all sequences can well exceed this threshold.
-
Minimap2 often misses small exons.