Merge remote-tracking branch 'origin/UniqFeature'

DerrickWood · Oct 13, 2020 · 809aff4 · 809aff4
2 parents a2d4a44 + 87267f8
commit 809aff4
Show file tree

Hide file tree

Showing 14 changed files with 1,458 additions and 73 deletions.
diff --git a/docs/MANUAL.html b/docs/MANUAL.html
@@ -39,6 +39,7 @@ <h1>Table of Contents</h1>
 </ul></li>
 <li><a href="#confidence-scoring">Confidence Scoring</a></li>
 <li><a href="#inspecting-a-kraken-2-databases-contents">Inspecting a Kraken 2 Database's Contents</a></li>
+<li><a href="#distinct-minimizer-count-information">Distinct minimizer count information</a></li>
 <li><a href="#kraken-2-environment-variables">Kraken 2 Environment Variables</a></li>
 </ul>
 </div>
@@ -210,7 +211,7 @@ <h1 id="custom-databases">Custom Databases</h1>
 <pre><code>kraken2-build --build --db $DBNAME</code></pre></li>
 </ol>
 <p>The <code>--threads</code> option is also helpful here to reduce build time.</p>
-<p>By default, the values of <span class="math inline"><em>k</em></span> and <span class="math inline">ℓ</span> are 35 and 31, respectively (or 15 and 12 for protein databases). These values can be explicitly set with the <code>--kmer-len</code> and <code>minimizer-len</code> options, however. Note that the minimizer length must be no more than 31 for nucleotide databases, and 15 for protein databases. Additionally, the minimizer length <span class="math inline">ℓ</span> must be no more than the <span class="math inline"><em>k</em></span>-mer length. There is no upper bound on the value of <span class="math inline"><em>k</em></span>, but sequences less than <span class="math inline"><em>k</em></span> bp in length cannot be classified.</p>
+<p>By default, the values of <span class="math inline"><em>k</em></span> and <span class="math inline">ℓ</span> are 35 and 31, respectively (or 15 and 12 for protein databases). These values can be explicitly set with the <code>--kmer-len</code> and <code>--minimizer-len</code> options, however. Note that the minimizer length must be no more than 31 for nucleotide databases, and 15 for protein databases. Additionally, the minimizer length <span class="math inline">ℓ</span> must be no more than the <span class="math inline"><em>k</em></span>-mer length. There is no upper bound on the value of <span class="math inline"><em>k</em></span>, but sequences less than <span class="math inline"><em>k</em></span> bp in length cannot be classified.</p>
 <p>Kraken 2 also utilizes a simple spaced seed approach to increase accuracy. A number <span class="math inline"><em>s</em></span> &lt; <span class="math inline">ℓ</span>/4 can be chosen, and <span class="math inline"><em>s</em></span> positions in the minimizer will be masked out during all comparisons. Masked positions are chosen to alternate from the second-to-last position in the minimizer; e.g., <span class="math inline"><em>s</em></span> = 5 and <span class="math inline">ℓ</span> = 31 will result in masking out the 0 positions shown here:</p>
 <pre><code>   111 1111 1111 1111 1111 1101 0101 0101</code></pre>
 <p>By default, <span class="math inline"><em>s</em></span> = 7 for nucleotide databases, and <span class="math inline"><em>s</em></span> = 0 for protein databases. This can be changed using the <code>--minimizer-spaces</code> option along with the <code>--build</code> task of <code>kraken2-build</code>.</p>
@@ -255,8 +256,36 @@ <h1 id="inspecting-a-kraken-2-databases-contents">Inspecting a Kraken 2 Database
  43.89% 777062062       1312736 P       1224          Proteobacteria
  18.62% 329590216       555667  C       1236            Gammaproteobacteria</code></pre>
 <p>This output indicates that 555667 of the minimizers in the database map directly to the Gammaproteobacteria class (taxid #1236), and 329590216 (18.62%) of the database's minimizers map to a taxon in the clade rooted at Gammaproteobacteria. For more information on <code>kraken2-inspect</code>'s options, use its <code>--help</code> option.</p>
+<h1 id="distinct-minimizer-count-information">Distinct minimizer count information</h1>
+<p>The <a href="https://github.com/fbreitwieser/krakenuniq">KrakenUniq</a> project extended Kraken 1 by, among other things, reporting an estimate of the number of distinct k-mers associated with each taxon in the input sequencing data. This allows users to better determine if Kraken's classifications are due to reads distributed throughout a reference genome, or due to only a small segment of a reference genome (and therefore likely false positive).</p>
+<p>Thanks to the generosity of KrakenUniq's developer Florian Breitwieser in allowing parts of the KrakenUniq source code to be licensed under Kraken 2's MIT license, this distinct counting estimation is now available in Kraken 2. Development work by Martin Steinegger and Ben Langmead helped bring this functionality to Kraken 2.</p>
+<p>At present, this functionality is an optional <em>experimental feature</em> -- meaning that we may later alter it in a way that is not backwards compatible with previous versions of the feature.</p>
+<p>To use this functionality, simply run the <code>kraken2</code> script with the additional <code>--report-minimizer-data</code> flag along with <code>--report</code>, e.g.:</p>
+<pre><code>kraken2 --db $DBNAME --report k2_report.txt --report-minimizer-data \
+    --output k2_output.txt sequence_data.fq</code></pre>
+<p>This will put the standard Kraken 2 output (formatted as described in <a href="#standard-kraken-output-format">Standard Kraken Output Format</a>) in <code>k2_output.txt</code> and the report information in <code>k2_report.txt</code>. Within the report file, two additional columns will be present, e.g.:</p>
+<p><strong>normal report format</strong>:</p>
+<pre><code>36.40   182 182 S2  211044                        Influenza A virus (A/Puerto Rico/8/1934(H1N1))</code></pre>
+<p><strong>modified report format</strong>:</p>
+<pre><code>36.40   182 182 1688    18  S2  211044                        Influenza A virus (A/Puerto Rico/8/1934(H1N1))</code></pre>
+<p>In this modified report format, the two new columns are the fourth and fifth, respectively representing the number of minimizers found to be associated with a taxon in the read sequences (1688), and the estimate of the number of distinct minimizers associated with a taxon in the read sequence data (18). This would indicate that although 182 reads were classified as belonging to H1N1 influenza, only 18 distinct minimizers led to those 182 classifications.</p>
+<p>The format with the <code>--report-minimizer-data</code> flag, then, is similar to that described in <a href="#sample-report-output-format">Sample Report Output Format</a>, but slightly different. The fields in this new format, from left-to-right, are:</p>
+<ol style="list-style-type: decimal">
+<li>Percentage of fragments covered by the clade rooted at this taxon</li>
+<li>Number of fragments covered by the clade rooted at this taxon</li>
+<li>Number of fragments assigned directly to this taxon</li>
+<li>Number of minimizers in read data associated with this taxon (<strong>new</strong>)</li>
+<li>An estimate of the number of distinct minimizers in read data associated with this taxon (<strong>new</strong>)</li>
+<li>A rank code, indicating (U)nclassified, (R)oot, (D)omain, (K)ingdom, (P)hylum, (C)lass, (O)rder, (F)amily, (G)enus, or (S)pecies. Taxa that are not at any of these 10 ranks have a rank code that is formed by using the rank code of the closest ancestor rank with a number indicating the distance from that rank. E.g., &quot;G2&quot; is a rank code indicating a taxon is between genus and species and the grandparent taxon is at the genus rank.</li>
+<li>NCBI taxonomic ID number</li>
+<li>Indented scientific name</li>
+</ol>
+<p>We decided to make this an optional feature so as not to break existing software that processes Kraken 2's standard report format. However, this new format can be converted to the standard report format with the command:</p>
+<pre><code>cut -f1-3,6-8 k2_new_report.txt &gt; k2_std_report.txt</code></pre>
+<p>As noted above, this is an <em>experimental feature</em>. We intend to continue development on this feature, and may change the new format and/or its information if we determine it to be necessary.</p>
+<p>For background on the data structures used in this feature and their interaction with Kraken, please read the <a href="https://genomebiology.biomedcentral.com/articles/10.1186/s13059-018-1568-0">KrakenUniq paper</a>, and please cite that paper if you use this functionality as part of your work.</p>
 <h1 id="kraken-2-environment-variables">Kraken 2 Environment Variables</h1>
-<p>The <code>kraken2</code> and <code>kraken2-inpsect</code> scripts supports the use of some environment variables to help in reducing command line lengths:</p>
+<p>The <code>kraken2</code> and <code>kraken2-inspect</code> scripts supports the use of some environment variables to help in reducing command line lengths:</p>
 <ul>
 <li><p><strong><code>KRAKEN2_NUM_THREADS</code></strong>: if the <code>--threads</code> option is not supplied to <code>kraken2</code>, then the value of this variable (if it is set) will be used as the number of threads to run <code>kraken2</code>. (This variable does not affect <code>kraken2-inspect</code>.)</p></li>
 <li><p><strong><code>KRAKEN2_DB_PATH</code></strong>: much like the <code>PATH</code> variable is used for executables by your shell, <code>KRAKEN2_DB_PATH</code> is a colon-separated list of directories that will be searched for the database you name if the named database does not have a slash (<code>/</code>) character. By default, Kraken 2 assumes the value of this variable is &quot;<code>.</code>&quot; (i.e., the current working directory). This variable can be used to create one (or more) central repositories of Kraken databases in a multi-user system. Example usage in bash:</p>

diff --git a/docs/MANUAL.markdown b/docs/MANUAL.markdown
@@ -526,7 +526,7 @@ To build a custom database:
 
    By default, the values of $k$ and $\ell$ are 35 and 31, respectively (or
    15 and 12 for protein databases).  These values can be explicitly set
-   with the `--kmer-len` and `minimizer-len` options, however.  Note that
+   with the `--kmer-len` and `--minimizer-len` options, however.  Note that
    the minimizer length must be no more than 31 for nucleotide databases,
    and 15 for protein databases.  Additionally, the minimizer length $\ell$
    must be no more than the $k$-mer length.  There is no upper bound on
@@ -682,10 +682,94 @@ Gammaproteobacteria.  For more information on `kraken2-inspect`'s options,
 use its `--help` option.
 
 
+Distinct minimizer count information
+====================================
+
+The [KrakenUniq] project extended Kraken 1 by, among other things, reporting
+an estimate of the number of distinct k-mers associated with each taxon in the
+input sequencing data.  This allows users to better determine if Kraken's
+classifications are due to reads distributed throughout a reference genome,
+or due to only a small segment of a reference genome (and therefore likely
+false positive).
+
+Thanks to the generosity of KrakenUniq's developer Florian Breitwieser in
+allowing parts of the KrakenUniq source code to be licensed under Kraken 2's
+MIT license, this distinct counting estimation is now available in Kraken 2.
+Development work by Martin Steinegger and Ben Langmead helped bring this
+functionality to Kraken 2.
+
+At present, this functionality is an optional *experimental feature* -- meaning
+that we may later alter it in a way that is not backwards compatible with
+previous versions of the feature.
+
+To use this functionality, simply run the `kraken2` script with the additional
+`--report-minimizer-data` flag along with `--report`, e.g.:
+
+    kraken2 --db $DBNAME --report k2_report.txt --report-minimizer-data \
+        --output k2_output.txt sequence_data.fq
+
+This will put the standard Kraken 2 output (formatted as described in
+[Standard Kraken Output Format]) in `k2_output.txt` and the report information
+in `k2_report.txt`.  Within the report file, two additional columns will be
+present, e.g.:
+
+**normal report format**:
+
+    36.40	182	182	S2	211044	                      Influenza A virus (A/Puerto Rico/8/1934(H1N1))
+
+**modified report format**:
+
+    36.40	182	182	1688	18	S2	211044	                      Influenza A virus (A/Puerto Rico/8/1934(H1N1))
+
+In this modified report format, the two new columns are the fourth and fifth,
+respectively representing the number of minimizers found to be associated with
+a taxon in the read sequences (1688), and the estimate of the number of distinct
+minimizers associated with a taxon in the read sequence data (18).  This would
+indicate that although 182 reads were classified as belonging to H1N1 influenza,
+only 18 distinct minimizers led to those 182 classifications.
+
+The format with the `--report-minimizer-data` flag, then, is similar to that
+described in [Sample Report Output Format], but slightly different.  The fields
+in this new format, from left-to-right, are:
+
+1. Percentage of fragments covered by the clade rooted at this taxon
+2. Number of fragments covered by the clade rooted at this taxon
+3. Number of fragments assigned directly to this taxon
+4. Number of minimizers in read data associated with this taxon (**new**)
+5. An estimate of the number of distinct minimizers in read data associated
+   with this taxon (**new**)
+6. A rank code, indicating (U)nclassified, (R)oot, (D)omain, (K)ingdom,
+   (P)hylum, (C)lass, (O)rder, (F)amily, (G)enus, or (S)pecies.
+   Taxa that are not at any of these 10 ranks have a rank code that is
+   formed by using the rank code of the closest ancestor rank with
+   a number indicating the distance from that rank.  E.g., "G2" is a
+   rank code indicating a taxon is between genus and species and the
+   grandparent taxon is at the genus rank.
+7. NCBI taxonomic ID number
+8. Indented scientific name
+
+We decided to make this an optional feature so as not to break existing
+software that processes Kraken 2's standard report format.  However, this
+new format can be converted to the standard report format with the command:
+
+    cut -f1-3,6-8 k2_new_report.txt > k2_std_report.txt
+
+As noted above, this is an *experimental feature*.  We intend to continue
+development on this feature, and may change the new format and/or its
+information if we determine it to be necessary.
+
+For background on the data structures used in this feature and their
+interaction with Kraken, please read the [KrakenUniq paper], and please
+cite that paper if you use this functionality as part of your work.
+
+[KrakenUniq]: https://github.com/fbreitwieser/krakenuniq
+[KrakenUniq paper]: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-018-1568-0
+
+
 Kraken 2 Environment Variables
 ==============================
 
-The `kraken2` and `kraken2-inpsect` scripts supports the use of some
+The `kraken2` and `kraken2-inspect` scripts supports the use of some
 environment variables to help in reducing command line lengths:
 
 * **`KRAKEN2_NUM_THREADS`**: if the

diff --git a/scripts/kraken2 b/scripts/kraken2
@@ -48,6 +48,7 @@ my $report_filename;
 my $use_mpa_style = 0;
 my $report_zero_counts = 0;
 my $minimum_hit_groups = 2;
+my $report_minimizer_data = 0;
 
 GetOptions(
   "help" => \&display_help,
@@ -70,6 +71,7 @@ GetOptions(
   "use-mpa-style" => \$use_mpa_style,
   "report-zero-counts" => \$report_zero_counts,
   "minimum-hit-groups=i" => \$minimum_hit_groups,
+  "report-minimizer-data" => \$report_minimizer_data,
 );
 
 if (! defined $threads) {
@@ -137,6 +139,7 @@ push @flags, "-m" if $use_mpa_style;
 push @flags, "-z" if $report_zero_counts;
 push @flags, "-M" if $memory_mapping;
 push @flags, "-g", $minimum_hit_groups;
+push @flags, "-K" if $report_minimizer_data;
 
 # Stupid hack to keep filehandles from closing before exec
 # filehandles opened inside for loop below go out of scope
@@ -202,6 +205,8 @@ Options:
                           kraken-mpa-report
   --report-zero-counts    With --report, report counts for ALL taxa, even if
                           counts are zero
+  --report-minimizer-data With --report, report minimizer and distinct minimizer
+                          count information in addition to normal Kraken report
   --memory-mapping        Avoids loading database into RAM
   --paired                The filenames provided have paired-end reads
   --use-names             Print scientific names instead of just taxids

diff --git a/src/Makefile b/src/Makefile
@@ -15,21 +15,33 @@ clean:
 	rm -f *.o $(PROGS)
 
 taxonomy.o: taxonomy.cc taxonomy.h mmap_file.h
+hyperloglogplus.o: hyperloglogplus.cc hyperloglogplus.h
 mmap_file.o: mmap_file.cc mmap_file.h
-compact_hash.o: compact_hash.cc compact_hash.h kv_store.h mmap_file.h
+compact_hash.o: compact_hash.cc compact_hash.h kv_store.h mmap_file.h kraken2_data.h
 mmscanner.o: mmscanner.cc mmscanner.h
 seqreader.o: seqreader.cc seqreader.h
 omp_hack.o: omp_hack.cc omp_hack.h
-reports.o: reports.cc reports.h
+reports.o: reports.cc reports.h kraken2_data.h
 aa_translate.o: aa_translate.cc aa_translate.h
 utilities.o: utilities.cc utilities.h
 
-build_db: build_db.cc mmap_file.o compact_hash.o taxonomy.o seqreader.o mmscanner.o omp_hack.o utilities.o
+classify.o: classify.cc kraken2_data.h kv_store.h taxonomy.h seqreader.h mmscanner.h compact_hash.h aa_translate.h reports.h utilities.h readcounts.h
+dump_table.o: dump_table.cc compact_hash.h taxonomy.h mmscanner.h kraken2_data.h reports.h
+estimate_capacity.o: estimate_capacity.cc kv_store.h mmscanner.h seqreader.h utilities.h
+build_db.o: build_db.cc taxonomy.h mmscanner.h seqreader.h compact_hash.h kv_store.h kraken2_data.h utilities.h
+lookup_accession_numbers.o: lookup_accession_numbers.cc mmap_file.h utilities.h
 
-classify: classify.cc reports.o mmap_file.o compact_hash.o taxonomy.o seqreader.o mmscanner.o omp_hack.o aa_translate.o utilities.o
+build_db: build_db.o mmap_file.o compact_hash.o taxonomy.o seqreader.o mmscanner.o omp_hack.o utilities.o
+	$(CXX) $(CXXFLAGS) -o $@ $^
 
-estimate_capacity: estimate_capacity.cc seqreader.o mmscanner.o omp_hack.o utilities.o
+classify: classify.o reports.o hyperloglogplus.o mmap_file.o compact_hash.o taxonomy.o seqreader.o mmscanner.o omp_hack.o aa_translate.o utilities.o
+	$(CXX) $(CXXFLAGS) -o $@ $^
 
-dump_table: dump_table.cc mmap_file.o compact_hash.o omp_hack.o taxonomy.o reports.o
+estimate_capacity: estimate_capacity.o seqreader.o mmscanner.o omp_hack.o utilities.o
+	$(CXX) $(CXXFLAGS) -o $@ $^
 
-lookup_accession_numbers: lookup_accession_numbers.cc mmap_file.o omp_hack.o utilities.o
+dump_table: dump_table.o mmap_file.o compact_hash.o omp_hack.o taxonomy.o reports.o hyperloglogplus.o
+	$(CXX) $(CXXFLAGS) -o $@ $^
+
+lookup_accession_numbers: lookup_accession_numbers.o mmap_file.o omp_hack.o utilities.o
+	$(CXX) $(CXXFLAGS) -o $@ $^