Releases: jtnystrom/Discount
Version 3.0.1
Version 3.0.0
This version adds indexes (k-mer databases with counted k-mers) and ways to combine these, including intersect, union and subtract. Various rules for intersection and union are available, including max, min, left, right. Most operations that could formerly be done only on raw sequence files can now also be done on indexes with a similar syntax.
Indexes are stored using bucketed parquet files, which gives good efficiency when using the same input data multiple times, as the k-mers do not have to be shuffled again during subsequent use.
Indexes can be manipulated using the command-line interface as well as the API from notebooks or from the Spark shell.
Summary of changes in this version:
- The minimum supported Spark version is now 3.1.0.
- Support for indexes (k-mer databases) written as parquet files.
- Index operations such as union, intersect, subtract, with various combination rules like min, max, sum, left, right.
- Restructured the API to use indexes as much as possible.
- Several operations were moved to Spark SQL (from handcrafted Scala) for performance and simplicity.
- Run scripts were renamed and can now detect their location, which makes it easy to symlink them to somewhere in $PATH.
- The new -p flag is now the preferred way to specify the number of partitions.
- Most commands that take input can now read input from an index (using -i) as well as from sequence files.
- K-mer counts are now consistently represented as Int instead of Long in the user API as they were limited to 32-bit signed integers internally.
- Added com.globalmentor's hadoop-bare-naked-local-fs to avoid dependency on winutils.exe on Windows when running tests.
- Various simplifications and speedups.
Version 2.3.0
Version 2.3.0 greatly increases the maximum data size that can be analysed (we have tested up to 6 TB of input data). As some very minor changes are incompatible, API users may need to manually migrate some code.
- Pre-grouped mode for handling repetitive or very large data, which can be enabled with
--method pregrouped
. - Some minimizer sets are now bundled in the Discount jar, which means that many users will not need to supply minimizers manually.
- Improved support for large m (up to 13), which helps subdivide complex data with many distinct k-mers.
- Automatic coalescing of partitions in frequency sampling when appropriate (improves performance).
- Support for
@inputs.txt
syntax to supply a list of input files on the command line. - More efficient frequency sampling by doing the sampling entirely in Spark SQL, instead of partially on the driver.
Version 2.2.1
This release fixes a data loss bug in the parsing of some fastq files.
Version 2.2.0
- Improved support for very long fasta sequences (e.g. full chromosomes), even for multiple sequences per file. This is done by relying on an external .fai index, which is now necessary for sequences with unbounded length.
- File input formats can now be mixed (e.g. fastq, fasta, long fasta can be read by the same job).
- k-mer statistics can now optionally be written to an output file using a new argument (not just to standard output as before).
- For convenience, additional PASHA minimizer sets for k >= 19, m=10,11 were added to the distribution.
Version 2.1.0
- Classes were restructured under the com.jnpersson.discount package (instead of simply "discount") to comply with normal Java/Scala conventions. This is a breaking change for API users, but should be a simple migration.
- Faster algorithms for read splitting and bitwise encoding.
- Sampling and input parsing has changed into a unified API that is consistent across short reads and long sequences, and that samples long sequences more fairly.
- Foundational work towards preserving the sequence locations of input sequence fragments.
- Additional test cases for different kinds of input data.
Version 2.0.1
This release fixes a bug where long, multiline input sequences were not handled correctly and k-mer counts would occasionally be wrong, along with some other minor improvements.
Version 2.0.0
This version includes the following improvements.
- Nearly 50% faster counting due to better algorithms, including a version of radix sort from the Fastutil library
- Automatic selection of the most appropriate minimizer set from a directory, by matching with the desired (k, m) values
- Support for interactive notebooks (a Zeppelin example is included) and a restructured API to support this
- Hashed superkmers can now be queried by sequences to find matching k-mers
- Support for lowercase nucleotide letters in input
- Support for user-defined minimizer orderings (-o given)
- Various simplifications and enhancements
Version 1.4.0
Version 1.4.0 has the following improvements:
- Scala 2.12/Spark 3.1 are now the default versions when compiling.
- Bugfix for incorrect counting when k mod 16 = 0.
- sbt-assembly is now the preferred way to package Discount, including its dependencies (Scallop and Fastdoop) in a "fat" jar.
- Additional property-based unit tests using ScalaCheck.
- A minimal demo application (ReadSplitDemo) shows how to use the Discount API without Spark.
- Various simplifications, code cleanups and speedups.
Version 1.3.0
Version 1.3.0 has the following improvements:
- Improved performance for large m
- Reduced memory usage in the hashing stage
- Fixed a bug that caused Discount to crash on empty inputs
- Improved command line argument validation
- Renamed the output path for count --stats
- Renamed the command line arguments --motif-set and --stats to --minimizers and --buckets, respectively, for improved clarity