Skip to content

Commit

Permalink
Speed2 (#44)
Browse files Browse the repository at this point in the history
* [FIX] Speed comparison.

* [MISC] Rename library to original.

* [MISC] Change input to unique.

* [MISC] Speed up randstrobemers and minstrobemers.

* [MISC] Remove bidirectionality and random access from strobes.

* [MISC] Speed up unique.

* [MISC] Python script to mutate fasta file with certain error rate.

* [FIX] Speed minstrobemers.

* [MISC] Make strobemers fast and adapt speed snakefile.

* Add syncmer to speed.

* Correct shapes.

* Save output files in results.

* Fix typos.

* Speed for reprentative methods, some doc.

* Count enables representative methods for strobemers and some clean up.

* Count for strobemers with representative methods.

* Match for strobmers and representative methods.

* Correct distance and distance with strobemers as bases.

* Add strobemer as option to accuracy.

* Accuracy Workflow

* Update.

* counts

* Add counts

* Update accuracy.

* Correct seed default.

* Correct distance

* Fixes errors.

* Revert "Count enables representative methods for strobemers and some clean up."

This reverts commit ba3fbc6.

* Fix

* Updated.

* Update README.md

* [FIX] Static assert.

* [DOC] Fixes.

* [DOC] Fixes.
  • Loading branch information
MitraDarja authored Aug 24, 2023
1 parent f8b74a5 commit 78e5b32
Show file tree
Hide file tree
Showing 52 changed files with 4,383 additions and 2,831 deletions.
14 changes: 14 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,20 @@ src/snakemake/
!src/snakemake/genmap/Snakefile
!src/snakemake/genmap/genmap_uniqueness.py

!src/snakemake/accuracy/add_errors.py
!src/snakemake/accuracy/README
!src/snakemake/accuracy/Snakefile
!src/snakemake/accuracy/plot_match.py
!src/snakemake/accuracy/plot_match_representative.py
!src/snakemake/accuracy/create_res.py
!src/snakemake/count/Snakefile
!src/snakemake/count/plot_counts.py
!src/snakemake/count/plot_counts_representative.py
!src/snakemake/count/plot_counts_representative2.py

!src/snakemake/distance/Snakefile

!src/snakemake/speed/README
!src/snakemake/speed/Snakefile
!src/snakemake/speed/plot_speed.py
!src/snakemake/speed/plot_speed_representative.py
65 changes: 58 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,10 +11,10 @@ Currently, the following methods are supported:
- k-mers
- minimizers
- modmers
- strobemers (integrated as submodule from [here](https://github.com/ksahlin/strobemers) and implemented as a view)
- strobemers (integrated as submodule from [here](https://github.com/ksahlin/strobemers) and implemented as a view for hybrid-, min- and randstrobemers)
- syncmers

See Issue #1 for a list of methods that will be added in the future.
See Issue #1 for a list of methods that will be added in the future (see down below here for an example usage of each method).

The following evaluation metrics are implemented at the moment (see an example usage for each metric down below):

Expand All @@ -32,11 +32,21 @@ Run test to check, if Comparison is working as intended. All tests should pass.
make test
```

# Counts

Counts creates two output files: One named `{method}_{inputfile_name}_counts.out` storing as a binary file all submers and their respective count values and one named `{method}_counts.out` storing the minimium, mean, the variance and maximum of the count values. Count can also handle multiple files and calculate the mean over all sequences found in all files. Counts considers for all supported methods the canonical version.

Example usage for calculating the counts of k-mers of a given input file `in.fa`:
```
minions counts --method kmer -k 16 in.fasta
```
This results in the two files: `kmer_hash_16_in_counts.out` and `kmer_hash_16_counts.out`.

# Speed

Speeds creates a file called `{method}_speed.out` and returns the speed of processing a singular sequence in microseconds. As typical one sequence file contains multiple sequences the minimum speed, the mean, the variance and the maximum speed are returned. Speed can also handle multiple files and calculate the mean over all sequences found in all files.
Speeds creates a file called `{method}_speed.out` and returns the speed of processing a singular sequence in microseconds. As typical one sequence file contains multiple sequences the minimum speed, the mean, the variance and the maximum speed are returned. Speed can also handle multiple files and calculate the mean over all sequences found in all files. Speed considers for all supported methods the non-canonical version.

Example usage for calculating the k-mers of a given input file `in.fa`:
Example usage for calculating the speed of k-mers of a given input file `in.fa`:
```
minions speed --method kmer -k 16 in.fasta
```
Expand All @@ -49,9 +59,50 @@ kmer_hash_16 10 11.478 0.970317 21 -1590685541
The first number is the minimum, then follows the mean, the variance and the maximum. The last number in the row can be ignored as it's only used for internal purposes.

**Note:**
Currently, there are two implementations of the strobemers supported. The original one from [Kristoffer Sahlin](https://github.com/ksahlin/strobemers) and the one here presented. The one here presented is more comparable to the other methods used here, because they are based on the same hash functions. Therefore, these strobemers are used for almost every evaluation metric. However, currently the implementation is slower than the one from Sahlin, that is why both implementations can be used with speed.
Currently, speed supports two implementation of the strobemers. The original one from [Kristoffer Sahlin](https://github.com/ksahlin/strobemers) and the one here presented. The one here presented is more comparable to the other methods used here, because they are based on the same hash functions. Therefore, these strobemers are used for every other evaluation metric.

For the original implementation, add the flag `--original` and note that for the original implementation, only randstrobemers are supported for order 2 and 3, minstrobemers and hybridstrobemers only support order 2. Furthermore, the flags `--w-min` and `--w-max` have different meanings between the original implementation and the implementation here.

For the original implementation, add the flag `--original` and note that for the original implementation, only randstrobemers are supported for irder 2 and 3, minstrobemers and hybridstrobemers only support order 2. Furthermore, the flags `--w-min` and `--w-max` have different meanings between the original implementation and the implementation here.
`w-min` in the implementation from minions is the distance between the first strobe to second strobe. While for the original implementation, it is the starting position in the sequence of the window that is considered for the second strobe. Therefore, the call with original should always add (k+1) to `w-min` compared to the minion implementation.

`w-min` in the implementation from minions is the distance between the first strobe to second strobe. While for the original implementation, it is the starting position in the sequence of the window that is considered for the second strobe.
`w-max` in the implementation from minions is the window length that should be considered for every strobe besides the first one. All strobes need to be completely inside this window length to be considered. While for the original implementation, it is the position in the sequence until which a strobe that is considered has to start. Therefore, for a strobemer with a strobe length of 8, `w-min` of 0 and `w-max` of 15 in the minion implementation would equal a `w-min` of 9 and `w-max` of 17. For more details, please read the documentation for both implementations.

# Unique

Unique should be run after counts, as the input should be a `{method}_{inputfile_name}_counts.out` file, which stores the submers with their count values. Unique then calculates the percentage of unique submers for all given files and reports it in a output file.

Example usage for calculating the uniqueness of k-mers for the output file of the example in counts:
```
minions unique `kmer_hash_16_in_counts.out` -o Unique.out
```

This results in the file `Unique.out`, which looks like:
```
kmer_hash_16 89.7
```
If multiple files would have been given, each file would have added another row.

# Methods

If a metric supports a method, pick it with the flag `--method`.

## k-mers

k-mers are defined by their value of k, which can be defined with `-k`. A gapped k-mer can be used by defining a shape with `--shape`. Shape expects a number, this number should be the decimal representation of a binary number with a starting and an ending 1, each 0 in the binary number will be considered a gap.

## strobemers

Currently, minstrobemers with the flag `--min`, hybridstrobmers with the flag `--hybrid` and randstrobmers with the flag `--rand` are supported for order 2 and 3, which can be defined with `--order`.
With `-k` the length of a single strobe can be defined and with `--w-min` the distance between the first strobe to the next one and with `--w-max` the length of the window to pick the second and third strobe.

## minimizers

Minimizers support ungapped and gapped k-mers. A window size can be given with `-w`. The randomization of the order is achieved by XOR all k-mer hash values with a seed, if the lexicographical order is wanted `--seed` should be set to 0. For more information, see the [seqan tutorial](http://docs.seqan.de/seqan/3-master-user/tutorial_minimiser.html).

## modmers

Minimizers support ungapped and gapped k-mers. The mod value can be given with `-w`. The randomization of the order is achieved by XOR all k-mer hash values with a seed, if the lexicographical order is wanted `--seed` should be set to 0.

## syncmers

Syncmers support ungapped. The s-mer value can be given with `-w`. The positions of a s-mer that make a k-mer a syncmer can be given with `-p`. The randomization of the order is achieved by XOR all k-mer hash values with a seed, if the lexicographical order is wanted `--seed` should be set to 0.
23 changes: 13 additions & 10 deletions include/compare.h
Original file line number Diff line number Diff line change
Expand Up @@ -107,9 +107,10 @@ void store_ibf(IBFType const & ibf,
}

/*! \brief Function that creates the string name of the used view.
* \param args The arguments about the view to be used.
* \param args The arguments about the view to be used.
* \param underlying_strobemer If true, "Strobmer" is added to the name.
*/
std::string create_name(range_arguments & args);
std::string create_name(range_arguments & args, bool underlying_strobemer = false);

/*! \brief Function, comparing the methods in regard of their coverage.
* \param args The arguments about the view to be used.
Expand All @@ -119,31 +120,33 @@ void do_accuracy(accuracy_arguments & args);
/*! \brief Function, comparing the number of submers.
* \param sequence_files A vector of sequence files.
* \param args The arguments about the view to be used.
* \param underlying_strobemer True, if strobemers should be used with a representative method like minimizer.
*/
void do_counts(std::vector<std::filesystem::path> sequence_files, range_arguments & args);
void do_counts(std::vector<std::filesystem::path> sequence_files, range_arguments & args, bool underlying_strobemer = false);

/*! \brief Function, comparing the methods in regard of their distance.
* \param sequence_file A sequence file.
* \param args The arguments about the view to be used.
* \param underlying_strobemer True, if strobemers should be used with a representative method like minimizer.
*/
void do_distance(std::filesystem::path sequence_file, range_arguments & args);
void do_distance(std::filesystem::path sequence_file, range_arguments & args, bool underlying_strobemer = false);

/*! \brief Function, counting number of matches between two sequences.
* \param sequence_file1 The first sequence file.
* \param sequence_file2 The second sequence file.
* \param args The arguments about the view to be used.
* \param underlying_strobemer True, if strobemers should be used with a representative method like minimizer.
*/
void do_match(std::filesystem::path sequence_file1, std::filesystem::path sequence_file2, range_arguments & args);
void do_match(std::filesystem::path sequence_file1, std::filesystem::path sequence_file2, range_arguments & args, bool underlying_strobemer = false);

/*! \brief Function, comparing the speed.
* \param sequence_files A vector of sequence files.
* \param args The arguments about the view to be used.
*/
void do_speed(std::vector<std::filesystem::path> sequence_files, range_arguments & args);

/*! \brief Function that calculates the uniqueness of submers in given sequence files.
* \param sequence_files A vector of sequence files.
* \param method_name The name of the method.
* \param args The arguments about the view to be used.
/*! \brief Function that calculates the uniqueness of submers in given files.
* \param input_files A vector of input files. An input file is a count file obtained by counts.
* \param oname The name of the output file.
*/
void unique(std::vector<std::filesystem::path> sequence_files, std::string method_name, range_arguments & args);
void unique(std::vector<std::filesystem::path> input_files, std::filesystem::path oname);
Loading

0 comments on commit 78e5b32

Please sign in to comment.