Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed2 #44

Merged
merged 35 commits into from
Aug 24, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
33aea22
[FIX] Speed comparison.
MitraDarja Jan 19, 2023
b0499f6
[MISC] Rename library to original.
MitraDarja Jan 20, 2023
e9e650e
[MISC] Change input to unique.
MitraDarja Jan 20, 2023
2236f9b
[MISC] Speed up randstrobemers and minstrobemers.
MitraDarja Jan 20, 2023
027e1fd
[MISC] Remove bidirectionality and random access from strobes.
MitraDarja Jan 20, 2023
7517742
[MISC] Speed up unique.
MitraDarja Jan 24, 2023
9e7045c
[MISC] Python script to mutate fasta file with certain error rate.
MitraDarja Jan 24, 2023
9ce21d3
[FIX] Speed minstrobemers.
MitraDarja Jan 25, 2023
bea5fd0
[MISC] Make strobemers fast and adapt speed snakefile.
MitraDarja Jan 25, 2023
4c2ee1d
Add syncmer to speed.
MitraDarja Jan 25, 2023
1a4d2a8
Correct shapes.
MitraDarja Jan 26, 2023
06491e2
Save output files in results.
MitraDarja Jan 26, 2023
f9c0caf
Fix typos.
MitraDarja Jan 26, 2023
e76b26c
Speed for reprentative methods, some doc.
MitraDarja Feb 1, 2023
ba3fbc6
Count enables representative methods for strobemers and some clean up.
MitraDarja Feb 3, 2023
5488d62
Count for strobemers with representative methods.
MitraDarja Feb 6, 2023
1702733
Match for strobmers and representative methods.
MitraDarja Feb 8, 2023
2836ed1
Correct distance and distance with strobemers as bases.
MitraDarja Feb 9, 2023
c66e70b
Add strobemer as option to accuracy.
MitraDarja Feb 9, 2023
b72f0f5
Accuracy Workflow
MitraDarja May 5, 2023
239bed0
Update.
MitraDarja May 30, 2023
8f3e683
counts
MitraDarja May 30, 2023
71c1b52
Add counts
MitraDarja May 30, 2023
73d4075
Update accuracy.
MitraDarja Jun 2, 2023
4036359
Correct seed default.
MitraDarja Jun 12, 2023
489b055
Correct distance
MitraDarja Jun 12, 2023
8c6636d
Merge branch 'speed2' of github.com:seqan/minions into speed2
MitraDarja Jun 12, 2023
c45dc89
Fixes errors.
MitraDarja Jun 12, 2023
e5d54a3
Revert "Count enables representative methods for strobemers and some …
MitraDarja Jun 12, 2023
88b569e
Fix
MitraDarja Jun 12, 2023
da6760f
Updated.
MitraDarja Jul 3, 2023
978d339
Update README.md
MitraDarja Aug 17, 2023
8767a77
[FIX] Static assert.
MitraDarja Aug 24, 2023
66a5de6
[DOC] Fixes.
MitraDarja Aug 24, 2023
0fcaa75
[DOC] Fixes.
MitraDarja Aug 24, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,20 @@ src/snakemake/
!src/snakemake/genmap/Snakefile
!src/snakemake/genmap/genmap_uniqueness.py

!src/snakemake/accuracy/add_errors.py
!src/snakemake/accuracy/README
!src/snakemake/accuracy/Snakefile
!src/snakemake/accuracy/plot_match.py
!src/snakemake/accuracy/plot_match_representative.py
!src/snakemake/accuracy/create_res.py
!src/snakemake/count/Snakefile
!src/snakemake/count/plot_counts.py
!src/snakemake/count/plot_counts_representative.py
!src/snakemake/count/plot_counts_representative2.py

!src/snakemake/distance/Snakefile

!src/snakemake/speed/README
!src/snakemake/speed/Snakefile
!src/snakemake/speed/plot_speed.py
!src/snakemake/speed/plot_speed_representative.py
65 changes: 58 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,10 +11,10 @@ Currently, the following methods are supported:
- k-mers
- minimizers
- modmers
- strobemers (integrated as submodule from [here](https://github.com/ksahlin/strobemers) and implemented as a view)
- strobemers (integrated as submodule from [here](https://github.com/ksahlin/strobemers) and implemented as a view for hybrid-, min- and randstrobemers)
- syncmers

See Issue #1 for a list of methods that will be added in the future.
See Issue #1 for a list of methods that will be added in the future (see down below here for an example usage of each method).

The following evaluation metrics are implemented at the moment (see an example usage for each metric down below):

Expand All @@ -32,11 +32,21 @@ Run test to check, if Comparison is working as intended. All tests should pass.
make test
```

# Counts

Counts creates two output files: One named `{method}_{inputfile_name}_counts.out` storing as a binary file all submers and their respective count values and one named `{method}_counts.out` storing the minimium, mean, the variance and maximum of the count values. Count can also handle multiple files and calculate the mean over all sequences found in all files. Counts considers for all supported methods the canonical version.

Example usage for calculating the counts of k-mers of a given input file `in.fa`:
```
minions counts --method kmer -k 16 in.fasta
```
This results in the two files: `kmer_hash_16_in_counts.out` and `kmer_hash_16_counts.out`.

# Speed

Speeds creates a file called `{method}_speed.out` and returns the speed of processing a singular sequence in microseconds. As typical one sequence file contains multiple sequences the minimum speed, the mean, the variance and the maximum speed are returned. Speed can also handle multiple files and calculate the mean over all sequences found in all files.
Speeds creates a file called `{method}_speed.out` and returns the speed of processing a singular sequence in microseconds. As typical one sequence file contains multiple sequences the minimum speed, the mean, the variance and the maximum speed are returned. Speed can also handle multiple files and calculate the mean over all sequences found in all files. Speed considers for all supported methods the non-canonical version.

Example usage for calculating the k-mers of a given input file `in.fa`:
Example usage for calculating the speed of k-mers of a given input file `in.fa`:
```
minions speed --method kmer -k 16 in.fasta
```
Expand All @@ -49,9 +59,50 @@ kmer_hash_16 10 11.478 0.970317 21 -1590685541
The first number is the minimum, then follows the mean, the variance and the maximum. The last number in the row can be ignored as it's only used for internal purposes.

**Note:**
Currently, there are two implementations of the strobemers supported. The original one from [Kristoffer Sahlin](https://github.com/ksahlin/strobemers) and the one here presented. The one here presented is more comparable to the other methods used here, because they are based on the same hash functions. Therefore, these strobemers are used for almost every evaluation metric. However, currently the implementation is slower than the one from Sahlin, that is why both implementations can be used with speed.
Currently, speed supports two implementation of the strobemers. The original one from [Kristoffer Sahlin](https://github.com/ksahlin/strobemers) and the one here presented. The one here presented is more comparable to the other methods used here, because they are based on the same hash functions. Therefore, these strobemers are used for every other evaluation metric.

For the original implementation, add the flag `--original` and note that for the original implementation, only randstrobemers are supported for order 2 and 3, minstrobemers and hybridstrobemers only support order 2. Furthermore, the flags `--w-min` and `--w-max` have different meanings between the original implementation and the implementation here.

For the original implementation, add the flag `--original` and note that for the original implementation, only randstrobemers are supported for irder 2 and 3, minstrobemers and hybridstrobemers only support order 2. Furthermore, the flags `--w-min` and `--w-max` have different meanings between the original implementation and the implementation here.
`w-min` in the implementation from minions is the distance between the first strobe to second strobe. While for the original implementation, it is the starting position in the sequence of the window that is considered for the second strobe. Therefore, the call with original should always add (k+1) to `w-min` compared to the minion implementation.

`w-min` in the implementation from minions is the distance between the first strobe to second strobe. While for the original implementation, it is the starting position in the sequence of the window that is considered for the second strobe.
`w-max` in the implementation from minions is the window length that should be considered for every strobe besides the first one. All strobes need to be completely inside this window length to be considered. While for the original implementation, it is the position in the sequence until which a strobe that is considered has to start. Therefore, for a strobemer with a strobe length of 8, `w-min` of 0 and `w-max` of 15 in the minion implementation would equal a `w-min` of 9 and `w-max` of 17. For more details, please read the documentation for both implementations.

# Unique

Unique should be run after counts, as the input should be a `{method}_{inputfile_name}_counts.out` file, which stores the submers with their count values. Unique then calculates the percentage of unique submers for all given files and reports it in a output file.

Example usage for calculating the uniqueness of k-mers for the output file of the example in counts:
```
minions unique `kmer_hash_16_in_counts.out` -o Unique.out
```

This results in the file `Unique.out`, which looks like:
```
kmer_hash_16 89.7
```
If multiple files would have been given, each file would have added another row.

# Methods

If a metric supports a method, pick it with the flag `--method`.

## k-mers

k-mers are defined by their value of k, which can be defined with `-k`. A gapped k-mer can be used by defining a shape with `--shape`. Shape expects a number, this number should be the decimal representation of a binary number with a starting and an ending 1, each 0 in the binary number will be considered a gap.

## strobemers

Currently, minstrobemers with the flag `--min`, hybridstrobmers with the flag `--hybrid` and randstrobmers with the flag `--rand` are supported for order 2 and 3, which can be defined with `--order`.
With `-k` the length of a single strobe can be defined and with `--w-min` the distance between the first strobe to the next one and with `--w-max` the length of the window to pick the second and third strobe.

## minimizers

Minimizers support ungapped and gapped k-mers. A window size can be given with `-w`. The randomization of the order is achieved by XOR all k-mer hash values with a seed, if the lexicographical order is wanted `--seed` should be set to 0. For more information, see the [seqan tutorial](http://docs.seqan.de/seqan/3-master-user/tutorial_minimiser.html).

## modmers

Minimizers support ungapped and gapped k-mers. The mod value can be given with `-w`. The randomization of the order is achieved by XOR all k-mer hash values with a seed, if the lexicographical order is wanted `--seed` should be set to 0.

## syncmers

Syncmers support ungapped. The s-mer value can be given with `-w`. The positions of a s-mer that make a k-mer a syncmer can be given with `-p`. The randomization of the order is achieved by XOR all k-mer hash values with a seed, if the lexicographical order is wanted `--seed` should be set to 0.
23 changes: 13 additions & 10 deletions include/compare.h
Original file line number Diff line number Diff line change
Expand Up @@ -107,9 +107,10 @@ void store_ibf(IBFType const & ibf,
}

/*! \brief Function that creates the string name of the used view.
* \param args The arguments about the view to be used.
* \param args The arguments about the view to be used.
* \param underlying_strobemer If true, "Strobmer" is added to the name.
*/
std::string create_name(range_arguments & args);
std::string create_name(range_arguments & args, bool underlying_strobemer = false);

/*! \brief Function, comparing the methods in regard of their coverage.
* \param args The arguments about the view to be used.
Expand All @@ -119,31 +120,33 @@ void do_accuracy(accuracy_arguments & args);
/*! \brief Function, comparing the number of submers.
* \param sequence_files A vector of sequence files.
* \param args The arguments about the view to be used.
* \param underlying_strobemer True, if strobemers should be used with a representative method like minimizer.
*/
void do_counts(std::vector<std::filesystem::path> sequence_files, range_arguments & args);
void do_counts(std::vector<std::filesystem::path> sequence_files, range_arguments & args, bool underlying_strobemer = false);

/*! \brief Function, comparing the methods in regard of their distance.
* \param sequence_file A sequence file.
* \param args The arguments about the view to be used.
* \param underlying_strobemer True, if strobemers should be used with a representative method like minimizer.
*/
void do_distance(std::filesystem::path sequence_file, range_arguments & args);
void do_distance(std::filesystem::path sequence_file, range_arguments & args, bool underlying_strobemer = false);

/*! \brief Function, counting number of matches between two sequences.
* \param sequence_file1 The first sequence file.
* \param sequence_file2 The second sequence file.
* \param args The arguments about the view to be used.
* \param underlying_strobemer True, if strobemers should be used with a representative method like minimizer.
*/
void do_match(std::filesystem::path sequence_file1, std::filesystem::path sequence_file2, range_arguments & args);
void do_match(std::filesystem::path sequence_file1, std::filesystem::path sequence_file2, range_arguments & args, bool underlying_strobemer = false);

/*! \brief Function, comparing the speed.
* \param sequence_files A vector of sequence files.
* \param args The arguments about the view to be used.
*/
void do_speed(std::vector<std::filesystem::path> sequence_files, range_arguments & args);

/*! \brief Function that calculates the uniqueness of submers in given sequence files.
* \param sequence_files A vector of sequence files.
* \param method_name The name of the method.
* \param args The arguments about the view to be used.
/*! \brief Function that calculates the uniqueness of submers in given files.
* \param input_files A vector of input files. An input file is a count file obtained by counts.
* \param oname The name of the output file.
*/
void unique(std::vector<std::filesystem::path> sequence_files, std::string method_name, range_arguments & args);
void unique(std::vector<std::filesystem::path> input_files, std::filesystem::path oname);
Loading