-
Notifications
You must be signed in to change notification settings - Fork 82
Requirements for Binning Directories
log2(k)
Given a text T with |T| = n and a k-mer size k, the set of consecutive k-mers S is defined as
S = {[T0, T1, ..., Tk-1], [T1, T2, ..., Tk], ..., [Tn-k, Tn-k+1, ..., Tn-1]}
k = 5
T = GACTACGTAGC
S = [ GACTA,
ACTAC,
CTACG,
TACGT,
ACGTA,
CGTAG,
GTAGC ]
To determine whether a pattern P with |P| = m occurs in the text T with up to e errors, the well known k-mer lemma can be applied: A text T of length n contains n-k+1 k-mers and a occurrence of pattern P with |P| = m shares at least m-(e+1)*k-1 k-mers with T.
Given a text T with |T| = n, a k-mer size k and an offset o, the set of offset k-mers S = [ [T_0, T_1, ..., T_k-1], [T_o, T_o+1, ..., T_o+k-1], [T_2o, T_2o+1, ..., T_2o+k-1], ... [T_ceil((n-k+1)/o)-o, ..., T_ceil(ceil((n-k+1)/o)-o+k].
Given a text T of length n, a k-mer size k and an offset o, T contains ok = ceil((n-k+1)/o) many k-mers and a occurrence of pattern |P| with |P|=m shares at least ok - e*ceil(k/o) many k-mers.
k = 5
o = 3
T = GACTACGTAGC
S = [ GACTA,
TACGT,
GTAGC ]
When looking for a pattern P in text T, all k-mers of T must be compared to the offset k-mers of P, since we don't know if P aligns to T with a multiple of offset o. In theory, it would also work the other way around (storing the offset k-mers of T and searching all k-mers of T). Storing the offset k-mers of T would reduce the memory needed to store the k-mers of T, while it (depending on how you look the k-mers up, this is the case for Binning Directories) wouldn't speed up the querying. For Binning Directories we usually store all k-mers of T and look up the offset k-mers of P to speed up the querying (usually by about a factor of o).
k = 3
o = 2
T = GACTACGTAGC
S_T = [ GAC,
CTA,
ACG,
GTA,
AGC ]
P = ACTACG
S_P = [ ACT,
TAC ]
Even though P is a substring of T, no k-mers match.
k = 3
o = 2
T = GACTACGTAGC
S_T = [ GAC,
ACT,
CTA,
TAC,
ACG,
CGT,
GTA,
TAG,
AGC ]
P = ACTACG
S_P = [ ACT,
TAC ]
Now S_P is contained in S_T.
Given a text T with |T| = n, a k-mer size k and a window size w, the set of minimizing k-mers over T is the set of the lexicographically smallest k-mers in consecutive substrings of length w of T and its reverse complement.
Heuristically we can look at the minimizer coverage of a pattern P and infer some consequences for the amount of minimizers destroyed by an error.
[...]
We have a mask.
NP-hard. There are heuristics.
[...]
In SeqAn2 the computation is done rather naively. We always go over the text and compute the hash (ignoring 0s and looking at the 1s of the mask). Since we know the mask at compile time we can probably devise a multiplication by a specific value that masks the values we want to ignore?
Dump
API:
- insert/construct
- update
- select
Bitvector
- Uncompressed
- Compressed
- Chunked
Configuration:
- If possible at compile time
- Things like k, w, o may be only determined at runtime (user set)
- Same for Bitvector config
Also:
- Resize bitvector (add bins) - required by third party (Ganon)
Decisions:
- Since it is a vector, we may not enforce b (number bins) to be a multiple of the word size (64). This may cause a slow down for look ups though (check if that's the case).
- Check if it is beneficial to use Fibonacci Hashing instead of a modulo to hash a k-mer to a position in the Binning directory. In this context also check if we always (especially if we enforce above constraint) always do a modulo by a power of 2 - this should be optimised by the compiler to a bit operation, but since the blockSize (the modulo operand) is decided at runtime this may not be the case (check) - in which case we could hard code the modulo as bit operation. Still, in this case, the Fibonacci hashing may have some advantaged regarding the distribution of the values ... yet again we have relatively uniform values to hash (arithmetic encoding of the k-mers, which we can assume as uniformly distributed), so Fibonacci Hashing may not have benefits in this regard.
Terms:
- bins: number of bins
- Size: the size of the bitvector
- Metadatasize: the size of the bits that are appended to the bitvector to store information like k, b and so on. Deprecated with the use of Cereal for serialisation.
- binWidth: how many words do we need to represent the bins (64 bins=1 64-bit word, 70 bins=2 64-bit words)
- blockBitSize: How long is a single block representing the bins (binWidth * wordsize)
- noOfBlocks: How many blocks do we have available/number of hash values (size/blockbitsize)