-
Notifications
You must be signed in to change notification settings - Fork 82
Search Module
This page gives an overview of the search module and is a good start to get familiar with the basics of it.
The search module covers two areas: string indices (such as FM indices, suffix arrays or k-mer indices) and search algorithms (those that need a string index and those who don't). Currently, only FM indices and search algorithms for FM indices are implemented.
The string indices in SeqAn3 are based on the SDSL (Succinct Data Structure Library) and so far only cover FM indices (i.e. the Burrows-Wheeler-Transform and a sampled suffix array). In the SDSL, FM indices are referred to as compressed suffix arrays (CSAs). SeqAn3 provides wrapper classes for indices to have a consistent interface with other SeqAn3 classes and for Cereal Support (Serialization). It also takes care of indexing string collections (formerly known as string sets), e.g. a vector of DNA sequences.
SeqAn3 index wrappers only construct and hold the data structures and allow serialization. The actual search algorithm is performed by iterators. These iterators (which actually do not behave like standard iterators. Thus, they will be renamed to cursors soon) can search a pattern in the indexed text. The text is searched character by character, either to the right or to the left (if the index is bidirectional).
Currently, the search schemes algorithm is the only (index-based) search algorithm implemented in SeqAn3 (Search schemes formulate a strategy for approximate string matching: they divide the query sequence into blocks and give a set of searches. A search specifies the order in which the blocks are searched with lower and upper error bounds for each block). Note that many other search algorithms on indices can be (re)formulated as search schemes (e.g. 01*0 seeds, pigeonhole, etc.). A trivial backtracking approach is implemented for unit testing only. If for a given error a precomputed optimal search scheme does not exist, a (most likely not-optimal) search scheme is constructed using the pigeonhole principle, 01*0 seeds or similar.
Search schemes can be stored in either search_scheme_type
or search_scheme_dyn_type
. The first is based on std::array
, the second on std::vector
. Since the size of search schemes (number of searches and number of blocks) are known for optimal search schemes at compile time, they are of type search_scheme_type
. Search schemes computed at compile time for larger errors (for which no optimal search scheme has been computed and implemented yet) are usually of type search_scheme_dyn_type
because the number of blocks and searches are not known at compile time (mainly because the number of errors is not a compile-time parameter).
The search scheme algorithm takes a lower error bound (i.e. a minimum number of errors that have to be spent when searching a query sequence). This is not part of the public interface, i.e. the user cannot specify a minimum number of errors. This feature is only used internally to speed up strata searches.
Unfortunately, commits have been squashed so that the algorithmic differences of the search schemes from SeqAn2 and SeqAn3 are not visible in the history anymore. As those changes might be helpful if unexpected behaviour occurs, the changes are reported below. Some if clauses were superfluous and the behaviour for insertion/deletion at the beginning and end of a sequence changed. All changes refer to search/algorithm/detail/search_scheme_algorithm.hpp
.
inline bool search_ss_deletion(...)
{
...
// Insert deletions into the current block as long as possible
- if (!(search.pi[block_id] == 1 && !go_right) && max_error_left_in_block > 0 &&
- error_left.total > 0 && error_left.deletion > 0 &&
+ if (!(search.pi[block_id] == 1 && !go_right) && // Do not allow deletions at the beginning of the leftmost block
+ !(search.pi[block_id] == search.blocks() && go_right) && // Do not allow deletions at the end of the rightmost block
+ max_error_left_in_block > 0 && error_left.total > 0 && error_left.deletion > 0 &&
...
}
inline bool search_ss_children(...)
{
...
- // TODO: incorporate error_left.deletion into formula and simplify a bit
- if (error_left.deletion == 0 && min_error_left_in_block > 0 && chars_left + delta < min_error_left_in_block + 1u)
+ // TODO: incorporate error_left.deletion into formula
+ if (error_left.deletion == 0 && chars_left + delta < min_error_left_in_block + 1u)
...
// Deletion
- if (error_left.deletion > 0)
+ // TODO: check whether the conditions for deletions at the beginning/end of the query are really necessary
+ if (error_left.deletion > 0 &&
+ !(go_right && (rb == 1 || rb == query.size() + 1)) && // No deletion at the beginning of the leftmost block.
+ !(!go_right && (lb == 0 || lb == query.size()))) // No deletion at the end of the rightmost block.
...
}