Skip to content

Commit

Permalink
[DOC] More documentation for HIBF and config. (#104)
Browse files Browse the repository at this point in the history
* [DOC] update HIBF documentation.

* [DOC] Update config documentation.

* [MISC] automatic linting

* Update include/hibf/config.hpp

* Update include/hibf/config.hpp

* Apply suggestions from code review

* Apply suggestions from code review

* [MISC] automatic linting

* Update test/snippet/hibf/hierarchical_interleaved_bloom_filter.cpp

* [MISC] automatic linting

* Update hierarchical_interleaved_bloom_filter.cpp

* Create hierarchical_interleaved_bloom_filter.out

* Apply suggestions from code review

---------

Co-authored-by: seqan-actions[bot] <[email protected]>
Co-authored-by: Enrico Seiler <[email protected]>
  • Loading branch information
3 people authored Sep 13, 2023
1 parent 997d284 commit 320f2c8
Show file tree
Hide file tree
Showing 5 changed files with 170 additions and 82 deletions.
72 changes: 21 additions & 51 deletions include/hibf/config.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -26,15 +26,33 @@ namespace seqan::hibf
using insert_iterator = std::insert_iterator<robin_hood::unordered_flat_set<uint64_t>>;

/*!\brief The configuration used to build an (H)IBF
*
* # The (H)IBF config
*
* The configuration can be used to construct an HIBF or IBF.
*
* When constructing an IBF, only the members `General Configuration` are considered, layout parameters from
* the section `HIBF Layout Configuration` are ignored.
*
* \note If an option is marked [REQUIRED], an error will be thrown on (H)IBF
* construction if it is not set. An option is marked RECOMMENDED_TO_ADAPT if we give a sensible default
* but still recommend from experience that it is worth thinking about adjusting it to your data.
* Here is the list of all configs options:
*
* | Type | Option Name | Default | Note |
* |:--------|:-------------------------------------------------|:-------:|:-----------------------|
* | General | seqan::hibf::config::input_fn | - | [REQUIRED] |
* | General | seqan::hibf::config::number_of_user_bins | - | [REQUIRED] |
* | General | seqan::hibf::config::number_of_hash_functions | 2 | |
* | General | seqan::hibf::config::maximum_false_positive_rate | 0.05 | [RECOMMENDED_TO_ADAPT] |
* | General | seqan::hibf::config::threads | 1 | [RECOMMENDED_TO_ADAPT] |
* | Layout | seqan::hibf::config::sketch_bits | 12 | |
* | Layout | seqan::hibf::config::tmax | 0 | 0 indicates unset |
* | Layout | seqan::hibf::config::max_rearrangement_ratio | 0.5 | |
* | Layout | seqan::hibf::config::alpha | 1.2 | |
* | Layout | seqan::hibf::config::disable_estimate_union | false | |
* | Layout | seqan::hibf::config::disable_rearrangement | false | |
*
* As a copy and paste source, here are all config options with their defaults:
*
* \include test/snippet/hibf/hibf_construction.cpp
*
* ## The HIBF takes too long to construct?
*
Expand All @@ -51,54 +69,6 @@ using insert_iterator = std::insert_iterator<robin_hood::unordered_flat_set<uint
* * seqan::hibf::config::number_of_hash_functions
* * seqan::hibf::config::maximum_false_positive_rate
*
* ## General Configuration
*
* ### seqan::hibf::config::input_fn [REQUIRED]
*
* \copydetails seqan::hibf::config::input_fn
*
* ### seqan::hibf::config::number_of_user_bins [REQURIED]
*
* \copydetails seqan::hibf::config::number_of_user_bins
*
* ### seqan::hibf::config::number_of_hash_functions
*
* \copydetails seqan::hibf::config::number_of_hash_functions
*
* ### seqan::hibf::config::maximum_false_positive_rate [RECOMMENDED_TO_ADAPT]
*
* \copydetails seqan::hibf::config::maximum_false_positive_rate
*
* ### seqan::hibf::config::threads [RECOMMENDED_TO_ADAPT]
*
* \copydetails seqan::hibf::config::threads
*
* ## HIBF Layout Configuration
*
* ### seqan::hibf::config::sketch_bits
*
* \copydetails seqan::hibf::config::sketch_bits
*
* ### seqan::hibf::config::tmax
*
* \copydetails seqan::hibf::config::tmax
*
* ### seqan::hibf::config::max_rearrangement_ratio
*
* \copydetails seqan::hibf::config::max_rearrangement_ratio
*
* ### seqan::hibf::config::alpha
*
* \copydetails seqan::hibf::config::alpha
*
* ### seqan::hibf::config::disable_estimate_union
*
* \copydetails seqan::hibf::config::disable_estimate_union
*
* ### seqan::hibf::config::disable_rearrangement
*
* \copydetails seqan::hibf::config::disable_rearrangement
*
*/
struct config
{
Expand Down
101 changes: 70 additions & 31 deletions include/hibf/hierarchical_interleaved_bloom_filter.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -28,48 +28,58 @@
namespace seqan::hibf
{

/*!\brief The HIBF binning directory. A data structure that efficiently answers set-membership queries for multiple
* bins.
* \tparam data_layout_mode_ Indicates whether the underlying data type is compressed. See
* [seqan::hibf::data_layout](https://docs.seqan.de/seqan/3.0.3/group__submodule__dream__index.html#gae9cb143481c46a1774b3cdf5d9fdb518).
* \see [seqan::hibf::interleaved_bloom_filter][1]
/*!\brief The Hierarchical Interleaved Bloom Filter (HIBF) - Fast answers to set-membership queries for multiple bins.
* \details
*
* This class improves the [seqan::hibf::interleaved_bloom_filter][1] by adding additional bookkeeping that allows
* to establish a hierarchical structure. This structure can then be used to split or merge user bins and distribute
* them over a variable number of technical bins. In the [seqan::hibf::interleaved_bloom_filter][1], the number of user bins
* and technical bins is always the same. This causes performance degradation when there are many user bins or the user
* bins are unevenly distributed.
* to establish a hierarchical structure. It is especially suited if you want to index are many samples/sets/user bins
* or if their sizes are unevenly distributed.
*
* # Terminology
* Publication reference: https://doi.org/10.1186/s13059-023-02971-4
*
* ## Technical Bin
* A Technical Bin represents an actual bin in the binning directory. In the IBF, it stores its kmers in a single Bloom
* Filter (which is interleaved with all the other BFs).
* ### Example
*
* ## User Bin
* The user may impose a structure on his sequence data in the form of logical groups (e.g. species). When querying the
* IBF, the user is interested in an answer that differentiates between these groups.
* \include test/snippet/hibf/hierarchical_interleaved_bloom_filter.cpp
*
* # Hierarchical Interleaved Bloom Filter (HIBF)
* ### Cite
*
* In constrast to the [seqan::hibf::interleaved_bloom_filter][1], the user bins may be split across multiple technical bins
* , or multiple user bins may be merged into one technical bin. When merging multiple user bins, the HIBF stores
* another IBF that is built over the user bins constituting the merged bin. This lower-level IBF can then be used
* to further distinguish between merged bins.
* *Mehringer, Svenja, et al. "Hierarchical Interleaved Bloom Filter: enabling ultrafast, approximate sequence queries."
* Genome Biology 24.1 (2023): 1-25.*
*
* In this example, user bin 1 was split into two technical bins. Bins 3, 4, and 5 were merged into a single technical
* bin, and another IBF was added for the merged bin.
* \image html hibf.svg width=90%
* ## Constructing the HIBF
*
* The individual IBFs may have a different number of technical bins and differ in their sizes, allowing an efficient
* distribution of the user bins.
* The HIBF is constructed by passing a seqan::hibf::config. There are two options required to be set:
* (1) seqan::hibf::config::input_fn and (2) seqan::hibf::config::number_of_user bins. For all other options
* we have set sensible defaults.
*
* Here are all options with their defaults:
*
* \include test/snippet/hibf/hibf_construction.cpp
*
* ## Querying
* To query the Hierarchical Interleaved Bloom Filter for values, call
* seqan::hibf::hierarchical_interleaved_bloom_filter::membership_agent() and use the returned
* Please see the documentation of seqan::hibf::config for details on how to configure the HIBF construction.
*
* ## Membership Queries with the HIBF
*
* To allow efficient, thread-safe membership queries, you need to use the
* seqan::hibf::hierarchical_interleaved_bloom_filter::membership_agent.
* In contrast to the [seqan::hibf::interleaved_bloom_filter][1], the result will consist of indices of user bins.
*
* \include test/snippet/hibf/hierarchical_interleaved_bloom_filter.cpp
*
* You retrieve an membership_agent by calling seqan::hibf::hierarchical_interleaved_bloom_filter::membership_agent().
*
* You can then pass a **range of hashes** and a **threshold**.
*
* ### Thresholding
*
* Given a number `x` of hashes to query and a threshold value `t`, a query will return all user bin ids for which at
* least `t` number of hashes have been found in the respective user bin in the HIBF index. In other words, the hit
* count must be equal or greater than `t` (`count >= t`).
*
* For all practical applications it is recommended to research sensible thresholds based on the data, the false
* positive rate, the length of the query and whether (canonical) k-mers, minimizers, syncmers,.. etc were used for
* hashing genomic content.
*
* ## Counting Queries with the HIBF
*
* To count the occurrences in each user bin of a range of values in the Hierarchical Interleaved Bloom Filter, call
* seqan::hibf::hierarchical_interleaved_bloom_filter::counting_agent() and use
Expand All @@ -81,7 +91,36 @@ namespace seqan::hibf
* calls to `const` member functions are safe from multiple threads (as long as no thread calls
* a non-`const` member function at the same time).
*
* [1]: https://docs.seqan.de/seqan/3.0.3/classseqan3_1_1interleaved__bloom__filter.html
* # Details on the data structure
*
* The following gives some insights about the general design of the HIBF data structure. More details can be found
* in the publication: https://doi.org/10.1186/s13059-023-02971-4
*
* ## Terminology
*
* ### Technical Bin
* A Technical Bin represents an actual bin in the binning directory. In the IBF, it stores its kmers in a single Bloom
* Filter (which is interleaved with all the other BFs).
*
* ### User Bin
* The user may impose a structure on his sequence data in the form of logical groups (e.g. species). When querying the
* IBF, the user is interested in an answer that differentiates between these groups.
*
* ## Hierarchical Interleaved Bloom Filter (HIBF)
*
* In constrast to the [seqan::hibf::interleaved_bloom_filter][1], the user bins may be split across multiple technical
* bins, or multiple user bins may be merged into one technical bin. When merging multiple user bins, the HIBF stores
* another IBF that is built over the user bins constituting the merged bin. This lower-level IBF can then be used
* to further distinguish between merged bins.
*
* In this example, user bin 1 was split into two technical bins. Bins 3, 4, and 5 were merged into a single technical
* bin, and another IBF was added for the merged bin.
* \image html hibf.svg width=90%
*
* The individual IBFs may have a different number of technical bins and differ in their sizes, allowing an efficient
* distribution of the user bins.
*
* \see [seqan::hibf::interleaved_bloom_filter][1]
*/
class hierarchical_interleaved_bloom_filter
{
Expand Down
30 changes: 30 additions & 0 deletions test/snippet/hibf/hibf_construction.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
#include <hibf/config.hpp> // for insert_iterator, config
#include <hibf/hierarchical_interleaved_bloom_filter.hpp> // for hierarchical_interleaved_bloom_filter

int main()
{
// 2 user bins:
std::vector<std::vector<size_t>> hashes{{1u, 2u, 3u, 4u, 5u, 6u, 7u, 8u, 9u, 10u}, {1u, 2u, 3u, 4u, 5u}};

// input just passes hashes:
auto my_input = [&](size_t const user_bin_id, seqan::hibf::insert_iterator it)
{
for (auto const hash : hashes[user_bin_id])
it = hash;
};

seqan::hibf::config config{.input_fn = my_input, // required
.number_of_user_bins = 2, // required
.number_of_hash_functions = 2,
.maximum_false_positive_rate = 0.05, // recommended to adapt
.threads = 1, // recommended to adapt
.sketch_bits = 12,
.tmax = 0, // triggers default copmutation
.alpha = 1.2,
.max_rearrangement_ratio = 0.5,
.disable_estimate_union = false,
.disable_rearrangement = false};

// construct the HIBF
seqan::hibf::hierarchical_interleaved_bloom_filter hibf{config};
}
46 changes: 46 additions & 0 deletions test/snippet/hibf/hierarchical_interleaved_bloom_filter.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
#include <hibf/config.hpp> // for insert_iterator, config
#include <hibf/hierarchical_interleaved_bloom_filter.hpp> // for hierarchical_interleaved_bloom_filter

void print(std::vector<int64_t> const & vector)
{
std::cout << '[';

if (!vector.empty())
{
for (size_t i = 0u; i < vector.size() - 1u; ++i)
std::cout << vector[i] << ',';
std::cout << vector.back();
}

std::cout << "]\n";
}

int main()
{
// 2 user bins:
std::vector<std::vector<size_t>> hashes{{1u, 2u, 3u, 4u, 5u, 6u, 7u, 8u, 9u, 10u}, {1u, 2u, 3u, 4u, 5u}};

// input just passes hashes:
auto my_input = [&](size_t const user_bin_id, seqan::hibf::insert_iterator it)
{
for (auto const hash : hashes[user_bin_id])
it = hash;
};

seqan::hibf::config config{.input_fn = my_input, .number_of_user_bins = 2};

// construct the HIBF
seqan::hibf::hierarchical_interleaved_bloom_filter hibf{config};

// query the HIBF
std::vector<size_t> query{1u, 2u, 3u};
std::vector<size_t> query2{8u, 9u, 10u};

auto agent = hibf.membership_agent(); // you need an agent for efficient queries
auto & result = agent.membership_for(query, 2u); // both user bins have hashes 1,2,3
print(result); // [1,0]
agent.sort_results(); // Results can also be sorted
print(result); // [0,1]
auto & result2 = agent.membership_for(query2, 2u); // only user bin 0 has hashes 8,9,10
print(result2); // [0]
}
3 changes: 3 additions & 0 deletions test/snippet/hibf/hierarchical_interleaved_bloom_filter.out
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
[1,0]
[0,1]
[0]

1 comment on commit 320f2c8

@vercel
Copy link

@vercel vercel bot commented on 320f2c8 Sep 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Successfully deployed to the following URLs:

hibf – ./

hibf.vercel.app
hibf-seqan.vercel.app
hibf-git-main-seqan.vercel.app

Please sign in to comment.