Number of hash functions (k) should be floored, not ceiled #45

ashtul · 2024-08-24T16:16:08Z

Historically, when calculating the number of hash functions (k) for Bloom filters, developers have commonly opted to use ceil() on the resulting float value. However, this approach is suboptimal. Instead, calling floor() on the result offers clear advantages over ceiling.

Flooring the number of hash functions does not meaningfully degrade the false-positive rate, as reducing by just one function still maintains a similar error probability. The risk of degradation only becomes a concern when additional hash functions are removed. However, by using floor() instead of ceil(), benchmarks demonstrate a notable 10% improvement in processing time. This improvement stems from the reduced number of hash functions, which decreases the filter's fill rate while keeping the same total bit count. As a result, the computational efficiency is enhanced without significantly compromising accuracy.

Moreover, having fewer hash functions means that, for elements that are 'probably in the set,' fewer bits need to be checked. For elements that are 'definitely not in the set,' each hash function has a higher chance of failing earlier due to the lower fill rate, further boosting overall efficiency.

With ceil()
test bench_get_100     ... bench:          62.71 ns/iter (+/- 0.60)
test bench_get_1000    ... bench:          62.35 ns/iter (+/- 0.63)
test bench_get_m_1     ... bench:          79.28 ns/iter (+/- 2.05)
test bench_get_m_10    ... bench:         101.78 ns/iter (+/- 10.46)
test bench_insert_100  ... bench:         119.64 ns/iter (+/- 3.92)
test bench_insert_1000 ... bench:         119.94 ns/iter (+/- 3.78)
test bench_insert_m_1  ... bench:         127.84 ns/iter (+/- 2.51)
test bench_insert_m_10 ... bench:         183.98 ns/iter (+/- 33.69)

to

With floor()
test bench_get_100     ... bench:          55.19 ns/iter (+/- 0.90)
test bench_get_1000    ... bench:          57.16 ns/iter (+/- 0.49)
test bench_get_m_1     ... bench:          71.36 ns/iter (+/- 0.70)
test bench_get_m_10    ... bench:          94.81 ns/iter (+/- 17.49)
test bench_insert_100  ... bench:         105.73 ns/iter (+/- 7.65)
test bench_insert_1000 ... bench:         105.82 ns/iter (+/- 3.88)
test bench_insert_m_1  ... bench:         115.46 ns/iter (+/- 3.02)
test bench_insert_m_10 ... bench:         162.82 ns/iter (+/- 18.40)

jedisct1 · 2024-11-05T17:19:36Z

Shouldn't the number of hash functions be rounded instead?

ashtul added 3 commits August 24, 2024 18:40

add benchmarks

4d7001d

add false positive test

6befceb

always floor your bloom

3230729

ashtul force-pushed the floor branch from 880d882 to 3230729 Compare August 24, 2024 16:25

ashtul marked this pull request as ready for review September 1, 2024 10:37

ashtul mentioned this pull request Sep 1, 2024

[NEW] Valkey-Bloom: BloomFilter support for Valkey. valkey-io/valkey#407

Open

jedisct1 closed this in 2acb4f8 Nov 5, 2024

jedisct1 reopened this Nov 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Number of hash functions (k) should be floored, not ceiled #45

Number of hash functions (k) should be floored, not ceiled #45

ashtul commented Aug 24, 2024 •

edited

Loading

jedisct1 commented Nov 5, 2024

Number of hash functions (k) should be floored, not ceiled #45

Are you sure you want to change the base?

Number of hash functions (k) should be floored, not ceiled #45

Conversation

ashtul commented Aug 24, 2024 • edited Loading

jedisct1 commented Nov 5, 2024

ashtul commented Aug 24, 2024 •

edited

Loading