Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Number of hash functions (k) should be floored, not ceiled #45

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

ashtul
Copy link

@ashtul ashtul commented Aug 24, 2024

Historically, when calculating the number of hash functions (k) for Bloom filters, developers have commonly opted to use ceil() on the resulting float value. However, this approach is suboptimal. Instead, calling floor() on the result offers clear advantages over ceiling.

Flooring the number of hash functions does not meaningfully degrade the false-positive rate, as reducing by just one function still maintains a similar error probability. The risk of degradation only becomes a concern when additional hash functions are removed. However, by using floor() instead of ceil(), benchmarks demonstrate a notable 10% improvement in processing time. This improvement stems from the reduced number of hash functions, which decreases the filter's fill rate while keeping the same total bit count. As a result, the computational efficiency is enhanced without significantly compromising accuracy.

Moreover, having fewer hash functions means that, for elements that are 'probably in the set,' fewer bits need to be checked. For elements that are 'definitely not in the set,' each hash function has a higher chance of failing earlier due to the lower fill rate, further boosting overall efficiency.

With ceil()
test bench_get_100     ... bench:          62.71 ns/iter (+/- 0.60)
test bench_get_1000    ... bench:          62.35 ns/iter (+/- 0.63)
test bench_get_m_1     ... bench:          79.28 ns/iter (+/- 2.05)
test bench_get_m_10    ... bench:         101.78 ns/iter (+/- 10.46)
test bench_insert_100  ... bench:         119.64 ns/iter (+/- 3.92)
test bench_insert_1000 ... bench:         119.94 ns/iter (+/- 3.78)
test bench_insert_m_1  ... bench:         127.84 ns/iter (+/- 2.51)
test bench_insert_m_10 ... bench:         183.98 ns/iter (+/- 33.69)

to

With floor()
test bench_get_100     ... bench:          55.19 ns/iter (+/- 0.90)
test bench_get_1000    ... bench:          57.16 ns/iter (+/- 0.49)
test bench_get_m_1     ... bench:          71.36 ns/iter (+/- 0.70)
test bench_get_m_10    ... bench:          94.81 ns/iter (+/- 17.49)
test bench_insert_100  ... bench:         105.73 ns/iter (+/- 7.65)
test bench_insert_1000 ... bench:         105.82 ns/iter (+/- 3.88)
test bench_insert_m_1  ... bench:         115.46 ns/iter (+/- 3.02)
test bench_insert_m_10 ... bench:         162.82 ns/iter (+/- 18.40)

@jedisct1
Copy link
Owner

jedisct1 commented Nov 5, 2024

Shouldn't the number of hash functions be rounded instead?

@jedisct1 jedisct1 closed this in 2acb4f8 Nov 5, 2024
@jedisct1 jedisct1 reopened this Nov 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants