Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Saving/Loading a filter of 'bytes' objects doesn't work #31

Closed
DavidJBianco opened this issue Mar 4, 2020 · 6 comments
Closed

Saving/Loading a filter of 'bytes' objects doesn't work #31

DavidJBianco opened this issue Mar 4, 2020 · 6 comments

Comments

@DavidJBianco
Copy link

I have code that creates a filter containing a bunch of bytes objects (basically, MD5 hashes of things). While that Python interpreter is still running, I can load the saved filter into a different filter object and it works:

Python 3.7.6 (default, Dec 30 2019, 19:38:28)
[Clang 11.0.0 (clang-1100.0.33.16)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from pybloomfilter import BloomFilter
>>> f = BloomFilter(capacity=10, error_rate=0.1, filename="test.bloom")
>>> f.add(b'de12aa3166ca2e5710233ddf5b5ffe1f')
False
>>> b'de12aa3166ca2e5710233ddf5b5ffe1f' in f
True
>>> f.sync()
>>> f.close()
>>> f2=BloomFilter.open("test.bloom")
>>> b'de12aa3166ca2e5710233ddf5b5ffe1f' in f2
True

However, if I quit the interpreter and load the filter again, it no longer works:

Python 3.7.6 (default, Dec 30 2019, 19:38:28)
[Clang 11.0.0 (clang-1100.0.33.16)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from pybloomfilter import BloomFilter
>>> f2=BloomFilter.open("test.bloom")
>>> b'de12aa3166ca2e5710233ddf5b5ffe1f' in f2
False

As you can see, I tested the existence of the same hash value, but got a different result. However, if I create the filter using strings instead of bytes objects, the save/reload test works:

Python 3.7.6 (default, Dec 30 2019, 19:38:28)
[Clang 11.0.0 (clang-1100.0.33.16)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from pybloomfilter import BloomFilter
>>> f = BloomFilter(capacity=10, error_rate=0.1, filename="test.bloom")
>>> f.add('de12aa3166ca2e5710233ddf5b5ffe1f')
False
>>> 'de12aa3166ca2e5710233ddf5b5ffe1f' in f
True
>>> f.sync()
>>> f.close()

Then starting a new interpreter:

Python 3.7.6 (default, Dec 30 2019, 19:38:28)
[Clang 11.0.0 (clang-1100.0.33.16)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from pybloomfilter import BloomFilter
>>> f2=BloomFilter.open("test.bloom")
>>> 'de12aa3166ca2e5710233ddf5b5ffe1f' in f2
True
@mizvyt
Copy link
Contributor

mizvyt commented May 29, 2020

@DavidJBianco which version of the lib are you running?

@prashnts
Copy link
Owner

Sorry if it's dumb question, but just looking at the snippet at the moment to help triage this (will check the source later), did you close that f2 in first example before starting second interpreter?

Also the capacity 10, and error rate 0.1 aren't the best parameters to choose, it's "too small" for bloomfilters, the standard sets should perform better and more reliably ("small" depends on your use case, but this capacity might have 1/9 chances of error :)

I'd try with a bigger capacity filter, and also try to isolate whether this is due to mmap access stuff (slightly related #38 ), edge cases in persistence, or pythons implementation detail. Because otherwise we're breaking the "no false negatives" part and that needs to be fixed !

@nxanton
Copy link

nxanton commented Oct 25, 2021

I can confirm this bug. Try the following script (run it twice):

from pybloomfilter import BloomFilter
import os

FILTER = 'test.bloom'
# binary test data
VALUES = [ a.encode() for a in """Lorem ipsum dolor sit amet, 
consectetur adipiscing elit. Fusce 
imperdiet augue nec erat finibus, 
sit amet gravida ipsum dictum. Ut 
pellentesque, tellus at tempor 
fermentum, nulla sem accumsan enim, 
ac malesuada leo dui non elit. Sed 
vestibulum euismod tortor, vel 
pellentesque lacus dignissim ac. 
Proin eleifend cursus maximus. 
Praesent ornare ex non tempus luctus. 
""".split('\n')]

# create filter if it doesn't already exist
if not os.path.exists(FILTER):
    bf = BloomFilter(1000, 0.001, FILTER)
    for v in VALUES:
        bf.add(v)
    print("created filter");
    # make sure it is properly synced
    bf.sync()
    bf.close()

bf = BloomFilter.open(FILTER)
for value in VALUES:
    print(f"{value.decode():<45} in bf {value in bf}")

I get the following output:

> python3 pybloomtest.py
created filter
Lorem ipsum dolor sit amet,                   in bf True
consectetur adipiscing elit. Fusce            in bf True
imperdiet augue nec erat finibus,             in bf True
sit amet gravida ipsum dictum. Ut             in bf True
pellentesque, tellus at tempor                in bf True
fermentum, nulla sem accumsan enim,           in bf True
ac malesuada leo dui non elit. Sed            in bf True
vestibulum euismod tortor, vel                in bf True
pellentesque lacus dignissim ac.              in bf True
Proin eleifend cursus maximus.                in bf True
Praesent ornare ex non tempus luctus.         in bf True
                                              in bf True
> python3 pybloomtest.py
Lorem ipsum dolor sit amet,                   in bf False
consectetur adipiscing elit. Fusce            in bf False
imperdiet augue nec erat finibus,             in bf False
sit amet gravida ipsum dictum. Ut             in bf False
pellentesque, tellus at tempor                in bf False
fermentum, nulla sem accumsan enim,           in bf False
ac malesuada leo dui non elit. Sed            in bf False
vestibulum euismod tortor, vel                in bf False
pellentesque lacus dignissim ac.              in bf False
Proin eleifend cursus maximus.                in bf False
Praesent ornare ex non tempus luctus.         in bf False
                                              in bf True

As you can clearly see, the filter is unusable the second time, the only value that is still "recognised" is an empty string.

I am running version 0.5.3.

@nxanton
Copy link

nxanton commented Oct 25, 2021

I just checked out this repo trying to debug the issue and noticed that this bug seems to be fixed in the latest (unreleased) version, while checking out 8542533 fails this test. It seems that #41 fixed this. Would it be possible to get a new release out with this fix, as it is currently blocking us and poses a potential trap for others.

@prashnts
Copy link
Owner

So we can definitely make a new release with #41 patch, I will check and see if that fixes it.

Thanks for a working example! I was able to reproduce that v0.5.3 does make the filter unusable, while master on HEAD works as intended. I'll push a release tonight or tomorrow.

@prashnts
Copy link
Owner

Alright, I've released new version (v0.5.5) on PyPI! It's a patch release so hopefully you won't need to do anything in your build as long as it's not pinned.

I messed up on v0.5.4 release and accidentally uploaded wrong source, but thankfully noticed it soon enough to make another patch! That version is yanked on PyPI.

Hope this fixes your issue, and please open another issue if it does not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants
@prashnts @DavidJBianco @mizvyt @nxanton and others