Saving/Loading a filter of 'bytes' objects doesn't work #31

DavidJBianco · 2020-03-04T15:16:37Z

I have code that creates a filter containing a bunch of bytes objects (basically, MD5 hashes of things). While that Python interpreter is still running, I can load the saved filter into a different filter object and it works:

Python 3.7.6 (default, Dec 30 2019, 19:38:28)
[Clang 11.0.0 (clang-1100.0.33.16)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from pybloomfilter import BloomFilter
>>> f = BloomFilter(capacity=10, error_rate=0.1, filename="test.bloom")
>>> f.add(b'de12aa3166ca2e5710233ddf5b5ffe1f')
False
>>> b'de12aa3166ca2e5710233ddf5b5ffe1f' in f
True
>>> f.sync()
>>> f.close()
>>> f2=BloomFilter.open("test.bloom")
>>> b'de12aa3166ca2e5710233ddf5b5ffe1f' in f2
True

However, if I quit the interpreter and load the filter again, it no longer works:

Python 3.7.6 (default, Dec 30 2019, 19:38:28)
[Clang 11.0.0 (clang-1100.0.33.16)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from pybloomfilter import BloomFilter
>>> f2=BloomFilter.open("test.bloom")
>>> b'de12aa3166ca2e5710233ddf5b5ffe1f' in f2
False

As you can see, I tested the existence of the same hash value, but got a different result. However, if I create the filter using strings instead of bytes objects, the save/reload test works:

Python 3.7.6 (default, Dec 30 2019, 19:38:28)
[Clang 11.0.0 (clang-1100.0.33.16)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from pybloomfilter import BloomFilter
>>> f = BloomFilter(capacity=10, error_rate=0.1, filename="test.bloom")
>>> f.add('de12aa3166ca2e5710233ddf5b5ffe1f')
False
>>> 'de12aa3166ca2e5710233ddf5b5ffe1f' in f
True
>>> f.sync()
>>> f.close()

Then starting a new interpreter:

Python 3.7.6 (default, Dec 30 2019, 19:38:28)
[Clang 11.0.0 (clang-1100.0.33.16)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from pybloomfilter import BloomFilter
>>> f2=BloomFilter.open("test.bloom")
>>> 'de12aa3166ca2e5710233ddf5b5ffe1f' in f2
True

The text was updated successfully, but these errors were encountered:

mizvyt · 2020-05-29T14:41:23Z

@DavidJBianco which version of the lib are you running?

prashnts · 2020-08-23T04:04:59Z

Sorry if it's dumb question, but just looking at the snippet at the moment to help triage this (will check the source later), did you close that f2 in first example before starting second interpreter?

Also the capacity 10, and error rate 0.1 aren't the best parameters to choose, it's "too small" for bloomfilters, the standard sets should perform better and more reliably ("small" depends on your use case, but this capacity might have 1/9 chances of error :)

I'd try with a bigger capacity filter, and also try to isolate whether this is due to mmap access stuff (slightly related #38 ), edge cases in persistence, or pythons implementation detail. Because otherwise we're breaking the "no false negatives" part and that needs to be fixed !

nxanton · 2021-10-25T08:35:55Z

I can confirm this bug. Try the following script (run it twice):

from pybloomfilter import BloomFilter
import os

FILTER = 'test.bloom'
# binary test data
VALUES = [ a.encode() for a in """Lorem ipsum dolor sit amet, 
consectetur adipiscing elit. Fusce 
imperdiet augue nec erat finibus, 
sit amet gravida ipsum dictum. Ut 
pellentesque, tellus at tempor 
fermentum, nulla sem accumsan enim, 
ac malesuada leo dui non elit. Sed 
vestibulum euismod tortor, vel 
pellentesque lacus dignissim ac. 
Proin eleifend cursus maximus. 
Praesent ornare ex non tempus luctus. 
""".split('\n')]

# create filter if it doesn't already exist
if not os.path.exists(FILTER):
    bf = BloomFilter(1000, 0.001, FILTER)
    for v in VALUES:
        bf.add(v)
    print("created filter");
    # make sure it is properly synced
    bf.sync()
    bf.close()

bf = BloomFilter.open(FILTER)
for value in VALUES:
    print(f"{value.decode():<45} in bf {value in bf}")

I get the following output:

> python3 pybloomtest.py
created filter
Lorem ipsum dolor sit amet,                   in bf True
consectetur adipiscing elit. Fusce            in bf True
imperdiet augue nec erat finibus,             in bf True
sit amet gravida ipsum dictum. Ut             in bf True
pellentesque, tellus at tempor                in bf True
fermentum, nulla sem accumsan enim,           in bf True
ac malesuada leo dui non elit. Sed            in bf True
vestibulum euismod tortor, vel                in bf True
pellentesque lacus dignissim ac.              in bf True
Proin eleifend cursus maximus.                in bf True
Praesent ornare ex non tempus luctus.         in bf True
                                              in bf True
> python3 pybloomtest.py
Lorem ipsum dolor sit amet,                   in bf False
consectetur adipiscing elit. Fusce            in bf False
imperdiet augue nec erat finibus,             in bf False
sit amet gravida ipsum dictum. Ut             in bf False
pellentesque, tellus at tempor                in bf False
fermentum, nulla sem accumsan enim,           in bf False
ac malesuada leo dui non elit. Sed            in bf False
vestibulum euismod tortor, vel                in bf False
pellentesque lacus dignissim ac.              in bf False
Proin eleifend cursus maximus.                in bf False
Praesent ornare ex non tempus luctus.         in bf False
                                              in bf True

As you can clearly see, the filter is unusable the second time, the only value that is still "recognised" is an empty string.

I am running version 0.5.3.

nxanton · 2021-10-25T09:35:18Z

I just checked out this repo trying to debug the issue and noticed that this bug seems to be fixed in the latest (unreleased) version, while checking out 8542533 fails this test. It seems that #41 fixed this. Would it be possible to get a new release out with this fix, as it is currently blocking us and poses a potential trap for others.

prashnts · 2021-10-25T15:15:01Z

So we can definitely make a new release with #41 patch, I will check and see if that fixes it.

Thanks for a working example! I was able to reproduce that v0.5.3 does make the filter unusable, while master on HEAD works as intended. I'll push a release tonight or tomorrow.

prashnts · 2021-10-28T18:37:17Z

Alright, I've released new version (v0.5.5) on PyPI! It's a patch release so hopefully you won't need to do anything in your build as long as it's not pinned.

I messed up on v0.5.4 release and accidentally uploaded wrong source, but thankfully noticed it soon enough to make another patch! That version is yanked on PyPI.

Hope this fixes your issue, and please open another issue if it does not.

prashnts mentioned this issue Aug 23, 2020

__len__(bf) misreports after open(filename) #28

Open

prashnts closed this as completed Oct 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Saving/Loading a filter of 'bytes' objects doesn't work #31

Saving/Loading a filter of 'bytes' objects doesn't work #31

DavidJBianco commented Mar 4, 2020

mizvyt commented May 29, 2020

prashnts commented Aug 23, 2020

nxanton commented Oct 25, 2021 •

edited

Loading

nxanton commented Oct 25, 2021 •

edited

Loading

prashnts commented Oct 25, 2021

prashnts commented Oct 28, 2021

Saving/Loading a filter of 'bytes' objects doesn't work #31

Saving/Loading a filter of 'bytes' objects doesn't work #31

Comments

DavidJBianco commented Mar 4, 2020

mizvyt commented May 29, 2020

prashnts commented Aug 23, 2020

nxanton commented Oct 25, 2021 • edited Loading

nxanton commented Oct 25, 2021 • edited Loading

prashnts commented Oct 25, 2021

prashnts commented Oct 28, 2021

nxanton commented Oct 25, 2021 •

edited

Loading

nxanton commented Oct 25, 2021 •

edited

Loading