Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inverted indexes #56

Open
staltz opened this issue Dec 3, 2020 · 3 comments
Open

Inverted indexes #56

staltz opened this issue Dec 3, 2020 · 3 comments
Labels
idea / experiment Good for newcomers

Comments

@staltz
Copy link
Member

staltz commented Dec 3, 2020

I just had an idea that is either silly/naive or groundbreaking.

This was inspired by your sort idea #54. So, normal indexes in jitdb can be considered a hash map from offset to boolean:

0 -> 0
1 -> 0
2 -> 1
3 -> 0
4 -> 1
5 -> 1
6 -> 0
7 -> 1

But prefix indexes are:

0 -> 0
1 -> %c9A
2 -> 0
3 -> %bWw
4 -> %c9A
5 -> 0
6 -> %zy7
7 -> %bWw

And the sort idea is just a way of squishing all the similar values together. What we need to do instead is invert the hash map above, from prefixes to offsets:

%c9A -> [1, 4]
%bWw -> [3, 7]
%zy7 -> [6]

Now we don't even need to sort, we get all the values that match the msgid. And look at all that space eliminated by not storing the zeroes. Since we saved up so much space, we can actually quit the prefix and go back to full id:

%c9Ac8fc8f0Aqmcuz/8sc=.ed25519 -> [1, 4]
%bWw39c91CIelu1285riC=.ed25519 -> [3, 7]
%zy718ci0aZmc58aTFUC0=.ed25519 -> [6]

Which eliminates the need to do a second iteration to remove false positives!

Considerations: the normal index doesn't directly store the offsets because those are implicit in the arrangement of the values, but inverted index stores all the offsets plus the full key, so it "uses more space". On the other hand, we're not storing all those zeroes, and there sure was a lot of zeroes. Might be interesting to test this in practice though.

Now, the inverted index above really looks like a leveldb kind of thing, so maybe we're just reinventing leveldb. Or maybe not! We don't need to sort the keys, and leveldb has strict sorting of keys lexicographically, and sorting isn't free. We just need to load the whole thing into memory and we don't care about sorting.

Thoughts @arj03 ?

@staltz
Copy link
Member Author

staltz commented Dec 3, 2020

More about the storage of zeroes: for instance take the conventional index value_author_andrestaltzphone. I ran the statistics here, and it's 97% zeroes, and this is 166KB file. Now do that times N authors. Assume we have 16k authors here on my laptop, then you'd get in total 16k files totalling 2.6GB, of which 2.3GB is zeroes. I'm pretty sure that eliminating all those zeroes and using an inverted index would get us down to megabytes. If it's like a single 100MB file, we can chop that into pieces.

@arj03
Copy link
Member

arj03 commented Dec 3, 2020

Now, the inverted index above really looks like a leveldb kind of thing, so maybe we're just reinventing leveldb. Or maybe not! We don't need to sort the keys, and leveldb has strict sorting of keys lexicographically, and sorting isn't free. We just need to load the whole thing into memory and we don't care about sorting.

I think this would be very much like level then yes.

Assume we have 16k authors here on my laptop

I think its more like 1k, so thats 166mb. Hmm still a lot yes.

@arj03
Copy link
Member

arj03 commented Dec 4, 2020

About the author index, we do already have a level index of author messages sorted by sequence.

@staltz staltz added the idea / experiment Good for newcomers label Dec 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
idea / experiment Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants