Inverted indexes #56

staltz · 2020-12-03T21:02:36Z

I just had an idea that is either silly/naive or groundbreaking.

This was inspired by your sort idea #54. So, normal indexes in jitdb can be considered a hash map from offset to boolean:

0 -> 0
1 -> 0
2 -> 1
3 -> 0
4 -> 1
5 -> 1
6 -> 0
7 -> 1

But prefix indexes are:

0 -> 0
1 -> %c9A
2 -> 0
3 -> %bWw
4 -> %c9A
5 -> 0
6 -> %zy7
7 -> %bWw

And the sort idea is just a way of squishing all the similar values together. What we need to do instead is invert the hash map above, from prefixes to offsets:

%c9A -> [1, 4]
%bWw -> [3, 7]
%zy7 -> [6]

Now we don't even need to sort, we get all the values that match the msgid. And look at all that space eliminated by not storing the zeroes. Since we saved up so much space, we can actually quit the prefix and go back to full id:

%c9Ac8fc8f0Aqmcuz/8sc=.ed25519 -> [1, 4]
%bWw39c91CIelu1285riC=.ed25519 -> [3, 7]
%zy718ci0aZmc58aTFUC0=.ed25519 -> [6]

Which eliminates the need to do a second iteration to remove false positives!

Considerations: the normal index doesn't directly store the offsets because those are implicit in the arrangement of the values, but inverted index stores all the offsets plus the full key, so it "uses more space". On the other hand, we're not storing all those zeroes, and there sure was a lot of zeroes. Might be interesting to test this in practice though.

Now, the inverted index above really looks like a leveldb kind of thing, so maybe we're just reinventing leveldb. Or maybe not! We don't need to sort the keys, and leveldb has strict sorting of keys lexicographically, and sorting isn't free. We just need to load the whole thing into memory and we don't care about sorting.

Thoughts @arj03 ?

The text was updated successfully, but these errors were encountered:

staltz · 2020-12-03T21:15:16Z

More about the storage of zeroes: for instance take the conventional index value_author_andrestaltzphone. I ran the statistics here, and it's 97% zeroes, and this is 166KB file. Now do that times N authors. Assume we have 16k authors here on my laptop, then you'd get in total 16k files totalling 2.6GB, of which 2.3GB is zeroes. I'm pretty sure that eliminating all those zeroes and using an inverted index would get us down to megabytes. If it's like a single 100MB file, we can chop that into pieces.

arj03 · 2020-12-03T21:25:49Z

Now, the inverted index above really looks like a leveldb kind of thing, so maybe we're just reinventing leveldb. Or maybe not! We don't need to sort the keys, and leveldb has strict sorting of keys lexicographically, and sorting isn't free. We just need to load the whole thing into memory and we don't care about sorting.

I think this would be very much like level then yes.

Assume we have 16k authors here on my laptop

I think its more like 1k, so thats 166mb. Hmm still a lot yes.

arj03 · 2020-12-04T07:14:14Z

About the author index, we do already have a level index of author messages sorted by sequence.

staltz added the idea / experiment Good for newcomers label Dec 14, 2020

staltz mentioned this issue Jan 22, 2021

Optimize prefixes for queries #95

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inverted indexes #56

Inverted indexes #56

staltz commented Dec 3, 2020

staltz commented Dec 3, 2020 •

edited

Loading

arj03 commented Dec 3, 2020

arj03 commented Dec 4, 2020

Inverted indexes #56

Inverted indexes #56

Comments

staltz commented Dec 3, 2020

staltz commented Dec 3, 2020 • edited Loading

arj03 commented Dec 3, 2020

arj03 commented Dec 4, 2020

staltz commented Dec 3, 2020 •

edited

Loading