You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
And the sort idea is just a way of squishing all the similar values together. What we need to do instead is invert the hash map above, from prefixes to offsets:
%c9A -> [1, 4]
%bWw -> [3, 7]
%zy7 -> [6]
Now we don't even need to sort, we get all the values that match the msgid. And look at all that space eliminated by not storing the zeroes. Since we saved up so much space, we can actually quit the prefix and go back to full id:
Which eliminates the need to do a second iteration to remove false positives!
Considerations: the normal index doesn't directly store the offsets because those are implicit in the arrangement of the values, but inverted index stores all the offsets plus the full key, so it "uses more space". On the other hand, we're not storing all those zeroes, and there sure was a lot of zeroes. Might be interesting to test this in practice though.
Now, the inverted index above really looks like a leveldb kind of thing, so maybe we're just reinventing leveldb. Or maybe not! We don't need to sort the keys, and leveldb has strict sorting of keys lexicographically, and sorting isn't free. We just need to load the whole thing into memory and we don't care about sorting.
More about the storage of zeroes: for instance take the conventional index value_author_andrestaltzphone. I ran the statistics here, and it's 97% zeroes, and this is 166KB file. Now do that times N authors. Assume we have 16k authors here on my laptop, then you'd get in total 16k files totalling 2.6GB, of which 2.3GB is zeroes. I'm pretty sure that eliminating all those zeroes and using an inverted index would get us down to megabytes. If it's like a single 100MB file, we can chop that into pieces.
Now, the inverted index above really looks like a leveldb kind of thing, so maybe we're just reinventing leveldb. Or maybe not! We don't need to sort the keys, and leveldb has strict sorting of keys lexicographically, and sorting isn't free. We just need to load the whole thing into memory and we don't care about sorting.
I think this would be very much like level then yes.
Assume we have 16k authors here on my laptop
I think its more like 1k, so thats 166mb. Hmm still a lot yes.
I just had an idea that is either silly/naive or groundbreaking.
This was inspired by your sort idea #54. So, normal indexes in jitdb can be considered a hash map from offset to boolean:
But prefix indexes are:
And the sort idea is just a way of squishing all the similar values together. What we need to do instead is invert the hash map above, from prefixes to offsets:
Now we don't even need to sort, we get all the values that match the msgid. And look at all that space eliminated by not storing the zeroes. Since we saved up so much space, we can actually quit the prefix and go back to full id:
Which eliminates the need to do a second iteration to remove false positives!
Considerations: the normal index doesn't directly store the offsets because those are implicit in the arrangement of the values, but inverted index stores all the offsets plus the full key, so it "uses more space". On the other hand, we're not storing all those zeroes, and there sure was a lot of zeroes. Might be interesting to test this in practice though.
Now, the inverted index above really looks like a leveldb kind of thing, so maybe we're just reinventing leveldb. Or maybe not! We don't need to sort the keys, and leveldb has strict sorting of keys lexicographically, and sorting isn't free. We just need to load the whole thing into memory and we don't care about sorting.
Thoughts @arj03 ?
The text was updated successfully, but these errors were encountered: