-
Notifications
You must be signed in to change notification settings - Fork 557
HLL++ in sparse mode can be large than in normal mode #38
Comments
I think you are right. We've struggled a bit with this threshold value. I Matt On Tue, Jul 2, 2013 at 10:51 AM, Clément MATHIEU
|
Unfortunately I think that I will be too busy in the next weeks to help you to fix this one, and my current knowledge of this area of HLL++ is quite limited. However I will try to give it a look. I also noticed that the two threshold values are "added". The sort threshold is added to the sparseSetThreshold, postponing even further the switch to the normal representation. What do you think about the lazy RegisterSet creation ? Would not it be better and meet the goal of the sparse mode ? This one is quite easy to fix and does not require a deep knowledge of the paper. |
I spent some time investigating this and the issue is how we store the 12.4K is still about 1.5K bigger than a comparable normal HLLplus While investigating this I also noticed that we were under counting the Also your point on creating the RegisterSet even when we are in normal mode Matt On Tue, Jul 2, 2013 at 3:15 PM, Clément MATHIEU [email protected]:
|
Alright, good news. I've modified the code so that we can read and write sparse representations with no delimiter. This means the sparse set with 5K elements is down to 8K. Further good news is I realized that we can use the varint encoding for encoding/decoding normal representations as well. This brings the encoded size of a normal serialized HLL++ object with p=14 to 7K (down from 10K). This will certainly break binary compatibility. I'm thinking of cutting the 2.4.0 release and then moving to 3.x snapshot for this change. |
Thanks for your work. I am definitively not fluent with the sparse mode and I will most likely I say silly things. Please correct me when I am wrong.
Does not it have exactly the same impact on the memory footprint ?
I currently prefer to not be that adventurous ;)
Does not it introduce a regression for high cardinality or pathologic streams ? Don't we get a similar compression factor using Snappy or Gzip ? Lot of zeros in a row usually compress quite well. Most likely better than an integer with only bits 25 to 30 set with a varInt encoding. I think that if the ratio is the same or even slightly better, then it is better to keep the dumb version. The size of the serialized object is highly predictable with an easy to check upper bound, and users are free to use the codec they want. Do they want to be fast or size efficient ? Let them decide.
Sounds very reasonable to me. For 2.4.0 we could:
Personally, I had no other alternative than to hard code a threshold value since we always use the same Your suggestion regarding the sparse mode serialization in 3.0.0 sounds fine to me. I will try to see if I can find something more efficient. Is your code available somewhere ? |
On Mon, Jul 8, 2013 at 11:03 AM, Clément MATHIEU
[4 bytes for length of varint][n bytes for varint]... The 4 bytes for the length of varint were not necessary and can be totally
I've thought about this a bit further and I think we may be able to do this
Matt
|
I've pushed one more update to the serialization branch. Changes are: -> do not use varint for encoding normal register sets Net result is we should be able to deploy the new code in 2.4.0x without breaking existing systems. They will magically get the new encoding format so after the HLL+ objects serialized with older code are decoded and then re-encoded using the 2.4.0x branch the re-encoded size should be much smaller if the HLL+ was in SPARSE mode. It would be great if you could test this branch and confirm performance/correctness. I'll move it to master after you've had a chance to test. Thanks, |
It definitively has impact on memory usage. Basically the memory overhead for each byte[] is 12 bytes (8 bytes of object overhead, 4 bytes for array length). An int is 4 byte, we are trying to optimize it to 1, 2 or 3 bytes for small values. Game over. Not doable in Java nor in other languages if you consider memory alignment issues excepted in a large byte stream like we use for HyperLogLogPlus hllp = new HyperLogLogPlus(14, 14);
for (int i = 0; i < 10_000 ; i++) {
hllp.offer(i );
}
System.out.println("Size: " + hllp.getBytes().length); It prints 43587, but the retained size of I'm not sure to fully understand all the code but according to theses facts, i do believe that deltaAdd only leads to an increased memory consumption. Would not a plain old array of int be more memory efficient ?
Fine for me.
Good idea. We currently only use the first bit of the first byte. Plenty of unused space to encode a version of whatever.
Cool ! I will try to give a look at it. |
I can merge your branch into mine without "polluting" the master if you prefer. I will keep you informed about the output of my tests (ETA Friday or Monday) |
I misunderstood your comment re memory. I thought you were referring just to the impact of serialization on memory footprint. I think you are probably correct. If we stick with an int array for the in memory representation it should produce a better result. When we encode/decode the array we can convert that to a delta encoded var int byte stream for efficient serialized representations. I'll take a crack at these changes soon. |
I'm currently playing with your branch and testing the byte[] -> int switch. Seems doable, my branch passes the test suite. It still use a List which is a huge memory waste due to autoboxing. Since we are always adding to the end of the list, using an array of integer is an option and will make the sparse mode memory efficient. BTW: Perhaps we could base this work on pull request #34 (even if we keep the IOException) ? Regarding the varInt serialization we should be able to avoid the extra byte added before each varInt to indicate its size. It is usually piggy backed. If you are using the same encoding than protocol buffer, then you can detect the last byte using the most significant bit of the byte. If it is set to 1 then you must read the next byte, otherwise you reached the end of the varInt. Let me know if you want me to work on any of theses topics. |
Yes, I've been working on the same thing. I've replaced the List(s) with int arrays and moved the register set to lazy instantiation. Now. HyperLogLogPlus hllp = new HyperLogLogPlus(14, 14);
for (int i = 0; i < 10000 ; i++) {
hllp.offer(i );
}
System.out.println("Size: " + hllp.getBytes().length); Prints out 22316 and the retained size of the entire HyperLogLogPlus object is 42,232. The sparse set retains 29840 and the tmpset retains 12312. Nothing else in the object consumes significant resources. All tests pass locally for me but this does introduce some more complexity when it comes to the thread safety of the class (lazy instantiation of RegisterSet as an example). I think I can squeeze a bit more efficiency out of the serialized representation of the sparse set by using delta encoding (currently I'm just using varints). I will try to work on that and backwards compatibility over the weekend. Let me get this branch in a good place and then we can decide what to do with #34. Matt |
Wouldn't a long[] be more efficient than int[]? |
If we don't need to put longs inside it, then no. An array of native type has no per item memory overhead. Using longs to store the data would make the code more complex or waste memory (since the current implementation uses a list of varInt to store the delta-encoded values I assume that we have to store integer). |
I've pushed another update to the https://github.com/clearspring/stream-lib/tree/serialization_improvements branch. I believe this version is still backwards compatible with the 2.4.x branches but i could use some help testing. These changes improve memory efficiency by using a normal int array for storing the sparse set in memory (thanks to Clément for the tip on that). I've also modified the serialization code for sparse sets to use a delta encoded list rather than standard varints. The result is that persisting the HLLPlus instance with 10K objects now takes 13761 bytes. This is down from 43587 bytes using the current serialization approach in master. One side effect of this change is that the HLLPlus instance is even less thread safe than it was before. We'll need to add some concurrency protection to the indexes that track our current position in the tmpSet and sparseSet as the instance is mutated in order to make it thread safe. I have some local code, not pushed to git yet, that uses lazy instantiation for the register set. This reduces the memory foot print further (and adds even more thread safety issues). I'll probably commit that code to a different branch soon for consideration. Clément - when you have a chance can you run this against your benchmark suite? I'm unsure of the impact of these changes on runtime performance. Also if you see any backwards compatibility issues please let me know. |
I should be able to do it today. Did you give a look to the extra prefix byte for VarInt encoding or do you want me to test it ? |
Thanks. The branch you are testing has the extra prefix byte removed. I've added a test that checks that the code in this branch is able to decode instances encoded with the branch in master but it would be nice to have further validation that there are no compatibility issues. |
I did wrote a comment several hours ago but it seems that I forgot to send it :/ I tested your branch today. Unfortunately, I found regressions in the sparse mode running this very simple test: hllp.offer("foo");
hllp.cardinality() == 1 // Returns 2 Attaching a debugger shown that at least the sort method returns bogus results. The sparse array should contains only one element not two (we get I stopped my tests here. IMHO, we should try to have a more object oriented design. It could lead to simpler and more robust code. |
Not sure what happened there. Not a great commit. I've fixed the issue and now all tests, including a new 'testOne' pass. Patches are welcome for changing the design. e4466d8 |
Will give it a try again today and bench it if it passes the test suite. Thanks. |
Found another regression due to this change:
|
Reverted that change. Was there for experiments with lazy init of On Tue, Jul 16, 2013 at 7:09 AM, Clément MATHIEU
|
Just ran the benchmarks, master VS your branch. Same performances for normal mode. Sparse mode is a little bit faster. Results follows: Master
Branch
You can observe than the merge macro benchmark is now 30% faster. Half of the merged HLL should be in sparse mode (max cardinality is 80K) I also tried your branch with one of our jobs. It seems fine but I need more time to assert that everything is fine. Perhaps somebody else could try to use it ? |
Thanks for the feedback and glad to see positive results. One of our Matt On Wed, Jul 17, 2013 at 3:07 PM, Clément MATHIEU
|
these changes have been merged into master. thanks to @cykl for helping so much with this patch. |
Today I played a little bit with HLL++ in sparse mode. I have tens of millions HLL estimators and most of them have a low cardinality. Using HLL or HLL++ in normal mode is not memory efficient for this use case. To be picky I don't care that much about memory consumption, I am trying to minimize the serialized size of the estimators .
This whole idea behind the sparse mode is to not waste memory with the normal representation when we can do better for small cardinality. It sounds reasonable to switch back to the normal representation as soon as the sparse mode consume more memory.
However I don't observe a such behavior with HyperLogLogPlus:
HyperLogLog:
HyperLogLogPlus in normal mode
Empty HyperLogLogPlus in sparse mode
5K elements with HyperLogLogPlus in sparse mode
According to the source code the
sparseSetThreshold
only depends ofp
and is set to 12288 forp = 14
. It means that if the set contains 12000 elements, I'm wasting almost 40KBytes compared to the normal representation.Am I wrong ? Is this behavior expected ?
My second question would be: Do we really want to create the
RegisterSet
even when we are in Sparse mode ? It ruins the goal to be memory efficient. It currently does not matters for me since my bottleneck is the serialization size but I find this quite surprising.The text was updated successfully, but these errors were encountered: