-
Notifications
You must be signed in to change notification settings - Fork 557
HyperLogLogPlus merge introduces error when counters overlap #31
Comments
I recommend not using the sparse set as it's currently implemented because it actually uses a lot more memory than the normal set. I'm surprised you're getting 30% error with the normal set at any cardinality though. Are you using the latest version of the code? What is the cardinality range you're testing with? |
I am more concerned with serialization size than memory ( I have modified the sparse list serialization format to be much smaller ). The 30% error only occurs at very low p values, but the same problem can occur for higher p with lower error. I found the problem while testing merging a counter with a copy of itself (it was actually a test of serialization - I rematerialized a counter after serializing it, and figured that an easy way to test if the counters were identical was to merge the counter with its re-materialized self and see if anything changed). My ranges for p, sp, and cardinalities are as follows: public static final long[] cardinalities = { 0, 1, 10, 100, 1000, 10000, 100000 }; |
I ran a test of every combination of cardinalities, ps, and sps and I was not able to reproduce the error. I built the HLLP, serialized, rehydrated into a new object, then merged that object with the original. The cardinalities always identical. Can you share a test that shows the issue. Regarding sparse representation. Can you please share your implementation that is more memory efficient? |
Yes. I need to do some refactoring and jump through some hoops to comply with my company's open source policies (the pull request will come from a different account) but I will get the changes to you. It isn't more memory efficient; it is a change in getBytes() and the Builder() class to make serialization more efficient. |
Great, thanks we appreciate it. Matt On Mon, May 13, 2013 at 4:08 PM, deGravity [email protected] wrote:
|
does our work on #38 meet your needs? We've made significant improvements to space required for serialization and memory footprint. |
@deGravity Have you had a chance to look at this again? |
seems likely to be resolved to me. please re-open if not |
When merging two HyperLogLogPlus counters that have a large intersection of their underlying sets it is possible to introduce a large amount of error in cardinality estimation. This is caused by checking if the size of the two sparse lists sums to a number greater than the sparseThreshold. If the two lists share many elements, then their actual merge should be a list much smaller than the sparse threshold - this can cause a sparse counter to be promoted to a normal counter long before it should. If this happens it both wastes space and, if it happens too early can cause large errors in cardinality estimation ( I have seen up to 30% in tests ) due to the bias estimation curves not having samples for cardinalities that low.
Forcing merges to always merge sparse lists before checking size helps, but does not completely eliminate the problem. I believe this is due to problems in the merge implementation that I will open another issue for.
The easiest way to reproduce this error is to produce two counters with identical elements at just over 1/2 the sparseThreshold, then merge them.
The text was updated successfully, but these errors were encountered: