Typos, documentation and parameters #64

s-j · 2014-01-31T14:20:21Z

Hi guys,

thanks for such a nice collection of algorithms, I've been playing a bit with and stumbled over a few things I would like to clarify:

HyperLogLog:151,161 seems to have a redundant "+ 1", perhaps appeared on copy-paste from LogLog, suggested correction:

final int r = Long.numberOfLeadingZeros((hashedValue << this.log2m) | (1 << (this.log2m - 1)) + 1;

TestLogLog uses 1.04 as error estimate parameter (beta), while other sources, e.g. http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining and the original Falojet's papers, uses 1.30. What is correct?
Furthermore, TestLogLog asserts the estimated value to be within 2 * se from the actual value, while TestHyperLogLog checks if it is within 3 * se? Is there any good reason for this?
HyperLogLog:94,106 mentions 1.106 vs 1.04, which is correct? (Googling 1.106 led me to the beta-32-bound given in the Flajolet's 2007 paper, while 1.04 corresponds to beta-infinity.)
CountThenEstimate:217 -- perhaps it should be LogLog(bytes) instead of LinearCounting(bytes)?

Thanks and cheers,
Simon

The text was updated successfully, but these errors were encountered:

abramsm · 2014-02-01T14:29:05Z

Simon -

Thanks!

HyperLogLog:151,161 seems to have a redundant "+ 1", perhaps
appeared on copy-paste from LogLog, suggested correction:

final int r = Long.numberOfLeadingZeros((hashedValue << this.log2m) |
(1 << (this.log2m - 1)) + 1;

MA - this is a little bit tricky but the code is correct as it is
written currently. We are adding 1 to the function retreiving the
number of leading zeros and then adding 1 to the number returned from
that function.

TestLogLog uses 1.04 as error estimate parameter (beta), while other
sources, e.g. http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining
and the original Falojet's papers, uses 1.30. What is correct?

MA - 1.30 is correct for LogLog, HyperLogLog improved the value to 1.04.

Furthermore, TestLogLog asserts the estimated value to be within 2 *
se from the actual value, while TestHyperLogLog checks if it is within
3 * se? Is there any good reason for this?

MA - you are correct, it should not be using the 2x, however the tests
never passed with that tolerance while using the default hash
function. The guava128 offered hash provides better results. This
test is ignored and has some other issues. Bottom line is LogLog
should probably be deprecated and removed from future releases given
that HLLP is far superior in performance and size.

HyperLogLog:94,106 mentions 1.106 vs 1.04, which is correct? (Googling
1.106 led me to the beta-32-bound given in the Flajolet's 2007 paper,
while 1.04 corresponds to beta-infinity.)

MA - I am not sure. I cannot recall why I choose 1.106. All tests
seem to pass with both values and I have not done a more detailed
analysis on which provides better results. Flajolet's 2007 paper
provides these beta values:

b_16=1.106, b_32=1.070, b_64=1.054, b_128=1.04, b_infinity=1.039

Do you have thoughts on what the correct value is? I think, at this
point, my recommendation for new implementations would be to use HLLP
rather than HLL.

CountThenEstimate:217 -- perhaps it should be LogLog(bytes) instead of
LinearCounting(bytes)?

MA - yes, good catch!

Thanks,
Matt

On Fri, Jan 31, 2014 at 9:20 AM, Simon Jonassen
[email protected] wrote:

Hi guys,

thanks for such a nice collection of algorithms, I've been playing a bit
with and stumbled over a few things I would like to clarify:

HyperLogLog:151,161 seems to have a redundant "+ 1", perhaps appeared on
copy-paste from LogLog, suggested correction:

final int r = Long.numberOfLeadingZeros((hashedValue << this.log2m) | (1 <<
(this.log2m - 1)) + 1;

TestLogLog uses 1.04 as error estimate parameter (beta), while other
sources, e.g.
http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining
and the original Falojet's papers, uses 1.30. What is correct?
Furthermore, TestLogLog asserts the estimated value to be within 2 * se from
the actual value, while TestHyperLogLog checks if it is within 3 * se? Is
there any good reason for this?
HyperLogLog:94,106 mentions 1.106 vs 1.04, which is correct? (Googling 1.106
led me to the beta-32-bound given in the Flajolet's 2007 paper, while 1.04
corresponds to beta-infinity.)
CountThenEstimate:217 -- perhaps it should be LogLog(bytes) instead of
LinearCounting(bytes)?

Thanks and cheers,
Simon

Reply to this email directly or view it on GitHub.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Typos, documentation and parameters #64

Typos, documentation and parameters #64

s-j commented Jan 31, 2014

abramsm commented Feb 1, 2014

Typos, documentation and parameters #64

Typos, documentation and parameters #64

Comments

s-j commented Jan 31, 2014

abramsm commented Feb 1, 2014