Skip to content
This repository has been archived by the owner on Jul 7, 2020. It is now read-only.

Argument checks in HyperLogLogPlus constructor need to be more restrictive #95

Open
oertl opened this issue Nov 1, 2015 · 1 comment

Comments

@oertl
Copy link
Contributor

oertl commented Nov 1, 2015

The HyperLogLogPlus implementation only supports parameters p and sp for which 4 <= p <= sp <= 25. However, there is only a check for 4 <= p <= sp <= 32. Here is some test code with sp=26 that demonstrates the problem:

        HyperLogLogPlus hll = new HyperLogLogPlus(25, 26);

        hll.offerHashed(0X7FFFFF8000000000L);
        hll.offerHashed(0XFFFFFF8000000000L); 

        assertEquals(2, hll.cardinality()); // fails, hll.cardinality() returns 1 

The expected cardinality is 2, because the index (first 25 bits) as well as the sparse index (first 26 bits) are different.

@szhem
Copy link

szhem commented Oct 9, 2017

Hello, I was able to reproduce this issue too with the following code snippet in scala

object Main {      
  def main(args: Array[String]) {
    (1 to 5).foreach(_ => run(10 * 1000 * 1000L, 20, 25))
    println()
    (1 to 5).foreach(_ => run(10 * 1000 * 1000L, 20, 32))
  }

  def run(size: Long, p: Int, sp: Int): Unit = {
    val streamlib = new HyperLogLogPlus(p, sp)

    var i: Long = 0L
    while (i < size) {
      val uuid = UUID.randomUUID().toString
      streamlib.offer(uuid)
      i += 1
    }

    printf("p: %s, sp: %s -- exact: %s, estimated: %s, error: %f%%%n",
      p, sp, size, streamlib.cardinality(), 
      100 * Math.abs((streamlib.cardinality() - size) / size.toFloat)
    )
  }
}

... and got the following results

p: 20, sp: 25 -- exact: 10000000, estimated: 9988177, error: 0.118230%
p: 20, sp: 25 -- exact: 10000000, estimated: 10001031, error: 0.010310%
p: 20, sp: 25 -- exact: 10000000, estimated: 9998517, error: 0.014830%
p: 20, sp: 25 -- exact: 10000000, estimated: 9993139, error: 0.068610%
p: 20, sp: 25 -- exact: 10000000, estimated: 10009278, error: 0.092780%

p: 20, sp: 32 -- exact: 10000000, estimated: 9924770, error: 0.752300%
p: 20, sp: 32 -- exact: 10000000, estimated: 9920361, error: 0.796390%
p: 20, sp: 32 -- exact: 10000000, estimated: 9932979, error: 0.670210%
p: 20, sp: 32 -- exact: 10000000, estimated: 9917889, error: 0.821110%
p: 20, sp: 32 -- exact: 10000000, estimated: 9916722, error: 0.832780%

If we try to estimate the error rate boundaries for p=20 we should get 3 * 1.04 / sqrt (2^20) which is 0.003046875 or 0.3046875%, but error rate for p=20,sp=32 is much higher.

Given the described behaviour, should the implementation be patched to allow only the following input values 4 <= p <= sp <= 25 as @oertl suggested?

If so, I can submit a patch to fix it.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants