Count-Min Sketch Constant for Calculating Width #93

abraithwaite · 2015-10-05T17:28:24Z

The width of the sketch according to the paper should be set to ceil(e/epsilon) where e is Euler's number. However, I noticed in the current code that this is just set to 2.0.

I'm curious as to why this is. Is it good enough in practice for it to not make a difference?

The text was updated successfully, but these errors were encountered:

tea-dragon · 2015-10-05T18:26:25Z

it looks like it was written that way from the first version submitted by @jkff. To clarify for the lazy, the difference is between ceil(e/epsilon) and the current ceil(2/epsilon), where e is the usual 2.7.... I can imagine them being similar enough in practice, but I couldn't say why it was initially chosen.

abraithwaite · 2015-10-05T19:11:28Z

Changing the 2 to e everywhere results in the tests still passing. I imagine that the current implementation would just less accurate than it would be if using Math.E since it would cause the sketches to be smaller. However it's unlikely that you'd be able to change this without breaking whomever is using it unfortunately.

cykl · 2015-10-05T19:25:39Z

An intern reported this issue to me last month. Being confused, I quickly re-read the papers and noticed that e has been used in the initial paper but 2 is used in the latest paper from the same author, see Approximate Date with the Count-Min Data Structure page 4.

According to git log, @jkff wrote the implementation after this second publication. It could explain why. (I don't have time to do the maths but it would be interesting to investigate this change)

tea-dragon · 2015-10-05T19:43:30Z

The paper referenced in the javadoc uses e (as per the wayback machine anyway), but in any event, since there is no serialization issue, we could probably change it if there is cause. The latest paper using 2 gives me pause though.

abraithwaite · 2015-10-05T21:09:32Z

since there is no serialization issue

I think using the same arguments to the constructors would result in different width and depth parameters which wouldn't be ideal, but the epsilon would adjust itself on deserialization at least. In any case I think the safer bet is just leaving it be.

It is curious that there are two versions of this paper without an errata in the later one discussing the changes but at least this mystery is solved.

Thanks for the quick responses!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Count-Min Sketch Constant for Calculating Width #93

Count-Min Sketch Constant for Calculating Width #93

abraithwaite commented Oct 5, 2015

tea-dragon commented Oct 5, 2015

abraithwaite commented Oct 5, 2015

cykl commented Oct 5, 2015

tea-dragon commented Oct 5, 2015

abraithwaite commented Oct 5, 2015

Count-Min Sketch Constant for Calculating Width #93

Count-Min Sketch Constant for Calculating Width #93

Comments

abraithwaite commented Oct 5, 2015

tea-dragon commented Oct 5, 2015

abraithwaite commented Oct 5, 2015

cykl commented Oct 5, 2015

tea-dragon commented Oct 5, 2015

abraithwaite commented Oct 5, 2015