Skip to content
This repository has been archived by the owner on Jul 7, 2020. It is now read-only.

Count-Min Sketch Constant for Calculating Width #93

Open
abraithwaite opened this issue Oct 5, 2015 · 5 comments
Open

Count-Min Sketch Constant for Calculating Width #93

abraithwaite opened this issue Oct 5, 2015 · 5 comments

Comments

@abraithwaite
Copy link

The width of the sketch according to the paper should be set to ceil(e/epsilon) where e is Euler's number. However, I noticed in the current code that this is just set to 2.0.

I'm curious as to why this is. Is it good enough in practice for it to not make a difference?

@tea-dragon
Copy link
Contributor

it looks like it was written that way from the first version submitted by @jkff. To clarify for the lazy, the difference is between ceil(e/epsilon) and the current ceil(2/epsilon), where e is the usual 2.7.... I can imagine them being similar enough in practice, but I couldn't say why it was initially chosen.

@abraithwaite
Copy link
Author

Changing the 2 to e everywhere results in the tests still passing. I imagine that the current implementation would just less accurate than it would be if using Math.E since it would cause the sketches to be smaller. However it's unlikely that you'd be able to change this without breaking whomever is using it unfortunately.

@cykl
Copy link
Contributor

cykl commented Oct 5, 2015

An intern reported this issue to me last month. Being confused, I quickly re-read the papers and noticed that e has been used in the initial paper but 2 is used in the latest paper from the same author, see Approximate Date with the Count-Min Data Structure page 4.

According to git log, @jkff wrote the implementation after this second publication. It could explain why. (I don't have time to do the maths but it would be interesting to investigate this change)

@tea-dragon
Copy link
Contributor

The paper referenced in the javadoc uses e (as per the wayback machine anyway), but in any event, since there is no serialization issue, we could probably change it if there is cause. The latest paper using 2 gives me pause though.

@abraithwaite
Copy link
Author

since there is no serialization issue

I think using the same arguments to the constructors would result in different width and depth parameters which wouldn't be ideal, but the epsilon would adjust itself on deserialization at least. In any case I think the safer bet is just leaving it be.

It is curious that there are two versions of this paper without an errata in the later one discussing the changes but at least this mystery is solved.

Thanks for the quick responses!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants