-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Awful quantile estimation on small sample size #12
Comments
So here's my current understanding of this issue: This implementation estimates the value under a quantile following Whilst using the reference implementation does behave a lot better for this pathological example, it also suffers from a serious bug (Notice that the test is disabled in the reference implementation). Interestingly the current version does not suffer from the Apparently there's a new approach being developed which I'll take a look at as soon as possible. I also skimmed through the merging digest implementation and it seems like it's not much work to port it. |
Alright, branch Gotchas (explanation in the commit message): Another notable commit (because it makes test execution significantly slower): 79b62e9 I'm pretty happy with the results, even considering the above gotchas so my plan is cleaning the branch up (a lot of code has gone away, I didn't pay attention to abstractions nor "niceties") when I can and merging it as a real fix for this. If anyone is watching, now it's the time to shout 😝 |
Notice that this test has its exterme quantile (0.999) skipped because the java reference implementation behaves the same. Reasoning and more details on issue #12
The fix has been released on v1.1.3: https://github.com/caio/go-tdigest/releases/tag/v1.1.3 |
Notice that this test has its exterme quantile (0.999) skipped because the java reference implementation behaves the same. Reasoning and more details on issue caio#12
This patch is too big, but there isn't much getting away from it in smaller steps because summary{} and TDigest{} are actually tightly coupled (i.e.: the abstraction is mostly useful for code organization but fails at isolation). The major changes are: - Summaries now hold repeated items instead of just unique means and their respective counts (which led to changes in how the digest adds new centroids too) - Quantile estimation is now a straight port from the reference implementation (issue-84 branch) The digest now doesn't potentially report completely wrong values on distributions with multiple steep hills nor biases in favour of big centroids with few occurrences. This patch closes caio#12. Some historical details for the motivation for this work can be found on PR caio#11.
Following the discussion on PR #11 I've isolated a tiny pathological sample:
Which yields
-67.7500
. In hindsight, this outrageously wrong number should have raised a major red flag in my head already, but when comparing to Java this gets even more obviously wrong:Prints out
1.0
😅 (Note: before the min/max patch on tdunning/t-digest@89bb394 it yields1.5
).The text was updated successfully, but these errors were encountered: