Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Awful quantile estimation on small sample size #12

Closed
caio opened this issue Oct 16, 2017 · 3 comments
Closed

Awful quantile estimation on small sample size #12

caio opened this issue Oct 16, 2017 · 3 comments

Comments

@caio
Copy link
Owner

caio commented Oct 16, 2017

Following the discussion on PR #11 I've isolated a tiny pathological sample:

t := New(10)
data := []float64{0, 279, 2, 281}
for _, f := range data {
    t.Add(f, 1)
}
fmt.Println(t.Quantile(0.25))

Which yields -67.7500. In hindsight, this outrageously wrong number should have raised a major red flag in my head already, but when comparing to Java this gets even more obviously wrong:

public class Ughhh {
    public static void main(String[] args) {
        TDigest t = new AVLTreeDigest(10);
        Arrays.asList(0, 279, 2, 281).forEach(t::add);
        System.out.println(t.quantile(0.25));
    }
}

Prints out 1.0 😅 (Note: before the min/max patch on tdunning/t-digest@89bb394 it yields 1.5).

@caio
Copy link
Owner Author

caio commented Oct 20, 2017

So here's my current understanding of this issue:

This implementation estimates the value under a quantile following Algorithm 4 described in the paper whereas the reference implementation uses a different method.

Whilst using the reference implementation does behave a lot better for this pathological example, it also suffers from a serious bug (Notice that the test is disabled in the reference implementation).

Interestingly the current version does not suffer from the singletonInACrowd bug. So it would be trading one bug for another. It's hard for me to say which is worse for I don't quite understand the impact of this current issue on similar-but-real shapes.

Apparently there's a new approach being developed which I'll take a look at as soon as possible. I also skimmed through the merging digest implementation and it seems like it's not much work to port it.

@caio
Copy link
Owner Author

caio commented Oct 24, 2017

Alright, branch avl-wip now contains code that is able to close this issue.

Gotchas (explanation in the commit message):

Another notable commit (because it makes test execution significantly slower): 79b62e9

I'm pretty happy with the results, even considering the above gotchas so my plan is cleaning the branch up (a lot of code has gone away, I didn't pay attention to abstractions nor "niceties") when I can and merging it as a real fix for this. If anyone is watching, now it's the time to shout 😝

@caio caio mentioned this issue Oct 24, 2017
7 tasks
caio added a commit that referenced this issue Oct 25, 2017
Notice that this test has its exterme quantile (0.999)
skipped because the java reference implementation behaves
the same. Reasoning and more details on issue #12
@caio caio closed this as completed in 06d4170 Oct 25, 2017
@caio
Copy link
Owner Author

caio commented Oct 25, 2017

The fix has been released on v1.1.3: https://github.com/caio/go-tdigest/releases/tag/v1.1.3

ianwilkes pushed a commit to honeycombio/go-tdigest that referenced this issue Oct 11, 2018
Notice that this test has its exterme quantile (0.999)
skipped because the java reference implementation behaves
the same. Reasoning and more details on issue caio#12
ianwilkes pushed a commit to honeycombio/go-tdigest that referenced this issue Oct 11, 2018
This patch is too big, but there isn't much getting away from it
in smaller steps because summary{} and TDigest{} are actually
tightly coupled (i.e.: the abstraction is mostly useful for
code organization but fails at isolation).

The major changes are:

- Summaries now hold repeated items instead of just unique means
  and their respective counts (which led to changes in how the
  digest adds new centroids too)
- Quantile estimation is now a straight port from the reference
  implementation (issue-84 branch)

The digest now doesn't potentially report completely wrong values
on distributions with multiple steep hills nor biases in favour
of big centroids with few occurrences.

This patch closes caio#12. Some historical details for the motivation
for this work can be found on PR caio#11.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant