Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API to allow incremental updates #7

Open
tvondra opened this issue Aug 9, 2020 · 4 comments
Open

API to allow incremental updates #7

tvondra opened this issue Aug 9, 2020 · 4 comments
Labels
enhancement New feature or request

Comments

@tvondra
Copy link
Owner

tvondra commented Aug 9, 2020

Started as an e-mail discussion with @sporty81, who asked how to do incremental updates, i.e. being able to do something like this:

UPDATE t SET digest = tdigest_add(digest, new_value);

or something like that. A very early WIP version of this is implemented in #6 which essentially adds tdigest_add() and tdigest_union() that are conceptually equivalent to hll_add and hll_union. I'm sure the current code is rather inefficient but hopefully it's good enough for some basic feedback.

I'm more worried about accuracy, though. Essentially, every tdigest_add has to deserialize the value, adds the new value and then compacts the t-digest. The (de)serialization might be fairly expensive, but I'm sure there are ways to improve that, to some extent (one of the limiting factors will be that Postgres does not allow partial / in-place updates, so any change to the tdigest has to store a new copy of the whole value).

But I suspect compacting the digests after adding every single value may have negative impact on accuracy of the digest, so maybe we'll need to reconsider this strategy and invoke compaction less frequently.

@tvondra tvondra added the enhancement New feature or request label Aug 9, 2020
@tvondra tvondra linked a pull request Aug 9, 2020 that will close this issue
@sporty81
Copy link

What can I do to help you finish this up and release this update?

Thanks!

@tvondra
Copy link
Owner Author

tvondra commented Sep 30, 2020

Sorry, I was off for a couple days, hence the delay in responding.

I think this needs a bit of a research to

(a) quantify how expensive the de-serialization and serialization for every increment is an issue in practice. OTOH we can't do much about this anyway, because the digest is already a plain varlena buffer anyway, so most of the overhead will come from pglz and toasting.

(b) see if compacting after every value has negative impact on accuracy of the t-digest, or whether we need to batch it somehow (which I think we could without breaking the on-disk format by using a flag). I think the easiest way to evaluate this is to compare tdigests built incrementally (by using the WIP patch) and using the regular aggregates.

@tvondra
Copy link
Owner Author

tvondra commented Dec 4, 2020

I've significantly reworked/improved this feature, cleaned up the code and added docs and tests.

The API that I propose has three new functions:

  • tdigest_add(tdigest, double precision) - adds a single value to the t-digest
  • tdigest_add(tdigest, double precision[]) - adds an array of values to the t-digest
  • tdigest_union(tdigest, tdigest) - merges-in a t-digest

I'm not sure whether to rename the third function to tdigest_add too - the naming difference seems somewhat arbitrary.

There's one more thing - by default, serializing the in-memory representation into t-digest forces compaction. But forcing it after adding each individual value does not seem like a great idea, so I've added a parameter that allows disabling it. Not sure if that's a good idea, though.

@tvondra
Copy link
Owner Author

tvondra commented May 4, 2021

I've polished the dev branch a bit and pushed it as 9d34031.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants