API to allow incremental updates #7

tvondra · 2020-08-09T21:56:16Z

Started as an e-mail discussion with @sporty81, who asked how to do incremental updates, i.e. being able to do something like this:

UPDATE t SET digest = tdigest_add(digest, new_value);

or something like that. A very early WIP version of this is implemented in #6 which essentially adds tdigest_add() and tdigest_union() that are conceptually equivalent to hll_add and hll_union. I'm sure the current code is rather inefficient but hopefully it's good enough for some basic feedback.

I'm more worried about accuracy, though. Essentially, every tdigest_add has to deserialize the value, adds the new value and then compacts the t-digest. The (de)serialization might be fairly expensive, but I'm sure there are ways to improve that, to some extent (one of the limiting factors will be that Postgres does not allow partial / in-place updates, so any change to the tdigest has to store a new copy of the whole value).

But I suspect compacting the digests after adding every single value may have negative impact on accuracy of the digest, so maybe we'll need to reconsider this strategy and invoke compaction less frequently.

The text was updated successfully, but these errors were encountered:

sporty81 · 2020-09-20T13:22:06Z

What can I do to help you finish this up and release this update?

Thanks!

tvondra · 2020-09-30T21:19:08Z

Sorry, I was off for a couple days, hence the delay in responding.

I think this needs a bit of a research to

(a) quantify how expensive the de-serialization and serialization for every increment is an issue in practice. OTOH we can't do much about this anyway, because the digest is already a plain varlena buffer anyway, so most of the overhead will come from pglz and toasting.

(b) see if compacting after every value has negative impact on accuracy of the t-digest, or whether we need to batch it somehow (which I think we could without breaking the on-disk format by using a flag). I think the easiest way to evaluate this is to compare tdigests built incrementally (by using the WIP patch) and using the regular aggregates.

tvondra · 2020-12-04T01:38:59Z

I've significantly reworked/improved this feature, cleaned up the code and added docs and tests.

The API that I propose has three new functions:

tdigest_add(tdigest, double precision) - adds a single value to the t-digest
tdigest_add(tdigest, double precision[]) - adds an array of values to the t-digest
tdigest_union(tdigest, tdigest) - merges-in a t-digest

I'm not sure whether to rename the third function to tdigest_add too - the naming difference seems somewhat arbitrary.

There's one more thing - by default, serializing the in-memory representation into t-digest forces compaction. But forcing it after adding each individual value does not seem like a great idea, so I've added a parameter that allows disabling it. Not sure if that's a good idea, though.

tvondra · 2021-05-04T21:24:23Z

I've polished the dev branch a bit and pushed it as 9d34031.

tvondra added the enhancement New feature or request label Aug 9, 2020

tvondra linked a pull request Aug 9, 2020 that will close this issue

WIP support for incremental updates #6

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API to allow incremental updates #7

API to allow incremental updates #7

tvondra commented Aug 9, 2020

sporty81 commented Sep 20, 2020

tvondra commented Sep 30, 2020

tvondra commented Dec 4, 2020

tvondra commented May 4, 2021

API to allow incremental updates #7

API to allow incremental updates #7

Comments

tvondra commented Aug 9, 2020

sporty81 commented Sep 20, 2020

tvondra commented Sep 30, 2020

tvondra commented Dec 4, 2020

tvondra commented May 4, 2021