Skip to content

Commit

Permalink
Support incremental t-digest updates
Browse files Browse the repository at this point in the history
The existing code allows merging of pre-calculated tdigest values, which
is good enough for rollup use cases. But it does not work too well for
cases that require incremental updates, when the digest is updated with
individual values.

This adds two new functions tdigest_add and tdigest_union that support
these incremental updates. With tdigest_add() it's possible to add a
single value to the tdigest value

    UPDATE t SET digest = tdigest_add(digest, x, 100);

or an array of values

    UPDATE t SET digest = tdigest_add(digest, ARRAY[x, y, z], 100);

while tdigest_union allows() 'merging' two digests:

    UPDATE t SET digest = tdigest_union(digest1, digest2);

The idea is similar to hll_add/hll_union from the HLL extension, and we
might even add the same || operator for tdigest_union. For tdigest_add
that may not be possible, because we need to specify the compression for
cases where there's no initial tdigest (i.e. tdigest_add(NULL,x)), and
operators only allow two parameters.

The tdigest_add function may be rather inefficient, because it requires
deserialization, compaction and serialization for each call. This is
particularly true for the single-value variant. To mitigate this, use
the variant with an array of values, or possibly the tdigest_union.

Based on proposal/discussion with sporty81 (Matt Watson).
  • Loading branch information
tvondra authored and Tomas Vondra committed May 8, 2021
1 parent 43de1bf commit 2055fae
Show file tree
Hide file tree
Showing 7 changed files with 450 additions and 6 deletions.
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ MODULE_big = tdigest
OBJS = tdigest.o

EXTENSION = tdigest
DATA = tdigest--1.0.0.sql tdigest--1.0.0--1.0.1.sql tdigest--1.0.1--1.1.0.sql
DATA = tdigest--1.0.0.sql tdigest--1.0.0--1.0.1.sql tdigest--1.0.1--1.2.0.sql
MODULES = tdigest

CFLAGS=`pg_config --includedir-server`
Expand Down
113 changes: 113 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -187,6 +187,64 @@ There are five such aggregate functions:
* `tdigest(value double precision, count bigint, compression int)`


## Incremental updates

An existing t-digest may be updated incrementally, either by adding a single
value, or by merging-in a whole t-digest. For example, it's possible to add
1000 random values to the t-digest like this:

```
DO LANGUAGE plpgsql $$
DECLARE
r record;
BEGIN
FOR r IN (SELECT random() AS v FROM generate_series(1,1000)) LOOP
UPDATE t SET d = tdigest_add(d, r.v);
END LOOP;
END $$;
```

The overhead of doing this is fairly high, though - the t-digest has to be
deserialized and serialized over and over, for each value we're adding.
That overhead may be reduced by pre-aggregating data, either into an array
or a t-digest.

```
DO LANGUAGE plpgsql $$
DECLARE
a double precision[];
BEGIN
SELECT array_agg(random()) INTO a FROM generate_series(1,1000);
UPDATE t SET d = tdigest_add(d, a);
END $$;
```

Alternatively, it's possible to use pre-aggregated t-digest values instead
of the arrays:

```
DO LANGUAGE plpgsql $$
DECLARE
r record;
BEGIN
FOR r IN (SELECT mod(i,3) AS a, tdigest(random(),100) AS d FROM generate_series(1,1000) s(i) GROUP BY mod(i,3)) LOOP
UPDATE t SET d = tdigest_union(d, r.d);
END LOOP;
END $$;
```

It may be undesirable to perform compaction after every incremental update
(esp. when adding the values one by one). All functions in the incremental
API allow disabling compaction by setting the `compact` parameter to `false`.
The disadvantage is that without the compaction, the resulting digests may
be somewhat larger (by a factor of 10). It's advisable to use either the
multi-value functions (with compaction after each batch) if possible, or
force compaction, e.g. by doing something like this:

```
UPDATE t SET d = tdigest_union(NULL, d);
```


## Functions

Expand Down Expand Up @@ -457,6 +515,61 @@ SELECT tdigest_percentile_of(d, ARRAY[438.256, 349834.1]) FROM (
- `hypothetical_value` - hypothetical values


### `tdigest_add(tdigest, double precision)`

Performs incremental update of the t-digest by adding a single value.

#### Synopsis

```
UPDATE t SET d = tdigest_add(d, random());
```

#### Parameters

- `tdigest` - t-digest to update
- `element` - value to add to the digest
- `compression` - compression t (used when t-digest is `NULL`)
- `compact` - force compaction (default: true)


### `tdigest_add(tdigest, double precision[])`

Performs incremental update of the t-digest by adding values from an array.

#### Synopsis

```
UPDATE t SET d = tdigest_add(d, ARRAY[random(), random(), random()]);
```

#### Parameters

- `tdigest` - t-digest to update
- `elements` - array of values to add to the digest
- `compression` - compression t (used when t-digest is `NULL`)
- `compact` - force compaction (default: true)


### `tdigest_union(tdigest, tdigest)`

Performs incremental update of the t-digest by merging-in another digest.

#### Synopsis

```
WITH x AS (SELECT tdigest(random(), 100) AS d FROM generate_series(1,1000))
UPDATE t SET d = tdigest_union(t.d, x.d) FROM x;
```

#### Parameters

- `tdigest` - t-digest to update
- `tdigest_add` - t-digest to merge into `tdigest`
- `compression` - compression t (used when t-digest is `NULL`)
- `compact` - force compaction (default: true)


Notes
-----

Expand Down
15 changes: 15 additions & 0 deletions tdigest--1.0.1--1.1.0.sql → tdigest--1.0.1--1.2.0.sql
Original file line number Diff line number Diff line change
Expand Up @@ -72,3 +72,18 @@ CREATE AGGREGATE tdigest_percentile_of(double precision, bigint, int, double pre
COMBINEFUNC = tdigest_combine,
PARALLEL = SAFE
);

CREATE OR REPLACE FUNCTION tdigest_add(p_digest tdigest, p_element double precision, p_compression int = NULL, p_compact bool = true)
RETURNS tdigest
AS 'tdigest', 'tdigest_add_double_increment'
LANGUAGE C IMMUTABLE;

CREATE OR REPLACE FUNCTION tdigest_add(p_digest tdigest, p_elements double precision[], p_compression int = NULL, p_compact bool = true)
RETURNS tdigest
AS 'tdigest', 'tdigest_add_double_array_increment'
LANGUAGE C IMMUTABLE;

CREATE OR REPLACE FUNCTION tdigest_union(p_digest1 tdigest, p_digest2 tdigest, p_compact bool = true)
RETURNS tdigest
AS 'tdigest', 'tdigest_union_double_increment'
LANGUAGE C IMMUTABLE;
Loading

0 comments on commit 2055fae

Please sign in to comment.