Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Storage specification needed, similar to postgres-hll #23

Open
bjhartin opened this issue Apr 18, 2022 · 2 comments
Open

Storage specification needed, similar to postgres-hll #23

bjhartin opened this issue Apr 18, 2022 · 2 comments

Comments

@bjhartin
Copy link

Use Case

I'd like to aggregate using a tdigest library within the app, publish once per minute to a tdigest table, then aggregate across instances/intervals via SQL. This works very well with postgres-hll+java-hll, because there is a shared storage specification, i.e. I can serialize the java sketches to the DB.

Request

Publish the details of how you store the TDigest sketch so that language-specific implementations can serialize to it and thus have aggregation 'all the way down', eliminating the need for a table that holds large numbers of individual values.

@tvondra
Copy link
Owner

tvondra commented Apr 24, 2022

Hi @bjhartin, by specification you mean just a description / documentation of the storage format? That shouldn't be hard, as it essentially requires describing tdigest_t and centroid_t structures, and tdigest_serial / tdigest_deserial functions. The format is pretty simple, really.

The only issue I can think of is endinanness, which is not fixed and depends on the architecture. So far the stored data is only ever exchanged through in/out or send/receive functions, which do the conversions, not directly. But allowing exchanging in the storage format would require that.

@jachinte
Copy link

@tvondra Could you please describe what would the tdigest table look like? I don't understand how I can store a tdigest in the database if I compute it in my application.

This is the dictionary I get from the t-digest Python library:

{"n": 100000, "delta": 0.01, "K": 25, "centroids": [{"m": 1.0, "c": 17.0}, {"m": 2.0, "c": 8.0}, {"m": 3.0, "c": 4.0}, ...]}

Would it be enough if I create a table called "centroids" with fields "m" and "c"? How can I instantiate a tdigest in one of my SQL queries?

Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants