Duplicate Time Values for Order Dependent Aggregates #65

davidkohn88 · 2021-02-05T21:48:40Z

davidkohn88
Feb 5, 2021

When aggregates rely on ordering (like time_weighted_average (#46, #52) and counter_reset (#9) aggregates) there's a question of what to do when you encounter a point with a duplicate time value.

Should you
a) error
b) ignore the later of the points

if you error, any valid cases with duplicate points (which should be rare, but may exist, say around switches in DST) then those cases have trouble, if you disregard the later of the points, you get non-deterministic results, because there are two of the same values and we can't ensure deterministic ordering in any reasonable way, so whichever happens to arrive first this time will be the one that actually gets the computation done on it. (I think that introducing a deterministic ordering will make the code horribly inefficient in many cases and result in significantly more necessary state given that runs of these could theoretically go on for a very long time, though it could be done in the ordering code that we write that will already need to scan all the data).

For now, I think I will ignore the later of the points as random errors will be much worse as a UX than potentially different results either of which could be valid according to any reasonable rule of interpretation, but I'd love thoughts from the community, and also should we eventually introduce a GUC to control this behavior? Something like enforce_strict_ordering?

inselbuch · 2021-02-05T23:56:58Z

inselbuch
Feb 5, 2021

a true duplicate value, duplicate time AND value should naturally just be ignored because one of the two will get a weight of zero if you have two different values with the same timestamp the result is indeterminate and the user should be taken out and flogged

0 replies

WireBaron · 2021-03-03T01:00:19Z

WireBaron
Mar 3, 2021

After having a discussion with David on this, I'm on board with his idea of ignoring the second value. It's definitely not an ideal solution, but these types of data show up in real world data due to various rounding, skew, error, etc. If a user is worried about guarding against this case, they can and should mark the column as unique. If they haven't done that, we just have to try to make our best estimate based on the data that we're given.

1 reply

davidkohn88 Mar 4, 2021
Author

(they can mark the column unique or have a multi-column unique constraint in the more common case, which I'm mostly just putting here so people don't get the wrong idea and think that their timestamps have to be unique across all their ids or whatever, this would would just fine with a unique constraint on (id, time))

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicate Time Values for Order Dependent Aggregates #65

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Duplicate Time Values for Order Dependent Aggregates #65

davidkohn88 Feb 5, 2021

Replies: 2 comments · 1 reply

inselbuch Feb 5, 2021

WireBaron Mar 3, 2021

davidkohn88 Mar 4, 2021 Author

davidkohn88
Feb 5, 2021

Replies: 2 comments 1 reply

inselbuch
Feb 5, 2021

WireBaron
Mar 3, 2021

davidkohn88 Mar 4, 2021
Author