Duplicate Time Values for Order Dependent Aggregates #65
davidkohn88
started this conversation in
Crosscutting issues
Replies: 2 comments 1 reply
-
a true duplicate value, duplicate time AND value should naturally just be ignored because one of the two will get a weight of zero
if you have two different values with the same timestamp the result is indeterminate and the user should be taken out and flogged
|
Beta Was this translation helpful? Give feedback.
0 replies
-
After having a discussion with David on this, I'm on board with his idea of ignoring the second value. It's definitely not an ideal solution, but these types of data show up in real world data due to various rounding, skew, error, etc. If a user is worried about guarding against this case, they can and should mark the column as unique. If they haven't done that, we just have to try to make our best estimate based on the data that we're given. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
When aggregates rely on ordering (like time_weighted_average (#46, #52) and counter_reset (#9) aggregates) there's a question of what to do when you encounter a point with a duplicate time value.
Should you
a) error
b) ignore the later of the points
if you error, any valid cases with duplicate points (which should be rare, but may exist, say around switches in DST) then those cases have trouble, if you disregard the later of the points, you get non-deterministic results, because there are two of the same values and we can't ensure deterministic ordering in any reasonable way, so whichever happens to arrive first this time will be the one that actually gets the computation done on it. (I think that introducing a deterministic ordering will make the code horribly inefficient in many cases and result in significantly more necessary state given that runs of these could theoretically go on for a very long time, though it could be done in the ordering code that we write that will already need to scan all the data).
For now, I think I will ignore the later of the points as random errors will be much worse as a UX than potentially different results either of which could be valid according to any reasonable rule of interpretation, but I'd love thoughts from the community, and also should we eventually introduce a GUC to control this behavior? Something like
enforce_strict_ordering
?Beta Was this translation helpful? Give feedback.
All reactions