-
Notifications
You must be signed in to change notification settings - Fork 914
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add fields to Parquet Statistics structure that were added in parquet-format 2.10 #15412
Conversation
// cudf min/max statistics are always exact (i.e. not truncated) | ||
encoder.field_bool(7, true); | ||
encoder.field_bool(8, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can both of these fields be present (as true
) even if the stats does not have min and max value?
I hope, other parquet readers don't break if this is true
when min and max values are missing in stats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm...good point. As currently written, the stats encoding will always write something in the min and max fields if s->has_minmax
is true, so I think it's safe to always set the exact
fields to true here.
That said, the rules around when min/max should have values encoded and not have been in flux (and as with everything Parquet, are not well documented), so it would probably be a good idea to make sure we're following all of the rules properly. I'm not sure if that's in scope for this PR or should be a separate issue.
/ok to test |
/ok to test |
/merge |
/ok to test |
Description
PARQUET-2352 added fields to the
Statistics
struct to indicate whether the min and max values were exact or had been truncated. This was somewhat ambiguous in the past. One reason to want to know this is to allow avoiding the decoding of pages (or column chunks) that contain a single value (if the min and max are the same value, and are known to be exact values, and there are no nulls, then the only valid value for the page will be that value). This PR adds these new fields, which will always betrue
in cuDF since cuDF does not support truncating min and max values in the statistics (but does support truncation in the page indexes).Checklist