Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fields to Parquet Statistics structure that were added in parquet-format 2.10 #15412

Merged
merged 20 commits into from
Apr 26, 2024

Conversation

etseidl
Copy link
Contributor

@etseidl etseidl commented Mar 28, 2024

Description

PARQUET-2352 added fields to the Statistics struct to indicate whether the min and max values were exact or had been truncated. This was somewhat ambiguous in the past. One reason to want to know this is to allow avoiding the decoding of pages (or column chunks) that contain a single value (if the min and max are the same value, and are known to be exact values, and there are no nulls, then the only valid value for the page will be that value). This PR adds these new fields, which will always be true in cuDF since cuDF does not support truncating min and max values in the statistics (but does support truncation in the page indexes).

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@etseidl etseidl requested a review from a team as a code owner March 28, 2024 22:01
Copy link

copy-pr-bot bot commented Mar 28, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Mar 28, 2024
@etseidl etseidl changed the title statistics are always exact Add fields to Parquet Statistics structure that were added in parquet-format 2.10 Mar 28, 2024
@karthikeyann karthikeyann added feature request New feature or request non-breaking Non-breaking change labels Apr 9, 2024
Comment on lines +2929 to +2931
// cudf min/max statistics are always exact (i.e. not truncated)
encoder.field_bool(7, true);
encoder.field_bool(8, true);
Copy link
Contributor

@karthikeyann karthikeyann Apr 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can both of these fields be present (as true) even if the stats does not have min and max value?
I hope, other parquet readers don't break if this is true when min and max values are missing in stats.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm...good point. As currently written, the stats encoding will always write something in the min and max fields if s->has_minmax is true, so I think it's safe to always set the exact fields to true here.

That said, the rules around when min/max should have values encoded and not have been in flux (and as with everything Parquet, are not well documented), so it would probably be a good idea to make sure we're following all of the rules properly. I'm not sure if that's in scope for this PR or should be a separate issue.

@karthikeyann
Copy link
Contributor

/ok to test

@ttnghia
Copy link
Contributor

ttnghia commented Apr 26, 2024

/ok to test

@ttnghia
Copy link
Contributor

ttnghia commented Apr 26, 2024

/merge

@ttnghia
Copy link
Contributor

ttnghia commented Apr 26, 2024

/ok to test

@rapids-bot rapids-bot bot merged commit 064dd7b into rapidsai:branch-24.06 Apr 26, 2024
69 checks passed
@etseidl etseidl deleted the exact_stats branch April 26, 2024 23:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants