Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Parquet predicate pushdown filtering example #164

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

MrPowers
Copy link

@MrPowers MrPowers commented Aug 26, 2020

Here's an example of Parquet predicate pushdown filtering, as suggested by @jrbourbeau here.

I am a Dask newbie and this is my first time working with a Jupyter notebook so be nice! ;)

I am not super-excited about checking in Parquet files to the repo (don't like binaries in GitHub), but it's the best way to keep the example simple. Let me know if you have other suggestions.

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

Review Jupyter notebook visual diffs & provide feedback on notebooks.


Powered by ReviewNB

@TomAugspurger
Copy link
Member

I am not super-excited about checking in Parquet files to the repo (don't like binaries in GitHub), but it's the best way to keep the example simple. Let me know if you have other suggestions.

I think it might be better to write the parquet files in the example, since (IIRC) they need to be written specifically to support filtering. So perhaps use a built-in dataset like dask.datasets.timeseries() and then show reading it back in with a filter.

@MrPowers
Copy link
Author

@TomAugspurger - thanks for reviewing. I updated the example to build the Parquet lake from CSV files. Checking in CSV files to source control doesn't feel as wrong. The Parquet lake is written to a directory that's already gitignored.

I looked into dask.datasets.timeseries() and it looks like I'd be able to make that work too, but I think it'd make the explanation a bit more complicated. I'm open to that approach if you feel strongly about it.

If this pull request gets merged, I'd like to use the CSV files to provide an expanded discussion on "Select only the columns that you plan to use" in the 01-data-access example. I've done benchmarking in the Spark world and have found that column pruning can provide huge performance boosts. Want to provide a detailed discussion and demonstrate that column pruning isn't possible for CSV lakes, but is possible for Parquet lakes.

Thanks again for reviewing!

@TomAugspurger
Copy link
Member

TomAugspurger commented Sep 8, 2020 via email

@MrPowers
Copy link
Author

MrPowers commented Sep 8, 2020

@TomAugspurger - thanks for the response. This code reads in CSV files and writes out Parquet files, so the current code should satisfy your wish to write out the Parquet files within the example itself.

Parquet predicate pushdown filtering depends on the parquet metadata statistics (not how the Parquet files are written on disk). Here's how you can access the metadata statistics:

parquet_file = pq.ParquetFile('some_file.parquet')
print(parquet_file.metadata.row_group(0).column(1).statistics)

See here for more info.

The partition_cols filtering is different. That's generally what I call "partition filtering". That one depends on the directory structure and how the Parquet files are organized on disk. After this PR is merged, I'll create a separate PR with a good partition filtering example.

Thanks again for reviewing and let me know if any additional changes are needed! Really appreciate your feedback!

Base automatically changed from master to main January 27, 2021 16:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants