-
-
Notifications
You must be signed in to change notification settings - Fork 228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Parquet predicate pushdown filtering example #164
base: main
Are you sure you want to change the base?
Conversation
Check out this pull request on Review Jupyter notebook visual diffs & provide feedback on notebooks. Powered by ReviewNB |
I think it might be better to write the parquet files in the example, since (IIRC) they need to be written specifically to support filtering. So perhaps use a built-in dataset like |
@TomAugspurger - thanks for reviewing. I updated the example to build the Parquet lake from CSV files. Checking in CSV files to source control doesn't feel as wrong. The Parquet lake is written to a directory that's already gitignored. I looked into If this pull request gets merged, I'd like to use the CSV files to provide an expanded discussion on "Select only the columns that you plan to use" in the Thanks again for reviewing! |
Is it right to say that the parquet files need to be written with specific
arguments (like partition_cols) for this performance benefit? If so I'd
prefer to write them in the example.
…On Thu, Aug 27, 2020 at 1:00 PM Matthew Powers ***@***.***> wrote:
@TomAugspurger <https://github.com/TomAugspurger> - thanks for reviewing.
I updated the example to build the Parquet lake from CSV files. Checking in
CSV files to source control doesn't feel as wrong. The Parquet lake is
written to a directory that's already gitignored.
I looked into dask.datasets.timeseries() and it looks like I'd be able to
make that work too, but I think it'd make the explanation a bit more
complicated. I'm open to that approach if you feel strongly about it.
If this pull request gets merged, I'd like to use the CSV files to provide
an expanded discussion on "Select only the columns that you plan to use" in
the 01-data-access example. I've done benchmarking in the Spark world and
have found that column pruning can provide huge performance boosts. Want to
provide a detailed discussion and demonstrate that column pruning isn't
possible for CSV lakes, but is possible for Parquet lakes.
Thanks again for reviewing!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#164 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKAOISLPFPCAZNWPSAMH7DSC2NNNANCNFSM4QML6H5Q>
.
|
@TomAugspurger - thanks for the response. This code reads in CSV files and writes out Parquet files, so the current code should satisfy your wish to write out the Parquet files within the example itself. Parquet predicate pushdown filtering depends on the parquet metadata statistics (not how the Parquet files are written on disk). Here's how you can access the metadata statistics: parquet_file = pq.ParquetFile('some_file.parquet')
print(parquet_file.metadata.row_group(0).column(1).statistics) See here for more info. The Thanks again for reviewing and let me know if any additional changes are needed! Really appreciate your feedback! |
Here's an example of Parquet predicate pushdown filtering, as suggested by @jrbourbeau here.
I am a Dask newbie and this is my first time working with a Jupyter notebook so be nice! ;)
I am not super-excited about checking in Parquet files to the repo (don't like binaries in GitHub), but it's the best way to keep the example simple. Let me know if you have other suggestions.