Add Parquet predicate pushdown filtering example #164

MrPowers · 2020-08-26T23:58:42Z

Here's an example of Parquet predicate pushdown filtering, as suggested by @jrbourbeau here.

I am a Dask newbie and this is my first time working with a Jupyter notebook so be nice! ;)

I am not super-excited about checking in Parquet files to the repo (don't like binaries in GitHub), but it's the best way to keep the example simple. Let me know if you have other suggestions.

review-notebook-app · 2020-08-26T23:58:46Z

Check out this pull request on

Review Jupyter notebook visual diffs & provide feedback on notebooks.

Powered by ReviewNB

TomAugspurger · 2020-08-27T15:25:02Z

I am not super-excited about checking in Parquet files to the repo (don't like binaries in GitHub), but it's the best way to keep the example simple. Let me know if you have other suggestions.

I think it might be better to write the parquet files in the example, since (IIRC) they need to be written specifically to support filtering. So perhaps use a built-in dataset like dask.datasets.timeseries() and then show reading it back in with a filter.

MrPowers · 2020-08-27T18:00:40Z

@TomAugspurger - thanks for reviewing. I updated the example to build the Parquet lake from CSV files. Checking in CSV files to source control doesn't feel as wrong. The Parquet lake is written to a directory that's already gitignored.

I looked into dask.datasets.timeseries() and it looks like I'd be able to make that work too, but I think it'd make the explanation a bit more complicated. I'm open to that approach if you feel strongly about it.

If this pull request gets merged, I'd like to use the CSV files to provide an expanded discussion on "Select only the columns that you plan to use" in the 01-data-access example. I've done benchmarking in the Spark world and have found that column pruning can provide huge performance boosts. Want to provide a detailed discussion and demonstrate that column pruning isn't possible for CSV lakes, but is possible for Parquet lakes.

Thanks again for reviewing!

TomAugspurger · 2020-09-08T13:41:42Z

Is it right to say that the parquet files need to be written with specific arguments (like partition_cols) for this performance benefit? If so I'd prefer to write them in the example.

…

On Thu, Aug 27, 2020 at 1:00 PM Matthew Powers ***@***.***> wrote: @TomAugspurger <https://github.com/TomAugspurger> - thanks for reviewing. I updated the example to build the Parquet lake from CSV files. Checking in CSV files to source control doesn't feel as wrong. The Parquet lake is written to a directory that's already gitignored. I looked into dask.datasets.timeseries() and it looks like I'd be able to make that work too, but I think it'd make the explanation a bit more complicated. I'm open to that approach if you feel strongly about it. If this pull request gets merged, I'd like to use the CSV files to provide an expanded discussion on "Select only the columns that you plan to use" in the 01-data-access example. I've done benchmarking in the Spark world and have found that column pruning can provide huge performance boosts. Want to provide a detailed discussion and demonstrate that column pruning isn't possible for CSV lakes, but is possible for Parquet lakes. Thanks again for reviewing! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#164 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAKAOISLPFPCAZNWPSAMH7DSC2NNNANCNFSM4QML6H5Q> .

MrPowers · 2020-09-08T16:58:23Z

@TomAugspurger - thanks for the response. This code reads in CSV files and writes out Parquet files, so the current code should satisfy your wish to write out the Parquet files within the example itself.

Parquet predicate pushdown filtering depends on the parquet metadata statistics (not how the Parquet files are written on disk). Here's how you can access the metadata statistics:

parquet_file = pq.ParquetFile('some_file.parquet')
print(parquet_file.metadata.row_group(0).column(1).statistics)

See here for more info.

The partition_cols filtering is different. That's generally what I call "partition filtering". That one depends on the directory structure and how the Parquet files are organized on disk. After this PR is merged, I'll create a separate PR with a good partition filtering example.

Thanks again for reviewing and let me know if any additional changes are needed! Really appreciate your feedback!

Add Parquet predicate pushdown filtering example

d830b80

Use CSV files because they're nicer in git repos

6622df9

Base automatically changed from master to main January 27, 2021 16:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Parquet predicate pushdown filtering example #164

Add Parquet predicate pushdown filtering example #164

MrPowers commented Aug 26, 2020 •

edited

Loading

review-notebook-app bot commented Aug 26, 2020

TomAugspurger commented Aug 27, 2020

MrPowers commented Aug 27, 2020

TomAugspurger commented Sep 8, 2020 via email

MrPowers commented Sep 8, 2020

Add Parquet predicate pushdown filtering example #164

Are you sure you want to change the base?

Add Parquet predicate pushdown filtering example #164

Conversation

MrPowers commented Aug 26, 2020 • edited Loading

review-notebook-app bot commented Aug 26, 2020

TomAugspurger commented Aug 27, 2020

MrPowers commented Aug 27, 2020

TomAugspurger commented Sep 8, 2020 via email

MrPowers commented Sep 8, 2020

MrPowers commented Aug 26, 2020 •

edited

Loading