Update BQ deduplicate macro to support partition pruning downstream #929

austinclooney · 2024-07-01T21:04:44Z

resolves #928

Problem

The current BQ deduplicate macro will not work for a view as it doesn't allow partition pruning downstream of the macro.

Solution

Add an an optional parameter that explicitly selects the partition by columns outside of the array_agg

Checklist

This code is associated with an issue which has been triaged and accepted for development.
I have read the contributing guide and understand what's expected of me
I have run this code in development and it appears to resolve the stated issue
This PR includes tests, or tests are not required/relevant for this PR
I have updated the README.md (if applicable)

dbeatty10 · 2024-07-02T13:13:49Z

macros/sql/deduplicate.sql

@@ -93,12 +93,19 @@
 --  clause in BigQuery:
 --  https://github.com/dbt-labs/dbt-utils/issues/335#issuecomment-788157572
 #}
-{%- macro bigquery__deduplicate(relation, partition_by, order_by) -%}
+{%- macro bigquery__deduplicate(relation, partition_by, order_by, partition_by_pass_thru=false) -%}


Thanks for raising this PR @austinclooney !

I'd prefer not to introduce a new parameter if we can avoid it.

BigQuery supports the qualify clause. Is there any reason we shouldn't / couldn't just use qualify in the same way as Snowflake and Databricks?

The original implementation for BigQuery was for performance reasons. But I'm not sure if those performance reasons still exist or if working for views outweighs those considerations.

To my knowledge the memory issue still exists, yes. Deduplicating with array_agg uses fewer resources than with qualify which allows it to be done on larger datasets before hitting memory issues. The reason I added a parameter here is because explicitly selecting the partition_by by default will change the column order in the output so I think it would technically be a breaking change?

There are several different scenarios to consider for the relation being deduplicated:

table

view

materialized view

CTE

What would be the consequences if we did not add a new parameter named partition_by_pass_thru and just treated it as always being true instead?

I believe the only difference between the current method and the changes added in this PR is this PR would move the partition_by columns to the end of the output (or the beginning if preferred) and the current output retains the original column order.

Since we're only explicitly selecting the partitioning keys it won't affect the actual deduplication as those are the groups we're deduplicating within.

[Edit: I just re-read the prior message and update the question below]

So the net consequence would be the following?

partition pruning allowing faster queries for less cost

output relation has the same columns but possibly in a different order than before

And these consequences would equally apply to all of tables, views, materialized views, and CTE?

Can you think of any way to do the following?

output relation with the same columns in the same order as before

If so, that would allow us to ship these performance improvements without any breaking changes or new parameters.

I've thought about this for a while but I don't believe it's possible without running another query since the input to the macro can be a CTE.

Something like this would work:

{%- macro bigquery__deduplicate(relation, partition_by, order_by) -%} {%- set columns_query -%} select * from {{ relation }} limit 1 {%- endset -%} {%- set columns_result = run_query(columns_query) -%} {%- set columns = columns_result.column_names -%} select {%- for column in columns %} unique.{{ column }}{%- if not loop.last %},{% endif %} {%- endfor %} from ( select {{ partition_by }}, array_agg( original order by {{ order_by }} limit 1 )[offset(0)] unique from {{ relation }} original group by {{ partition_by }} ) {%- endmacro -%}

But I don't think we'd want to run an external query to get the macro to work.

Update BQ deduplicate macro to support partition pruning downstream.

1db3228

dbeatty10 reviewed Jul 2, 2024

View reviewed changes

dbeatty10 changed the title ~~Update BQ deduplicate macro to support partition pruning downstream.~~ Update BQ deduplicate macro to support partition pruning downstream Jul 2, 2024

dbeatty10 added the enhancement New feature or request label Jul 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update BQ deduplicate macro to support partition pruning downstream #929

Update BQ deduplicate macro to support partition pruning downstream #929

austinclooney commented Jul 1, 2024 •

edited by dbeatty10

Loading

dbeatty10 Jul 2, 2024

austinclooney Jul 2, 2024

dbeatty10 Jul 2, 2024

austinclooney Jul 2, 2024

dbeatty10 Jul 2, 2024 •

edited

Loading

austinclooney Jul 3, 2024

dbeatty10 Jul 3, 2024

austinclooney Aug 1, 2024

Update BQ deduplicate macro to support partition pruning downstream #929

Are you sure you want to change the base?

Update BQ deduplicate macro to support partition pruning downstream #929

Conversation

austinclooney commented Jul 1, 2024 • edited by dbeatty10 Loading

Problem

Solution

Checklist

dbeatty10 Jul 2, 2024

Choose a reason for hiding this comment

austinclooney Jul 2, 2024

Choose a reason for hiding this comment

dbeatty10 Jul 2, 2024

Choose a reason for hiding this comment

austinclooney Jul 2, 2024

Choose a reason for hiding this comment

dbeatty10 Jul 2, 2024 • edited Loading

Choose a reason for hiding this comment

austinclooney Jul 3, 2024

Choose a reason for hiding this comment

dbeatty10 Jul 3, 2024

Choose a reason for hiding this comment

austinclooney Aug 1, 2024

Choose a reason for hiding this comment

austinclooney commented Jul 1, 2024 •

edited by dbeatty10

Loading

dbeatty10 Jul 2, 2024 •

edited

Loading