Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Kernel] Add partition pruning related utility methods #2098

Merged
merged 5 commits into from
Sep 28, 2023

Conversation

vkorukanti
Copy link
Collaborator

@vkorukanti vkorukanti commented Sep 24, 2023

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Description

Part of #2071 (Partition Pruning in Kernel). This PR adds the following utility methods:

  • Dividing Predicate given to the ScanBuilder.withFilter into data column and partition column predicates
  • Rewrite the partition column Predicate to refer to the columns in the scan file columnar batch with the appropriate partition value deserialization expressions applied.

How was this patch tested?

Added UTs

Copy link
Collaborator

@scottsand-db scottsand-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks awesome. left some comments.

Set<String> partitionColNames) {
for (Expression child : children) {
if (child instanceof Column) {
String[] names = ((Column) child).getNames();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm. i'm just noticing this getNames API. IMO getNames implies that the Column has multiple names, no? that seems odd. nestedName would be better?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spark has filedNames. nameParts is another option. nestedName could imply it is just for nested columns.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nameParts is great

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't need to be done in this PR. We can create an issue and assign it to the 3.1 milestone.

val partitionTestCases = Map[Predicate, (String, String)](
// single predicate on a data column
predicate("=", col("data1"), ofInt(12)) ->
("ALWAYS_TRUE()", "(column(`data1`) = 12)"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: probably too late, but it should ideally just be

`column(data1)` 

instead of

column(`data1`)

right?

i.e. we should only use back ticks for columns with .s in them?

column(a.b.`col.with.dot`.e)

Isn't this what spark does?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is just a toString method. If I recall correctly Delta-Spark does the same (add `` for every column). I can make it just adding backtick for names containing ., not sure if it is worth it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. I just want to do what Spark does.

Copy link
Collaborator

@scottsand-db scottsand-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, with 1 comment about the ALWAYS_FALSE case. Will let you decide to accept / ignore it.

@vkorukanti vkorukanti merged commit ca69895 into delta-io:master Sep 28, 2023
7 checks passed
@vkorukanti vkorukanti deleted the partitionUtils branch September 28, 2023 19:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants