Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Kernel] Add partition_value and element_at expressions #2096

Merged
merged 2 commits into from
Sep 27, 2023

Conversation

vkorukanti
Copy link
Collaborator

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Description

Part of #2071 (Partition Pruning in Kernel). We need the following two expressions in order to evaluate predicate on scan file columnar batch data.

  • element_at(map_column, key_value): Take input a map type column and key value, return the value for the given key. This is similar to Apache Spark UDF for similar purposes. This expression will be used to retrieve the specific partition value from the (partition column name -> string serialized partition) map
  • partition_value(string_type_value, datatype): Decode the partition value given as a string into the given datatype format. The interpretation of the string value is according to the Delta protocol.

How was this patch tested?

Added UTs

Copy link
Collaborator

@scottsand-db scottsand-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments

* deserialized as according to the Delta Protocol.
*/
public PartitionValueExpression(
Expression serializedPartitionValue, DataType partitionDataType) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: 8 space indent, no? (are we on the same page about this? perhaps I missed a decision where we don't want this?)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code style issue can be addressed at once.

Usually the { on a new line, simplifies this, but we choose to keep the starting bracket on the same line. Also I don't see an easy way to enforce this Intellij without affecting every multi-line statement.

}

/**
* Utility method to evaluate the {@code element_at} on given map and key vectors.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but there's no element_at as input into this method?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There isn't a need for it. Both the map and lookup key values are passed which are sufficient for evaluating the element_at expression.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so why split up the input? instead of just passing in the actual element_at instance?

Copy link
Collaborator Author

@vkorukanti vkorukanti Sep 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

element_at expression is just the definition, it doesn't have any input data.

static ColumnVector eval(ColumnVector map, ColumnVector lookupKey) {
return new ColumnVector() {
// Store the last lookup value to avoid multiple looks up for same row id.
private int lastLookupRowId = -1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the intuition behind assuming this will be a common lookup case?

this is essentially just a cache of size 1. any reason not to have a cache of size, say, 10?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically the pattern is: isNullAt(rowId) based on the value, make a call to get the value (getValue)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a comment :)

/**
* Utility method to return the left child of the binary input expression
*/
static Expression getLeft(Expression expression) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there not a BinaryExpression class that could have a getLeft API?

This seems like a very c / struct type design, instead of a OOP design?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like another public class BinaryExpression just for getLeft and getRight. Not sure if it is worth bloating the interface for utility methods.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im worried that these APIs won't be discoverable. with an IDE, i can see all of the APIs available for a given class. with this util organization, I won't be able to do that.

also, what's wrong with another class for just two methods? seems fine with me

not a blocker. we can decide and refactor later.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are not APIs, just utility methods in the expression evaluator which is a table client implementation.

Anyways, if this becomes a problem for custom TableClient implementations, we can add the BinaryExpression.

* a column vector with decoded values according to the given partition type.
*/
static ColumnVector eval(ColumnVector input, DataType partitionType) {
return new ColumnVector() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explicitly document which data types are not yet supported? (or, will likely never be supported?) arrays, maps, structs

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is already documented on the PartitionValueExpression.

Copy link
Collaborator

@scottsand-db scottsand-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but left 1 comment

@vkorukanti vkorukanti merged commit 879df3c into delta-io:master Sep 27, 2023
19 checks passed
@vkorukanti vkorukanti deleted the predicateExprs branch October 2, 2023 05:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants