Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Kernel] Refactor expressions #1997

Merged
merged 4 commits into from
Sep 12, 2023
Merged

Conversation

vkorukanti
Copy link
Collaborator

@vkorukanti vkorukanti commented Aug 21, 2023

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Description

  • Refactor the Kernel expression interfaces. Currently, the Expression base interface contains a few unnecessary methods such as dataType and eval. Keep the Expression to a minimum, so that it is just used to represent a SQL string version of the expression in Kernel Expression classes.
interface Expression {
  /**
    * @return a {@link List} of the immediate children of this node
    */
  List<Expression> children();
}
  • Introduce a subtype of Expression called ScalarExpression which is a base class for all scalar expressions.
  • Introduce a subtype of ScalarExpression called Predicate as the base expression class for scalar expression that evaluates to boolean type. The Predicate is defined such that it takes a generic expression name and any number of input expressions. It is up to the evaluator to make sure the given Predicate is evaluable. Currently Predicate only allows a subset of expressions (=, <, <=, >, >=, AND, OR, ALWAYS_TRUE, ALWAYS_FALSE) as of now. In the future, this can be extended to support more predicate expressions with minimal code changes.
  • Update scan-related APIs to Predicate instead of Expression.
  • Remove the use of Literal.FALSE and Literal.TRUE and instead use AlwaysTrue.ALWAYS_TRUE and AlwaysFalse.ALWAYS_FALSE. Literal is not a predicate.
  • Extract the expression evaluation from kernel-api into kernel-defaults.
    • DefaultExpressionEvaluator validates the expression and adds necessary implicit casts to allow evaluation.

TODO (will be addressed after this PR is landed):

How was this patch tested?

Moved the existing Java-based test to Scala and also added new tests (some of them are copied over from the standalone ExpressionSuite and updated).

@vkorukanti vkorukanti force-pushed the expression_eval branch 5 times, most recently from 4736839 to 472d32a Compare August 22, 2023 21:53
Copy link
Contributor

@tdas tdas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comments

* @return the String representation of this expression.
*/
String toString();

/**
* @return a {@link List} of the immediate children of this node
*/
List<Expression> children();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any thoughts on using List vs Array?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am inclined to use the List. Lists are easy to create (like appending an element etc) and also can create an unmodifiable view on top of the list when the children are requested through getter.

/**
* List of supported predicate expressions
*/
private static final Set<String> PREDICATE_TYPES = Stream.of(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

types? or names? since these are actual allowed names.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either of them should be fine. Given this is a private variable, does it matter?

@vkorukanti vkorukanti changed the title [WIP][Kernel] Refactor expressions [Kernel] Refactor expressions Aug 29, 2023
Copy link
Collaborator

@allisonport-db allisonport-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few questions on the changes in kernel-api

Comment on lines 39 to 48
public Predicate getLeft() {
return (Predicate) children().get(0);
}

/**
* @return Right side operand.
*/
public Predicate getRight() {
return (Predicate) children().get(1);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need these APIs in the Kernel API framework?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It helps the clients of this expression. Don't need to have the code to get the childrens list, get the first element and validate it is a Predicate expression.

private static void validatePredicateExpression(String name, List<Expression> children) {
checkArgument(PREDICATE_TYPES.contains(name), "Unknown predicate expression: %s", name);
switch (name) {
case "AND": // fall through
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about a user provided expression like col1 AND col2? Do we expect the connector to create a different kind of expression than Predicate?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

col1 AND col2 can be rewritten using col1 = true AND col2 = true. If this becomes a problem, we can relax this requirement in future.

* @param name Predicate expression name
* @param children List of input expressions to the predicate expression.
*/
private static void validatePredicateExpression(String name, List<Expression> children) {
Copy link
Collaborator Author

@vkorukanti vkorukanti Aug 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tdas I am wondering if we should get rid of this type checks? It is becoming complicated as there is a ScalarExpression in between Predicate and Expression. ScalarExpression needs to aware of these types if it wants to validate the given expression is allowed or not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ScalarExpression has no data type or type validation right? So whats the problem if Predicate constructor does type checks before calling in the ScalarExpression constructor which does not do type checks?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in future when we add more scalar expression that are not Predicate, where do we do the validation?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The validation also includes the given name is one of the supported expression.

It looks like the validation is not adding much value. The implementation of the ExpressionHandler already throws error if it gets an expression which is not one of the listed expression in the JavaDoc of Predicate or ScalarExpression. I am inclined to remove these validation checks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is a case in the future where wrong expressions does not throw any error and silently does not produce the intended result.. then we throwing error might be better. Keep an eye out for those stuff.

return String.format("%s(%s)", name, args);
}

private static final Set<String> COMPARATOR_OPERATORS =
Copy link
Contributor

@tdas tdas Aug 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the names... right? why are you calling this "operators" in one place and "types" in another place (below)? why not just "names"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is removed. See here for details.

/**
* Implementation of {@link ExpressionVisitor} to evaluate it on a {@link ColumnarBatch}.
*/
private static class ExpressionEvaluator extends ExpressionVisitor<ColumnVector> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same name as ExpressionEvaluator interface!!!! This is hella confusing!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to ExpressionEvalVisitor.

ExpressionEvaluator getEvaluator(StructType batchSchema, Expression expression);
ExpressionEvaluator getEvaluator(
StructType inputSchema,
Expression expression,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be ScalarExpression if we're adding that as an an expression type?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am ok to update this, but we need to start adding more methods to the interface if we end up adding aggregation expressions or introduce another interface between Expression and ScalarExpression

@tdas let me know what you think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, its likely we have to for aggregate expressions... because for aggregate .. would ExpressionsEvaluator produce ColumnVector?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it makes sense to make this a ScalaExpression

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently Column and Literal are not derived from ScalarExpression. Should we make them extend ScalarExpression before making this change? Otherwise we cannot populate partition values.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me, seems like literal and column should be scalar expressions

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Literal and Column are leaf expressions which don't take any input expressions. There isn't a one input and output concept here. Also Spark DS v2 doesn't extend Literal and Column from ScalarExpression.

I am inclined to keep it this way. If there is a strong reason, I will followup the change in a separate PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay good to merge. Let's follow up on it with @tdas next week?

* output value. A subclass of these expressions are of type {@link Predicate} whose result is
* `boolean`. See {@link Predicate} for predicate type scalar expressions. Supported
* non-predicate type scalar expressions are listed.
* TODO: Currently there aren't any. Will be added in future. An example one looks like this:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we intend to allow engines to create/pass through any scalar expressions they support? I was under the impression we wouldn't be enforcing a supported list for these.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we already have these in InternalUtils? let's not duplicate them in both places?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am overall confused. It this Utils supposed to be public or internal? things like checkArgument does not seem like things that kernel should expose as public apis

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It involves copying, but I am inclined to keep them internal and duplicate them in both kernel-api and kernel-default. This is just a few lines of code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could put them in an internal util in Kernel API and reuse that in Kernel Default. Its okay for Kernel Defaults to use some small internal stuff from Kernel API.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will followup with a separate PR (there is a TODO already).

* of the following types based on the literal data type:
*
* <u>
* <li>BOOLEAN -> {@link Boolean}</li>>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do these link to?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They link to the specific class referred with @link


/**
* Utility method to compare the left and right according to the natural ordering
* and return the comparison result (-1, 0, 1) for each row as an integer array.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what types are allowed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a comment.

@@ -156,16 +156,6 @@ private static Type prunedType(Type type, DataType deltaType) {
}
}

/**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why remove this from internal class to public class?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the previous comment response.

@@ -74,6 +72,8 @@ public static Type findSubFieldType(GroupType groupType, StructField field) {
return null;
}

// TODO: Move these precondition checks into a separate utility class.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please keep track of the TODOs and make sure all these are done before the 3.0 release.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree.

* output value. A subclass of these expressions are of type {@link Predicate} whose result type is
* `boolean`. See {@link Predicate} for predicate type scalar expressions. Supported
* non-predicate type scalar expressions are listed below.
* TODO: Currently there aren't any. Will be added in future. An example one looks like this:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vkorukanti @allisonport-db remind me what was decided... whether we are going to validation for allow expressions or not.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we weren't since we were going to let engines use whatever expressions they support

}
}

test("expression: and") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are these empty??

Copy link
Collaborator

@scottsand-db scottsand-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just wonder if a BinaryPredicate could simplify the code a bit and remove duplication?

/**
* @return Left side operand.
*/
public Predicate getLeft() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't this fit best inside of a BinaryPredicate class?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are trying to minimize the number of interfaces. The goal is to have just one Predicate class that captures all the predicates. And and Or are just two special expressions.

* <p>
* Only supports primitive data types, see
* <a href="https://github.com/delta-io/delta/blob/master/PROTOCOL.md#primitive-types">Delta Transaction Log Protocol: Primitive Types</a>.
* An expression type that refers to a column by name (case-sensitive) in the input.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does this handle nested columns?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

* @return a {@link Literal} with data type {@link TimestampType}
*/
public static Literal of(Timestamp value) {
return new Literal(value, TimestampType.INSTANCE);
public static Literal ofTimestamp(long microsSinceEpochUTC) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: ofTimestampUTC ? wouldn't that make this API super, super clear

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are using the datatype as a suffix to differentiate. Adding the UTC makes it look like another datatype.

default DataType dataType() {
return BooleanType.INSTANCE;
public String toString() {
if (Arrays.asList("<", "<=", ">", ">=", "=").contains(name)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where are these expression classes (less than, greater than, etc.) defined?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid the class bloat whenever a new expression is added, we defined a minimal set of generic expressions. The supported expressions are listed in the javadoc. This type of generic interface also allows the connector to pass through any special expression it wants through Kernel-API module to its own implementation of the ExpressionHandler.

Copy link
Collaborator

@allisonport-db allisonport-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly minor comments

* <ul>
* <li>given input column is part of the input data schema</li>
* <li>expression inputs are of supported types. Insert cast according to the rules in
* {@link ImplicitCastExpression} to make the types compatible for evaluation by
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about implicit casting to the output type? Should we do that somewhere?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bump on this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good qn. We want to do this at some point. For now all our expressions either return boolean or specify the exact type (partition col population). Created #2047 to track this.

* output value. A subclass of these expressions are of type {@link Predicate} whose result type is
* `boolean`. See {@link Predicate} for predicate type scalar expressions. Supported
* non-predicate type scalar expressions are listed below.
* TODO: Currently there aren't any. Will be added in future. An example one looks like this:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we weren't since we were going to let engines use whatever expressions they support

* the same type, but the evaluator expects same type inputs. There could be more use cases, but
* for now this is the only use case.
*/
final class ImplicitCastExpression implements Expression {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do anything for decimal types?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may need to add more types. That's why I added a comment above to indicate the list is not exhastive.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Want to create a github issue to follow up on this? Seems pretty weird that a comparison between decimal types with different scales (which just determines how it's written in parquet and not the actual value) would fail

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#2044 (also mentioned in the PR description)

ExpressionEvaluator getEvaluator(StructType batchSchema, Expression expression);
ExpressionEvaluator getEvaluator(
StructType inputSchema,
Expression expression,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me, seems like literal and column should be scalar expressions

Copy link
Collaborator

@allisonport-db allisonport-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pending questions about casting to output type and making literal/column scalar expressions

* <ul>
* <li>given input column is part of the input data schema</li>
* <li>expression inputs are of supported types. Insert cast according to the rules in
* {@link ImplicitCastExpression} to make the types compatible for evaluation by
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bump on this?

@vkorukanti
Copy link
Collaborator Author

vkorukanti commented Sep 12, 2023

Pending questions about casting to output type and making literal/column scalar expressions

Created a tracking issue to handle the cast expression output to evaluator output.

For making the literal/column scalar expression: I responded here. Let me know if there is any strong reason to deviate from Spark DSv2 expression model.

Copy link
Collaborator

@allisonport-db allisonport-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. To follow up about literal/column as scalar expressions next week.

@vkorukanti vkorukanti merged commit 03217d1 into delta-io:master Sep 12, 2023
6 checks passed
@vkorukanti vkorukanti deleted the expression_eval branch September 14, 2023 11:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants