Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Kernel] Refactor the Array and Map return types in ColumnVector and Row #2087

Merged
merged 10 commits into from
Oct 2, 2023

Conversation

allisonport-db
Copy link
Collaborator

@allisonport-db allisonport-db commented Sep 22, 2023

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Description

BASED OFF OF #2069; to review changes in this PR select the last 4 commits (here)

Refactors the ColumnVector and Row getArray and getMap APIs to return wrapper objects. These wrappers provide APIs to retrieve column vector views of the elements or keys/values of the map/array.

How was this patch tested?

Supporting the new APIs in some of the java testing infrastructure was non-trivial, so this PR also moves some tests from TestDeltaTableReads to DeltaTableReadsSuite and adds the complex types to the scala testing infrastructure.

Existing tests have been modified for this PR as well as a few additional checks/tests added.

* Get the value at given {@code rowId} from the column vector. The type of the value object
* depends on the data type of the {@code vector}.
*/
public static Object getValueAsObject(ColumnVector vector, int rowId) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved over from kernel/kernel-defaults/src/main/java/io/delta/kernel/defaults/internal/data/vector/VectorUtils.java since it was only being used in test code after these changes

Copy link
Collaborator

@scottsand-db scottsand-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Left some comments

@@ -112,11 +110,11 @@ public interface Row {
* Return array value of the column located at the given ordinal.
* Throws error if the column at given ordinal is not of array type,
*/
<T> List<T> getArray(int ordinal);
ArrayValue getArray(int ordinal);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would getArrayValue and getMapValue be better APIs?

Since getMap makes me think it will return a java.util.Map

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I necessarily agree with

getMap makes me think it will return a java.util.Map

But I'm relatively indifferent about changing this if everyone is in agreement. Noting that Spark's method is ColumnVector::getMap --> ColumnarMap.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use getArrayValue etc. Map and Array ubiquitously refer to java.util.Map and the java array type

Copy link
Collaborator Author

@allisonport-db allisonport-db Oct 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's followup on this when we do the API review (easy to change in a followup PR)

return configuration;
}

public String getColumnMappingMode() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm. I don't know if this is a very scalable or future-proof design.

we won't have getters for every other configuration, will we? (I don't think so)

So, what makes this one special?

Instead, we should just lazily convert the configuration MapValue into a Map on the first .getTableProperty(key: String) and access it that way

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added getConfiguration --> Map<String, String> instead. Let me know if you'd prefer a getTableProperty API instead, but I think we'd have to provide a couple methods

  • boolean isTablePropertySet(key)
  • String getTableProperty(key)
  • String getTableProperty(key, default) <-- optional

@@ -160,7 +157,17 @@ private static Object decodeElement(JsonNode jsonValue, DataType dataType) {
final Object parsedElement = decodeElement(element, arrayType.getElementType());
output.add(parsedElement);
}
return output;
return new ArrayValue() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this becomes a common pattern, add a utility method to convert List to ArrayValue.

}

@Override
public <K, V> Map<K, V> getMap(int ordinal) {
return (Map<K, V>) parsedValues[ordinal];
}
}

private static class TestColumnVector implements ColumnVector {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if there is another alternative to this. Does Spark has similar requirement where they need to expose a list as Vector?

@@ -30,7 +30,8 @@ public final class VectorUtils {
private VectorUtils() {}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we consolidate the toJavaMap(MapValue) and toMapValue(java.util.Map here? Similar for the array.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Responding to the (I think) 3 comments related to this here.

We need to expose java list/maps as MapValue/ArrayValue (and thus as ColumnVectors) in a couple of places

In kernelApi:

  • (1) JsonHandlerTestImpl

In kernelDefaults:

  • (2) DefaultJsonHandlerImpl (in DefaultJsonRow) -- very similar pattern to the first
  • (3) Tests (once in ExpressionEvaluatorSuite)

I can add a test utility to do this to cover (1) and (3) but I don't think we want to add this to VectorUtils as we shouldn't be creating ColumnVectors there?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about add utility method for defaults and it can be used in default prod code and test code.

For the JsonHandlerTestImpl, I am actually thinking of removing it and make it a proper mock.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a DefaultGenericVector and use that in DefaultJsonRow and in a test utility I added that converts a Seq[Map[..,..]]to a CV. Can't really combine more than that since in DefaultJsonRow we can build the keys/values separately and don't start with a Map

@vkorukanti vkorukanti self-requested a review September 23, 2023 06:36
Copy link
Collaborator

@vkorukanti vkorukanti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rebase the change and will take another pass.

mapValue.getValues().getString(i);
}
}
// TODO what is the behavior for element_at if not in the map?
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vkorukanti I can update the docs for the function as well

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't return null. We should return an Optional.

What if the value for the key is null? Isn't that perfectly valid?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Please add the docs. I didn't mention the null behavior. The behavior should be it should return null. We have tests that confirm this.

@@ -112,11 +110,11 @@ public interface Row {
* Return array value of the column located at the given ordinal.
* Throws error if the column at given ordinal is not of array type,
*/
<T> List<T> getArray(int ordinal);
ArrayValue getArray(int ordinal);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use getArrayValue etc. Map and Array ubiquitously refer to java.util.Map and the java array type

mapValue.getValues().getString(i);
}
}
// TODO what is the behavior for element_at if not in the map?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't return null. We should return an Optional.

What if the value for the key is null? Isn't that perfectly valid?

@@ -73,7 +74,7 @@ static ColumnVector eval(ColumnVector map, ColumnVector lookupKey) {
// The general pattern is call `isNullAt(rowId)` followed by `getString`.
// So the cache of one value is enough.
private int lastLookupRowId = -1;
private Object lastLookupValue = null;
private String lastLookupValue = null;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does null represent? that there was no last lookup value? this is ambiguous, it should be Optional

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't represent a null value if we make it Optional<String>

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think lastLookupRowId reflects whether there was a last lookup value.


@Override
public void close() {
underlyingVector.close();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be a problem. We shouldn't be closing the underlying vector as it could still be used.

For Java heap-based vectors this doesn't matter, for off-heap-based ones, we need some kind of ref counting.

mapValue.getValues().getString(i);
}
}
// TODO what is the behavior for element_at if not in the map?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Please add the docs. I didn't mention the null behavior. The behavior should be it should return null. We have tests that confirm this.

@tdas
Copy link
Contributor

tdas commented Sep 29, 2023

This PR has progressed enough that its not great to change it right now, but it would have been better to split this large PR into 2 PRs one for Array and one for Map

@allisonport-db
Copy link
Collaborator Author

This PR has progressed enough that its not great to change it right now, but it would have been better to split this large PR into 2 PRs one for Array and one for Map

@tdas Tried to do this originally but it breaks a bunch of the existing tests, and couldn't refactor them to the new framework without making the changes for both types.

@allisonport-db allisonport-db merged commit a038186 into delta-io:master Oct 2, 2023
6 checks passed
vkorukanti pushed a commit to vkorukanti/delta that referenced this pull request Oct 3, 2023
…Row (delta-io#2087)

#### Which Delta project/connector is this regarding?
<!--
Please add the component selected below to the beginning of the pull request title
For example: [Spark] Title of my pull request
-->

- [ ] Spark
- [ ] Standalone
- [ ] Flink
- [x] Kernel
- [ ] Other (fill in here)

## Description

BASED OFF OF delta-io#2069; to review changes in this PR select the last 4 commits ([here](https://github.com/delta-io/delta/pull/2087/files/a5073b4c14496314bae1ca97163f9e7bbc67bc6d..d160915717d683ebfd6087a93ccb224a57216d82))

Refactors the ColumnVector and Row `getArray` and `getMap` APIs to return wrapper objects. These wrappers provide APIs to retrieve column vector views of the elements or keys/values of the map/array. 

## How was this patch tested?

Supporting the new APIs in some of the java testing infrastructure was non-trivial, so this PR also moves some tests from TestDeltaTableReads to DeltaTableReadsSuite and adds the complex types to the scala testing infrastructure.

Existing tests have been modified for this PR as well as a few additional checks/tests added.
vkorukanti pushed a commit to vkorukanti/delta that referenced this pull request Oct 3, 2023
…Row (delta-io#2087)

#### Which Delta project/connector is this regarding?
<!--
Please add the component selected below to the beginning of the pull request title
For example: [Spark] Title of my pull request
-->

- [ ] Spark
- [ ] Standalone
- [ ] Flink
- [x] Kernel
- [ ] Other (fill in here)

## Description

BASED OFF OF delta-io#2069; to review changes in this PR select the last 4 commits ([here](https://github.com/delta-io/delta/pull/2087/files/a5073b4c14496314bae1ca97163f9e7bbc67bc6d..d160915717d683ebfd6087a93ccb224a57216d82))

Refactors the ColumnVector and Row `getArray` and `getMap` APIs to return wrapper objects. These wrappers provide APIs to retrieve column vector views of the elements or keys/values of the map/array. 

## How was this patch tested?

Supporting the new APIs in some of the java testing infrastructure was non-trivial, so this PR also moves some tests from TestDeltaTableReads to DeltaTableReadsSuite and adds the complex types to the scala testing infrastructure.

Existing tests have been modified for this PR as well as a few additional checks/tests added.
Kimahriman pushed a commit to Kimahriman/delta that referenced this pull request Oct 3, 2023
…Row (delta-io#2087)

#### Which Delta project/connector is this regarding?
<!--
Please add the component selected below to the beginning of the pull request title
For example: [Spark] Title of my pull request
-->

- [ ] Spark
- [ ] Standalone
- [ ] Flink
- [x] Kernel
- [ ] Other (fill in here)

## Description

BASED OFF OF delta-io#2069; to review changes in this PR select the last 4 commits ([here](https://github.com/delta-io/delta/pull/2087/files/a5073b4c14496314bae1ca97163f9e7bbc67bc6d..d160915717d683ebfd6087a93ccb224a57216d82))

Refactors the ColumnVector and Row `getArray` and `getMap` APIs to return wrapper objects. These wrappers provide APIs to retrieve column vector views of the elements or keys/values of the map/array. 

## How was this patch tested?

Supporting the new APIs in some of the java testing infrastructure was non-trivial, so this PR also moves some tests from TestDeltaTableReads to DeltaTableReadsSuite and adds the complex types to the scala testing infrastructure.

Existing tests have been modified for this PR as well as a few additional checks/tests added.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants