-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Kernel] Refactor the Array and Map return types in ColumnVector and Row #2087
Conversation
kernel/kernel-api/src/main/java/io/delta/kernel/internal/data/ScanStateRow.java
Show resolved
Hide resolved
...defaults/src/test/java/io/delta/kernel/defaults/internal/parquet/TestParquetBatchReader.java
Outdated
Show resolved
Hide resolved
* Get the value at given {@code rowId} from the column vector. The type of the value object | ||
* depends on the data type of the {@code vector}. | ||
*/ | ||
public static Object getValueAsObject(ColumnVector vector, int rowId) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved over from kernel/kernel-defaults/src/main/java/io/delta/kernel/defaults/internal/data/vector/VectorUtils.java since it was only being used in test code after these changes
kernel/kernel-defaults/src/main/java/io/delta/kernel/defaults/internal/data/DefaultJsonRow.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Left some comments
@@ -112,11 +110,11 @@ public interface Row { | |||
* Return array value of the column located at the given ordinal. | |||
* Throws error if the column at given ordinal is not of array type, | |||
*/ | |||
<T> List<T> getArray(int ordinal); | |||
ArrayValue getArray(int ordinal); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would getArrayValue
and getMapValue
be better APIs?
Since getMap
makes me think it will return a java.util.Map
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I necessarily agree with
getMap makes me think it will return a java.util.Map
But I'm relatively indifferent about changing this if everyone is in agreement. Noting that Spark's method is ColumnVector::getMap --> ColumnarMap
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use getArrayValue
etc. Map
and Array
ubiquitously refer to java.util.Map and the java array type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's followup on this when we do the API review (easy to change in a followup PR)
kernel/kernel-api/src/main/java/io/delta/kernel/internal/ScanBuilderImpl.java
Outdated
Show resolved
Hide resolved
kernel/kernel-api/src/main/java/io/delta/kernel/internal/data/ScanStateRow.java
Show resolved
Hide resolved
return configuration; | ||
} | ||
|
||
public String getColumnMappingMode() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm. I don't know if this is a very scalable or future-proof design.
we won't have getters for every other configuration, will we? (I don't think so)
So, what makes this one special?
Instead, we should just lazily convert the configuration MapValue into a Map on the first .getTableProperty(key: String)
and access it that way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added getConfiguration --> Map<String, String>
instead. Let me know if you'd prefer a getTableProperty API instead, but I think we'd have to provide a couple methods
- boolean isTablePropertySet(key)
- String getTableProperty(key)
- String getTableProperty(key, default) <-- optional
kernel/kernel-api/src/main/java/io/delta/kernel/internal/data/ScanStateRow.java
Show resolved
Hide resolved
kernel/kernel-api/src/main/java/io/delta/kernel/utils/ArrayValueWrapper.java
Outdated
Show resolved
Hide resolved
kernel/kernel-api/src/main/java/io/delta/kernel/utils/VectorUtils.java
Outdated
Show resolved
Hide resolved
@@ -160,7 +157,17 @@ private static Object decodeElement(JsonNode jsonValue, DataType dataType) { | |||
final Object parsedElement = decodeElement(element, arrayType.getElementType()); | |||
output.add(parsedElement); | |||
} | |||
return output; | |||
return new ArrayValue() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this becomes a common pattern, add a utility method to convert List
to ArrayValue
.
} | ||
|
||
@Override | ||
public <K, V> Map<K, V> getMap(int ordinal) { | ||
return (Map<K, V>) parsedValues[ordinal]; | ||
} | ||
} | ||
|
||
private static class TestColumnVector implements ColumnVector { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if there is another alternative to this. Does Spark has similar requirement where they need to expose a list as Vector?
...defaults/src/test/java/io/delta/kernel/defaults/internal/parquet/TestParquetBatchReader.java
Outdated
Show resolved
Hide resolved
@@ -30,7 +30,8 @@ public final class VectorUtils { | |||
private VectorUtils() {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we consolidate the toJavaMap(MapValue)
and toMapValue(java.util.Map
here? Similar for the array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Responding to the (I think) 3 comments related to this here.
We need to expose java list/maps as MapValue/ArrayValue (and thus as ColumnVectors) in a couple of places
In kernelApi:
- (1) JsonHandlerTestImpl
In kernelDefaults:
- (2) DefaultJsonHandlerImpl (in DefaultJsonRow) -- very similar pattern to the first
- (3) Tests (once in ExpressionEvaluatorSuite)
I can add a test utility to do this to cover (1) and (3) but I don't think we want to add this to VectorUtils
as we shouldn't be creating ColumnVectors there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about add utility method for defaults and it can be used in default prod code and test code.
For the JsonHandlerTestImpl
, I am actually thinking of removing it and make it a proper mock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a DefaultGenericVector
and use that in DefaultJsonRow
and in a test utility I added that converts a Seq[Map[..,..]]
to a CV. Can't really combine more than that since in DefaultJsonRow
we can build the keys/values separately and don't start with a Map
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please rebase the change and will take another pass.
d160915
to
f72d5fb
Compare
mapValue.getValues().getString(i); | ||
} | ||
} | ||
// TODO what is the behavior for element_at if not in the map? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vkorukanti I can update the docs for the function as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't return null. We should return an Optional
.
What if the value for the key is null
? Isn't that perfectly valid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! Please add the docs. I didn't mention the null
behavior. The behavior should be it should return null
. We have tests that confirm this.
...st/scala/io/delta/kernel/defaults/internal/expressions/DefaultExpressionEvaluatorSuite.scala
Outdated
Show resolved
Hide resolved
@@ -112,11 +110,11 @@ public interface Row { | |||
* Return array value of the column located at the given ordinal. | |||
* Throws error if the column at given ordinal is not of array type, | |||
*/ | |||
<T> List<T> getArray(int ordinal); | |||
ArrayValue getArray(int ordinal); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use getArrayValue
etc. Map
and Array
ubiquitously refer to java.util.Map and the java array type
kernel/kernel-api/src/main/java/io/delta/kernel/internal/actions/Metadata.java
Outdated
Show resolved
Hide resolved
kernel/kernel-api/src/main/java/io/delta/kernel/utils/VectorUtils.java
Outdated
Show resolved
Hide resolved
kernel/kernel-api/src/main/java/io/delta/kernel/utils/VectorUtils.java
Outdated
Show resolved
Hide resolved
kernel/kernel-api/src/main/java/io/delta/kernel/utils/VectorUtils.java
Outdated
Show resolved
Hide resolved
...defaults/src/main/java/io/delta/kernel/defaults/internal/data/vector/DefaultArrayVector.java
Outdated
Show resolved
Hide resolved
...defaults/src/main/java/io/delta/kernel/defaults/internal/data/vector/DefaultArrayVector.java
Show resolved
Hide resolved
mapValue.getValues().getString(i); | ||
} | ||
} | ||
// TODO what is the behavior for element_at if not in the map? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't return null. We should return an Optional
.
What if the value for the key is null
? Isn't that perfectly valid?
...defaults/src/main/java/io/delta/kernel/defaults/internal/expressions/ElementAtEvaluator.java
Show resolved
Hide resolved
@@ -73,7 +74,7 @@ static ColumnVector eval(ColumnVector map, ColumnVector lookupKey) { | |||
// The general pattern is call `isNullAt(rowId)` followed by `getString`. | |||
// So the cache of one value is enough. | |||
private int lastLookupRowId = -1; | |||
private Object lastLookupValue = null; | |||
private String lastLookupValue = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does null
represent? that there was no last lookup value? this is ambiguous, it should be Optional
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't represent a null
value if we make it Optional<String>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think lastLookupRowId
reflects whether there was a last lookup value.
kernel/kernel-api/src/main/java/io/delta/kernel/data/ArrayValue.java
Outdated
Show resolved
Hide resolved
kernel/kernel-api/src/main/java/io/delta/kernel/data/MapValue.java
Outdated
Show resolved
Hide resolved
|
||
@Override | ||
public void close() { | ||
underlyingVector.close(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be a problem. We shouldn't be closing the underlying vector as it could still be used.
For Java heap-based vectors this doesn't matter, for off-heap-based ones, we need some kind of ref counting.
mapValue.getValues().getString(i); | ||
} | ||
} | ||
// TODO what is the behavior for element_at if not in the map? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! Please add the docs. I didn't mention the null
behavior. The behavior should be it should return null
. We have tests that confirm this.
This PR has progressed enough that its not great to change it right now, but it would have been better to split this large PR into 2 PRs one for Array and one for Map |
@tdas Tried to do this originally but it breaks a bunch of the existing tests, and couldn't refactor them to the new framework without making the changes for both types. |
…Row (delta-io#2087) #### Which Delta project/connector is this regarding? <!-- Please add the component selected below to the beginning of the pull request title For example: [Spark] Title of my pull request --> - [ ] Spark - [ ] Standalone - [ ] Flink - [x] Kernel - [ ] Other (fill in here) ## Description BASED OFF OF delta-io#2069; to review changes in this PR select the last 4 commits ([here](https://github.com/delta-io/delta/pull/2087/files/a5073b4c14496314bae1ca97163f9e7bbc67bc6d..d160915717d683ebfd6087a93ccb224a57216d82)) Refactors the ColumnVector and Row `getArray` and `getMap` APIs to return wrapper objects. These wrappers provide APIs to retrieve column vector views of the elements or keys/values of the map/array. ## How was this patch tested? Supporting the new APIs in some of the java testing infrastructure was non-trivial, so this PR also moves some tests from TestDeltaTableReads to DeltaTableReadsSuite and adds the complex types to the scala testing infrastructure. Existing tests have been modified for this PR as well as a few additional checks/tests added.
…Row (delta-io#2087) #### Which Delta project/connector is this regarding? <!-- Please add the component selected below to the beginning of the pull request title For example: [Spark] Title of my pull request --> - [ ] Spark - [ ] Standalone - [ ] Flink - [x] Kernel - [ ] Other (fill in here) ## Description BASED OFF OF delta-io#2069; to review changes in this PR select the last 4 commits ([here](https://github.com/delta-io/delta/pull/2087/files/a5073b4c14496314bae1ca97163f9e7bbc67bc6d..d160915717d683ebfd6087a93ccb224a57216d82)) Refactors the ColumnVector and Row `getArray` and `getMap` APIs to return wrapper objects. These wrappers provide APIs to retrieve column vector views of the elements or keys/values of the map/array. ## How was this patch tested? Supporting the new APIs in some of the java testing infrastructure was non-trivial, so this PR also moves some tests from TestDeltaTableReads to DeltaTableReadsSuite and adds the complex types to the scala testing infrastructure. Existing tests have been modified for this PR as well as a few additional checks/tests added.
…Row (delta-io#2087) #### Which Delta project/connector is this regarding? <!-- Please add the component selected below to the beginning of the pull request title For example: [Spark] Title of my pull request --> - [ ] Spark - [ ] Standalone - [ ] Flink - [x] Kernel - [ ] Other (fill in here) ## Description BASED OFF OF delta-io#2069; to review changes in this PR select the last 4 commits ([here](https://github.com/delta-io/delta/pull/2087/files/a5073b4c14496314bae1ca97163f9e7bbc67bc6d..d160915717d683ebfd6087a93ccb224a57216d82)) Refactors the ColumnVector and Row `getArray` and `getMap` APIs to return wrapper objects. These wrappers provide APIs to retrieve column vector views of the elements or keys/values of the map/array. ## How was this patch tested? Supporting the new APIs in some of the java testing infrastructure was non-trivial, so this PR also moves some tests from TestDeltaTableReads to DeltaTableReadsSuite and adds the complex types to the scala testing infrastructure. Existing tests have been modified for this PR as well as a few additional checks/tests added.
Which Delta project/connector is this regarding?
Description
BASED OFF OF #2069; to review changes in this PR select the last 4 commits (here)
Refactors the ColumnVector and Row
getArray
andgetMap
APIs to return wrapper objects. These wrappers provide APIs to retrieve column vector views of the elements or keys/values of the map/array.How was this patch tested?
Supporting the new APIs in some of the java testing infrastructure was non-trivial, so this PR also moves some tests from TestDeltaTableReads to DeltaTableReadsSuite and adds the complex types to the scala testing infrastructure.
Existing tests have been modified for this PR as well as a few additional checks/tests added.