-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Kernel] Assign base row ID to AddFile actions #3894
base: master
Are you sure you want to change the base?
[Kernel] Assign base row ID to AddFile actions #3894
Conversation
…Errors.java Co-authored-by: Johan Lasperas <[email protected]>
Co-authored-by: Johan Lasperas <[email protected]>
Co-authored-by: Johan Lasperas <[email protected]>
…uring conflict resolution
…actDomainMetadataMap to fillDomainMetadataMap.
} | ||
} | ||
|
||
test("Fail to assign base row IDs to AddFile actions w/o stats") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More tests will be added.
RowIDAssignmentResult rowIDAssignmentResult = | ||
RowTracking.assignBaseRowId(protocol, readSnapshot, FULL_SCHEMA, dataActions); | ||
dataActions = rowIDAssignmentResult.getDataActions(); | ||
domainMetadataIter = | ||
domainMetadataIter.combine(rowIDAssignmentResult.getDomainMetadatasIter()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is done in this way because I remember we discussed that we want to keep dataActions
only for data-related actions (AddFile
, RemoveFile
). If it’s acceptable to include DomainMetadata
actions in dataActions
, we could simplify this by appending potential DomainMetadata
actions directly to dataActions
.
// Contains domain metadata actions known prior to iterating and writing the data actions | ||
private final List<DomainMetadata> domainMetadatas = new ArrayList<>(); | ||
// Contains domain metadata actions generated on the fly while writing the data actions | ||
private CloseableIterator<DomainMetadata> domainMetadataIter = | ||
toCloseableIterator(Collections.emptyIterator()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I probably will change this. Currently the list is only used in tests, and iterator is only used for row tracking's domain metadata. I’ll look for more use cases of domain metadata in Delta-Spark to see if there is a better way for managing DomainMetadata
actions in a transaction.
54b77cc
to
75c0005
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a few things missing from the row tracking spec, see my comment
import java.util.Optional; | ||
|
||
/** A collection of helper methods for working with row tracking. */ | ||
public class RowTracking { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check the specification of row tracking to ensure the implementation respects it: https://github.com/delta-io/delta/blob/master/PROTOCOL.md#row-tracking
We want to implement all MUST requirements for writers so that we can support the feature. SHOULD requirements - preserving row IDs / row commit versions by materializing them in the data files - won't be addressed here
In particular:
- We also need to populate the
defaultRowCommitVersion
in add/remove actions - We have to ensure that the DomainMetadata feature is supported whenever RowTracking is supported - we may want to have a check that also forbids writing DomainMetadata actions if the feature isn't supported as a safeguard
|
||
// This one-element array is used to keep track of the current high watermark as we iterate | ||
// through the data actions | ||
final long[] currRowIdHighWatermark = {prevRowIdHighWatermark}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just a long?
return new KernelException( | ||
"Cannot assign baseRowId to add action. " | ||
+ "The number of records in this data file is missing."); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we discussed, this can be an issue: if connectors don't populate numRecords
stats in the addFile action that are committed, the commit will fail if row tracking is supported (note that this is still better than today where we always fail in that case since we don't support row tracking.
Question more for kernel folks: do we some guarantee or requirement that connectors populate numRecords
? Are connectors that implement writes today (if any) populating numRecords
?
In any case, I would word the exception so that it puts the burden more on the connector, for example:
"All add actions must have statistics that include the number of records when writing to a Delta table with the RowTracking table feature enabled."
Which Delta project/connector is this regarding?
Description
This PR builds on the base changes which are not yet merged. For changes specific to this PR, please refer to the last commit only.
This PR implements the first part of row tracking support in Delta Kernel, based on the Delta Protocol. Specifically, it includes the following changes:
baseRowId
field toAddFile
actionbaseRowId
toAddFile
actions prior to committing themrowIdHighWaterMark
of thedelta.rowTracking
metadata domain during the base row ID assignment, which is the highest assigned fresh row id for the tableHow was this patch tested?
Added tests in
RowTrackingSuite.scala
.Does this PR introduce any user-facing changes?
No.