Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Kernel] Assign base row ID to AddFile actions #3894

Open
wants to merge 37 commits into
base: master
Choose a base branch
from

Conversation

qiyuandong-db
Copy link

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Description

This PR builds on the base changes which are not yet merged. For changes specific to this PR, please refer to the last commit only.

This PR implements the first part of row tracking support in Delta Kernel, based on the Delta Protocol. Specifically, it includes the following changes:

  • add a new baseRowId field to AddFile action
  • implement functionality to assign baseRowId to AddFile actions prior to committing them
  • maintain the rowIdHighWaterMark of the delta.rowTracking metadata domain during the base row ID assignment, which is the highest assigned fresh row id for the table

How was this patch tested?

Added tests in RowTrackingSuite.scala.

Does this PR introduce any user-facing changes?

No.

qiyuandong-db and others added 30 commits October 31, 2024 14:40
Co-authored-by: Johan Lasperas <[email protected]>
Co-authored-by: Johan Lasperas <[email protected]>
…actDomainMetadataMap to fillDomainMetadataMap.
}
}

test("Fail to assign base row IDs to AddFile actions w/o stats") {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More tests will be added.

Comment on lines +159 to +163
RowIDAssignmentResult rowIDAssignmentResult =
RowTracking.assignBaseRowId(protocol, readSnapshot, FULL_SCHEMA, dataActions);
dataActions = rowIDAssignmentResult.getDataActions();
domainMetadataIter =
domainMetadataIter.combine(rowIDAssignmentResult.getDomainMetadatasIter());
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done in this way because I remember we discussed that we want to keep dataActions only for data-related actions (AddFile, RemoveFile). If it’s acceptable to include DomainMetadata actions in dataActions, we could simplify this by appending potential DomainMetadata actions directly to dataActions.

Comment on lines +77 to +81
// Contains domain metadata actions known prior to iterating and writing the data actions
private final List<DomainMetadata> domainMetadatas = new ArrayList<>();
// Contains domain metadata actions generated on the fly while writing the data actions
private CloseableIterator<DomainMetadata> domainMetadataIter =
toCloseableIterator(Collections.emptyIterator());
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I probably will change this. Currently the list is only used in tests, and iterator is only used for row tracking's domain metadata. I’ll look for more use cases of domain metadata in Delta-Spark to see if there is a better way for managing DomainMetadata actions in a transaction.

Copy link
Collaborator

@johanl-db johanl-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few things missing from the row tracking spec, see my comment

import java.util.Optional;

/** A collection of helper methods for working with row tracking. */
public class RowTracking {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check the specification of row tracking to ensure the implementation respects it: https://github.com/delta-io/delta/blob/master/PROTOCOL.md#row-tracking

We want to implement all MUST requirements for writers so that we can support the feature. SHOULD requirements - preserving row IDs / row commit versions by materializing them in the data files - won't be addressed here

In particular:

  • We also need to populate the defaultRowCommitVersion in add/remove actions
  • We have to ensure that the DomainMetadata feature is supported whenever RowTracking is supported - we may want to have a check that also forbids writing DomainMetadata actions if the feature isn't supported as a safeguard


// This one-element array is used to keep track of the current high watermark as we iterate
// through the data actions
final long[] currRowIdHighWatermark = {prevRowIdHighWatermark};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just a long?

Comment on lines +307 to +310
return new KernelException(
"Cannot assign baseRowId to add action. "
+ "The number of records in this data file is missing.");
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed, this can be an issue: if connectors don't populate numRecords stats in the addFile action that are committed, the commit will fail if row tracking is supported (note that this is still better than today where we always fail in that case since we don't support row tracking.

Question more for kernel folks: do we some guarantee or requirement that connectors populate numRecords? Are connectors that implement writes today (if any) populating numRecords?

In any case, I would word the exception so that it puts the burden more on the connector, for example:
"All add actions must have statistics that include the number of records when writing to a Delta table with the RowTracking table feature enabled."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants