[SPARK] Managed Commit support for cold and hot snapshot update #2755

prakharjain09 · 2024-03-15T06:11:49Z

Which Delta project/connector is this regarding?

Description

This PR adds support for cold and hot snapshot update for managed-commits.

How was this patch tested?

UTs

Does this PR introduce any user-facing changes?

No

ryan-johnson-databricks

Nice start!

I especially like that the logic backing getSnapshotAtInit and updateInternal has been unified. The redundancy was annoying.

spark/src/main/scala/org/apache/spark/sql/delta/SnapshotManagement.scala

ryan-johnson-databricks · 2024-03-15T15:56:53Z

spark/src/main/scala/org/apache/spark/sql/delta/SnapshotManagement.scala

+    // If the commit store has changed, we need to recursively invoke updateSnapshot so that we
+    // could get the latest commits from the new commit store.
+    while (newSnapshot.version >= 0 && newSnapshot.commitStoreOpt != commitStoreUsed) {


For my understanding -- it seems like we need this loop for (at least) three reasons?

On cold read, we have no way to guess the table's commit owner. But if the table has a commit owner, we can learn that from the snapshot we created. This seems like a non-trivial overhead for cold reads of managed-commit tables?

On warm read, it's possible the table's commit owner changed since we last updated the snapshot. If so, the first update attempt will only get the last commit by the older commit owner (which points to the new commit owner). Thus, we need to contact the new commit owner if we want the latest snapshot. This should be an exceedingly rare scenario, because we don't expect O(1) commit owner changes during the entire lifetime of any one table?

The commit owner could technically change even while the loop is running, tho this should be a vanishingly rare scenario.

Thus, the loop ensures we always get the latest snapshot (not merely the latest backfilled snapshot, or the latest snapshot some older commit owner knew about) -- even if there were multiple commit ownership changes after the last backfilled commit?

We don't actually need to capture case 3/, because the ownership change commits must have arrived after our snapshot update process started, and we still have a linearizable result even if we don't see them.

As for cases 1/ and 2/, I believe we could change the while to if IFF the managed commit spec requires atomic backfill of all commits that change ownership. Currently the RFC requires atomic backfill for both FS -> owned and owned -> FS cases, and doesn't say anything about transferring a table from one owner to another. Seems like we should update the spec to either forbid such direct ownership changes, or else define whether they require backfill or not?

(all of this is probably academic, tho -- it seems highly unlikely a table could change ownership a second time before the first ownership change backfills)

ryan-johnson-databricks · 2024-03-15T16:00:12Z

spark/src/main/scala/org/apache/spark/sql/delta/SnapshotManagement.scala

      isAsync: Boolean): Snapshot = {
    segmentOpt.map { segment =>
-      if (segment == previousSnapshot.logSegment) {
+      if (previousSnapshotOpt.exists(_.logSegment == segment)) {
        // If no changes were detected, just refresh the timestamp


Stale comment? The timestamp manipulation code here was deleted?

ryan-johnson-databricks · 2024-03-15T16:00:55Z

spark/src/main/scala/org/apache/spark/sql/delta/SnapshotManagement.scala

    }
-    currentSnapshot.snapshot
  }

  /** Replace the given snapshot with the provided one. */


Stale comment, now that we added the keep-same-snapshot semantics?

ryan-johnson-databricks · 2024-03-15T16:18:20Z

spark/src/test/scala/org/apache/spark/sql/delta/managedcommit/ManagedCommitSuite.scala

+  // There are two table owners: CS1 and CS2 both of which are pointing to same underlying
+  // in-memory implementation.
+  // 1. Make 3 commits on the table with CS1 as owner.
+  // 2. Modify the content of commit-2. Change the table owner from 1 to 2 as part of it.


AFAIK, the RFC doesn't say anything about transferring a table between commit owners?

We should add a section to RFC around Transferring commit ownership.

The RFC already talks about FS -> MC and MC -> FS. The same could be combined to change the owners.
i.e. MC1 -> FS -> MC2.

This is better than directly allowing MC1 -> MC2 as that would need us to define another protocol which the owners have to follow.

After a bunch of discussion offline, it turns out direct ownership transfers are really messy, and hard to do securely, because they require two commit owners to coordinate in some ad-hoc way. At this point it looks better to keep the RFC spec as-is:

If the Delta client wants to propose an owner for a FS table, they must send a commit request to the proposed owner; if the owner agrees to take over the table, they arrange for a direct-backfilled commit that makes the owner change.

If the Delta client wants to remove the owner of a table (making it FS based), they must send one last commit request to the current owner; if the owner agrees to disown the table, they arrange for a direct-backfilled commit that makes the owner change.

The direct-backfilled commit may be done client-side by the owner's commit store, or server-side by the owner itself (that's up to the owner to decide).

If we are in agreement about the above, we should update this test (and possibly others as well) to follow spec?

ryan-johnson-databricks · 2024-03-15T16:18:55Z

spark/src/test/scala/org/apache/spark/sql/delta/managedcommit/ManagedCommitSuite.scala

+  // 4. Write commit 3/4 using new commit owner.
+  // 5. Read the table again and make sure right APIs are called:
+  //    a) If read query is run in scala, we do listing 2 times. So CS2.getCommits will be called
+  //       twice. We should not be contacting CS1 anymore.


I guess we expect backfill somewhere along the way? But when does that occur, and wouldn't commit-2 and its backfilled counterpart disagree after we modify?

Update: Actually, batch size 10 means we don't backfill anything, and instead rely on CS2 to know about commits performed by CS1, and also rely on warm start snapshot update to start directly from CS2?

Actually, batch size 10 means we don't backfill anything, and instead rely on CS2 to know about commits performed by CS1, and also rely on warm start snapshot update to start directly from CS2

Yes - this is correct. Even if we add a constrain in RFC that commits must be backfilled on ownership change, still our delta-spark code could recursively update snapshots across different owners even when the commits where ownership changes are not backfilled. This test case is testing the same.

Even if we add a constrain in RFC that commits must be backfilled on ownership change, still our delta-spark code could recursively update snapshots across different owners even when the commits where ownership changes are not backfilled. This test case is testing the same.

Why would we need to validate that scenario, once the spec forbids it?

That scenario has been removed/simplified now to just test what spec allows.

ryan-johnson-databricks · 2024-03-15T16:22:44Z

spark/src/test/scala/org/apache/spark/sql/delta/managedcommit/ManagedCommitSuite.scala

+        Seq(2).toDF.write.format("delta").mode("append").save(tablePath) // version 2
+        DeltaLog.clearCache()
+        checkAnswer(sql(s"SELECT * FROM delta.`$tablePath`"), Seq(Row(0), Row(1), Row(2)))
+        def deltaLog(): DeltaLog = DeltaLog.forTable(spark, tablePath)


I only see one call to this method (L350 below) and also see code reusing log. Maybe this method isn't needed?

ryan-johnson-databricks · 2024-03-15T16:29:49Z

spark/src/test/scala/org/apache/spark/sql/delta/managedcommit/ManagedCommitSuite.scala

+  //      and create a snapshot out of it. Then it will contact cs2 and fail. So deltaLog.update()
+  //      won't succeed and throw exception. But underlying DeltaLog object now has reference to v3.
+  //      The recorded timestamp for this must be commit timestamp and not clock timestamp.


How could DeltaLog instance have a reference to v3, when getUpdatedSnapshot only installs the new snapshot until after the commit-owner-changed loop exits?

The description here is out-of-sync from the actual logic. Fixing this.

ryan-johnson-databricks · 2024-03-15T16:30:40Z

spark/src/test/scala/org/apache/spark/sql/delta/managedcommit/ManagedCommitSuite.scala

+          .commitImpl(logStore, hadoopConf, logPath, commitVersion, commitFile, commitTimestamp)
+      }
+
+      override protected[delta] def registerBackfill(


nit: we could simplify the override since this class is anyway scoped to this specific test case?

Suggested change

override protected[delta] def registerBackfill(

override def registerBackfill(

ryan-johnson-databricks · 2024-03-15T21:47:06Z

spark/src/main/scala/org/apache/spark/sql/delta/SnapshotManagement.scala

+    )
+    // If the commit store has changed, we need to recursively invoke updateSnapshot so that we
+    // could get the latest commits from the new commit store.
+    while (newSnapshot.version >= 0 && newSnapshot.commitStoreOpt != commitStoreUsed) {


What commit owner does the initial snapshot use? If it's the default commit owner, attempting an update would fail because the default owner doesn't know about this (not yet existing) table? Seems like we need to force commit owner to None for initial snapshot, and let commit 0 install a commit owner if it wishes?

ryan-johnson-databricks

Looks great. Just needs some unit tests cleanup and we should be good to merge!

ryan-johnson-databricks · 2024-03-20T16:20:06Z

spark/src/test/scala/org/apache/spark/sql/delta/managedcommit/ManagedCommitSuite.scala

+  // There are two table owners: CS1 and CS2 both of which are pointing to same underlying
+  // in-memory implementation.
+  // 1. Make 3 commits on the table with CS1 as owner.
+  // 2. Modify the content of commit-2. Change the table owner from 1 to 2 as part of it.


After a bunch of discussion offline, it turns out direct ownership transfers are really messy, and hard to do securely, because they require two commit owners to coordinate in some ad-hoc way. At this point it looks better to keep the RFC spec as-is:

If the Delta client wants to propose an owner for a FS table, they must send a commit request to the proposed owner; if the owner agrees to take over the table, they arrange for a direct-backfilled commit that makes the owner change.

If the Delta client wants to remove the owner of a table (making it FS based), they must send one last commit request to the current owner; if the owner agrees to disown the table, they arrange for a direct-backfilled commit that makes the owner change.

The direct-backfilled commit may be done client-side by the owner's commit store, or server-side by the owner itself (that's up to the owner to decide).

ryan-johnson-databricks · 2024-03-20T16:20:45Z

spark/src/test/scala/org/apache/spark/sql/delta/managedcommit/ManagedCommitSuite.scala

+  // There are two table owners: CS1 and CS2 both of which are pointing to same underlying
+  // in-memory implementation.
+  // 1. Make 3 commits on the table with CS1 as owner.
+  // 2. Modify the content of commit-2. Change the table owner from 1 to 2 as part of it.


If we are in agreement about the above, we should update this test (and possibly others as well) to follow spec?

ryan-johnson-databricks · 2024-03-20T16:25:04Z

spark/src/main/scala/org/apache/spark/sql/delta/SnapshotManagement.scala

@@ -231,11 +217,11 @@ trait SnapshotManagement { self: DeltaLog =>
   * @return Some LogSegment to build a Snapshot if files do exist after the given
   *         startCheckpoint. None, if the directory was missing or empty.
   */
-  protected def getLogSegmentForVersion(
+  protected def createLogSegment(


aside: This is a welcome change... the old name never really did make much sense.

ryan-johnson-databricks · 2024-03-20T16:26:41Z

spark/src/main/scala/org/apache/spark/sql/delta/SnapshotManagement.scala

@@ -253,6 +239,12 @@ trait SnapshotManagement { self: DeltaLog =>
    )
  }

+  private def createLogSegment(previousSnapshot: Snapshot): Option[LogSegment] = {


This is just a convenience overload to create a new log segment, using the previous snapshot as starting point, right? Otherwise nothing special?
(maybe worth a quick doc comment)

ryan-johnson-databricks · 2024-03-20T16:36:11Z

spark/src/main/scala/org/apache/spark/sql/delta/SnapshotManagement.scala

@@ -498,30 +490,20 @@ trait SnapshotManagement { self: DeltaLog =>
   * file as a hint on where to start listing the transaction log directory. If the _delta_log
   * directory doesn't exist, this method will return an `InitialSnapshot`.
   */
-  protected def getSnapshotAtInit: CapturedSnapshot = {
+  protected def getSnapshotAtInit: CapturedSnapshot = withSnapshotLockInterruptibly {


To make sure I'm understanding the new flow correctly:

At construction time, getSnapshotAtInit:

calls createLogSegment with last checkpoint (if any) as the starting point

passes the resulting log segment to getUpdatedSnapshot (which handles commit owner changes)

directly returns the resulting CapturedSnapshot so it can be assigned

At update time, updateInternal:

calls createLogSegment with current snapshot as the starting point

passes the resulting log segment to getUpdatedSnapshot (which handles commit owner changes)

calls installSnapshot to update or replace the captured snapshot as needed.

Nice. Much cleaner than the previous flow, which was creating log segments in different ways for init vs. update paths. Good-bye, createSnapshotAtInitInternal!

nicely summarized.

ryan-johnson-databricks reviewed Mar 15, 2024

View reviewed changes

prakharjain09 force-pushed the mc-coldSnapshot-1 branch from ba04d7f to e29b33b Compare March 18, 2024 15:13

prakharjain09 requested a review from ryan-johnson-databricks March 18, 2024 15:15

prakharjain09 force-pushed the mc-coldSnapshot-1 branch 2 times, most recently from 0621fd8 to b0b8bb1 Compare March 20, 2024 05:44

ryan-johnson-databricks approved these changes Mar 20, 2024

View reviewed changes

Managed Commit support for cold and hot snapshot

271c1a4

prakharjain09 force-pushed the mc-coldSnapshot-1 branch from f4b2779 to 271c1a4 Compare March 20, 2024 17:43

tdas approved these changes Mar 20, 2024

View reviewed changes

tdas merged commit 4619af7 into delta-io:master Mar 20, 2024
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK] Managed Commit support for cold and hot snapshot update #2755

[SPARK] Managed Commit support for cold and hot snapshot update #2755

prakharjain09 commented Mar 15, 2024

ryan-johnson-databricks left a comment

ryan-johnson-databricks Mar 15, 2024

ryan-johnson-databricks Mar 15, 2024

ryan-johnson-databricks Mar 15, 2024

ryan-johnson-databricks Mar 15, 2024

prakharjain09 Mar 15, 2024

prakharjain09 Mar 20, 2024 •

edited

Loading

ryan-johnson-databricks Mar 20, 2024

ryan-johnson-databricks Mar 20, 2024

ryan-johnson-databricks Mar 15, 2024

ryan-johnson-databricks Mar 15, 2024

prakharjain09 Mar 15, 2024

ryan-johnson-databricks Mar 15, 2024

prakharjain09 Mar 20, 2024 •

edited

Loading

ryan-johnson-databricks Mar 15, 2024

ryan-johnson-databricks Mar 15, 2024

prakharjain09 Mar 15, 2024

ryan-johnson-databricks Mar 15, 2024

ryan-johnson-databricks Mar 15, 2024

ryan-johnson-databricks left a comment

ryan-johnson-databricks Mar 20, 2024

ryan-johnson-databricks Mar 20, 2024

ryan-johnson-databricks Mar 20, 2024

ryan-johnson-databricks Mar 20, 2024

ryan-johnson-databricks Mar 20, 2024

prakharjain09 Mar 20, 2024

	override protected[delta] def registerBackfill(
	override def registerBackfill(

[SPARK] Managed Commit support for cold and hot snapshot update #2755

[SPARK] Managed Commit support for cold and hot snapshot update #2755

Conversation

prakharjain09 commented Mar 15, 2024

Which Delta project/connector is this regarding?

Description

How was this patch tested?

Does this PR introduce any user-facing changes?

ryan-johnson-databricks left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

prakharjain09 Mar 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

prakharjain09 Mar 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ryan-johnson-databricks left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

prakharjain09 Mar 20, 2024 •

edited

Loading

prakharjain09 Mar 20, 2024 •

edited

Loading