[Spark] Support external DSV2 catalog in Vacuum command #2039

gengliangwang · 2023-09-11T22:50:19Z

Which Delta project/connector is this regarding?

Description

Support external DSV2 catalog in Vacuum command. After the changes, the vacuum command supports tables from external DSV2 catalog.
For example, with

spark.sql.catalog.customer_catalog=org.apache.spark.sql.CustomerCatalog

We can query

SET CATALOG customer_catalog;
VACUUM t1

Or simply

VACUUM customer_catalog.default.t1

This PR also introduce a new analyzer rule ResolveDeltaPathTable so that external DSV2 catalogs won't need to implement the resolution of delta file path table.

How was this patch tested?

new end-to-end tests
new parser test cases

Does this PR introduce any user-facing changes?

Yes, users can use Vacuum command on the tables of their external DSV2 catalogs.

gengliangwang · 2023-09-11T22:50:36Z

cc @ryan-johnson-databricks @LukasRupprecht

ryan-johnson-databricks

Support external DSV2 catalog in Vacuum command. After the changes, the restore command supports tables from external DSV2 catalog.

"... the vacuum command supports... " ?

ryan-johnson-databricks · 2023-09-12T00:11:06Z

spark/src/main/scala/io/delta/tables/execution/VacuumTableCommand.scala

-      } else {
-        throw DeltaErrors.missingTableIdentifierException("VACUUM")
-      }
+    val pathToVacuum = getDeltaTable(child, "VACUUM").path
    val baseDeltaPath = DeltaTableUtils.findDeltaTableRoot(sparkSession, pathToVacuum)


I don't think we need to find table root any more with this change? If the child pointed to a subdirectory of the table, I would have expected an AnalysisException before now. Query resolution would not have been able to turn UnresolvedDeltaPathOrIdentifier into a Delta table (because no _delta_log directory present in the expected location).

If we really need to support triggering VACUUM for a table by pointing at any subdirectory of that table (as the current code does), then we'd have to somehow delay the table resolution until this point so we can findDeltaTableRoot. But allowing users to specify subdirectories, as if they were the table itself, seems more like a bug than a feature, to be honest.

And actually, L60 below seems to corroborate that subdirectories aren't supported that way, because it blows up if the found root path mismatches the given table path?

Thanks for the suggestion. I removed the baseDeltaPath check

Hmm, we should probably double check the existing behavior first -- if vacuuming a subdirectory was supported before, and our changes here would block it, then that's a breaking change and we need to proceed very carefully. I think it hinges off this code:

val baseDeltaPath = DeltaTableUtils.findDeltaTableRoot(sparkSession, pathToVacuum) if (baseDeltaPath.isDefined) { if (baseDeltaPath.get != pathToVacuum) { throw DeltaErrors.vacuumBasePathMissingException(baseDeltaPath.get) } }

If I'm not mistaken, it requires the given path to be the actual table path, which means the proposed change is not a breaking change. Even if findDeltaTableRoot were to find the table, starting from a subdirectory, the result would fail the equality check immediately after.

CC @tdas @zsxwing

spark/src/main/scala/io/delta/tables/execution/VacuumTableCommand.scala

ryan-johnson-databricks · 2023-09-12T00:20:14Z

spark/src/main/scala/org/apache/spark/sql/delta/ResolveDeltaPathTable.scala

+case class ResolveDeltaPathTable(sparkSession: SparkSession) extends Rule[LogicalPlan] {
+
+  private def maybeSQLFile(u: UnresolvedTable): Boolean = {
+    sparkSession.sessionState.conf.runSQLonFile && u.multipartIdentifier.size == 2


This seems to copy ideas from ResolveSQLOnFile from spark? Is there a reason we can't leverage that here, and let DeltaDataSource.getTable produce the DeltaTableV2 we need?

Ugh, UnresolvedTable != UnresolvedRelation, and it looks like the data source code uses UnresolvedRelation while UnresolvedPathBasedDeltaTable usesUnresolvedTable.

in the parser, this PR uses UnresolvedDeltaPathOrIdentifier and it will produce UnresolvedTable on table identifiers (including the file path table delta.path)

If we create UnresolvedRelation as the child of VacuumTableCommand, the resolved relation from Apache Spark will be a Parquet data source relation. There is some issue with my debugger and I haven't figured out the reason.

In the analyzer rule ResolveRelations, both UnresolvedTable and UnresolvedRelation are processed. UnresolvedTable always result in ResolvedTable, while UnresolvedRelation results in SubqueryAlias with various nodes. I think using UnresolvedTable is simpler here. Any reason why we should use UnresolvedRelation?

I think using UnresolvedTable is simpler here. Any reason why we should use UnresolvedRelation?

Yeah, UnresolvedRelation only makes sense if it allows us to reuse existing machinery in some way. But:

resolved relation from Apache Spark will be a Parquet data source relation

That's... awkward. Tho I've noticed that the file index for Delta is parquet source because that's the physical file format Delta reads. Is there no trace of Delta in the resulting scan node, tho?

spark/src/main/scala/io/delta/sql/parser/DeltaSqlParser.scala

ryan-johnson-databricks · 2023-09-12T15:09:52Z

spark/src/main/scala/io/delta/tables/execution/VacuumTableCommand.scala

    val deltaLog = DeltaLog.forTable(sparkSession, pathToVacuum)
    if (!deltaLog.tableExists) {


Now that we no longer have to search subdirectories, we don't need the DeltaLog.forTable call any more:

val deltaTable = getDeltaTable(child, "VACUUM") if (!deltaTable.tableExists) { throw DeltaErrors.notADeltaTableException( "VACUUM", DeltaTableIdentifier(path = Some(deltaTable.path.toString))) } VacuumCommand.gc(sparkSession, deltaTable.deltaLog, dryRun, horizonHours).collect()

Thanks, updated.

This failed the following test:

test("vacuum for a partition path") { withEnvironment { (tempDir, _) => import testImplicits._ val path = tempDir.getCanonicalPath Seq((1, "a"), (2, "b")).toDF("v1", "v2") .write .format("delta") .partitionBy("v2") .save(path) val ex = intercept[AnalysisException] { sql(s"vacuum '$path/v2=a' retain 0 hours") } assert(ex.getMessage.contains( s"`$path/v2=a` is not a Delta table. VACUUM is only supported for Delta tables.")) } }

There is no AnalysisException thrown.

I get the code for checking deltaLog back.

Should be resolved in the latest code

spark/src/test/scala/org/apache/spark/sql/delta/CustomCatalogSuite.scala

spark/src/main/scala/org/apache/spark/sql/delta/ResolveDeltaPathTable.scala

gengliangwang · 2023-09-12T18:22:26Z

spark/src/test/scala/org/apache/spark/sql/delta/DeltaVacuumSuite.scala

@@ -498,8 +498,7 @@ class DeltaVacuumSuite
        val e = intercept[AnalysisException] {
          vacuumSQLTest(tablePath, viewName)
        }
-        assert(e.getMessage.contains("not found") ||
-          e.getMessage.contains("TABLE_OR_VIEW_NOT_FOUND"))
+        assert(e.getMessage.contains("v is a temp view. 'VACUUM' expects a table."))


The error message here is improved.

Shouldn't we be checking for an error class, rather than specific strings?

def expectTableNotViewError( nameParts: Seq[String], isTemp: Boolean, cmd: String, mismatchHint: Option[String], t: TreeNode[_]): Throwable = { val viewStr = if (isTemp) "temp view" else "view" val hintStr = mismatchHint.map(" " + _).getOrElse("") new AnalysisException( errorClass = "_LEGACY_ERROR_TEMP_1013", messageParameters = Map( "nameParts" -> nameParts.quoted, "viewStr" -> viewStr, "cmd" -> cmd, "hintStr" -> hintStr), origin = t.origin) }

The error class from Spark is a temporary one and it won't be displayed. We can check it after it is assigned to a delegated name.

… consistency with existing code

gengliangwang · 2023-09-13T23:03:51Z

The test failure is about Flink. Should not be related to the code changes.

compilable

3ec4741

ryan-johnson-databricks reviewed Sep 12, 2023

View reviewed changes

address comments

8e81699

ryan-johnson-databricks reviewed Sep 12, 2023

View reviewed changes

gengliangwang added 2 commits September 12, 2023 10:47

fix test failures

1ae8c85

address Ryan's comment

9c986ee

ryan-johnson-databricks reviewed Sep 12, 2023

View reviewed changes

spark/src/test/scala/org/apache/spark/sql/delta/CustomCatalogSuite.scala Outdated Show resolved Hide resolved

spark/src/main/scala/org/apache/spark/sql/delta/ResolveDeltaPathTable.scala Outdated Show resolved Hide resolved

gengliangwang added 2 commits September 12, 2023 11:15

update parser; fix test regression

4f3946f

address nits

5d9bf88

gengliangwang commented Sep 12, 2023

View reviewed changes

gengliangwang added 2 commits September 12, 2023 12:58

improve delta log check; resolve file path table as ResolvedTable for…

d8a8846

… consistency with existing code

introduce a new method hasPartitionFilters

12d89a6

ryan-johnson-databricks approved these changes Sep 12, 2023

View reviewed changes

retrigger tests

2d7264b

allisonport-db closed this in 51f97b8 Sep 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Spark] Support external DSV2 catalog in Vacuum command #2039

[Spark] Support external DSV2 catalog in Vacuum command #2039

gengliangwang commented Sep 11, 2023 •

edited

Loading

gengliangwang commented Sep 11, 2023

ryan-johnson-databricks left a comment

ryan-johnson-databricks Sep 12, 2023

gengliangwang Sep 12, 2023

ryan-johnson-databricks Sep 12, 2023 •

edited

Loading

ryan-johnson-databricks Sep 12, 2023

ryan-johnson-databricks Sep 12, 2023 •

edited

Loading

gengliangwang Sep 12, 2023

ryan-johnson-databricks Sep 12, 2023

ryan-johnson-databricks Sep 12, 2023

gengliangwang Sep 12, 2023

gengliangwang Sep 12, 2023

gengliangwang Sep 12, 2023

gengliangwang Sep 12, 2023 •

edited

Loading

gengliangwang Sep 12, 2023

ryan-johnson-databricks Sep 12, 2023 •

edited

Loading

gengliangwang Sep 12, 2023

gengliangwang commented Sep 13, 2023

		val deltaLog = DeltaLog.forTable(sparkSession, pathToVacuum)
		if (!deltaLog.tableExists) {

[Spark] Support external DSV2 catalog in Vacuum command #2039

[Spark] Support external DSV2 catalog in Vacuum command #2039

Conversation

gengliangwang commented Sep 11, 2023 • edited Loading

Which Delta project/connector is this regarding?

Description

How was this patch tested?

Does this PR introduce any user-facing changes?

gengliangwang commented Sep 11, 2023

ryan-johnson-databricks left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ryan-johnson-databricks Sep 12, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ryan-johnson-databricks Sep 12, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gengliangwang Sep 12, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ryan-johnson-databricks Sep 12, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gengliangwang commented Sep 13, 2023

gengliangwang commented Sep 11, 2023 •

edited

Loading

ryan-johnson-databricks Sep 12, 2023 •

edited

Loading

ryan-johnson-databricks Sep 12, 2023 •

edited

Loading

gengliangwang Sep 12, 2023 •

edited

Loading

ryan-johnson-databricks Sep 12, 2023 •

edited

Loading