Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Spark] Support external DSV2 catalog in Vacuum command #2039
[Spark] Support external DSV2 catalog in Vacuum command #2039
Changes from 1 commit
3ec4741
8e81699
1ae8c85
9c986ee
4f3946f
5d9bf88
d8a8846
12d89a6
2d7264b
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to find table root any more with this change? If the
child
pointed to a subdirectory of the table, I would have expected an AnalysisException before now. Query resolution would not have been able to turnUnresolvedDeltaPathOrIdentifier
into a Delta table (because no_delta_log
directory present in the expected location).If we really need to support triggering VACUUM for a table by pointing at any subdirectory of that table (as the current code does), then we'd have to somehow delay the table resolution until this point so we can
findDeltaTableRoot
. But allowing users to specify subdirectories, as if they were the table itself, seems more like a bug than a feature, to be honest.And actually, L60 below seems to corroborate that subdirectories aren't supported that way, because it blows up if the found root path mismatches the given table path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion. I removed the baseDeltaPath check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, we should probably double check the existing behavior first -- if vacuuming a subdirectory was supported before, and our changes here would block it, then that's a breaking change and we need to proceed very carefully. I think it hinges off this code:
If I'm not mistaken, it requires the given path to be the actual table path, which means the proposed change is not a breaking change. Even if
findDeltaTableRoot
were to find the table, starting from a subdirectory, the result would fail the equality check immediately after.CC @tdas @zsxwing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to copy ideas from ResolveSQLOnFile from spark? Is there a reason we can't leverage that here, and let
DeltaDataSource.getTable
produce theDeltaTableV2
we need?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh,
UnresolvedTable
!=UnresolvedRelation
, and it looks like the data source code usesUnresolvedRelation
whileUnresolvedPathBasedDeltaTable
usesUnresolvedTable
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UnresolvedDeltaPathOrIdentifier
and it will produceUnresolvedTable
on table identifiers (including the file path table delta.path
)UnresolvedRelation
as the child ofVacuumTableCommand
, the resolved relation from Apache Spark will be a Parquet data source relation. There is some issue with my debugger and I haven't figured out the reason.ResolveRelations
, bothUnresolvedTable
andUnresolvedRelation
are processed.UnresolvedTable
always result inResolvedTable
, whileUnresolvedRelation
results inSubqueryAlias
with various nodes. I think usingUnresolvedTable
is simpler here. Any reason why we should useUnresolvedRelation
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah,
UnresolvedRelation
only makes sense if it allows us to reuse existing machinery in some way. But:That's... awkward. Tho I've noticed that the file index for Delta is parquet source because that's the physical file format Delta reads. Is there no trace of Delta in the resulting scan node, tho?