Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

normalize databricks paths as part of resolving them #157

Merged
merged 4 commits into from
Oct 10, 2024

Conversation

ericvergnaud
Copy link
Contributor

@ericvergnaud ericvergnaud commented Oct 10, 2024

Path("/a/b/../c").resolve() returns Path("/a/c")
Databricks paths should behave the same, but currently don't.
This PR fixes the issue, which participates in databrickslabs/ucx#2882

Progresses databrickslabs/ucx#2882

Copy link

github-actions bot commented Oct 10, 2024

✅ 40/40 passed, 2 skipped, 2m9s total

Running from acceptance #225

return absolute
return absolute.normalize()

def normalize(self: P) -> P:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A quick initial impression while I have a closer look at this: we should not be introducing new public methods like this, they're not part of the Path() API being emulated here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point. this method is a good private

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't disagree but we're already doing this: see is_notebook() ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 526 to 534
for index, part in enumerate(self._path_parts):
# we can't normalize '/../stuff' so let's ignore such scenarios
if index == 0 or part != "..":
continue
segments = list(self._path_parts)
segments.pop(index)
segments.pop(index - 1)
return self.with_segments(self.anchor, *segments).normalize()
return self
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for index, part in enumerate(self._path_parts):
# we can't normalize '/../stuff' so let's ignore such scenarios
if index == 0 or part != "..":
continue
segments = list(self._path_parts)
segments.pop(index)
segments.pop(index - 1)
return self.with_segments(self.anchor, *segments).normalize()
return self
stack = []
for part in self._path_parts:
if part == "..":
if stack:
stack.pop()
elif part == "." or not part:
continue
else:
stack.append(part)
return self.with_segments(self.anchor, *stack)

can we perhaps get a non-recursive option?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

return absolute
return absolute.normalize()

def normalize(self: P) -> P:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point. this method is a good private

@@ -520,7 +520,18 @@ def resolve(self: P, strict: bool = False) -> P:
if strict and not absolute.exists():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

won't self.absolute() fail because self.cwd() is not available for _DatabricksPath?

    def absolute(self: P) -> P:
        if self.is_absolute():
            return self
        return self.with_segments(self.cwd(), self)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will but that's not the problem being addressed by this PR

Comment on lines +534 to +535
elif part is None or part == '.':
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is needed: by the time we get here a part can neither be None nor .?

else:
segments.append(part)
# pylint: disable=protected-access
return self.with_segments(self.anchor, *segments)._normalize()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit unclear to me why we need to also call ._normalize() on the new instance?

Copy link
Collaborator

@nfx nfx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@nfx nfx merged commit ca6571c into main Oct 10, 2024
7 of 9 checks passed
@nfx nfx deleted the normalize-paths-when-resolving-them branch October 10, 2024 15:55
nfx added a commit that referenced this pull request Oct 10, 2024
* Bump actions/checkout from 4.1.7 to 4.2.0 ([#149](#149)). In this pull request, the `actions/checkout` dependency is upgraded from version 4.1.7 to 4.2.0 in the `acceptance.yml` and `downstreams.yml` workflow files. The new version provides additional Ref and Commit outputs, as well as updated dependencies, which aim to improve the functionality and security of the checkout process. The `Ref` output is a string representing the reference that was checked out, and the `Commit` output is the SHA-1 hash of the checked-out commit. Dependency updates include bumping the `braces` package from 3.0.2 to 3.0.3 and updating the minor-npm-dependencies group across one directory with four updates. These changes contribute to a more reliable and efficient checkout process and enhance the overall functionality and maintainability of the Action. Software engineers are recommended to review the changes and ensure they do not introduce conflicts with their current setup before adopting the new version.
* Bump actions/checkout from 4.2.0 to 4.2.1 ([#152](#152)). In this update, the version of the `actions/checkout` GitHub Action is bumped from 4.2.0 to 4.2.1 in a project's GitHub workflow files. This new version includes a modification to check out other `refs/*` by commit if provided, falling back to the ref. This change enhances the flexibility of the `checkout` action in handling different types of references, which could be useful for users working with multiple branches or references in their workflows. The update also adds a workflow file for publishing releases to an immutable action package. This release was contributed by the new project collaborator, @orhantoy, who made the change in pull request [1924](https://redirect.github.com/actions/checkout/pull/1924).
* Bump databrickslabs/sandbox from acceptance/v0.3.0 to 0.3.1 ([#155](#155)). In this update, the dependency for `databrickslabs/sandbox` has been bumped from version `acceptance/v0.3.0` to `0.3.1`. This change includes bug fixes, upgrades to go-git libraries, and dependency updates. The `golang.org/x/crypto` library was specifically bumped from version `0.16.0` to `0.17.0` in both `/go-libs` and `/runtime-packages`. Additionally, the `cac167b` commit expanded acceptance test logs and introduced experimental OIDC refresh token rotation. The acceptance test job in the workflow was also updated to use the new version of `databrickslabs/sandbox`. Ignore conditions were added for previous versions of `databrickslabs/sandbox` in this release. The README was also modified, and install instructions were added to the changelog.
* Catch all errors when checking Databricks path, notably BadRequest ones ([#156](#156)). This commit introduces improvements to the error handling of the `exists` method in the `paths.py` file when checking Databricks path. Previously, only `NotFound` errors were caught, but now `BadRequest` errors are also handled, addressing issue [#2882](https://github.com/databrickslabs/blueprint/issues/2882). The `exists` method has been updated to catch and manage `DatabricksError` exceptions, which now encompass `BadRequest` errors, ensuring comprehensive error handling for Databricks path-related operations. Additionally, the `_cached_file_info` and `_cached_object_info` attributes are now initialized when a `DatabricksError` exception occurs, returning `False` accordingly. This enhancement maintains consistency and accuracy in the `exists` method while broadening the range of errors captured, resulting in a more robust and reliable codebase with enhanced error reporting for users.
* Normalize databricks paths as part of resolving them ([#157](#157)). In this release, the `resolve` method in the `paths.py` file of the databricks/labs/blueprint project has been enhanced to handle parent directory references ("..") consistently with Python's built-in `Path` object. Previously, `Path("/a/b/../c").resolve()` would return `Path("/a/b/c")`, while Databricks paths were not behaving consistently. This modification introduces a new `_normalize()` method, which processes the path parts and ensures that ".." segments are handled correctly. The commit also includes a new test function, 'test_resolve_is_consistent', which checks the consistent resolution of Databricks paths with various input formats, such as relative paths, ".." or "." components, and absolute paths. This change ensures that the resolved path will be normalized according to the expected behavior, regardless of the input format, contributing to the resolution of issue [#2882](https://github.com/databrickslabs/blueprint/issues/2882). By normalizing Databricks paths in the same fashion as Python's built-in `Path` object, the code should become more robust and predictable, providing a more reliable and predictable experience for software engineers utilizing the project.
* Updated databrickslabs/sandbox requirement to acceptance/v0.3.0 ([#153](#153)). In this pull request, the `databrickslabs/sandbox` package requirement in the downstreams GitHub Actions workflow is updated to version 0.3.0, which is the latest version available. This package provides a sandbox environment for development and testing, and the new version includes bug fixes and dependency updates that may enhance its reliability and performance. Dependabot has been used to ensure a smooth update process, with any conflicts being resolved automatically. However, it is recommended to review the changelog and test the updated version before merging this pull request to ensure compatibility and functionality in your specific use case. Additionally, Dependabot commands are available to manage ignore conditions for this dependency.

Dependency updates:

 * Bump actions/checkout from 4.1.7 to 4.2.0 ([#149](#149)).
 * Bump actions/checkout from 4.2.0 to 4.2.1 ([#152](#152)).
 * Updated databrickslabs/sandbox requirement to acceptance/v0.3.0 ([#153](#153)).
 * Bump databrickslabs/sandbox from acceptance/v0.3.0 to 0.3.1 ([#155](#155)).
@nfx nfx mentioned this pull request Oct 10, 2024
nfx added a commit that referenced this pull request Oct 10, 2024
* Bump actions/checkout from 4.1.7 to 4.2.0
([#149](#149)). In
this pull request, the `actions/checkout` dependency is upgraded from
version 4.1.7 to 4.2.0 in the `acceptance.yml` and `downstreams.yml`
workflow files. The new version provides additional Ref and Commit
outputs, as well as updated dependencies, which aim to improve the
functionality and security of the checkout process. The `Ref` output is
a string representing the reference that was checked out, and the
`Commit` output is the SHA-1 hash of the checked-out commit. Dependency
updates include bumping the `braces` package from 3.0.2 to 3.0.3 and
updating the minor-npm-dependencies group across one directory with four
updates. These changes contribute to a more reliable and efficient
checkout process and enhance the overall functionality and
maintainability of the Action. Software engineers are recommended to
review the changes and ensure they do not introduce conflicts with their
current setup before adopting the new version.
* Bump actions/checkout from 4.2.0 to 4.2.1
([#152](#152)). In
this update, the version of the `actions/checkout` GitHub Action is
bumped from 4.2.0 to 4.2.1 in a project's GitHub workflow files. This
new version includes a modification to check out other `refs/*` by
commit if provided, falling back to the ref. This change enhances the
flexibility of the `checkout` action in handling different types of
references, which could be useful for users working with multiple
branches or references in their workflows. The update also adds a
workflow file for publishing releases to an immutable action package.
This release was contributed by the new project collaborator, @orhantoy,
who made the change in pull request
[1924](https://redirect.github.com/actions/checkout/pull/1924).
* Bump databrickslabs/sandbox from acceptance/v0.3.0 to 0.3.1
([#155](#155)). In
this update, the dependency for `databrickslabs/sandbox` has been bumped
from version `acceptance/v0.3.0` to `0.3.1`. This change includes bug
fixes, upgrades to go-git libraries, and dependency updates. The
`golang.org/x/crypto` library was specifically bumped from version
`0.16.0` to `0.17.0` in both `/go-libs` and `/runtime-packages`.
Additionally, the `cac167b` commit expanded acceptance test logs and
introduced experimental OIDC refresh token rotation. The acceptance test
job in the workflow was also updated to use the new version of
`databrickslabs/sandbox`. Ignore conditions were added for previous
versions of `databrickslabs/sandbox` in this release. The README was
also modified, and install instructions were added to the changelog.
* Catch all errors when checking Databricks path, notably BadRequest
ones ([#156](#156)).
This commit introduces improvements to the error handling of the
`exists` method in the `paths.py` file when checking Databricks path.
Previously, only `NotFound` errors were caught, but now `BadRequest`
errors are also handled, addressing issue
[#2882](https://github.com/databrickslabs/blueprint/issues/2882). The
`exists` method has been updated to catch and manage `DatabricksError`
exceptions, which now encompass `BadRequest` errors, ensuring
comprehensive error handling for Databricks path-related operations.
Additionally, the `_cached_file_info` and `_cached_object_info`
attributes are now initialized when a `DatabricksError` exception
occurs, returning `False` accordingly. This enhancement maintains
consistency and accuracy in the `exists` method while broadening the
range of errors captured, resulting in a more robust and reliable
codebase with enhanced error reporting for users.
* Normalize databricks paths as part of resolving them
([#157](#157)). In
this release, the `resolve` method in the `paths.py` file of the
databricks/labs/blueprint project has been enhanced to handle parent
directory references ("..") consistently with Python's built-in `Path`
object. Previously, `Path("/a/b/../c").resolve()` would return
`Path("/a/b/c")`, while Databricks paths were not behaving consistently.
This modification introduces a new `_normalize()` method, which
processes the path parts and ensures that ".." segments are handled
correctly. The commit also includes a new test function,
'test_resolve_is_consistent', which checks the consistent resolution of
Databricks paths with various input formats, such as relative paths,
".." or "." components, and absolute paths. This change ensures that the
resolved path will be normalized according to the expected behavior,
regardless of the input format, contributing to the resolution of issue
[#2882](https://github.com/databrickslabs/blueprint/issues/2882). By
normalizing Databricks paths in the same fashion as Python's built-in
`Path` object, the code should become more robust and predictable,
providing a more reliable and predictable experience for software
engineers utilizing the project.
* Updated databrickslabs/sandbox requirement to acceptance/v0.3.0
([#153](#153)). In
this pull request, the `databrickslabs/sandbox` package requirement in
the downstreams GitHub Actions workflow is updated to version 0.3.0,
which is the latest version available. This package provides a sandbox
environment for development and testing, and the new version includes
bug fixes and dependency updates that may enhance its reliability and
performance. Dependabot has been used to ensure a smooth update process,
with any conflicts being resolved automatically. However, it is
recommended to review the changelog and test the updated version before
merging this pull request to ensure compatibility and functionality in
your specific use case. Additionally, Dependabot commands are available
to manage ignore conditions for this dependency.

Dependency updates:

* Bump actions/checkout from 4.1.7 to 4.2.0
([#149](#149)).
* Bump actions/checkout from 4.2.0 to 4.2.1
([#152](#152)).
* Updated databrickslabs/sandbox requirement to acceptance/v0.3.0
([#153](#153)).
* Bump databrickslabs/sandbox from acceptance/v0.3.0 to 0.3.1
([#155](#155)).
nfx added a commit to databrickslabs/ucx that referenced this pull request Oct 10, 2024
## Changes
Workspace API does not support relative subpaths such as "/a/b/../c".
This PR fixes the issue by resolving workspace paths before calling the
API.

### Linked issues
Resolves #2882 
Requires #databrickslabs/blueprint#156
Requires #databrickslabs/blueprint#157

### Functionality
None

### Tests
- [x] added integration tests

---------

Co-authored-by: Eric Vergnaud <[email protected]>
Co-authored-by: Serge Smertin <[email protected]>
nfx pushed a commit to databrickslabs/ucx that referenced this pull request Oct 14, 2024
…databricks-labs-blueprint` (#2950)

## Changes

In #2920 we introduced a change to ensure that notebook paths are
normalised, resolving #2882. This change depended on an upstream fix
(databrickslabs/blueprint#157) included in the 0.9.1 release of the
dependency. This PR ensures that we run against that release or later.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants