-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add verify_existing
option to cache
#630
base: main
Are you sure you want to change the base?
Conversation
Related work:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
def cache_file(self, fname: str, secrets: Optional[dict], **open_kwargs) -> None: | ||
# check and see if the file already exists in the cache | ||
logger.info(f"Caching file '{fname}'") | ||
if self.exists(fname): | ||
exists = self.exists(fname) | ||
if exists and self.verify_existing: | ||
cached_size = self.size(fname) | ||
remote_size = _get_url_size(fname, secrets, **open_kwargs) | ||
if cached_size == remote_size: | ||
# TODO: add checksumming here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For future reference, do we have a way of doing this without downloading the file and computing the checksum? S3 provides that information but arbitrary HTTP endpoints can't be relied upon to provide the SHA in a predictable way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so digging into fsspec source a bit, it appears that fsspec's info
method ought to be used here. It will, by default, avoid opening the file and instead attempt to get both size and checksum from a head
request. It looks to me as though there's no benefit to caching with the current method if verify_existing
is True
, as it will get the size of an opened file rather than using a small HEAD
request
Would you like me to amend this PR with some logic to lean on those methods? Probably good to provide a warning in case it is absolutely necessary to fully load the file up (as happens here) for size/checksum validation
I'm afraid we let this get pretty stale. It might be easier to incorporate the intended changes here in related changes in #750 ... I'm not happy to do it, but the rebase/merge here gets a bit nastier than it ought to |
Adds the option to disable verification of files in the cache against their remote sizes.
For long-running / large caching operations which need to be restarted, simply verifying remote sizes can itself represent a large time cost. In such cases, it may be preferable to disable verification of remote sizes, to allow for more expedient continuation of the caching operation.