Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only the latest data files are accessible #147

Closed
ariostas opened this issue Mar 22, 2024 · 3 comments
Closed

Only the latest data files are accessible #147

ariostas opened this issue Mar 22, 2024 · 3 comments

Comments

@ariostas
Copy link
Collaborator

In scikit-hep/uproot5#1115 we noticed that tests seemed to pass even when the newest version of scikit-hep-testdata hadn't been deployed to PyPI.

I took a look at the code and found that the data files are always downloaded using these urls:

baseurl = "https://raw.githubusercontent.com/scikit-hep/scikit-hep-testdata/main/src/skhep_testdata/data/"
zipurl = "https://github.com/scikit-hep/scikit-hep-testdata/zipball/main"

This means that if a file gets updated, then it becomes impossible to use the older version by installing an older version of the package, and instead the only option is to manually move the older version of the file into the cache directory. And if a file gets deleted, then it becomes inaccessible.

So with #144 I unintentionally made previous RNTuple files inaccessible. In this case it's fine since no one should be using the older RNTuple spec, but there might be cases where files need to be updated, but with the option of using older versions by installing previous versions of this package.

So I think it would be better if the above lines could be replaced with something like

baseurl = f"https://github.com/scikit-hep/scikit-hep-testdata/raw/v{__version__}/src/skhep_testdata/data/" 
zipurl = f"https://github.com/scikit-hep/scikit-hep-testdata/archive/refs/tags/v{__version__}.zip" 

and maybe have a fallback in case someone is using an editable version or something like that.

Also, it would be nice if there was a clear_cache function.

@jpivarski
Copy link
Member

clear_cache would be good. skhep_testdata.data_path with a reload=True option would also be good. So would a checksum (skhep_testdata would come with a list of checksums for each file, and would compare a file-on-disk with its expected checksum before assuming that it's a cache hit).

I don't think there's a problem with newer versions of scikit-hep-testdata not providing access to older files, but it is a problem that the cache lives outside of package versioning: you can skhep_testdata.data_path in one version, upgrade the package, and then skhep_testdata.data_path gets the old file, rather than the new one (or vice-versa, if you downgrade). The problem is that file versions are decoupled from Python package versions, and users like me assume that the Python package version controls all of its contents (where the test files are "contents").

Since we usually only add files, this hasn't come up before, but it does make sense to replace RNTuple files until version 1.0 is declared.

@ariostas
Copy link
Collaborator Author

Okay, I guess this is really a non-issue since it is extremely rare to replace or remove files. I'd be happy to implement clear_cache and/or reload=True since they're pretty trivial. The checksum idea would be great if files were frequently changing, but it's overkill and would also require more thought about file versioning since older versions would want to get older data so that checksums match.

I'd also be happy just closing this issue and dealing with it manually for RNTuple v1.0.0.

@jpivarski
Copy link
Member

I don't view this library as something that needs to have all the niceties because it's not user-facing. Since we (developers) know that the cache is in ~/.local/skhepdata, we can already do clear_cache easily by hand. (And even if one couldn't, one would have to search some documentation to either find out where the cache is or find out that there's a clear_cache function—one is as good as the other.)

skhep_testdata.data_path(..., reload: bool = False) would be a nice convenience for certainty in local testing. This might come up again for RNTuple files, and it's more convenient to have a flag in the function you're actively working with in Python, rather than having to hunt for a particular filename in a separate terminal. You're most likely to be the user of this, so I'll let you decide whether you think it's worth spending the time on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants