Only the latest data files are accessible #147

ariostas · 2024-03-22T15:30:29Z

In scikit-hep/uproot5#1115 we noticed that tests seemed to pass even when the newest version of scikit-hep-testdata hadn't been deployed to PyPI.

I took a look at the code and found that the data files are always downloaded using these urls:

scikit-hep-testdata/src/skhep_testdata/local_files.py

Lines 22 to 23 in 0fd45be

    
           baseurl = "https://raw.githubusercontent.com/scikit-hep/scikit-hep-testdata/main/src/skhep_testdata/data/" 
        
           zipurl = "https://github.com/scikit-hep/scikit-hep-testdata/zipball/main"

This means that if a file gets updated, then it becomes impossible to use the older version by installing an older version of the package, and instead the only option is to manually move the older version of the file into the cache directory. And if a file gets deleted, then it becomes inaccessible.

So with #144 I unintentionally made previous RNTuple files inaccessible. In this case it's fine since no one should be using the older RNTuple spec, but there might be cases where files need to be updated, but with the option of using older versions by installing previous versions of this package.

So I think it would be better if the above lines could be replaced with something like

baseurl = f"https://github.com/scikit-hep/scikit-hep-testdata/raw/v{__version__}/src/skhep_testdata/data/" 
zipurl = f"https://github.com/scikit-hep/scikit-hep-testdata/archive/refs/tags/v{__version__}.zip"

and maybe have a fallback in case someone is using an editable version or something like that.

Also, it would be nice if there was a clear_cache function.

The text was updated successfully, but these errors were encountered:

jpivarski · 2024-03-22T19:26:33Z

clear_cache would be good. skhep_testdata.data_path with a reload=True option would also be good. So would a checksum (skhep_testdata would come with a list of checksums for each file, and would compare a file-on-disk with its expected checksum before assuming that it's a cache hit).

I don't think there's a problem with newer versions of scikit-hep-testdata not providing access to older files, but it is a problem that the cache lives outside of package versioning: you can skhep_testdata.data_path in one version, upgrade the package, and then skhep_testdata.data_path gets the old file, rather than the new one (or vice-versa, if you downgrade). The problem is that file versions are decoupled from Python package versions, and users like me assume that the Python package version controls all of its contents (where the test files are "contents").

Since we usually only add files, this hasn't come up before, but it does make sense to replace RNTuple files until version 1.0 is declared.

ariostas · 2024-03-22T20:56:55Z

Okay, I guess this is really a non-issue since it is extremely rare to replace or remove files. I'd be happy to implement clear_cache and/or reload=True since they're pretty trivial. The checksum idea would be great if files were frequently changing, but it's overkill and would also require more thought about file versioning since older versions would want to get older data so that checksums match.

I'd also be happy just closing this issue and dealing with it manually for RNTuple v1.0.0.

jpivarski · 2024-03-22T23:18:02Z

I don't view this library as something that needs to have all the niceties because it's not user-facing. Since we (developers) know that the cache is in ~/.local/skhepdata, we can already do clear_cache easily by hand. (And even if one couldn't, one would have to search some documentation to either find out where the cache is or find out that there's a clear_cache function—one is as good as the other.)

skhep_testdata.data_path(..., reload: bool = False) would be a nice convenience for certainty in local testing. This might come up again for RNTuple files, and it's more convenient to have a flag in the function you're actively working with in Python, rather than having to hunt for a particular filename in a separate terminal. You're most likely to be the user of this, so I'll let you decide whether you think it's worth spending the time on it.

ariostas closed this as completed Apr 2, 2024

jpivarski mentioned this issue May 16, 2024

The skhep_testdata.data_path function should run a checksum on the file that it finds #153

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Only the latest data files are accessible #147

Only the latest data files are accessible #147

ariostas commented Mar 22, 2024

jpivarski commented Mar 22, 2024

ariostas commented Mar 22, 2024

jpivarski commented Mar 22, 2024

Only the latest data files are accessible #147

Only the latest data files are accessible #147

Comments

ariostas commented Mar 22, 2024

jpivarski commented Mar 22, 2024

ariostas commented Mar 22, 2024

jpivarski commented Mar 22, 2024