Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Options for reading list of netcdf files on OSN pod for rechunking #485

Open
pnorton-usgs opened this issue Jun 13, 2024 · 8 comments
Open

Comments

@pnorton-usgs
Copy link
Contributor

@rsignell , @rsignell-usgs I'm trying out the rechunk workflow for CONUS404 using the OSN pod for the source hourly WRF output files. What are my options for reading those files? I've been trying to figure out the fsspec way of doing this but so far have not found good documentation for how to get this to work.

@rsignell
Copy link
Contributor

Best way would be to kerchunk the files! But you can also read them as xr.open_dataset(fs.open('s3://file.nc')

@pnorton-usgs
Copy link
Contributor Author

@rsignell
I have a list of files to open; 144 hourly files per 6 day chunk. The file list looks like,

[
    's3://hytest-internal/wrfconus404-pgw/WY1980/wrf2d_d01_1979-10-01_00:00:00',
    's3://hytest-internal/wrfconus404-pgw/WY1980/wrf2d_d01_1979-10-01_01:00:00',
    ...
]

If I try to open the list of files with something similar to the following:

ds = xr.open_mfdataset(job_files, engine='netcdf4', parallel=True)

I get the error

OSError: [Errno -128] NetCDF: Attempt to use feature that was not turned on when netCDF was built.: b's3://hytest-internal/wrfconus404-pgw/WY1980/wrf2d_d01_1979-10-01_05:00:00'

I can open individual files without any problem. I thought perhaps there was some fsspec command to handle the list but I've been unable to find it in the documentation.

I'm not sure that kerchunk would be a very good option here because we have over 380,000 hourly netcdf files.

@pnorton-usgs
Copy link
Contributor Author

@rsignell
If I use the fsspec.open_files() I have to use h5netcdf for the engine but I get a different error.

ds = xr.open_mfdataset(fsspec.open_files(job_files), engine='h5netcdf', parallel=True)
KeyError: [<class 'h5netcdf.core.File'>, (<OpenFile 'hytest-internal[/wrfconus404-pgw/WY1980/wrf2d_d01_1979-10-05_01:00:00](https://nebari.dev-wma.chs.usgs.gov/wrfconus404-pgw/WY1980/wrf2d_d01_1979-10-05_01#line=-1)'>,), 'r', (('decode_vlen_strings', True), ('driver', None), ('invalid_netcdf', None)), '6abad509-0177-4e5f-8d33-a2f6419645b4']

TypeError: expected str, bytes or os.PathLike object, not OpenFile

@rsignell
Copy link
Contributor

Can you open one of these files?
Are they NetCDF3 files, per chance? If so you cannot read them from directly from S3, and kerchunk would be your only good option! (Kerchunk works great for huge collections, we used it for 200,000+ NWM files, just need to store the refs in parquet

@pnorton-usgs
Copy link
Contributor Author

See note above - I can open individual files. They are netcdf4 files.

@pnorton-usgs
Copy link
Contributor Author

Does the kerchunk allow for easy modification of the metadata? What's an example of doing this with parquet refs?

@rsignell
Copy link
Contributor

@pnorton-usgs check out this example I made a few weeks ago: https://gist.github.com/rsignell/84f727f25d923aab5aa7c534cef14151

@rsignell
Copy link
Contributor

Sorry for the delayed and distracted responses. I'm currently in low bandwidth mode but will be back on Monday

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants