Options for reading list of netcdf files on OSN pod for rechunking #485

pnorton-usgs · 2024-06-13T13:18:08Z

@rsignell , @rsignell-usgs I'm trying out the rechunk workflow for CONUS404 using the OSN pod for the source hourly WRF output files. What are my options for reading those files? I've been trying to figure out the fsspec way of doing this but so far have not found good documentation for how to get this to work.

rsignell · 2024-06-13T13:34:23Z

Best way would be to kerchunk the files! But you can also read them as xr.open_dataset(fs.open('s3://file.nc')

pnorton-usgs · 2024-06-13T14:01:44Z

@rsignell
I have a list of files to open; 144 hourly files per 6 day chunk. The file list looks like,

[
    's3://hytest-internal/wrfconus404-pgw/WY1980/wrf2d_d01_1979-10-01_00:00:00',
    's3://hytest-internal/wrfconus404-pgw/WY1980/wrf2d_d01_1979-10-01_01:00:00',
    ...
]

If I try to open the list of files with something similar to the following:

ds = xr.open_mfdataset(job_files, engine='netcdf4', parallel=True)

I get the error

OSError: [Errno -128] NetCDF: Attempt to use feature that was not turned on when netCDF was built.: b's3://hytest-internal/wrfconus404-pgw/WY1980/wrf2d_d01_1979-10-01_05:00:00'

I can open individual files without any problem. I thought perhaps there was some fsspec command to handle the list but I've been unable to find it in the documentation.

I'm not sure that kerchunk would be a very good option here because we have over 380,000 hourly netcdf files.

pnorton-usgs · 2024-06-13T14:10:44Z

@rsignell
If I use the fsspec.open_files() I have to use h5netcdf for the engine but I get a different error.

ds = xr.open_mfdataset(fsspec.open_files(job_files), engine='h5netcdf', parallel=True)

KeyError: [<class 'h5netcdf.core.File'>, (<OpenFile 'hytest-internal[/wrfconus404-pgw/WY1980/wrf2d_d01_1979-10-05_01:00:00](https://nebari.dev-wma.chs.usgs.gov/wrfconus404-pgw/WY1980/wrf2d_d01_1979-10-05_01#line=-1)'>,), 'r', (('decode_vlen_strings', True), ('driver', None), ('invalid_netcdf', None)), '6abad509-0177-4e5f-8d33-a2f6419645b4']

TypeError: expected str, bytes or os.PathLike object, not OpenFile

rsignell · 2024-06-13T23:32:48Z

Can you open one of these files?
Are they NetCDF3 files, per chance? If so you cannot read them from directly from S3, and kerchunk would be your only good option! (Kerchunk works great for huge collections, we used it for 200,000+ NWM files, just need to store the refs in parquet

pnorton-usgs · 2024-06-14T11:50:11Z

See note above - I can open individual files. They are netcdf4 files.

pnorton-usgs · 2024-06-14T11:53:27Z

Does the kerchunk allow for easy modification of the metadata? What's an example of doing this with parquet refs?

rsignell · 2024-06-14T20:01:22Z

@pnorton-usgs check out this example I made a few weeks ago: https://gist.github.com/rsignell/84f727f25d923aab5aa7c534cef14151

rsignell · 2024-06-14T20:03:11Z

Sorry for the delayed and distracted responses. I'm currently in low bandwidth mode but will be back on Monday

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Options for reading list of netcdf files on OSN pod for rechunking #485

Options for reading list of netcdf files on OSN pod for rechunking #485

pnorton-usgs commented Jun 13, 2024

rsignell commented Jun 13, 2024

pnorton-usgs commented Jun 13, 2024

pnorton-usgs commented Jun 13, 2024

rsignell commented Jun 13, 2024

pnorton-usgs commented Jun 14, 2024

pnorton-usgs commented Jun 14, 2024

rsignell commented Jun 14, 2024

rsignell commented Jun 14, 2024

Options for reading list of netcdf files on OSN pod for rechunking #485

Options for reading list of netcdf files on OSN pod for rechunking #485

Comments

pnorton-usgs commented Jun 13, 2024

rsignell commented Jun 13, 2024

pnorton-usgs commented Jun 13, 2024

pnorton-usgs commented Jun 13, 2024

rsignell commented Jun 13, 2024

pnorton-usgs commented Jun 14, 2024

pnorton-usgs commented Jun 14, 2024

rsignell commented Jun 14, 2024

rsignell commented Jun 14, 2024