Rechunk NWM data #6

amsnyder · 2022-08-31T18:41:54Z

We want to rechunk all NWM v2.1 variables into a single zarr file. Chunks will be organized so that each chunk has 1 feature and the entire time series for that feature.

Rechunking will be based on this notebook: https://github.com/NCAR/rechunk_retro_nwm_v21/blob/main/notebooks/usage_example_rerechunk_chrtout.ipynb

gzt5142 · 2022-09-01T16:33:20Z

@amsnyder ,

I wrote the what-does-chunking-do tutorial notebook using that NCAR reference, and in the process found a few ways to economize on processing time (and probably memory). The end result is that re-chunking would likely run a little faster.

Check out the Cloud example (about 2/3 of the way down) in that chunking tutorial notebook to see the differences.
https://github.com/USGS-python/hytest_notebook_tutorials/blob/dev/Syllabus/L2/xx_Chunking.ipynb

gzt5142 · 2022-09-01T16:34:54Z

I just realized that the greatest economy came from changes in the way I selected features based on gage_id -- if you're not doing that sort of sub-setting, then my code won't run much differently than the core process in that NCAR example.

gzt5142 · 2022-10-03T20:03:11Z

Variables of interest, extracted from https://ral.ucar.edu/sites/default/files/public/WRFHydroV5_OutputVariableMatrix_V5.pdf

bold == priority

Variable	Domain	Description
ACCET	LDASOUT	Accumulated total evapotranspiration
SNEQV	LDASOUT	Snow water equivalent
FSNO	LDASOUT	Fraction of surface covered by snow
ACCPRCP	LDASOUT	Accumulated precipitation
SOIL_M	LDASOUT	Volumetric soil moisture
CANWAT	LDASOUT	Total canopy water (liquid + ice)
CANICE	LDASOUT	Canopy ice water content
depth	GWOUT	Groundwater bucket water level
sfcheadrt	RTOUT	Surface head (from HYDRO)
SFCRNOFF	LDASOUT	Surface runoff: accumulated
UGDRNOFF	LDASOUT	Underground runoff: accumulated

Data is accessed via the AWS Open Data registry.
See https://registry.opendata.aws/nwm-archive/

Datasets are available as netcdf or zarr file format via S3 buckets.
We prefer the zarr, so will prioritize reading data from
https://noaa-nwm-retrospective-2-1-zarr-pds.s3.amazonaws.com/index.html

amsnyder · 2022-10-06T15:22:46Z

As I'm actually starting to look at this data, I'm seeing that rechunking into a single dataset doesn't seem feasible.

I believe the gw, lake, and streamflow datasets have data associate with features (but a different set for each). We could rechunk each of these datasets by feature id (like we did for streamflow), but they would still need to be 3 different datasets because the features are different.

ldasout, precip, and rtout are gridded - these could be rechunked and combined into a single file (if we want to). I'd like some input from @rsignell-usgs or @pnorton-usgs on how to think about rechunking these data.

@sfoks and @pnorton-usgs (or anyone more familiar with NWM output) - is my understanding of the data correct? And if so, who can I talk to about optimal ways to rechunk the gridded data?

sfoks · 2022-10-06T16:37:16Z

Aubrey Dugger at NCAR is someone who knows this best, but I tried to compile notes here.

WRFHydro Variables

File format/name	Description	Notes
CHRTOUT_DOMAIN	Streamflow output at all channel reaches/cells	NHD reaches
CHANOBS_DOMAIN	Streamflow output at forecast points or gage reaches/cells	n=7994 gages for NWM v2.1
CHRTOUT_GRID	Streamflow on the 2D high resolution routing grid	high res grid is 250m
RTOUT_DOMAIN	Terrain routing variables on the 2D high resolution routing grid	high res is 250 m
LAKEOUT_DOMAIN	Lake output variables	1 km, I think a grid cell is coded as lake or not lake? Aubrey would know
GWOUT_DOMAIN	Ground water output variables	1 km, pretty sure this is gridded
LDASOUT_DOMAIN	Land surface model output	1 km

amsnyder · 2022-11-28T21:04:25Z

We will ask Alicia Rhoades to rechunk the NWM data - starting with the 3 priority variables identified by @gzt5142 above, which come from the ldasout.zarr store on AWS Open Data Registry: https://noaa-nwm-retrospective-2-1-zarr-pds.s3.amazonaws.com/index.html.

@rsignell-usgs will prepare a rechunking tutorial for gridded datasets that he will contribute to either the hytest-org repo here: https://github.com/hytest-org/hytest/tree/main/dataset_preprocessing/tutorials or Project Pythia.

@rsignell-usgs will go over this tutorial in a rechunking demo meeting. I am checking with Alicia on her availability for the demo and to start work. I will schedule this demo and invite @amrhoades, @rsignell-usgs, @sfoks, @thodson-usgs, @kathymd (rechunking data on NHGF), @ellen-brown (rechunking data on NHGF)

amsnyder · 2022-12-06T21:48:33Z

Meeting scheduled for Thursday, Dec. 8 at 4pm ET. I invited all those mentioned above, plus a few more. Let me know if you didn't receive an invite and would like one.

rsignell-usgs · 2022-12-22T21:41:32Z

An update on rechunking workflows here:

The meeting/tutorial scheduled for Dec 8 did not happen. It didn't happen because we ran the plan for the tutorial past one of the Pangeo Forge developers, and he said recommended that we use Pangeo Forge for rechunking. We canceled the tutorial until we had a chance to look at it's capabilities.
We tried submitting a recipe for rechunking some of the 1km gridded data from the NWM 2.1 reanalysis. While the local testing of the recipe worked, the recipe submitted to pangeo-forge has not worked yet. Unfortunately pangeo-forge doesn't expose the logs to users, so it requires a PangeoForge maintainer to diagnose the problem.
At the Pangeo weekly meeting this week, @rabernat mentioned that pangeo-forge doesn't use rechunker, so perhaps it's the best tool presently if rechunking is the goal (hoping he can correct me if I got that wrong). He suggested just using Dask for rechunking. I floated the idea of deploying AWS parallel cluster with/ FSx for Lustre, and using SlurmCluster to scale up the rechunking process, and the group seemed to think that sounded like a good idea. Maybe I should post the idea to rechunker to see if others have thoughts.
The rechunk with large memory and fast disk on HPC is the approach @pnorton-usgs uses on Denali -- except instead of SlurmCluster he uses LocalCluster because on Denali you get all 40 cores on the node. So he submits a bunch of jobs that each use 40 cores, with each job tackling a few weeks or months of the of the rechunking task.

amsnyder · 2023-04-03T16:25:57Z

@thodson-usgs will test out the current zarr data to see if creating a subset of rechunked data would be beneficial to eval

thodson-usgs · 2023-04-03T16:30:23Z

While reviewing this issue, we were unsure the purpose of this issue. After discussing, perhaps we'd intended to have an ARCO subset of NWM at gage locations (and daily averaged) so that we didn't need to scan the entire dataset during our workflows.

amsnyder self-assigned this Aug 31, 2022

rviger-usgs mentioned this issue Nov 22, 2022

Dev: summary stats of Reference Applications to feed IWAA National Assessment #124

Closed

gzt5142 added the WIM F&E label Mar 6, 2023

amsnyder removed their assignment Mar 31, 2023

thodson-usgs self-assigned this Apr 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rechunk NWM data #6

Rechunk NWM data #6

amsnyder commented Aug 31, 2022

gzt5142 commented Sep 1, 2022

gzt5142 commented Sep 1, 2022

gzt5142 commented Oct 3, 2022

amsnyder commented Oct 6, 2022

sfoks commented Oct 6, 2022

amsnyder commented Nov 28, 2022 •

edited

Loading

amsnyder commented Dec 6, 2022

rsignell-usgs commented Dec 22, 2022 •

edited

Loading

amsnyder commented Apr 3, 2023

thodson-usgs commented Apr 3, 2023

Rechunk NWM data #6

Rechunk NWM data #6

Comments

amsnyder commented Aug 31, 2022

gzt5142 commented Sep 1, 2022

gzt5142 commented Sep 1, 2022

gzt5142 commented Oct 3, 2022

amsnyder commented Oct 6, 2022

sfoks commented Oct 6, 2022

amsnyder commented Nov 28, 2022 • edited Loading

amsnyder commented Dec 6, 2022

rsignell-usgs commented Dec 22, 2022 • edited Loading

amsnyder commented Apr 3, 2023

thodson-usgs commented Apr 3, 2023

amsnyder commented Nov 28, 2022 •

edited

Loading

rsignell-usgs commented Dec 22, 2022 •

edited

Loading