Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dev: summary stats of Reference Applications to feed IWAA National Assessment #124

Closed
rviger-usgs opened this issue Nov 10, 2022 · 19 comments
Assignees

Comments

@rviger-usgs
Copy link

rviger-usgs commented Nov 10, 2022

HyTEST would like to deliver software to the National Assessment project that provides summaries of PRMS and WRF-Hydro results per month, per HUC-12, by mid-Feb 2023. As part of that, thinking we want to demo a prototype to that project team ahead of that (hoping in December but could slide to January if we need it). Would like to leverage existing HyTEST or NHGF code as a priority, then other existing code, then writing new code.

This spreadsheet contains the variable names, statistics, and other details being requested. Thinking we should add a column to that so we can indicate which of the requested information we anticipate being able to deliver.

Although the data to be summarized doesn't exist yet, similarly formatted data from earlier modeling applications could be used. Thinking @jlafonta-usgs and @arezoorn might be able to help with locating some sample data.

@sfoks will be directing this, in coordination with @rsignell-usgs and @amsnyder. (not sure how or if I have privilege to assign things to people).

@sfoks sfoks self-assigned this Nov 15, 2022
@sfoks
Copy link
Member

sfoks commented Nov 17, 2022

Tagging @tedstets-usgs here so we can begin notes on this. The last conversation we had on this 11/10 Teams chat, we were thinking HyTEST Eval & Comp Env team would be available to help with consulting, ideas, etc, development if needed on dscore (@thodson-usgs TBD). Additionally HyTEST Eval would help run summary statistics (column D, titled 'Summary Statistics') in spreadsheet linked above. Any additions and changes can be logged here.

@tedstets-usgs
Copy link

Thanks, @sfoks and @thodson-usgs . Looking forward to the ongoing discussions & interactions with HyTest.

@jlafonta-usgs
Copy link

@rviger-usgs, @sfoks , and @tedstets-usgs the latest production run of the NHM-PRMS v1.1 GridMET that has been updated through 2021 is located on Denali at /caldera/projects/usgs/water/wbeep/onhm_dev/historical/output_nhm_2.0.0/

@rviger-usgs
Copy link
Author

rviger-usgs commented Nov 18, 2022

@jlafonta-usgs and @tedstets-usgs wondering if the "NHM-PRMS v1.1 GridMET" application should be treated as an enterprise asset, i.e., as a Reference Application?

@tedstets-usgs will you need to be using "NHM-PRMS v1.1 GridMET" in your report?

If the answer is "yes," then we might want to talk about that. Guessing Jacob would be planning to make a Data Release via ScienceBase but we might want to talk about any additional work around this in a separate ticket (e.g., should @sfoks apply current benchmark tools to that and publish a Data Release of the benchmark results?).

Even if the answer is "no," @sfoks and company can still prototype the statistical summary workflows/tools against that version.

Adding @amsnyder and @alamotte-usgs in case HyTEST data managers need to be involved..

@rviger-usgs
Copy link
Author

@sfoks wanted to check my understanding of your text about re: the summary statistics. I was presuming that you/your team would start development of a summary statistics type of workflow (notebook or R script or something like that) now, with the result that: 1) the prototype workflow is usable by Ted's team pretty much as soon as output from Jacob's 2 new Reference Applications are available (~end of Feb), and 2) we have a good working prototype that we can release (say w/in another FY quarter) for use by any other project.

You might be thinking about/meaning this but wanted to make sure that the effort wasn't just one-on-one consultation for Ted's project since that would not be so scalable/transferable to other projects. This consultation is a good thing but shouldn't be the whole thing.

@thodson-usgs
Copy link
Member

thodson-usgs commented Nov 18, 2022

Just to chime in on summary stats, it might be good to have a call about this. We can generate whatever stats they desire, but there's a bit of a balancing act between the traditional and familiar (like RMSE and NSE) versus what's theoretically correct. For example, I've been benchmarking things in bit terms lately, as in the data are 64-bit floats but maybe only 46 of those bits are meaningful, the rest are noise/error. Bits give a fairly universal measure, but I also recognize that it can be helpful to decompose them into familiar things like RMSE, false positives, etc.

@rviger-usgs
Copy link
Author

Thanks for chiming in, @thodson-usgs! w/@tedstets-usgs, yes, please let's figure out a plan there asap.

There's a practical end that needs to be balanced here. We're hoping to have a minimally sufficient (and not wrong/inappropriate) workflow available for Ted by Feb 15 or sooner. Ted's got constraints in terms of timeline and staff that might require that he and his team need to start with more basic information from an earlier date. I'm thinking that is reflected in the current wish list. If you all can iterate pretty quickly and figure out how much more they want to chew on and agree on substitutions to what they're asking for (updating that spreadsheet), please do!

We can definitely iterate and improve on the initial offering after that. Ted's team has several cycles of reports ahead of them and think we want also want a plan for that longer timeframe.

@rviger-usgs
Copy link
Author

rviger-usgs commented Nov 22, 2022

Couple of questions:

  1. @jlafonta-usgs, will the PRMS output for the National Assessment timeline will be in the same format as the CSV files in /caldera/projects/usgs/water/wbeep/onhm_dev/historical/output_nhm_2.0.0/?
  2. for the group, what file format should the scripts that @sfoks will be writing run against for the output of the anticipated new CONUS404-forced PRMS model application?
  3. for the group, what file format should the scripts that @sfoks will be writing run against for the output of the anticipated new CONUS404-forced WRF-Hydro model application?

The point of the last two questions is that we have a kinda-sorta workflow for creating zarr versions of WRF-Hydro output (although itself based on a digest of the original output from WRF-Hydro) and not clear whether we have any kind of workflow for creating zarr versions of PRMS output. We're dealing with those in other issues (#6 and #71, respectively), but they're not done yet. If automating those workflows cannot happen quickly enough to ensure that Sydney can deliver her summary stats to Ted on or around Feb 15, then Sydney might need to read the native output formats directly (at least for the Feb 15th deadline). We're pushing to try to establish a prototype of Sydney's analysis in December and refine through January.

(Tagging @amsnyder since she's managing those other two issues I cited).

@jlafonta-usgs
Copy link

@rviger-usgs we have existing tools to provide CSV or netCDF versions of the PRMS output. Native format from PRMS is CSV, but Parker Norton has done conversions from CSV to netCDF for previous archived applications of NHM-PRMS. As Parker Norton has been involved in converting PRMS CSV output to netCDF and converting WRF output to zarr format, I anticipate that PRMS output could also be converted to zarr format if needed, but that would require confirmation from Parker. At a minimum, PRMS output can be provided in CSV or netCDF as needed by post-processing algorithms.

@sfoks
Copy link
Member

sfoks commented Nov 28, 2022

linking this here for notes; https://github.com/USGS-R/ncdfgeom

@rviger-usgs
Copy link
Author

that would require confirmation from Parker

@pnorton-usgs can you comment on Jacob's point? (it's from 7 days ago).

@pnorton-usgs
Copy link
Contributor

@rviger-usgs we have existing tools to provide CSV or netCDF versions of the PRMS output. Native format from PRMS is CSV, but Parker Norton has done conversions from CSV to netCDF for previous archived applications of NHM-PRMS. As Parker Norton has been involved in converting PRMS CSV output to netCDF and converting WRF output to zarr format, I anticipate that PRMS output could also be converted to zarr format if needed, but that would require confirmation from Parker. At a minimum, PRMS output can be provided in CSV or netCDF as needed by post-processing algorithms.

@rviger-usgs @jlafonta-usgs for NHM v1.0 and v1.1 I have produced the netCDF format of the output variables from the original .CSV files. A few weeks ago I finished prototyping the code to kerchunk the NHM output netCDF files. This will provide the zarr functionality while keeping the original netCDF files; this should provide maximum flexibility for using R or Python to work with the output variables. I've prototyped with a single calibrated release run but could kerchunk the remaining calibrations fairly quickly.

@rviger-usgs
Copy link
Author

@pnorton-usgs sounds great. does this mean we have a workflow whose TRL is high enough that it should be considered "production"? Or do we want to invest to get that prototype a bit more hardened so that someone in addition to you could run it?

@pnorton-usgs
Copy link
Contributor

@rviger-usgs @jlafonta-usgs As it turns out I apparently wrote the script from the prototype a few weeks ago. I just finished kerchunking both NHM v1.0 and v1.1 model output variables. Should these datasets reside in the hytest_scratch area for on-prem?

@rviger-usgs
Copy link
Author

I'll direct the question about where to store these data to @amsnyder and @alamotte-usgs.

@amsnyder
Copy link
Contributor

I talked to @pnorton-usgs this morning and suggested he place a copy of these NHM datasets hytest_scratch area for now. If someone needs to work with these in the cloud, we can discuss making an additional copy, but I think we should avoid making extra copies of data that we need to maintain and paying cloud storage fees without a use case. @pnorton-usgs will let us know when he has shared the data and added it to our intake catalog.

@sfoks
Copy link
Member

sfoks commented Jan 4, 2023

Thanks Parker!!!

@pnorton-usgs sounds great. does this mean we have a workflow whose TRL [technology readiness level] is high enough that it should be considered "production"? Or do we want to invest to get that prototype a bit more hardened so that someone in addition to you could run it?

@rviger-usgs @jlafonta-usgs As it turns out I apparently wrote the script from the prototype a few weeks ago. I just finished kerchunking both NHM v1.0 and v1.1 model output variables. [...]

Coming back to this point, @pnorton-usgs, where does this workflow reside?

@sfoks
Copy link
Member

sfoks commented Jan 20, 2023

Example files on Caldera

nhm-prms (NHM-PRMS v1.1 GridMET run up through 2021):
/caldera/projects/usgs/water/wbeep/NHM/gf_v11/releases/gm_byHWobs_prms_5.2.1

wrfhydro:
/caldera/projects/usgs/water/impd/rcabell/huc12_monthly_wb_nwmv21_aorc.nc
/caldera/projects/usgs/water/impd/rcabell/huc12_monthly_wb_nwmv21_aorc.RDS

@amsnyder
Copy link
Contributor

amsnyder commented Oct 6, 2023

Workflow is built and can be found here: https://github.com/hytest-org/workflow-2023-foks-nat-assessment

@amsnyder amsnyder closed this as completed Oct 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants