Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding a few rof notebooks #126

Merged
merged 91 commits into from
Nov 21, 2024
Merged

Adding a few rof notebooks #126

merged 91 commits into from
Nov 21, 2024

Conversation

nmizukami
Copy link
Member

@nmizukami nmizukami commented Aug 23, 2024

Initial commits for ROF notebooks.

An ultimate set of the notebooks intend to mimic old ROF diagnostic plots

This PR is just starting with a few notebooks.

All Submissions:

  • Have you followed the guidelines in our Contributor's Guide (including the pre-commit check)?
  • Have you checked to ensure there aren't other open Pull Requests for the same update/change?

New Feature Submissions:

  1. Does your submission pass tests?
  2. Have you lint your code locally prior to submission?

Changes to Core Features:

  • Have you added an explanation of what your changes do and why you'd like us to include them?
  • Have you successfully tested your changes locally?

@TeaganKing
Copy link
Collaborator

Hey @nmizukami ! Thanks so much for adding these notebooks! I made those changes we discussed to cupid-run and the config file to include rof in running cupid and in the jupyter book table of contents.

One note: I think we'll need to provide math in cupid-analysis if you prefer to use that for sqrt rather than numpy in metrics.py? I don't think this should be a problem, but just let me know if you'd like me to do that?

Would you be able to pull these changes in locally, test running your notebook with cupid-run and make sure that things look as you expect?

@TeaganKing TeaganKing self-requested a review August 23, 2024 20:11
@TeaganKing TeaganKing added lnd enhancement New feature or request labels Aug 23, 2024
@nmizukami
Copy link
Member Author

Hi Teagan (@TeaganKing), The notebook almost ran with cupid-run -rof. One error was reading geopackage file (gis vector data) via geopanda.

You can see /glade/work/mizukami/CUPiD/examples/coupled_model/month_annual_flow.ipynb.

cupid prints on screen this:

RROR 1: PROJ: proj_create_from_database: /glade/u/apps/casper/23.10/spack/opt/spack/proj/8.2.1/gcc/12.2.0/7gif/share/proj/proj.db contains DATABASE.LAYOUT.VERSION.MINOR = 2 whereas a number >= 3 is expected. It comes from another PROJ installation.

I have seen this before. I don't fully understand the error, but this is coming from pyogrio package that came with geopandas. pyogrio is used internally in geopandas.

When I ran the notebook outside cupid, it runs fine but I activated Python [conda-env:cupid-analysis] environment in the jupyterhub. I see another one called cupid-analysis, which I believe cupid actually uses. I saw similar error when I use cupid-analysis. wondering what is the difference between Python [conda-env:cupid-analysis] and cupid-analysis?

hopefully I am simply setting something e.g., environment incorrectly...

@TeaganKing
Copy link
Collaborator

Hi @nmizukami , Sorry I let this slip! In the environment in which this was working, did you have a particular version of geopandas or pyogrio pinned? I could also add that to the environment yaml specification. Or when you previously ran into this error, did you have another solution?

This error may be because PROJ is already installed-- I'm not sure where at this point, but can look into that.

@nmizukami
Copy link
Member Author

nmizukami commented Aug 27, 2024

Hi @TeaganKing, some hint is that I can ran outside cupid-run, meaning I can run the notebook manually on jupyterhub with [conda-env:cupid-analysis] env on, but NOT with cupid-analysis on (get similar error on PROJ). You see two similar envs in Jupyter in image below. I believe the package versions should be ok. I can think about this more... I don't know what is the difference between [conda-env:cupid-analysis] and cupid-analysis

Screen Shot 2024-08-27 at 12 07 05 PM

@TeaganKing
Copy link
Collaborator

TeaganKing commented Aug 27, 2024

It sounds like there may be some issue related to the ipykernel installation. I think one of these might be the installation from ipykernel (a soft linked conda environment) and the other may be a conda environment found elsewhere (possibly an outdated cupid-analysis that doesn't include geopandas?). Mike mentioned that the ipykernel installation basically creates a softlink to an environment, which made me think that could be an inconsistency.

I had updated a test environment but not my actual cupid-analysis environment; I'm doing that now and will test your notebook out. This is probably not the most efficient workflow, but I wonder if it might also be worth removing your cupid-analysis environment, see if it's still listed as an option in JupyterHub, make sure that both versions are removed, and then re-install a clean version?

@nmizukami
Copy link
Member Author

I did the following steps to remove cupid-analysis env and reinstall it on terminal.

mamba remove --name cupid-analysis --all
mamba env create -f environments/cupid-analysis.yml

It did not fix it. After removing cupid-analysis, jupyterhub still showed cupid-analysis, though [conda-env:cupid-analysis] was gone.

@nmizukami
Copy link
Member Author

Hi @TeaganKing , trying to run conda list to see what packages are there in cupid-analysis env when running cupid-run. So unfortunately including conda list in the notebook cause error in cupid run:

SyntaxError: An error happened when checking the source code. 
:25:7: invalid syntax

conda list

@nmizukami
Copy link
Member Author

casper-login1:/glade/work/mizukami/CUPiD/examples/coupled_model (main_adding_rof)> cupid-run -rof

/glade/work/mizukami/conda-envs/cupid-dev/lib/python3.11/site-packages/ploomber/dag/dag.py:455: UserWarning: 
========================================================================================= DAG render with warnings =========================================================================================
----------------------------------------------------------------- NotebookRunner: index -> File('computed_notebook...ucture/index.ipynb') ------------------------------------------------------------------
----------------------------------------------------------------- /glade/work/mizukami/CUPiD/examples/nblibrary/infrastructure/index.ipynb -----------------------------------------------------------------
These parameters are not used in the task's source code: 'CESM_output_dir', 'lc_kwargs', 'serial', and 'subset_kwargs'
----------------------------------------------------------- NotebookRunner: month_annual_flow -> File('computed_notebook..._annual_flow.ipynb') ------------------------------------------------------------
---------------------------------------------------------------- /glade/work/mizukami/CUPiD/examples/nblibrary/rof/month_annual_flow.ipynb -----------------------------------------------------------------
These parameters are not used in the task's source code: 'CESM_output_dir', 'lc_kwargs', 'serial', and 'subset_kwargs'
============================================================================================ Summary (2 tasks) =============================================================================================
NotebookRunner: index -> File('computed_notebook...ucture/index.ipynb')
NotebookRunner: month_annual_flow -> File('computed_notebook..._annual_flow.ipynb')
========================================================================================= DAG render with warnings =========================================================================================

  warnings.warn(str(warnings_))
Executing: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:02<00:00,  1.03cell/s]
Building task 'month_annual_flow':  50%|███████████████████████████████████████████████████████████████████                                                                   | 1/2 [00:02<00:02,  2.92s/itERROR 1: PROJ: proj_create_from_database: /glade/u/apps/casper/23.10/spack/opt/spack/proj/8.2.1/gcc/12.2.0/7gif/share/proj/proj.db contains DATABASE.LAYOUT.VERSION.MINOR = 2 whereas a number >= 3 is expected. It comes from another PROJ installation.
                                                                                                                                                                                                           /glade/u/apps/opt/conda/condabin/conda                                                                                                                                      | 5/69 [00:20<03:39,  3.44s/cell]
Executing:  90%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍               | 62/69 [03:53<00:26,  3.76s/cell]
Building task 'month_annual_flow': 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [03:56<00:00, 118.06s/it]
Traceback (most recent call last):
  File "/glade/work/mizukami/conda-envs/cupid-dev/bin/cupid-run", line 8, in <module>
    sys.exit(run())
             ^^^^^
  File "/glade/work/mizukami/conda-envs/cupid-dev/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/glade/work/mizukami/conda-envs/cupid-dev/lib/python3.11/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/glade/work/mizukami/conda-envs/cupid-dev/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/glade/work/mizukami/conda-envs/cupid-dev/lib/python3.11/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/glade/work/mizukami/CUPiD/cupid/run.py", line 290, in run
    dag.build()
  File "/glade/work/mizukami/conda-envs/cupid-dev/lib/python3.11/site-packages/ploomber/dag/dag.py", line 557, in build
    report = callable_()
             ^^^^^^^^^^^
  File "/glade/work/mizukami/conda-envs/cupid-dev/lib/python3.11/site-packages/ploomber/dag/dag.py", line 662, in _build
    raise build_exception
  File "/glade/work/mizukami/conda-envs/cupid-dev/lib/python3.11/site-packages/ploomber/dag/dag.py", line 591, in _build
    task_reports = self._executor(dag=self, show_progress=show_progress)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/glade/work/mizukami/conda-envs/cupid-dev/lib/python3.11/site-packages/ploomber/executors/serial.py", line 203, in __call__
    raise DAGBuildError(str(exceptions_all))
ploomber.exceptions.DAGBuildError: 
============================================================================================= DAG build failed =============================================================================================
----------------------------------------------------------- NotebookRunner: month_annual_flow -> File('computed_notebook..._annual_flow.ipynb') ------------------------------------------------------------
---------------------------------------------------------------- /glade/work/mizukami/CUPiD/examples/nblibrary/rof/month_annual_flow.ipynb -----------------------------------------------------------------
---------------------------------------------------------------------------
Exception encountered at "In [24]":
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[24], line 2
      1 column_stat = []
----> 2 gauge_shp_all_case = gauge_shp.copy(deep=True)
      3 for case, grid_name in cases.items():
      4     gauge_shp_all_case = gauge_shp_all_case.merge(
      5         gauge_shp1[case][["id", f"{error_metric}_{grid_name}"]],
      6         left_on="id",
      7         right_on="id",
      8     )

NameError: name 'gauge_shp' is not defined

ploomber.exceptions.TaskBuildError: Error when executing task 'month_annual_flow'. Partially executed notebook available at /glade/work/mizukami/CUPiD/examples/coupled_model/computed_notebooks/quick-run/rof/month_annual_flow.ipynb
ploomber.exceptions.TaskBuildError: Error building task "month_annual_flow"
============================================================================================= Summary (1 task) =============================================================================================
NotebookRunner: month_annual_flow -> File('computed_notebook..._annual_flow.ipynb')
============================================================================================= DAG build failed =============================================================================================

@nmizukami
Copy link
Member Author

nmizukami commented Aug 29, 2024

Hi @TeaganKing,
Small good new is that I got it run without the geopanda error. The trick is to add this
os.environ['PROJ_LIB']='/glade/work/mizukami/conda-envs/cupid-analysis/share/proj'
before loading geopandas.
However, I don't think this is permanent solution. I still try to consult with CISL.

I was able to create /glade/work/mizukami/CUPiD/examples/coupled_model/computed_notebooks/quick-run/_build/html/index.html
How do you usually open under HPC. I was trying to open firefox in derecho/casper, but it is very slow. Wonder if there is any other ways to look.

@TeaganKing
Copy link
Collaborator

Hi @nmizukami , I'm glad that is temporarily working (but of course we need this to work for any user's environment). Yes, I think this would be a good conversation to have with CISL.

Regarding looking at output, see the second section on this page for recommendations on NCAR machines.

@TeaganKing TeaganKing mentioned this pull request Sep 10, 2024
6 tasks
@TeaganKing
Copy link
Collaborator

TeaganKing commented Sep 10, 2024

Hey @nmizukami , I added a PR to bring rof into run.py. And then I realized these changes are already in this PR... so apologies-- feel free to ignore that!

@TeaganKing
Copy link
Collaborator

TeaganKing commented Sep 10, 2024

To-do:

  • update readme to include 'rof' on line 104: -rof, --river-runoff Run river runoff component diagnostics

@TeaganKing
Copy link
Collaborator

Thanks for clarifying-- the way you are using LocalCluster works!

For the notebook names, would these work? Or if you prefer, maybe you could just take out _compare_obs if observations are only available sometimes?

month_annual_flow.ipynb -> global_discharge_gauge_compare_obs.ipynb
ocean_discharge.ipynb -> global_discharge_ocean_compare_obs.ipynb

RE plotting in line-- I think that's fine if the plots show up when observations are available. Since the default is save_figs=False, I think this is reasonable.

Did you want to take out the coupled_model config file changes, per our discussion on updating the config file to a similar format as the key_metrics config file?

@TeaganKing
Copy link
Collaborator

Thanks for making these changes, @nmizukami ! These notebook names are clearer now. I also removed the jupyter book build bit for the rof notebooks that are no longer being computed in the coupled model config file. Both notebooks run on the order of 1min now!

@TeaganKing
Copy link
Collaborator

I think the only outstanding comment I'm concerned about is the inclusion of personal directories-- which may need to be pushed to a future issue ticket when we have a more clear location for hosting observational datasets. @mnlevy1981 did you want to take a quick look at this, as well?

Copy link
Collaborator

@mnlevy1981 mnlevy1981 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent some time trying to reduce the number of lines in metrics.py before realizing you never import it.

Also, global_discharge_gauge_compare_obs.ipynb ran successfully but

  1. Cell [14] (in 3.2. Annual cycle (at monthly step)) is the last cell that produces output; cells 15 - 21 have the %%time output but no plots or tables
  2. Did some metadata cells get moved around? I see the following sections in order:
    3. Analysis for 24 large rivers
    3.1 Annual flow series
    3.2. Annual cycle (at monthly step)
    3.3. scatter plots of monthly flow
    3.4. scatter plots of annual flow
    4. Anaysis for Large 50 rivers
    4.1 Summary tables
    4.2. scatter plot of annual mean flow
    
    3. Anaysis for all 922 sites
    3.1 Compute metris at all the sites (no plots nor tables)
    4.2. Spatial metric map
    4.3 Boxplots of Error metrics (RMSE, %bias, and correlation coefficient)
    

(Maybe the reason I could delete metrics.py is because the cells that should call those functions aren't being invoked correctly?)

I haven't had a chance to look at colors.py or utility.py yet, and if it turns out metrics.py is supposed to be called I can go look closer to see if more functions can be replaced with existing numpy calls

examples/nblibrary/rof/scripts/metrics.py Outdated Show resolved Hide resolved
examples/nblibrary/rof/scripts/metrics.py Outdated Show resolved Hide resolved
examples/nblibrary/rof/scripts/metrics.py Outdated Show resolved Hide resolved
examples/nblibrary/rof/scripts/metrics.py Outdated Show resolved Hide resolved
examples/nblibrary/rof/scripts/metrics.py Outdated Show resolved Hide resolved
examples/nblibrary/rof/scripts/metrics.py Outdated Show resolved Hide resolved
examples/nblibrary/rof/scripts/metrics.py Outdated Show resolved Hide resolved
examples/nblibrary/rof/scripts/metrics.py Outdated Show resolved Hide resolved
@nmizukami
Copy link
Member Author

I spent some time trying to reduce the number of lines in metrics.py before realizing you never import it.

Thanks for looking at the scripts. Actually it is not imported. I was going to use Cell [19] where bias, RMSE, and corr are computed in global_discharge_gauge_compare_obs.ipynb.

I put this metrics.py even though they are not used. This computes error metrics, also streamflow metrics (high flow, low flow extreme events etc.), which might be useful (at least use for hydrology, river often).

I can leave this out if confusing now, and add it when (if) need later.

Also, global_discharge_gauge_compare_obs.ipynb ran successfully but

  1. Cell [14] (in 3.2. Annual cycle (at monthly step)) is the last cell that produces output; cells 15 - 21 have the %%time output but no plots or tables

Yes, Cell 15 through 21 are skipped if there are no observations available. For this case, the simulation period year 0001 - 0101, so it is outside the time period when observation is available. these cells compute error against observations.

in those cells, the second line if obs_available: -> obs_available is False if there is no observation available.

  1. Did some metadata cells get moved around? I see the following sections in order:
    3. Analysis for 24 large rivers
    3.1 Annual flow series
    3.2. Annual cycle (at monthly step)
    3.3. scatter plots of monthly flow
    3.4. scatter plots of annual flow
    4. Anaysis for Large 50 rivers
    4.1 Summary tables
    4.2. scatter plot of annual mean flow
    
    3. Anaysis for all 922 sites
    3.1 Compute metris at all the sites (no plots nor tables)
    4.2. Spatial metric map
    4.3 Boxplots of Error metrics (RMSE, %bias, and correlation coefficient)
    

Yes, this needs to be updated. But i have been wondering if this is needed?? Also right now each section has a link to the corresponding cell. I don't think the link is not useful here?

(Maybe the reason I could delete metrics.py is because the cells that should call those functions aren't being invoked correctly?)

I haven't had a chance to look at colors.py or utility.py yet, and if it turns out metrics.py is supposed to be called I can go look closer to see if more functions can be replaced with existing numpy calls

utility.py and colors.py are imported (but not all the functions), but metrics.py is not called.

So what I could do is to include the functions used only? These scrips are from the scripts I use for my other notebooks, so if these become useful, could be added later.

@mnlevy1981
Copy link
Collaborator

2. Did some metadata cells get moved around? I see the following sections in order:
```
3. Analysis for 24 large rivers
3.1 Annual flow series
3.2. Annual cycle (at monthly step)
3.3. scatter plots of monthly flow
3.4. scatter plots of annual flow
4. Anaysis for Large 50 rivers
4.1 Summary tables
4.2. scatter plot of annual mean flow

3. Anaysis for all 922 sites
3.1 Compute metris at all the sites (no plots nor tables)
4.2. Spatial metric map
4.3 Boxplots of Error metrics (RMSE, %bias, and correlation coefficient)
```

Yes, this needs to be updated. But i have been wondering if this is needed?? Also right now each section has a link to the corresponding cell. I don't think the link is not useful here?

I would definitely encourage you to have some markdown cells that describe what plots are being generated. I don't think they need to be organized like a paper with sections and subsections. I hadn't noticed the links from the top of the notebook to each specific section until you mentioned it... let me look at how cupid-build parses those links before we discuss keeping them or removing them.

utility.py and colors.py are imported (but not all the functions), but metrics.py is not called.

So what I could do is to include the functions used only? These scrips are from the scripts I use for my other notebooks, so if these become useful, could be added later.

Yes, please remove functions that are not being used (and then I'm happy to review whatever is left). One reason to provide modules like utility.py and colors.py is to potentially share functions with other notebooks, but if these modules contain code that isn't used anywhere then there is no guarantee that they actually work as advertised.

@nmizukami
Copy link
Member Author

  1. Did some metadata cells get moved around? I see the following sections in order:
3. Analysis for 24 large rivers
3.1 Annual flow series
3.2. Annual cycle (at monthly step)
3.3. scatter plots of monthly flow
3.4. scatter plots of annual flow
4. Anaysis for Large 50 rivers
4.1 Summary tables
4.2. scatter plot of annual mean flow
  1. Anaysis for all 922 sites
    3.1 Compute metris at all the sites (no plots nor tables)
    4.2. Spatial metric map
    4.3 Boxplots of Error metrics (RMSE, %bias, and correlation coefficient)

Yes, this needs to be updated. But i have been wondering if this is needed?? Also right now each section has a link to the corresponding cell. I don't think the link is not useful here?

I would definitely encourage you to have some markdown cells that describe what plots are being generated. I don't think they need to be organized like a paper with sections and subsections. I hadn't noticed the links from the top of the notebook to each specific section until you mentioned it... let me look at how cupid-build parses those links before we discuss keeping them or removing them.

utility.py and colors.py are imported (but not all the functions), but metrics.py is not called.
So what I could do is to include the functions used only? These scrips are from the scripts I use for my other notebooks, so if these become useful, could be added later.

Yes, please remove functions that are not being used (and then I'm happy to review whatever is left). One reason to provide modules like utility.py and colors.py is to potentially share functions with other notebooks, but if these modules contain code that isn't used anywhere then there is no guarantee that they actually work as advertised.

Ok, I will clean up the scripts (sorry for confusing you.) and the markdown cells.

@nmizukami
Copy link
Member Author

nmizukami commented Nov 14, 2024

Hi @mnlevy1981, I think the scripts are clean now, and fixed the errors you pointed out in the notebooks. Also markdown cells are update. The notebooks still runs ok, though I saw lots of exceptions from the dask distributed.

"domain_dir = setup[\n",
" \"ancillary_dir\"\n",
"] # ancillary directory including ROF domain, river network data etc.\n",
"geospatial_dir = setup[\"ancillary_dir\"] # including shapefiles etc\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since geospatial_dir also appears in setup, can we rename this variable to ancillary_dir?

@TeaganKing
Copy link
Collaborator

Once the comments regarding ancillary data variable naming and personal directories are resolved, we can merge this.

@nmizukami
Copy link
Member Author

Hi @TeaganKing and @mnlevy1981 , I have updated the notebooks and setup.yaml based on your comments. thanks!

@TeaganKing
Copy link
Collaborator

Great, thanks so much @nmizukami ! This looks good to me. I think merging is blocked due to Mike's change request-- @mnlevy1981 can you please check that this addressed your comments as well and then merge this in?

Copy link
Collaborator

@mnlevy1981 mnlevy1981 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks for the contribution!

@mnlevy1981 mnlevy1981 merged commit d3d7971 into NCAR:main Nov 21, 2024
2 checks passed
@nmizukami nmizukami deleted the main_adding_rof branch November 21, 2024 16:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request lnd
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants