Skip to content

Commit

Permalink
Merge pull request #98 from zyhu-hu/zeyuanhu/online_testing_tomerge
Browse files Browse the repository at this point in the history
  • Loading branch information
jbusecke authored Jul 25, 2024
2 parents 8deb602 + 73948c1 commit 843e1b3
Show file tree
Hide file tree
Showing 80 changed files with 18,018 additions and 68 deletions.
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ website/_build/
website/demo_notebooks
website/evaluation
website/figures
website/downstream_test
website/online_testing
website/README.md
.DS_Store

Expand Down Expand Up @@ -85,6 +85,7 @@ target/

# Jupyter Notebook
.ipynb_checkpoints
hu_notebooks/

# IPython
profile_default/
Expand Down
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,10 +26,10 @@ We implement a range of deterministic and stochastic regression baselines to hig
* [Multi-Layer Perceptron (MLP) Example](https://leap-stc.github.io/ClimSim/demo_notebooks/mlp_example.html)
* [Convolutional Neural Network (CNN) Example](https://leap-stc.github.io/ClimSim/demo_notebooks/cnn_example.html)
* [Water Conservation Example](https://leap-stc.github.io/ClimSim/demo_notebooks/water_conservation.html)

## Online Testing

* [Online Testing](https://github.com/leap-stc/ClimSim/tree/online_testing/downstream_test)
## Online Testing

* [Online Testing](https://github.com/leap-stc/ClimSim/online_testing.html)

## Project Structure

Expand Down
516 changes: 452 additions & 64 deletions climsim_utils/data_utils.py

Large diffs are not rendered by default.

101 changes: 101 additions & 0 deletions online_testing/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
# Hybrid E3SM-MMF-NN-Emulator Simulation and Online Evaluation

## Table of Contents

1. [Problem overview](#1-problem-overview)
2. [Data preparation](#2-data-preparation)
1. [Data download](#21-data-download)
2. [Combine raw data into a few single files](#22-combine-raw-data-into-a-few-single-files)
3. [Model training](#3-model-training)
1. [General requirement](#31-general-requirement)
2. [Training scripts of our baseline online models](#32-training-scripts-of-our-baseline-online-models)
4. [Model post-processing: create wrapper for the trained model to include any normalization and de-normalization](#4-model-post-processing-create-wrapper-for-the-trained-model-to-include-any-normalization-and-de-normalization)
5. [Run hybrid E3SM MMF-NN-Emulator simulation](#5-run-hybrid-e3sm-mmf-nn-emulator-simulation)
6. [Evaluation of hybrid simulation](#6-evaluation-of-hybrid-simulation)

## 1. Problem overview
The ultimate goal of training a ML model emulator (of the cloud-resolving model embedded in the E3SM-MMF climate simulator) using the ClimSim dataset is to couple it to the host E3SM climate simulator and evaluate the performance of such hybrid ML-physics simulation, e.g., whether the hybrid simulation can reproduce the statistics of the pure physics simulation. Here we use "online" to denote this task of performing and evaluating hybrid simulation, in contrast to the "offline" task in which we focus on training a ML model. Here we describe the entire workflow of training these baseline models, running and evaluating the hybrid simulation. We provided a few baseline models that we trained and optimized on the online task. These pretrained models include the MLP models and U-Net models from [Stable Machine-Learning Parameterization](https://arxiv.org/abs/2407.00124) paper.

Refer to the [ClimSim-Online paper](https://arxiv.org/abs/2306.08754) for more details on the online task overview and the [Stable Machine-Learning Parameterization](https://arxiv.org/abs/2407.00124) paper for more details on the example baseline models we provide.

---

## 2. Data preparation

### 2.1 Data download

We take the low-resolution dataset as example. Dowload either the [Low-Resolution Real Geography](https://huggingface.co/datasets/LEAP/ClimSim_low-res) or [Low-Resolution Real Geography Expanded](https://huggingface.co/datasets/LEAP/ClimSim_low-res-expanded) dataset from Hugging Face. The expanded version includes additional input features such as large-scale forcings and convection memory (previous steps state tendencies) that we used in our pretrained U-Net models (refer to [this paper](https://arxiv.org/abs/2407.00124) for more details).

Please don't use the current preprocessed [Subsampled Low-Resolution Data](https://huggingface.co/datasets/LEAP/subsampled_low_res) which does not include cloud and wind tendencies in target variables. For online testing, we need the ML model to predict not only temperature and moisture tendencies but also these cloud and wind tendencies.

If you would like to work on the [High-Resolution Dataset]((https://huggingface.co/datasets/LEAP/ClimSim_high-res)) and also want to expand the input feature, you can follow [this notebook](./online_testing/data_preparation/adding_input_feature.ipynb) which illustrates how we created the expanded input features from the original low-resolution dataset.

### 2.2 Combine raw data into a few single files

The raw data contains a large number of individual data files outputted at each E3SM model time step. We need to aggregate these individual files into a few files containing data array for efficient training.

Take our MLP baseline model (from the [Stable Machine-Learning Parameterization](https://arxiv.org/abs/2407.00124) paper) for example. Run the [create_dataset_example_v2rh.ipynb](./data_preparation/create_dataset/create_dataset_example_v2rh.ipynb) notebook to prepare the input/output files for the MLP_v2rh model.

If you want to reproduce the U-Net models from [Stable Machine-Learning Parameterization](https://arxiv.org/abs/2407.00124) paper, run the [create_dataset_example_v4.ipynb](./data_preparation/create_dataset/create_dataset_example_v4.ipynb) notebook to prepare the input/output files for the Unet_v4 model. Or run the [create_dataset_example_v5.ipynb](./data_preparation/create_dataset/create_dataset_example_v5.ipynb) notebook to prepare the input/output files for the Unet_v5 model. 'v4' is the unconstrained U-Net, while 'v5' is the constrained U-Net, please refer to original paper for more details.

---

## 3. Model training

### 3.1 General requirement

To be able to couple your trained NN model to E3SM seeminglessly, you need to be aware of the following requirements before training your NN model:

- Your NN model must be saved in TorchScript format. Converting a pytorch model into TorchScript is straightforward. Our training scripts include the code to save the model in TorchScript format. You can also refer to the [Official Torchscript Documentation](https://pytorch.org/docs/stable/jit.html) for more details.
- Your NN model's forward method should take an input tensor with shape (batch_size, num_input_features) and return an output tensor with shape (batch_size, num_output_features). The output feature dimension should have a length of ```num_output_features = 368``` and contain the following variables in the same order as: ```'ptend_t', 'ptend_q0001', 'ptend_q0002', 'ptend_q0003', 'ptend_u', 'ptend_v', 'cam_out_NETSW', 'cam_out_FLWDS', 'cam_out_PRECSC', 'cam_out_PRECC', 'cam_out_SOLS', 'cam_out_SOLL', 'cam_out_SOLSD', 'cam_out_SOLLD'```. The ptend variables are vertical profiles of tendencies of atmospheric states and have a length of 60.

### 3.2 Training scripts of our baseline online models

We provide the training scripts under the ```online_testing/baseline_models/``` directory. Under the folder of each baseline model, we provide the slurm scripts under the ```slurm``` folder to run the training job.

For example, to train the MLP model (with a huber loss and a 'step' lr scheduler), you can run the following command:
```bash
cd online_testing/baseline_models/MLP_v2rh/training/slurm/
sbatch v2rh_mlp_nonaggressive_cliprh_huber_step_3l_lr1em3.sbatch
```

The training will read in the default configuration arguments listed in ```training/conf/config_single.yaml```. You need to change a few path argument in the config_single.yaml to the paths on your machine, or you can also overwrite those paths in the slurm job scripts. By default, the training slurm scripts requested to use 4 GPUs. You can change the number of GPUs in the slurm scripts.

The training requires to use the [modulus library](https://docs.nvidia.com/deeplearning/modulus/getting-started/index.html). We used the modulus container image for the training environment. You could download the latest version by following the instructions on [modulus website](https://docs.nvidia.com/deeplearning/modulus/getting-started/index.html). For reproducibility information, we used version ```nvcr.io/nvidia/modulus/modulus:24.01```. If you don't want to use a container, you could also use

```bash
pip install nvidia-modulus
```
to install on any system but we recommend the container for best results.

---

## 4 Model post-processing: create wrapper for the trained model to include any normalization and de-normalization

The E3SM MMF-NN-Emulator code expects the NN model to take un-normalized input features and output un-normalized output features. Notebooks provided in ```./model_postprocessing``` directory show how to create a wrapper for our pretrained MLP and U-Net models to include pre/post-processing such as normalization and de-normalization inside the forward method of the TorchScript model.

For example, the [v5_nn_wrapper.ipynb](./model_postprocessing/v5_nn_wrapper.ipynb) notebook shows how to create a wrapper for the U-Net model to read raw input features, calculate additional needed input features, normalize the input, clip input values, pass them to the U-Net model, de-normalize the output features, and apply the temperature-based liquid-ice cloud partitioning.

---

## 5. Run hybrid E3SM MMF-NN-Emulator simulations

Please follow the instructions in the [ClimSim-Online repository](https://github.com/leap-stc/climsim-online/tree/main) to set up the container environment and run the hybrid simulation.

Please check the [NVlabs/E3SM MMF-NN-Emulator repository](https://github.com/zyhu-hu/E3SM_nvlab/tree/cleaner_workflow_tomerge/climsim_scripts) to learn about the configurations and namelist variables of the E3SM MMF-NN-Emulator version.

---

## 6. Evaluation of hybrid simulations

The notebooks in the ```./evaluation``` directory show how to reproduce the plots in the [Stable Machine-Learning Parameterization](https://arxiv.org/abs/2407.00124) paper. Data required by these evaluation/visualization notebooks can be downloaded at [Stable Machine-Learning Parameterization: Zenodo Data](https://zenodo.org/records/12797811).

---

## Author
- Zeyuan Hu, Harvard University

## References

- [ClimSim-Online: A Large Multi-scale Dataset and Framework for Hybrid ML-physics Climate Emulation](https://arxiv.org/abs/2306.08754)
- [Stable Machine-Learning Parameterization of Subgrid Processes with Real Geography and Full-physics Emulation](https://arxiv.org/abs/2407.00124)
142 changes: 142 additions & 0 deletions online_testing/baseline_models/MLP_v2rh/training/climsim_datapip.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
# #import xarray as xr
# from torch.utils.data import Dataset
# import numpy as np
# import torch

#import xarray as xr
from torch.utils.data import Dataset
import numpy as np
import torch

class climsim_dataset(Dataset):
def __init__(self,
input_paths,
target_paths,
input_sub,
input_div,
out_scale,
qinput_prune,
output_prune,
strato_lev,
qc_lbd,
qi_lbd,
decouple_cloud=False,
aggressive_pruning=False,
strato_lev_qc=30,
strato_lev_qinput=None,
strato_lev_tinput=None,
strato_lev_out = 12,
input_clip=False,
input_clip_rhonly=False,):
"""
Args:
input_paths (str): Path to the .npy file containing the inputs.
target_paths (str): Path to the .npy file containing the targets.
input_sub (np.ndarray): Input data mean.
input_div (np.ndarray): Input data standard deviation.
out_scale (np.ndarray): Output data standard deviation.
qinput_prune (bool): Whether to prune the input data.
output_prune (bool): Whether to prune the output data.
strato_lev (int): Number of levels in the stratosphere.
qc_lbd (np.ndarray): Coefficients for the exponential transformation of qc.
qi_lbd (np.ndarray): Coefficients for the exponential transformation of qi.
"""
self.inputs = np.load(input_paths)
self.targets = np.load(target_paths)
self.input_paths = input_paths
self.target_paths = target_paths
self.input_sub = input_sub
self.input_div = input_div
self.out_scale = out_scale
self.qinput_prune = qinput_prune
self.output_prune = output_prune
self.strato_lev = strato_lev
self.qc_lbd = qc_lbd
self.qi_lbd = qi_lbd
self.decouple_cloud = decouple_cloud
self.aggressive_pruning = aggressive_pruning
self.strato_lev_qc = strato_lev_qc
self.strato_lev_out = strato_lev_out
self.input_clip = input_clip
if strato_lev_qinput <0:
self.strato_lev_qinput = strato_lev
else:
self.strato_lev_qinput = strato_lev_qinput
self.strato_lev_tinput = strato_lev_tinput
self.input_clip_rhonly = input_clip_rhonly

if self.strato_lev_qinput <self.strato_lev:
raise ValueError('strato_lev_qinput should be greater than or equal to strato_lev, otherwise inconsistent with E3SM')


def __len__(self):
return len(self.inputs)

def __getitem__(self, idx):
x = self.inputs[idx]
y = self.targets[idx]
# x = np.load(self.input_paths,mmap_mode='r')[idx]
# y = np.load(self.target_paths,mmap_mode='r')[idx]
x[120:180] = 1 - np.exp(-x[120:180] * self.qc_lbd)
x[180:240] = 1 - np.exp(-x[180:240] * self.qi_lbd)
# Avoid division by zero in input_div and set corresponding x to 0
# input_div_nonzero = self.input_div != 0
# x = np.where(input_div_nonzero, (x - self.input_sub) / self.input_div, 0)
x = (x - self.input_sub) / self.input_div
#make all inf and nan values 0
x[np.isnan(x)] = 0
x[np.isinf(x)] = 0

y = y * self.out_scale
if self.decouple_cloud:
x[120:240] = 0
x[60*14:60*16] =0
x[60*19:60*21] =0
elif self.aggressive_pruning:
# for profiles, only keep stratosphere temperature. prune all other profiles in stratosphere
x[60:60+self.strato_lev_qinput] = 0 # prune RH
x[120:120+self.strato_lev_qc] = 0
x[180:180+self.strato_lev_qinput] = 0
x[240:240+self.strato_lev] = 0 # prune u
x[300:300+self.strato_lev] = 0 # prune v
x[360:360+self.strato_lev] = 0
x[420:420+self.strato_lev] = 0
x[480:480+self.strato_lev] = 0
x[540:540+self.strato_lev] = 0
x[600:600+self.strato_lev] = 0
x[660:660+self.strato_lev] = 0
x[720:720+self.strato_lev] = 0
x[780:780+self.strato_lev_qinput] = 0
x[840:840+self.strato_lev_qc] = 0 # prune qc_phy
x[900:900+self.strato_lev_qinput] = 0
x[960:960+self.strato_lev] = 0
x[1020:1020+self.strato_lev] = 0
x[1080:1080+self.strato_lev_qinput] = 0
x[1140:1140+self.strato_lev_qc] = 0 # prune qc_phy in previous time step
x[1200:1200+self.strato_lev_qinput] = 0
x[1260:1260+self.strato_lev] = 0
x[1515] = 0 #SNOWHICE
elif self.qinput_prune:
# x[:,60:60+self.strato_lev] = 0
x[120:120+self.strato_lev] = 0
x[180:180+self.strato_lev] = 0

if self.strato_lev_tinput >0:
x[0:self.strato_lev_tinput] = 0

if self.input_clip:
if self.input_clip_rhonly:
x[60:120] = np.clip(x[60:120], 0, 1.2)
else:
x[60:120] = np.clip(x[60:120], 0, 1.2) # for RH, clip to (0,1.2)
x[360:720] = np.clip(x[360:720], -0.5, 0.5) # for dyn forcing, clip to (-0.5,0.5)
x[720:1320] = np.clip(x[720:1320], -3, 3) # for phy tendencies clip to (-3,3)


if self.output_prune:
y[60:60+self.strato_lev_out] = 0
y[120:120+self.strato_lev_out] = 0
y[180:180+self.strato_lev_out] = 0
y[240:240+self.strato_lev_out] = 0
y[300:300+self.strato_lev_out] = 0
return torch.tensor(x, dtype=torch.float32), torch.tensor(y, dtype=torch.float32)
Loading

0 comments on commit 843e1b3

Please sign in to comment.