Merge pull request #98 from zyhu-hu/zeyuanhu/online_testing_tomerge

leap-stc · Jul 25, 2024 · 843e1b3 · 843e1b3
2 parents 8deb602 + 73948c1
commit 843e1b3
Show file tree

Hide file tree

Showing 80 changed files with 18,018 additions and 68 deletions.
diff --git a/.gitignore b/.gitignore
@@ -2,7 +2,7 @@ website/_build/
 website/demo_notebooks
 website/evaluation
 website/figures
-website/downstream_test
+website/online_testing
 website/README.md
 .DS_Store
 
@@ -85,6 +85,7 @@ target/
 
 # Jupyter Notebook
 .ipynb_checkpoints
+hu_notebooks/
 
 # IPython
 profile_default/

diff --git a/README.md b/README.md
@@ -26,10 +26,10 @@ We implement a range of deterministic and stochastic regression baselines to hig
 * [Multi-Layer Perceptron (MLP) Example](https://leap-stc.github.io/ClimSim/demo_notebooks/mlp_example.html)
 * [Convolutional Neural Network (CNN) Example](https://leap-stc.github.io/ClimSim/demo_notebooks/cnn_example.html)
 * [Water Conservation Example](https://leap-stc.github.io/ClimSim/demo_notebooks/water_conservation.html)
-
- ## Online Testing
 
-* [Online Testing](https://github.com/leap-stc/ClimSim/tree/online_testing/downstream_test)
+## Online Testing
+
+* [Online Testing](https://github.com/leap-stc/ClimSim/online_testing.html)
 
 ## Project Structure
 

diff --git a/climsim_utils/data_utils.py b/climsim_utils/data_utils.py
diff --git a/online_testing/README.md b/online_testing/README.md
@@ -0,0 +1,101 @@
+# Hybrid E3SM-MMF-NN-Emulator Simulation and Online Evaluation
+
+## Table of Contents
+
+1. [Problem overview](#1-problem-overview)
+2. [Data preparation](#2-data-preparation)
+    1. [Data download](#21-data-download)
+    2. [Combine raw data into a few single files](#22-combine-raw-data-into-a-few-single-files)
+3. [Model training](#3-model-training)
+    1. [General requirement](#31-general-requirement)
+    2. [Training scripts of our baseline online models](#32-training-scripts-of-our-baseline-online-models)
+4. [Model post-processing: create wrapper for the trained model to include any normalization and de-normalization](#4-model-post-processing-create-wrapper-for-the-trained-model-to-include-any-normalization-and-de-normalization)
+5. [Run hybrid E3SM MMF-NN-Emulator simulation](#5-run-hybrid-e3sm-mmf-nn-emulator-simulation)
+6. [Evaluation of hybrid simulation](#6-evaluation-of-hybrid-simulation)
+
+## 1. Problem overview
+The ultimate goal of training a ML model emulator (of the cloud-resolving model embedded in the E3SM-MMF climate simulator) using the ClimSim dataset is to couple it to the host E3SM climate simulator and evaluate the performance of such hybrid ML-physics simulation, e.g., whether the hybrid simulation can reproduce the statistics of the pure physics simulation. Here we use "online" to denote this task of performing and evaluating hybrid simulation, in contrast to the "offline" task in which we focus on training a ML model. Here we describe the entire workflow of training these baseline models, running and evaluating the hybrid simulation. We provided a few baseline models that we trained and optimized on the online task. These pretrained models include the MLP models and U-Net models from [Stable Machine-Learning Parameterization](https://arxiv.org/abs/2407.00124) paper.
+
+Refer to the [ClimSim-Online paper](https://arxiv.org/abs/2306.08754) for more details on the online task overview and the [Stable Machine-Learning Parameterization](https://arxiv.org/abs/2407.00124) paper for more details on the example baseline models we provide.
+
+---
+
+## 2. Data preparation
+
+### 2.1 Data download
+
+We take the low-resolution dataset as example. Dowload either the [Low-Resolution Real Geography](https://huggingface.co/datasets/LEAP/ClimSim_low-res) or [Low-Resolution Real Geography Expanded](https://huggingface.co/datasets/LEAP/ClimSim_low-res-expanded) dataset from Hugging Face. The expanded version includes additional input features such as large-scale forcings and convection memory (previous steps state tendencies) that we used in our pretrained U-Net models (refer to [this paper](https://arxiv.org/abs/2407.00124) for more details). 
+
+Please don't use the current preprocessed [Subsampled Low-Resolution Data](https://huggingface.co/datasets/LEAP/subsampled_low_res) which does not include cloud and wind tendencies in target variables. For online testing, we need the ML model to predict not only temperature and moisture tendencies but also these cloud and wind tendencies.
+
+If you would like to work on the [High-Resolution Dataset]((https://huggingface.co/datasets/LEAP/ClimSim_high-res)) and also want to expand the input feature, you can follow [this notebook](./online_testing/data_preparation/adding_input_feature.ipynb) which illustrates how we created the expanded input features from the original low-resolution dataset.
+
+### 2.2 Combine raw data into a few single files
+
+The raw data contains a large number of individual data files outputted at each E3SM model time step. We need to aggregate these individual files into a few files containing data array for efficient training.
+
+Take our MLP baseline model (from the [Stable Machine-Learning Parameterization](https://arxiv.org/abs/2407.00124) paper) for example. Run the [create_dataset_example_v2rh.ipynb](./data_preparation/create_dataset/create_dataset_example_v2rh.ipynb) notebook to prepare the input/output files for the MLP_v2rh model. 
+
+If you want to reproduce the U-Net models from [Stable Machine-Learning Parameterization](https://arxiv.org/abs/2407.00124) paper, run the [create_dataset_example_v4.ipynb](./data_preparation/create_dataset/create_dataset_example_v4.ipynb) notebook to prepare the input/output files for the Unet_v4 model. Or run the [create_dataset_example_v5.ipynb](./data_preparation/create_dataset/create_dataset_example_v5.ipynb) notebook to prepare the input/output files for the Unet_v5 model. 'v4' is the unconstrained U-Net, while 'v5' is the constrained U-Net, please refer to original paper for more details.
+
+---
+
+## 3. Model training
+
+### 3.1 General requirement
+
+To be able to couple your trained NN model to E3SM seeminglessly, you need to be aware of the following requirements before training your NN model:
+
+- Your NN model must be saved in TorchScript format. Converting a pytorch model into TorchScript is straightforward. Our training scripts include the code to save the model in TorchScript format. You can also refer to the [Official Torchscript Documentation](https://pytorch.org/docs/stable/jit.html) for more details.
+- Your NN model's forward method should take an input tensor with shape (batch_size, num_input_features) and return an output tensor with shape (batch_size, num_output_features). The output feature dimension should have a length of ```num_output_features = 368``` and contain the following variables in the same order as: ```'ptend_t', 'ptend_q0001', 'ptend_q0002', 'ptend_q0003', 'ptend_u', 'ptend_v', 'cam_out_NETSW', 'cam_out_FLWDS', 'cam_out_PRECSC', 'cam_out_PRECC', 'cam_out_SOLS', 'cam_out_SOLL', 'cam_out_SOLSD', 'cam_out_SOLLD'```. The ptend variables are vertical profiles of tendencies of atmospheric states and have a length of 60.
+
+### 3.2 Training scripts of our baseline online models
+
+We provide the training scripts under the ```online_testing/baseline_models/``` directory. Under the folder of each baseline model, we provide the slurm scripts under the ```slurm``` folder to run the training job.
+
+For example, to train the MLP model (with a huber loss and a 'step' lr scheduler), you can run the following command:
+```bash
+cd online_testing/baseline_models/MLP_v2rh/training/slurm/
+sbatch v2rh_mlp_nonaggressive_cliprh_huber_step_3l_lr1em3.sbatch
+```
+
+The training will read in the default configuration arguments listed in ```training/conf/config_single.yaml```. You need to change a few path argument in the config_single.yaml to the paths on your machine, or you can also overwrite those paths in the slurm job scripts. By default, the training slurm scripts requested to use 4 GPUs. You can change the number of GPUs in the slurm scripts.
+
+The training requires to use the [modulus library](https://docs.nvidia.com/deeplearning/modulus/getting-started/index.html). We used the modulus container image for the training environment. You could download the latest version by following the instructions on [modulus website](https://docs.nvidia.com/deeplearning/modulus/getting-started/index.html). For reproducibility information, we used version ```nvcr.io/nvidia/modulus/modulus:24.01```. If you don't want to use a container, you could also use
+
+```bash
+pip install nvidia-modulus
+``` 
+to install on any system but we recommend the container for best results.
+
+---
+
+## 4 Model post-processing: create wrapper for the trained model to include any normalization and de-normalization
+
+The E3SM MMF-NN-Emulator code expects the NN model to take un-normalized input features and output un-normalized output features. Notebooks provided in ```./model_postprocessing``` directory show how to create a wrapper for our pretrained MLP and U-Net models to include pre/post-processing such as normalization and de-normalization inside the forward method of the TorchScript model. 
+
+For example, the [v5_nn_wrapper.ipynb](./model_postprocessing/v5_nn_wrapper.ipynb) notebook shows how to create a wrapper for the U-Net model to read raw input features, calculate additional needed input features, normalize the input, clip input values, pass them to the U-Net model, de-normalize the output features, and apply the temperature-based liquid-ice cloud partitioning.
+
+---
+
+## 5. Run hybrid E3SM MMF-NN-Emulator simulations
+
+Please follow the instructions in the [ClimSim-Online repository](https://github.com/leap-stc/climsim-online/tree/main) to set up the container environment and run the hybrid simulation.
+
+Please check the [NVlabs/E3SM MMF-NN-Emulator repository](https://github.com/zyhu-hu/E3SM_nvlab/tree/cleaner_workflow_tomerge/climsim_scripts) to learn about the configurations and namelist variables of the E3SM MMF-NN-Emulator version.
+
+---
+
+## 6. Evaluation of hybrid simulations
+
+The notebooks in the ```./evaluation``` directory show how to reproduce the plots in the [Stable Machine-Learning Parameterization](https://arxiv.org/abs/2407.00124) paper. Data required by these evaluation/visualization notebooks can be downloaded at [Stable Machine-Learning Parameterization: Zenodo Data](https://zenodo.org/records/12797811).
+
+---
+
+## Author
+- Zeyuan Hu, Harvard University
+
+## References
+
+- [ClimSim-Online: A Large Multi-scale Dataset and Framework for Hybrid ML-physics Climate Emulation](https://arxiv.org/abs/2306.08754)
+- [Stable Machine-Learning Parameterization of Subgrid Processes with Real Geography and Full-physics Emulation](https://arxiv.org/abs/2407.00124)
diff --git a/online_testing/baseline_models/MLP_v2rh/training/climsim_datapip.py b/online_testing/baseline_models/MLP_v2rh/training/climsim_datapip.py
@@ -0,0 +1,142 @@
+# #import xarray as xr
+# from torch.utils.data import Dataset
+# import numpy as np
+# import torch
+
+#import xarray as xr
+from torch.utils.data import Dataset
+import numpy as np
+import torch
+
+class climsim_dataset(Dataset):
+    def __init__(self, 
+                 input_paths, 
+                 target_paths, 
+                 input_sub, 
+                 input_div, 
+                 out_scale, 
+                 qinput_prune, 
+                 output_prune, 
+                 strato_lev,
+                 qc_lbd,
+                 qi_lbd, 
+                 decouple_cloud=False, 
+                 aggressive_pruning=False,
+                 strato_lev_qc=30,
+                 strato_lev_qinput=None, 
+                 strato_lev_tinput=None,
+                 strato_lev_out = 12,
+                 input_clip=False,
+                 input_clip_rhonly=False,):
+        """
+        Args:
+            input_paths (str): Path to the .npy file containing the inputs.
+            target_paths (str): Path to the .npy file containing the targets.
+            input_sub (np.ndarray): Input data mean.
+            input_div (np.ndarray): Input data standard deviation.
+            out_scale (np.ndarray): Output data standard deviation.
+            qinput_prune (bool): Whether to prune the input data.
+            output_prune (bool): Whether to prune the output data.
+            strato_lev (int): Number of levels in the stratosphere.
+            qc_lbd (np.ndarray): Coefficients for the exponential transformation of qc.
+            qi_lbd (np.ndarray): Coefficients for the exponential transformation of qi.
+        """
+        self.inputs = np.load(input_paths)
+        self.targets = np.load(target_paths)
+        self.input_paths = input_paths
+        self.target_paths = target_paths
+        self.input_sub = input_sub
+        self.input_div = input_div
+        self.out_scale = out_scale
+        self.qinput_prune = qinput_prune
+        self.output_prune = output_prune
+        self.strato_lev = strato_lev
+        self.qc_lbd = qc_lbd
+        self.qi_lbd = qi_lbd
+        self.decouple_cloud = decouple_cloud
+        self.aggressive_pruning = aggressive_pruning
+        self.strato_lev_qc = strato_lev_qc
+        self.strato_lev_out = strato_lev_out
+        self.input_clip = input_clip
+        if strato_lev_qinput <0:
+            self.strato_lev_qinput = strato_lev
+        else:
+            self.strato_lev_qinput = strato_lev_qinput
+        self.strato_lev_tinput = strato_lev_tinput
+        self.input_clip_rhonly = input_clip_rhonly
+
+        if self.strato_lev_qinput <self.strato_lev:
+            raise ValueError('strato_lev_qinput should be greater than or equal to strato_lev, otherwise inconsistent with E3SM')
+
+
+    def __len__(self):
+        return len(self.inputs)
+
+    def __getitem__(self, idx):
+        x = self.inputs[idx]
+        y = self.targets[idx]
+        # x = np.load(self.input_paths,mmap_mode='r')[idx]
+        # y = np.load(self.target_paths,mmap_mode='r')[idx]
+        x[120:180] = 1 - np.exp(-x[120:180] * self.qc_lbd)
+        x[180:240] = 1 - np.exp(-x[180:240] * self.qi_lbd)
+        # Avoid division by zero in input_div and set corresponding x to 0
+        # input_div_nonzero = self.input_div != 0
+        # x = np.where(input_div_nonzero, (x - self.input_sub) / self.input_div, 0)
+        x = (x - self.input_sub) / self.input_div
+        #make all inf and nan values 0
+        x[np.isnan(x)] = 0
+        x[np.isinf(x)] = 0
+
+        y = y * self.out_scale
+        if self.decouple_cloud:
+            x[120:240] = 0
+            x[60*14:60*16] =0
+            x[60*19:60*21] =0
+        elif self.aggressive_pruning:
+            # for profiles, only keep stratosphere temperature. prune all other profiles in stratosphere
+            x[60:60+self.strato_lev_qinput] = 0 # prune RH
+            x[120:120+self.strato_lev_qc] = 0
+            x[180:180+self.strato_lev_qinput] = 0
+            x[240:240+self.strato_lev] = 0 # prune u
+            x[300:300+self.strato_lev] = 0 # prune v
+            x[360:360+self.strato_lev] = 0
+            x[420:420+self.strato_lev] = 0
+            x[480:480+self.strato_lev] = 0
+            x[540:540+self.strato_lev] = 0
+            x[600:600+self.strato_lev] = 0
+            x[660:660+self.strato_lev] = 0
+            x[720:720+self.strato_lev] = 0
+            x[780:780+self.strato_lev_qinput] = 0
+            x[840:840+self.strato_lev_qc] = 0 # prune qc_phy
+            x[900:900+self.strato_lev_qinput] = 0
+            x[960:960+self.strato_lev] = 0
+            x[1020:1020+self.strato_lev] = 0
+            x[1080:1080+self.strato_lev_qinput] = 0
+            x[1140:1140+self.strato_lev_qc] = 0 # prune qc_phy in previous time step
+            x[1200:1200+self.strato_lev_qinput] = 0
+            x[1260:1260+self.strato_lev] = 0
+            x[1515] = 0 #SNOWHICE
+        elif self.qinput_prune:
+            # x[:,60:60+self.strato_lev] = 0
+            x[120:120+self.strato_lev] = 0
+            x[180:180+self.strato_lev] = 0
+
+        if self.strato_lev_tinput >0:
+            x[0:self.strato_lev_tinput] = 0
+
+        if self.input_clip:
+            if self.input_clip_rhonly:
+                x[60:120] = np.clip(x[60:120], 0, 1.2)
+            else:
+                x[60:120] = np.clip(x[60:120], 0, 1.2) # for RH, clip to (0,1.2)
+                x[360:720] = np.clip(x[360:720], -0.5, 0.5) # for dyn forcing, clip to (-0.5,0.5)
+                x[720:1320] = np.clip(x[720:1320], -3, 3) # for phy tendencies  clip to (-3,3)
+
+
+        if self.output_prune:
+            y[60:60+self.strato_lev_out] = 0
+            y[120:120+self.strato_lev_out] = 0
+            y[180:180+self.strato_lev_out] = 0
+            y[240:240+self.strato_lev_out] = 0
+            y[300:300+self.strato_lev_out] = 0
+        return torch.tensor(x, dtype=torch.float32), torch.tensor(y, dtype=torch.float32)