Skip to content

Commit

Permalink
Merge pull request #12 from ClimateImpactLab/Readme_changes
Browse files Browse the repository at this point in the history
Readme changes
  • Loading branch information
kemccusker authored Sep 17, 2022
2 parents 845cf49 + f206318 commit 2ee3784
Showing 1 changed file with 33 additions and 218 deletions.
251 changes: 33 additions & 218 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,21 +1,13 @@
[![pipeline status](https://gitlab.com/ClimateImpactLab/Impacts/integration/badges/main/pipeline.svg)](https://gitlab.com/ClimateImpactLab/Impacts/integration/-/commits/master)
[![docs page](https://img.shields.io/badge/docs-latest-blue)](https://climateimpactlab.gitlab.io/Impacts/integration/)
[![coverage report](https://gitlab.com/ClimateImpactLab/Impacts/integration/badges/main/coverage.svg)](https://gitlab.com/ClimateImpactLab/Impacts/integration/-/commits/main)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)

# DSCIM: The Data-driven Spatial Climate Impact Model

This Python library enables the calculation of a sector integrated social cost of carbon
(SCC) using a variety of valuation methods and assumptions. The main purpose of this
This Python library enables the calculation of sector-specific partial social cost of greenhouse gases (SC-GHG) and SCGHGs that are combined across sectors using a variety of valuation methods and assumptions. The main purpose of this
library is to parse the monetized spatial damages from different sectors and integrate them
using different options (or menu options) that encompass different decisions, such as
using different options ("menu options") that encompass different decisions, such as
discount levels, discount strategies, and different considerations related to
economic and climate uncertainty.

## Documentation

Full documentation is available here: https://climateimpactlab.gitlab.io/Impacts/integration/

## Setup

To begin we assume you have a system with `conda` available from the command line, and some familiarity with it. A conda distribution is available from [miniconda](https://docs.conda.io/en/latest/miniconda.html), [Anaconda](https://www.anaconda.com/), or [mamba](https://mamba.readthedocs.io/en/latest/). This helps to ensure required software packages are correctly compiled and installed, replicating the analysis environment.
Expand Down Expand Up @@ -48,21 +40,46 @@ python damage_fun_runs/Directory_setup.py

Note that this will download several gigabytes of data and may take several minutes, depending on your connection speed.

## Running SCCs
## Running SCGHGs

After setting up your environment and the input data, you can run SCCs under different conditions with
After setting up your environment and the input data, you can run SCGHG calculations under different conditions with

```bash
python damage_fun_runs/command_line_scc.py
```

and follow the on-screen prompts.
and follow the on-screen prompts. When the selector is a carrot, you may only select one option. Use the arrow keys on your keyboard to highlight your desired option and click enter to submit. When you are presented with `X` and `o` selectors, you may use the spacebar to select (`X`) or deselect (`o`) then click enter to submit once you have chosen your desired number of parameters. Once you have completed all of the options, the DSCIM run will begin.

### Command line options

Below is a short summary of what each command line option does. To view a more detailed description of what the run parameters do, see the [Documentation](https://impactlab.org/research/dscim-user-manual-version-092022-epa) for Data-driven Spatial Climate Impact Model (DSCIM).

#### Sector

The user may only select one sector per run. Sectors represent the combined SCGHG or partial SCGHGs of the chosen sector.

#### Discount rate

These runs use endogenous Ramsey discounting that are targeted to begin at the chosen near-term discount rate(s).

#### Pulse years

Pulse year represents the SCGHG for a pulse of greenhouse gas (GHG) emitted in the chosen pulse year(s).

#### Domain of damages

The default is a global SCGHG accounting for global damages in response to a pulse of GHG. The user has the option to instead compute a domestic SCGHG accounting only for United States damages.

#### Optional files

By default, the script will produce the expected SCGHGs as a `.csv`. The user also has the option to save the full distribution of SCGHGs -- across emissions, socioeconomics, and climate uncertainty -- as a `.csv`, and the option to save global consumption net of baseline climate damages ("global_consumption_no_pulse") as a netcdf `.nc4` file.


## Structure and logic

The library is split into several components that implement the hierarchy
defined by the menu options. These are the main elements of the library and
serve as the main classes to call different menu options. In this release, only `Baseline` is available:
serve as the main classes to call different menu options.

```mermaid
graph TD
Expand All @@ -76,7 +93,7 @@ SubGraph1Flow(Storage and I/O)
subgraph "Recipe Book"
A[StackedDamages] --> B[MainMenu]
B[MainMenu] --> C[Baseline];
B[MainMenu] --> C[AddingUpRecipe];
B[MainMenu] --> D[RiskAversionRecipe];
B[MainMenu] --> E[EquityRecipe]
end
Expand All @@ -96,209 +113,7 @@ Class | Function


and these elements can be used for the menu options:
- `Baseline`: Adding up all damages and collapse them to calculate a general SCC without valuing uncertainty.
- `AddingUpRecipe`: Adding up all damages and collapse them to calculate a general SCC without valuing uncertainty.
- `RiskAversionRecipe`: Add risk aversion certainty equivalent to consumption calculations - Value uncertainty over econometric and climate draws.
- `EquityRecipe`: Add risk aversion and equity to the consumption calculations. Equity includes taking a certainty equivalent over spatial impact regions.



### Documentation and contributing

Learn more about how to contribute to the library checking our [contribution
guidelines](./CONTRIBUTING.md) and the official [documentation][8].



## For Developers


### Contained environment
Additionally, we also have a built a contained environment compatible with most
HPC systems using Singularity. You can check more about how to use Singularity
using [its quick start guide][5]. In a nutshell, our Singularity container is
a Ubuntu OS with a Python (`miniconda3`) environment with all the needed
dependencies installed. We provide options to open jupyter notebooks that are
compatible with `Dask`. At the same time, you can build you own scripts and run
them against the same environment.

**A note on singularity remote builts**: `singularity build` needs root access,
which might be impossible to have if you live under the HPC admin tyranny. But,
Singularity have your back with the use of remote builds: `singularity build
--remote`. This means that the building process happens remotely on [Sylabs][6]
servers and gets automatically downloaded to the local machine. To make use of
this option, you need to authenticate and open a Sylabs account, you can start
this process by just doing: `singularity remote login`. A link will appear to
create an account and an API key. Later, an prompt will appear asking for your
API key, you just need to copy and paste it to your terminal.

You can build the container using our `Makefile`:

```bash
make Makefile container
```

After running this you will have a `/images` directory with the container file.
This container will contain the same libraries that in the [pangeo environment][7]
but the current version of the `dscim` will not be installed. In this
repo we added some tools to install the `dscim` and open a Jupyter
notebook to explore data or run SCC calculations.
`infrastructure/run_in_singularity.sh` is a script that installs this repo and
opens a Jupyter notebooks inside the container:

```bash
age: ${0} {build|notebook}
OPTIONS:
-h|help Show this message
-b|--build
-n|--notebook
INFRASTRUCTURE:
Build the infrastructure and output python
$ ./run_singularity.sh --build
Run notebook inside Singularity image. This function takes arguments
for both IP and port to use in Jupyterlab
$ ./run_singularity.sh --notebook 0.0.0.0 8888
```
We have wrapped this process within the same `Makefile` we use to build the
Singularity container, so you can just do:
```bash
make Makefile run-jupyter
```
The Jupyter `--port` option is hardcoded in the notebook, and the
auto-ssh-fowarding is active by using the `--ip` flag. Be aware that you do not
need to build the image on each run, the image will live in the `images/` folder
and you can use the `run-jupyter` to run the Jupyter Notebook. Also, everytime
you build the notebook, a fresh version of the code will be installed in the
notebook (this might take a while due to compilation issues).
## Requirements
The library runs on Python +3.6 and it expects a that all requirements are
installed previous running any code, check Installation The integration
process is stacking different damage outcomes from several sectors
at the impact region level. Thus, you will need several tricks to deal with
the data I/O.
## Computing
### Computing introduction
One of the tricks we rely on is the extensive use of `Dask` and `xarray` to
read raw damage data in `nc4` or `zarr` format (This latter is how coastal damages are provided).
Hence, you will need to have a `Dask` `distributed.client` to harness the power of distributed computing.
The computing requirements will vary depending on the execution of different
menu options and the number of sectors you are aggregating. These are some general rules about
computational intensity:
1. For recipes, `EquityRecipe > RiskAversionRecipe > BaselineRecipe`
2. For discounting, `euler_gwr > euler_ramsey > naive_gwr > naive_ramsey > constant > constant_model_collapsed`
3. More options (ie., greater number of SSPs, greater number of sectors) means more computing resources required.
4. `Dask` does not perfectly release memory after each menu run. Thus, if you are running
several menu options, in loops or otherwise, you may need to execute a `client.restart()` partway through
to force `Dask` into emptying memory.
5. Inclusion of coastal increases memory usage exponentially (due to the 500 batches and 10 GMSL bins against which
other sectors' damages must be broadcasted). Be careful and smart when running this option,
and don't be afraid to reconsider chunking for the files being read in.
### Setting up a Dask client
Ensure that the following packages are installed and updated:
[Dask](https://docs.dask.org/en/latest/install.html), [distributed](https://distributed.dask.org/en/latest/install.html), [Jupyter Dask extension](https://github.com/dask/dask-labextension), `dask_jobqueue`.
Ensure that your Jupyter Lab has add-ons enabled so that you can access Dask as an extension.
You have two options for setting up a Dask client.
#### Local client
<details><summary>Click to expand</summary>
If your local node has sufficient memory and computational power, you will only need to create a local Dask client.
_If you are operating on Midway3, you should be able to run the menu in its entirety.
Each `caslake` computing node on Midway3 has 193 GB memory, and 48 CPUs. This is sufficient for all options._
- Open the Dask tab on the left side of your Jupyter Lab page.
- Click `New + ` and wait for a cluster to appear.
- Drag and drop the cluster into your notebook and execute the cell.
- You now have a new Dask client!
- click on the `CPU`, `Worker Memory`, and `Progress` tabs to track progress. You can arrange them in a side bar of your
Jupyter notebook to keep them all visible at the same time.
- note that opening 2 or 3 local Clients does _not_ get you 2 or 3 times the compute space. These clients will be sharing
the same node, so in fact computing may be slower as they are fighting for resources. (_check this, it's a hypothesis_)
![](images/dask_example.png)
</details>
#### Distributed client
<details><summary>Click to expand</summary>
If your local node does not have sufficient computational power, you will need to manually request separate
nodes with `dask.distributed`:
```
cluster = SLURMCluster()
print(cluster.job_script())
cluster.scale(10)
client = Client(cluster)
client
```
You can adjust the number of workers by changing the integer inside `cluster.scale()`. You can adjust the CPUs
and memory per worker inside `~/.config/dask/jobqueue.yaml`.
To track progress of this client, copy-paste the "Dashboard" IP address and SSH into it. Example code:
```
ssh -N -f -L 8787:10.50.250.7:8510 [email protected]
```
Then go to `localhost:8787` in your browser to watch the magic.
</details>
### Dask troubleshooting
Most Dask issues in the menu come from one of two sources:
1. requesting Dask to compute too many tasks (your chunks are too small) which will result in a sort of "hung state"
and empty progress bar.
2. requesting Dask to compute too _large_ tasks (your chunks are too big). In this case, you will see memory under
`Worker Memory` taskbar shoot off the charts. Then your kernel will likely be killed by SLURM.
How can you avoid these situations?
1. Start with `client.restart()`. Sometimes, Dask does not properly release tasks from memory and this plugs up
the client. Doing a fresh restart (and perhaps a fresh restart of your notebook) will fix the problem.
2. Next, check your chunks! Ensure that any `xr.open_dataset()` or `xr.open_mfdataset()` commands have a `chunks`
argument passed. If not, Dask's default is to load the entire file into memory before rechunking later. This
is very bad news for impact-region-level damages, which are 10TB of data.
3. Start executing the menu object by object. Call an object, select a small slice of it, and add `.compute()`. If the object
computes successfully without overloading memory, it's not the memory leak. Keep moving through the menu until you find the
source of the error. _Hot tip: it's usually the initial reading-in of files where nasty things happen._ Check each object in the menu to
ensure three things:
- chunks should be a reasonable size ('reasonable' is relative, but approximately 250-750 MB is typically successful
on a Midway3 `caslake` computing node)
- not too many chunks! Again, this is relative, but more than 10,000 likely means you should reconsider your chunksize.
- not too many tasks per chunk. Again, relative, but more than 300,000 tasks early in the menu is unusual and should be
checked to make sure there aren't any unnecessary rechunking operations being forced upon the menu.
4. Consider rechunking your inputs. If your inputs are chunked in a manner that's orthogonal to your first few operations,
Dask will have a nasty time trying to rechunk all those files before executing things on them. Rechunking and resaving
usually takes a few minutes; rechunking in the middle of an operation can take hours.
5. If this has all been done and you are still getting large memory errors, it's possible that Dask isn't correctly separating
and applying operations to chunks. If this is the case, consider adding a `map_blocks` method, which explicitly
tells Dask to apply the operation to each chunk sequentially.
For more information about how to
execute `Dask` and the `job-queue` library (in case you are in a computing
cluster), refer to [Dask Distributed][3] and [job-queue][4] documentation.
You can check several use-case examples on the computed notebook under examples.
### Priority
Maintaining priority is important when given tight deadlines to run menu options. To learn more about
priority, click [here](https://rcc.uchicago.edu/docs/tutorials/rcc-tips-and-tricks.html#priority
).
In general, following these hygiene rules will keep priority high:
1. Kill all notebooks/clusters when not in use.
2. Only request what you need (in terms of `WALLTIME`, `WORKERS`, and `WORKER MEMORY`).
3. Run things right the first time around. Your notebook text is worth an extra double check :)
[3]: https://distributed.dask.org/en/latest/
[4]: https://jobqueue.dask.org/en/latest/
[5]: https://sylabs.io/guides/3.5/user-guide/quick_start.html
[6]: https://sylabs.io/
[7]: https://pangeo.io/setup_guides/hpc.html
[8]: https://climateimpactlab.gitlab.io/Impacts/integration/

0 comments on commit 2ee3784

Please sign in to comment.