Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Development of an ERA5 cloud storage for efficient access #396

Closed
richardarsenault opened this issue Jun 25, 2021 · 8 comments
Closed

Development of an ERA5 cloud storage for efficient access #396

richardarsenault opened this issue Jun 25, 2021 · 8 comments

Comments

@richardarsenault
Copy link
Contributor

Description

Raven will need to access ERA5 data as a reference dataset in many workflows. Currently, ERA5 data are stored on THREDDS, but this is not efficient. Pangeo has an implementation of ERA5 that is functional and stored on amazon AWS servers, but the data is chunked for spatial access rather than for time-series. This project aims to develop a new, time-series oriented ERA5 access point using cloud storage and computing.

The Wasabi service will be used to store and access ERA5 data in the Amazon AWS ecosystem. Preliminary tests show that it will take a few seconds at most to extract basin-averaged time-series for multi-decade periods. The data will be stored in zarr format and chunked in such a way as to maximize speed for time-series analysis.

Steps

The implementation will be done in a few steps:

1- ERA5 worldwide domain (total precip [tp] and air temperature [tas] only at this stage, hourly data).
2- ERA5-Land ([tp] and [tas]) for North-America only , (due to file size limitations, only a subset of the global map can be accommodated)
3- ERA5 and ERA5-Land Snow Water Equivalent [sd] for North-America only

We will then consider other variables and regions according to the needs of PAVICS-Hydro and other users, depending on the required resources. For example, pressure level data and ensembles are to be excluded for now. Only the actual reanalysis is made available.

Spatial Extents

Worldwide is self-explanatory
North-America will cover the region (first approximation/guess):
--- Latitude: [15 85]
--- Longitude: [-167 -50] *Note that ERA5 data natively uses the [0;360]° convention rather than the [-180;180]° convention.

Temporal parameters

Data will be stored at the native hourly resolution and UTC times.
The covered period will start in 1979 and extend to the most recent periods available on our end.

Updating

Files will be manually updated at first. If all goes well, we should be able to implement an auto-updating scheme to allow users to access the most recent data at all times.

Thanks

Big thanks to @sebastienlanglois for the help and contributions, without whom this project would not be possible!

@sebastienlanglois
Copy link

sebastienlanglois commented Jul 5, 2021

Thanks @richardarsenault for the exhaustive description.

To add further explanations on why Wasabi was selected as a cloud storage service rather than more traditional ones (AWS S3, GCS, etc.), the main reason is that Wasabi does not charge for egress fees or API requests but only for storage costs. Because we want data to be available to users both on cloud and local networks, the added egress fees from transferring large amounts of data from the cloud to users would have added up quickly (or the cost would've had to be passed to users)!

Also, Wasabi has S3-compatible API connectivity option which means it can be used with the following python libraries : fsspec, s3fs, etc.

@sebastienlanglois
Copy link

sebastienlanglois commented Jul 5, 2021

After discussions with @richardarsenault, here is a suggestion for bucket naming :

era5/...
  /world/
  /north-america/...
    /reanalysis/
    /ensemble/...
      /single-levels/
      /land/
      /pressure-levels/...
         /netcdf/
         /zarr/

@sebastienlanglois
Copy link

sebastienlanglois commented Jul 5, 2021

With regards to Step 1 (ERA5 worldwide domain), all tp and t2m (ERA5 name for tas) netcdf files have been uploaded to the following bucket

New netcdf files are added automatically on a daily basis with a 5 days lag with real time as per the CDS API design for ERA5. I will post examples to access the data in the coming days.

Before moving on and creating the ERA5 worldwide zarr dataset (Step 1), we should do some testing on a subset of the data to optimise how the dataset will be chunked. As mentioned by @richardarsenault, our use case is mainly for time series analysis. The optimal chunk to maximise speed for time-series retrieval should theoretically include all time steps inside a chunk, however such approach would require to update all the chunk each time the dataset is updated (for instance on a daily or monthly basis) which effectively means to recreate the entire dataset each time. With the tests, we will try to find a more suitable approach for chunking that should take into account the dataset update frequency and the speed of data retrieval.

Depending on the size and costs of maintaining the zarr datasets, @huard also suggested to consider having application-specific zarr datasets for time series and spatial analysis applications (such as a zarr-temporal and a zarr-spatial bucket).

@sebastienlanglois
Copy link

Tests - Zarr chunking

As promised, here are a couple of tests to experiment with zarr's chunking (with a subset of ERA5's data in the space dimension for tests purposes).

The tests datasets can be found in the following bucket : s3://era5/world/reanalysis/single-levels/zarr-tests

They are named according to how the dimensions were chunked (more chunking can be tested if requested):

  • era5-test-time-8784-longitude-2-latitude-2 (1 year per chunk)
  • era5-test-time-87840-longitude-1-latitude-1 (10 years per chunk)
  • era5-test-time-87840-longitude-2-latitude-5 (10 years per chunk)
  • era5-test-time-87840-longitude-5-latitude-5 (10 years per chunk)
  • era5-test-time-372552-longitude-1-latitude-1 (all timesteps per chunk)
  • era5-test-time-372552-longitude-2-latitude-2 (all timesteps per chunk)
  • era5-test-time-372552-longitude-5-latitude-5 (all timesteps per chunk)

The following code was tested both with :

  • A fairly slow home internet connection (50 Mbits/s (download)) as a baseline and hence the bandwidth speed is the main bottleneck when running it.
  • A Google Cloud Platform E2 VM instance

When working in scientific environments (cloud/academic/corporate networks) with fast bandwidth speed (1Gbps+), we expect the E2 VM instance to more accurately reflect the expected performance. I invite you to test it in your own environment (ping @richardarsenault, @huard, etc.).

# Time to extract all 372522 ERA5 t2m hourly time points at a specific pixel 
# and at multiple pixels to simulate the size of a watershed

import fsspec
import xarray as xr
import time
import os
import hvplot.xarray
from dask.distributed import Client

client = Client()

client_kwargs = {'endpoint_url': 'https://s3.wasabisys.com',
                 'region_name': 'us-east-1'}

buckets_dirname = 's3://era5/world/reanalysis/single-levels/zarr-tests'

buckets_basename = ['era5-test-time-8784-longitude-2-latitude-2',
                    'era5-test-time-87840-longitude-1-latitude-1',
                    'era5-test-time-87840-longitude-2-latitude-2',
                    'era5-test-time-87840-longitude-5-latitude-5',
                    'era5-test-time-372552-longitude-1-latitude-1',
                    'era5-test-time-372552-longitude-2-latitude-2',
                    'era5-test-time-372552-longitude-5-latitude-5']

for bucket_basename in buckets_basename:

    bucket = os.path.join(buckets_dirname, bucket_basename)
    store = fsspec.get_mapper(bucket, 
                              anon=True,
                              client_kwargs=client_kwargs)
    
    ds = xr.open_zarr(store, consolidated=True)
    start = time.time()
    da = (ds.t2m - 273.15)[:,-5,-5].load()
    end = time.time()
    print(bucket_basename, ' : ', end-start,' seconds')
    start = time.time()
    da = (ds.t2m - 273.15)[:,0:6,0:6].load()
    end = time.time()
    print(bucket_basename, ' (multiple pixels) : ', end-start,' seconds')

# With the 50 Mbits/s connection :
# era5-test-time-8784-longitude-2-latitude-2  :  1.5577044486999512  seconds
# era5-test-time-8784-longitude-2-latitude-2  (multiple pixels) :  5.472326755523682  seconds
# era5-test-time-87840-longitude-1-latitude-1  :  0.29694294929504395  seconds
# era5-test-time-87840-longitude-1-latitude-1  (multiple pixels) :  5.004878997802734  seconds
# era5-test-time-87840-longitude-2-latitude-2  :  2.889986991882324  seconds
# era5-test-time-87840-longitude-2-latitude-2  (multiple pixels) :  12.184975624084473  seconds
# era5-test-time-87840-longitude-5-latitude-5  :  3.0996978282928467  seconds
# era5-test-time-87840-longitude-5-latitude-5  (multiple pixels) :  11.215736627578735  seconds
# era5-test-time-372552-longitude-1-latitude-1  :  0.33968472480773926  seconds
# era5-test-time-372552-longitude-1-latitude-1  (multiple pixels) :  6.137498378753662  seconds
# era5-test-time-372552-longitude-2-latitude-2  :  0.6167998313903809  seconds
# era5-test-time-372552-longitude-2-latitude-2  (multiple pixels) :  4.357798337936401  seconds
# era5-test-time-372552-longitude-5-latitude-5  :  2.894357919692993  seconds
# era5-test-time-372552-longitude-5-latitude-5  (multiple pixels) :  11.943426370620728  seconds

# On the GCP E2 VM instance :
# era5-test-time-8784-longitude-2-latitude-2  :  1.3091447353363037  seconds
# era5-test-time-8784-longitude-2-latitude-2  (multiple pixels) :  8.865148305892944  seconds
# era5-test-time-87840-longitude-1-latitude-1  :  0.26541686058044434  seconds
# era5-test-time-87840-longitude-1-latitude-1  (multiple pixels) :  4.483410596847534  seconds
# era5-test-time-87840-longitude-2-latitude-2  :  0.6464042663574219  seconds
# era5-test-time-87840-longitude-2-latitude-2  (multiple pixels) :  0.7947907447814941  seconds
# era5-test-time-87840-longitude-5-latitude-5  :  0.32846498489379883  seconds
# era5-test-time-87840-longitude-5-latitude-5  (multiple pixels) :  1.1891248226165771  seconds
# era5-test-time-372552-longitude-1-latitude-1  :  0.27306079864501953  seconds
# era5-test-time-372552-longitude-1-latitude-1  (multiple pixels) :  1.621551513671875  seconds
# era5-test-time-372552-longitude-2-latitude-2  :  0.1411135196685791  seconds
# era5-test-time-372552-longitude-2-latitude-2  (multiple pixels) :  0.5919482707977295  seconds
# era5-test-time-372552-longitude-5-latitude-5  :  0.5591433048248291  seconds
# era5-test-time-372552-longitude-5-latitude-5  (multiple pixels) :  0.8325531482696533  seconds

Here is a snapshot of one of the tests dataset (ds) :
image

and a quick graph to visualise the data once it is loaded to the user's computer :

%%time
(da.hvplot(grid=True, label='t2m')* \
da.resample(time='1Y').mean().hvplot(label='t2m-yearly')).opts(legend_position='top')

image

Other food for thoughts

Daily zarr datasets ?

So far, we have kept the hourly time steps for zarr which we should do for the main dataset in Step 1. However, assuming most users will usually need daily time steps, it could be interesting to add another zarr dataset at the daily resolution (t2m, tmax, tmin and tp). Such dataset would be minimal in terms of size and costs (1/12th the size of the main hourly dataset) and would allow for sub-second requests even for users with poor internet connections. The downside with this approach is that conversion to the appropriate timezone from UTC would have to be done before aggregating to the daily time step which means having zarr daily datasets for multiples north american timezones.

@huard
Copy link
Contributor

huard commented Jul 7, 2021

Timings from Ouranos

era5-test-time-8784-longitude-2-latitude-2  :  3.9128899574279785  seconds
era5-test-time-8784-longitude-2-latitude-2  (multiple pixels) :  4.061275482177734  seconds
era5-test-time-87840-longitude-1-latitude-1  :  0.5617625713348389  seconds
era5-test-time-87840-longitude-1-latitude-1  (multiple pixels) :  1.9685022830963135  seconds
era5-test-time-87840-longitude-2-latitude-2  :  6.899303674697876  seconds
era5-test-time-87840-longitude-2-latitude-2  (multiple pixels) :  7.567781686782837  seconds
era5-test-time-87840-longitude-5-latitude-5  :  2.085627794265747  seconds
era5-test-time-87840-longitude-5-latitude-5  (multiple pixels) :  2.293652296066284  seconds
era5-test-time-372552-longitude-1-latitude-1  :  0.6129217147827148  seconds
era5-test-time-372552-longitude-1-latitude-1  (multiple pixels) :  1.1872832775115967  seconds
era5-test-time-372552-longitude-2-latitude-2  :  1.1998472213745117  seconds
era5-test-time-372552-longitude-2-latitude-2  (multiple pixels) :  1.7468199729919434  seconds
era5-test-time-372552-longitude-5-latitude-5  :  6.857306003570557  seconds
era5-test-time-372552-longitude-5-latitude-5  (multiple pixels) :  7.299430847167969  seconds

@sebastienlanglois
Copy link

I have uploaded a new and hopefully fully corrected zarr version of the ERA5 (single levels) dataset. The previous dataset had some issues that are probably related to using Copernicus' experimental netcdf format as a basis for the zarr conversion rather than their native grib format. This current zarr dataset was therefore created from grib files and for now, only this format should be used to append the zarr dataset. ERA5's netcdf files are still fine if they are not transformed to zarr.

What is new :

  • ERA5 can now be accessed directly from the catalog :
    image
  • Chunks are more reasonable than previous iteration. The intent is still to speed up data access for queries in the time dimension but some spatial analysis are still possible
    image

What is coming :

A word on performance

  • The zarr dataset was tested on a large cluster (about 100 cores) to compute the mean temperature along all dimensions which means processing 1.4 Terabytes of data. The calculation took about 11 minutes which means about 17 Gbits were processed every second. The more we throw nodes at the problem, the faster the computation gets done as access to cloud storage can scale linearly (no practical server bandwidth limit) .
    image

@alxmrs
Copy link

alxmrs commented Nov 7, 2022

Since this issue was created, we've my team has released https://github.com/google-research/arco-era5. I was wondering if the folks in this issue would like to join forces and work together on an Phase 2 of our project :)

@sebastienlanglois
Copy link

sebastienlanglois commented Apr 19, 2023

Just wanted to document a few updates about this issue before we close it.
I am happy to report that we were able to achieve all the goals we set out at the start of this issue.

We now have 2 zarr datasets:

  • era5_single_levels_reanalysis (world - fully operational right now)
  • era5_land_reanalysis (North-America - will be fully operational in a few weeks)

These terabyte-scale analysis-ready cloud-optimised (ARCO) datasets are (will be for ERA5-Land) freely and remotely accessible from anywhere (cloud or local laptop), efficiently chunked for time-series analysis + some regional spatial analyses and updated weekly. I believe they are among the first climate datasets to share all these properties, so hopefully, they are used a lot !

Up next, we intend to add more variables and possibly, other datasets. Stay tuned!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants