-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Development of an ERA5 cloud storage for efficient access #396
Comments
Thanks @richardarsenault for the exhaustive description. To add further explanations on why Wasabi was selected as a cloud storage service rather than more traditional ones (AWS S3, GCS, etc.), the main reason is that Wasabi does not charge for egress fees or API requests but only for storage costs. Because we want data to be available to users both on cloud and local networks, the added egress fees from transferring large amounts of data from the cloud to users would have added up quickly (or the cost would've had to be passed to users)! Also, Wasabi has S3-compatible API connectivity option which means it can be used with the following python libraries : fsspec, s3fs, etc. |
After discussions with @richardarsenault, here is a suggestion for bucket naming : era5/... |
With regards to Step 1 (ERA5 worldwide domain), all tp and t2m (ERA5 name for tas) netcdf files have been uploaded to the following bucket
New netcdf files are added automatically on a daily basis with a 5 days lag with real time as per the CDS API design for ERA5. I will post examples to access the data in the coming days. Before moving on and creating the ERA5 worldwide zarr dataset (Step 1), we should do some testing on a subset of the data to optimise how the dataset will be chunked. As mentioned by @richardarsenault, our use case is mainly for time series analysis. The optimal chunk to maximise speed for time-series retrieval should theoretically include all time steps inside a chunk, however such approach would require to update all the chunk each time the dataset is updated (for instance on a daily or monthly basis) which effectively means to recreate the entire dataset each time. With the tests, we will try to find a more suitable approach for chunking that should take into account the dataset update frequency and the speed of data retrieval. Depending on the size and costs of maintaining the zarr datasets, @huard also suggested to consider having application-specific zarr datasets for time series and spatial analysis applications (such as a zarr-temporal and a zarr-spatial bucket). |
Tests - Zarr chunkingAs promised, here are a couple of tests to experiment with zarr's chunking (with a subset of ERA5's data in the space dimension for tests purposes). The tests datasets can be found in the following bucket : s3://era5/world/reanalysis/single-levels/zarr-tests They are named according to how the dimensions were chunked (more chunking can be tested if requested):
The following code was tested both with :
When working in scientific environments (cloud/academic/corporate networks) with fast bandwidth speed (1Gbps+), we expect the E2 VM instance to more accurately reflect the expected performance. I invite you to test it in your own environment (ping @richardarsenault, @huard, etc.). # Time to extract all 372522 ERA5 t2m hourly time points at a specific pixel
# and at multiple pixels to simulate the size of a watershed
import fsspec
import xarray as xr
import time
import os
import hvplot.xarray
from dask.distributed import Client
client = Client()
client_kwargs = {'endpoint_url': 'https://s3.wasabisys.com',
'region_name': 'us-east-1'}
buckets_dirname = 's3://era5/world/reanalysis/single-levels/zarr-tests'
buckets_basename = ['era5-test-time-8784-longitude-2-latitude-2',
'era5-test-time-87840-longitude-1-latitude-1',
'era5-test-time-87840-longitude-2-latitude-2',
'era5-test-time-87840-longitude-5-latitude-5',
'era5-test-time-372552-longitude-1-latitude-1',
'era5-test-time-372552-longitude-2-latitude-2',
'era5-test-time-372552-longitude-5-latitude-5']
for bucket_basename in buckets_basename:
bucket = os.path.join(buckets_dirname, bucket_basename)
store = fsspec.get_mapper(bucket,
anon=True,
client_kwargs=client_kwargs)
ds = xr.open_zarr(store, consolidated=True)
start = time.time()
da = (ds.t2m - 273.15)[:,-5,-5].load()
end = time.time()
print(bucket_basename, ' : ', end-start,' seconds')
start = time.time()
da = (ds.t2m - 273.15)[:,0:6,0:6].load()
end = time.time()
print(bucket_basename, ' (multiple pixels) : ', end-start,' seconds')
# With the 50 Mbits/s connection :
# era5-test-time-8784-longitude-2-latitude-2 : 1.5577044486999512 seconds
# era5-test-time-8784-longitude-2-latitude-2 (multiple pixels) : 5.472326755523682 seconds
# era5-test-time-87840-longitude-1-latitude-1 : 0.29694294929504395 seconds
# era5-test-time-87840-longitude-1-latitude-1 (multiple pixels) : 5.004878997802734 seconds
# era5-test-time-87840-longitude-2-latitude-2 : 2.889986991882324 seconds
# era5-test-time-87840-longitude-2-latitude-2 (multiple pixels) : 12.184975624084473 seconds
# era5-test-time-87840-longitude-5-latitude-5 : 3.0996978282928467 seconds
# era5-test-time-87840-longitude-5-latitude-5 (multiple pixels) : 11.215736627578735 seconds
# era5-test-time-372552-longitude-1-latitude-1 : 0.33968472480773926 seconds
# era5-test-time-372552-longitude-1-latitude-1 (multiple pixels) : 6.137498378753662 seconds
# era5-test-time-372552-longitude-2-latitude-2 : 0.6167998313903809 seconds
# era5-test-time-372552-longitude-2-latitude-2 (multiple pixels) : 4.357798337936401 seconds
# era5-test-time-372552-longitude-5-latitude-5 : 2.894357919692993 seconds
# era5-test-time-372552-longitude-5-latitude-5 (multiple pixels) : 11.943426370620728 seconds
# On the GCP E2 VM instance :
# era5-test-time-8784-longitude-2-latitude-2 : 1.3091447353363037 seconds
# era5-test-time-8784-longitude-2-latitude-2 (multiple pixels) : 8.865148305892944 seconds
# era5-test-time-87840-longitude-1-latitude-1 : 0.26541686058044434 seconds
# era5-test-time-87840-longitude-1-latitude-1 (multiple pixels) : 4.483410596847534 seconds
# era5-test-time-87840-longitude-2-latitude-2 : 0.6464042663574219 seconds
# era5-test-time-87840-longitude-2-latitude-2 (multiple pixels) : 0.7947907447814941 seconds
# era5-test-time-87840-longitude-5-latitude-5 : 0.32846498489379883 seconds
# era5-test-time-87840-longitude-5-latitude-5 (multiple pixels) : 1.1891248226165771 seconds
# era5-test-time-372552-longitude-1-latitude-1 : 0.27306079864501953 seconds
# era5-test-time-372552-longitude-1-latitude-1 (multiple pixels) : 1.621551513671875 seconds
# era5-test-time-372552-longitude-2-latitude-2 : 0.1411135196685791 seconds
# era5-test-time-372552-longitude-2-latitude-2 (multiple pixels) : 0.5919482707977295 seconds
# era5-test-time-372552-longitude-5-latitude-5 : 0.5591433048248291 seconds
# era5-test-time-372552-longitude-5-latitude-5 (multiple pixels) : 0.8325531482696533 seconds Here is a snapshot of one of the tests dataset (ds) : and a quick graph to visualise the data once it is loaded to the user's computer : %%time
(da.hvplot(grid=True, label='t2m')* \
da.resample(time='1Y').mean().hvplot(label='t2m-yearly')).opts(legend_position='top') Other food for thoughtsDaily zarr datasets ?So far, we have kept the hourly time steps for zarr which we should do for the main dataset in Step 1. However, assuming most users will usually need daily time steps, it could be interesting to add another zarr dataset at the daily resolution (t2m, tmax, tmin and tp). Such dataset would be minimal in terms of size and costs (1/12th the size of the main hourly dataset) and would allow for sub-second requests even for users with poor internet connections. The downside with this approach is that conversion to the appropriate timezone from UTC would have to be done before aggregating to the daily time step which means having zarr daily datasets for multiples north american timezones. |
Timings from Ouranos
|
I have uploaded a new and hopefully fully corrected zarr version of the ERA5 (single levels) dataset. The previous dataset had some issues that are probably related to using Copernicus' experimental netcdf format as a basis for the zarr conversion rather than their native grib format. This current zarr dataset was therefore created from grib files and for now, only this format should be used to append the zarr dataset. ERA5's netcdf files are still fine if they are not transformed to zarr. What is new :
What is coming :
A word on performance
|
Since this issue was created, we've my team has released https://github.com/google-research/arco-era5. I was wondering if the folks in this issue would like to join forces and work together on an Phase 2 of our project :) |
Just wanted to document a few updates about this issue before we close it. We now have 2 zarr datasets:
These terabyte-scale analysis-ready cloud-optimised (ARCO) datasets are (will be for ERA5-Land) freely and remotely accessible from anywhere (cloud or local laptop), efficiently chunked for time-series analysis + some regional spatial analyses and updated weekly. I believe they are among the first climate datasets to share all these properties, so hopefully, they are used a lot ! Up next, we intend to add more variables and possibly, other datasets. Stay tuned! |
Description
Raven will need to access ERA5 data as a reference dataset in many workflows. Currently, ERA5 data are stored on THREDDS, but this is not efficient. Pangeo has an implementation of ERA5 that is functional and stored on amazon AWS servers, but the data is chunked for spatial access rather than for time-series. This project aims to develop a new, time-series oriented ERA5 access point using cloud storage and computing.
The Wasabi service will be used to store and access ERA5 data in the Amazon AWS ecosystem. Preliminary tests show that it will take a few seconds at most to extract basin-averaged time-series for multi-decade periods. The data will be stored in zarr format and chunked in such a way as to maximize speed for time-series analysis.
Steps
The implementation will be done in a few steps:
1- ERA5 worldwide domain (total precip [tp] and air temperature [tas] only at this stage, hourly data).
2- ERA5-Land ([tp] and [tas]) for North-America only , (due to file size limitations, only a subset of the global map can be accommodated)
3- ERA5 and ERA5-Land Snow Water Equivalent [sd] for North-America only
We will then consider other variables and regions according to the needs of PAVICS-Hydro and other users, depending on the required resources. For example, pressure level data and ensembles are to be excluded for now. Only the actual reanalysis is made available.
Spatial Extents
Worldwide is self-explanatory
North-America will cover the region (first approximation/guess):
--- Latitude: [15 85]
--- Longitude: [-167 -50] *Note that ERA5 data natively uses the [0;360]° convention rather than the [-180;180]° convention.
Temporal parameters
Data will be stored at the native hourly resolution and UTC times.
The covered period will start in 1979 and extend to the most recent periods available on our end.
Updating
Files will be manually updated at first. If all goes well, we should be able to implement an auto-updating scheme to allow users to access the most recent data at all times.
Thanks
Big thanks to @sebastienlanglois for the help and contributions, without whom this project would not be possible!
The text was updated successfully, but these errors were encountered: