Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ocean Reanalysis System 5 [ORAS5 ECMWF] #49

Open
sckw opened this issue Aug 31, 2023 · 3 comments
Open

Ocean Reanalysis System 5 [ORAS5 ECMWF] #49

sckw opened this issue Aug 31, 2023 · 3 comments
Labels

Comments

@sckw
Copy link

sckw commented Aug 31, 2023

Dataset Name

Ocean Reanalysis System 5

Dataset URL

https://cds.climate.copernicus.eu/cdsapp#!/dataset/reanalysis-oras5?tab=form

Description

This is an ocean reanalysis dataset from ECMWF. It provides 3D global, gridded (.25x.25), monthly mean data from 1958 to present.

Variables include: velocity (meridional, zonal), wind stress, MLDs, net heat flux, SSTs, potential temp, SSS, salinity, sea ice, SSH, etc.

The dataset is large and takes a while to download for individual uses. It would be useful to have this downloaded and stored at the LEAP Data Library instead of users downloading sub-datasets and storing it on their personal storage.

Size

3D datasets (e.g. meridional velocity) are larger ~10GB+. Single level datasets (e.g. SST) are <200MB

License

https://cds.climate.copernicus.eu/api/v2/terms/static/licence-to-use-copernicus-products.pdf

Data Format

NetCDF

Data Format (other)

No response

Access protocol

HTTP(S)

Source File Organization

There is one file per month, which is equal to one timestep. One year of data would have 12 netcdf files (months) that can be concatenated.

Files are downloaded via API request. Documentation on the download how-to is found here: https://cds.climate.copernicus.eu/api-how-to

Example of an API request for meridional velocity data, for 2009-2011 (all months), is shown below.

Example URLs

import cdsapi

c = cdsapi.Client()

c.retrieve(
    'reanalysis-oras5',
    {
        'format': 'zip',
        'vertical_resolution': 'all_levels',
        'product_type': 'consolidated',
        'variable': 'meridional_velocity',
        'year': [
            '2009', '2010', '2011',
        ],
        'month': [
            '01', '02', '03',
            '04', '05', '06',
            '07', '08', '09',
            '10', '11', '12',
        ],
    },
    'example_meridional_download.zip')

Authorization

API Token

Transformation / Processing

Since all files are one timestep ( one month), it would be useful to have the datasets be combined into at least years or the whole 1958-present range.

The files are also in 0.25x0.25 resolution, and so a 1x1 regridding could be useful.

Target Format

Zarr

Comments

No response

@sckw sckw added the dataset label Aug 31, 2023
@cisaacstern
Copy link
Contributor

Thanks for the request and detailed explanation, @sckw. Since the data is from ECMWF, I think we should probably use weather-dl to cache it, before transforming to Zarr with Pangeo Forge.

@alxmrs may be able to advise. Alex, I see weather-dl is documented as a CLI, but is there an opportunity for using its objects directly in a Pangeo Forge pipeline; e.g., in very coarse pseudocode (with the weather-dl bits referenced from here):

from pangeo_forge_recipes.transforms import StoreToZarr
from weather_dl.fetcher import Fetcher

recipe = (
    beam.Create(...)
    # the weather-dl part
    | 'GroupBy Request Limits' >> beam.GroupBy(...)
    | 'Fetch Data' >> beam.ParDo(Fetcher(...))
    # some tbd adapter
    | SomeAdapterTransform()
    # the Pangeo Forge part
    | StoreToZarr()
)

?

@alxmrs
Copy link

alxmrs commented Sep 1, 2023

I will take a deeper look at how PGF could use weather-tools (specifically weather-dl) as a library this Tuesday. I have a few ideas and words of caution.

I want to make sure that @mahrsee1997 has seen this, as he is now our weather-dl expert.

Some initial thoughts:

  • We could package our parser as a Beam transform s.t. users can pass in a config and it will do the create step.
  • How will we organize the partition and groupby steps into pure beam functions, given that they depend on pipeline-definition time code today?
  • weather-dl v1 has a few problems that stem from Beam. It’s why we have a v2 branch that downloads data with a k8s cluster. One main problem to note: Due to low CPU utilization while we wait for data to be ready from EC’s server, Beam’s autoscaler spins up more workers to process the task. This means that we eventually hit EC’s request quota. As a mitigation, weather-dl v1 sets the max number if workers based on the input config.

Would you be open to a semi-beam or non-beam based solution? With weather-dl 1.5 or 2, we’ve found more stability in downloading and higher utilization of ECMWF licenses.

I think this could integrate well into PGF in other ways; for example, by having Zarr conversion react to raw data appearing in a bucket (more in line with the streaming stuff we’ve been talking about).

@cisaacstern
Copy link
Contributor

I think this could integrate well into PGF in other ways; for example, by having Zarr conversion react to raw data appearing in a bucket (more in line with the streaming stuff we’ve been talking about).

I love this idea and discussed with @rabernat who agreed it's a great direction for us to take generally. Opened pangeo-forge/pangeo-forge-recipes#598 to discuss details. Thanks for summarizing the pros/cons of using weather-dl transforms directly in Pangeo Forge. That seems sufficiently difficult, and the streaming alternative sufficiently elegant, that I think we should focus 100% on the streaming option. 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants