Proposed Recipes for Large Ensemble pCO2 testbed #219

jbusecke · 2022-11-07T19:57:18Z

Dataset Name

Large ensemble pCO2 testbed by @lgloege

Dataset URL

https://figshare.com/collections/Large_ensemble_pCO2_testbed/4568555

Description

This is a collection of randomly selected ensemble members from 4 large ensemble projects:

Each ensemble member was interpolated from its native grid to a 1x1 degree lat/lon grid. The variables are monthly over the 1982-2017 time frame and sampled as the SOCATv5 data product. Historical atmospheric CO2 is used up to 2005 with RCP8.5 after 2005.

The intention of this dataset is to evaluate ocean pCO2 gap-filling techniques.

License

Unknown

Data Format

NetCDF

Data Format (other)

No response

Access protocol

HTTP(S)

Source File Organization

The data is organized on different levels:

There are 5 models that provide a Large Ensemble (many different members to quantify internal variability)
For each model there is one file per ensemble member given as <model><member_id>.tar.gz example
Each of the tar files contains several netcdf files that represent different variables

These variables are already concatenated in time

Example URLs

https://ndownloader.figshare.com/files/16129505

I actually have some trouble getting these from figshare. I was wondering if anyone here has had experience with pulling files from a collection/dataset in figshare? Id be happy to understand the figshare API and parse http links, but maybe there is something more clever to do with these archive/doi repos like figshare/zenodo?

Authorization

No; data are fully public

Transformation / Processing

This is pretty straightforward.

Id suggest to have one recipe per model (in a recipe dict), that simply combines variables by merging them.

There should probably be some rechunking, but I think I need some input from the actual users (cc @hatlenheimdalthea @galenmckinley) what is the best chunking structure for the use cases (e.g. are the gap filling models trained on single time step maps or location timeseries).

Target Format

Zarr

Comments

No response

The text was updated successfully, but these errors were encountered:

jbusecke · 2022-11-07T19:59:49Z

I was also unable to find the license of this dataset. I assume that it has some derived license from each of the used model datasets? Maybe @lgloege can help here

galenmckinley · 2022-11-07T20:09:51Z

We did put this at BCO DMO, per requirement of NSF funding: https://www.bco-dmo.org/dataset/840334 There are some more references there, in case useful.

I don't know any more specifically about licenses, but I concur with your assumption . I hope @lgoege can reply there.

jbusecke · 2022-11-07T23:04:18Z

I have looked into this a bit more, but I have one aspect that I am struggling with: Each url points to a tar file that then contains multiple netcdf files which need to be merged in xarray.
This does brake the assumption that there is a 1:1 mapping between urls and files, has anybody solved this previously? @pangeo-forge/dev-team ?

rabernat · 2022-11-14T15:28:15Z

@martindurant - do you know if it's possible for fsspec to index into a .tar.gz file the way it can with a .zip file? That is the key technical question. If so, we can use the same approach described in #90 (comment) to point at the individual files.

If not, we will not be able to ship this recipe without some more serious refactoring to pangeo forge recipes.

martindurant · 2022-11-14T15:57:30Z

Offsets within a gzip stream are not possible. There are no block markers and sequences can even start mid-byte. I had some vague ideas about brute force options to find viable offsets, but nothing has come of them. Much better than tar.gz would be a tar of gzipped files (which would be a static version of zip), but no one does this.

We can already index into tar and zip, and have plans to index into block-compressed files like blosc and zstd (even bzip2!) but never gzip.

rabernat · 2022-11-14T16:02:16Z

Martin, thank for the quick reply! That makes sense.

Just brainstorming workarounds here... @lgloege - is there any chance you could publish a new version of this dataset using .zip files instead of .tar.gz?

galenmckinley · 2022-11-16T16:51:04Z

@lgloege tells me he will work on this. he'll let us know when there's a new posting

jbusecke added the proposed recipe label Nov 7, 2022

jbusecke mentioned this issue Nov 16, 2022

Building recipes from files located within a large tar.gz file pangeo-forge/pangeo-forge-recipes#442

Open

jbusecke mentioned this issue Jan 4, 2023

Proposed Recipes for ClimateBench #243

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposed Recipes for Large Ensemble pCO2 testbed #219

Proposed Recipes for Large Ensemble pCO2 testbed #219

jbusecke commented Nov 7, 2022 •

edited

Loading

jbusecke commented Nov 7, 2022

galenmckinley commented Nov 7, 2022

jbusecke commented Nov 7, 2022

rabernat commented Nov 14, 2022

martindurant commented Nov 14, 2022

rabernat commented Nov 14, 2022

galenmckinley commented Nov 16, 2022

Proposed Recipes for Large Ensemble pCO2 testbed #219

Proposed Recipes for Large Ensemble pCO2 testbed #219

Comments

jbusecke commented Nov 7, 2022 • edited Loading

Dataset Name

Dataset URL

Description

License

Data Format

Data Format (other)

Access protocol

Source File Organization

Example URLs

Authorization

Transformation / Processing

Target Format

Comments

jbusecke commented Nov 7, 2022

galenmckinley commented Nov 7, 2022

jbusecke commented Nov 7, 2022

rabernat commented Nov 14, 2022

martindurant commented Nov 14, 2022

rabernat commented Nov 14, 2022

galenmckinley commented Nov 16, 2022

jbusecke commented Nov 7, 2022 •

edited

Loading