Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How best to support Ensembles of data #172

Open
chris-little opened this issue Jan 31, 2024 · 7 comments
Open

How best to support Ensembles of data #172

chris-little opened this issue Jan 31, 2024 · 7 comments
Labels
Priority 1 Highly desirable V1.1 Non-breaking change for Version 1.1

Comments

@chris-little
Copy link
Contributor

chris-little commented Jan 31, 2024

@m-burgoyne tried to create CoverageJSON of some NWP ensemble forecast data using a fifth custom dimension. Failed schema check. @jonblower @letmaik Could the schema be enhanced to support more dimensions, and remain backward compatible?

@jonblower
Copy link
Contributor

CoverageJSON can support an arbitrary number of axes, including ensemble. However, the defined domain types do specify which axes they support. So if he’s trying to use a Grid domain type (which is defined to have x and y and maybe z and t) with a custom axis, then yes, the validation will fail.

Options within the current spec are:

  • define a new domain type (EnsembleGrid maybe?)
  • not specify a domain type at all, and then he can use whatever axes he wants (but that makes it harder for downstream clients)
  • encode the data as a CoverageCollection, where each ensemble member is a Coverage.

The above basically mirrors what usually happens in the CF-NetCDF world (i.e. there is no established convention for an ensemble axis and most software won’t recognise one).

The other aspect is that ensemble datasets are likely to get very large and JSON documents aren’t designed for large bulk transfers. Datasets could be split into tiles, but this is more complicated. Could it be easier just to have each ensemble member as a separate coverage?

@chris-little
Copy link
Contributor Author

@jonblower We could discuss this online tomorrow (I and @m-burgoyne will not be in the Office). But we are keen to future proof for a variety of use cases, and therefore support multiple dimensions, and also envisage pervasive ensemble usage, though only "small" subsets via OGC API-EDR and its various Parts (Core, Pub-Sub, aggregated stats, etc).

We may need to break with the traditional CF-NetCDF model at some stage to benefit from (geo)Zarr, etc.

  1. Define domain type would be a minimal extension and meet the immediate need;
  2. Use whatever axes would give flexibility but may need some clever schema handling in both servers and clients;
  3. Encode ensembles as a CoverageCollection would also be minimalist, not even an extension, but there maybe the issue of assumed ordering of the CoverageCollection;
  4. Split into tiles is attractive for other large datasets, but we need to ensure that tiles could be persistent and idneitifed for cacheing to support multiple users. I think we should create a separate issue for this last option.

The above is just me thinking aloud.

@jonblower
Copy link
Contributor

Just to note that the purpose of the "domain types" is to define a practical set of restricted profiles of CoverageJSON so that clients don't have to deal with a large multiplicity of approaches. And yes, the point of using tiles is that the individual tiles are persistent and cacheable.

My philosophical point is that I have always been wary of treating "ensemble" as a dimension, with the same status as a spatiotemporal dimension. I know it's tempting to do so in the context of hypercubes, but ensembles (unlike S-T dimensions) don't have any defined ordering. I have always preferred to model them essentially as separate coverages, although this does introduce some redundancy of metadata.

Another option, which I've only just thought of, is to somehow deal with this in the Range. Maybe this isn't very neat, but if the ensemble members all share the same domain, then we could have (n x m) Range documents, where n is the number of ensemble members and m is the number of variables in each member. Would need more thought (and coffee!)

@chris-little
Copy link
Contributor Author

@jonblower An advantage of your third apporach, though it may be prine to be voluminous and creeping out of the orginal scope of CoverageJSON, is that it is more akin to the underlying theoretical scientific approach. I.e. an ensemble is an approximation to a probability distribution function (pdf) of a variable. It may be more amenable or elegant for retrieving derived statistics such a percentile or a threshold value.
More discussion needed. Possibly multiple approaches too.

@jonblower
Copy link
Contributor

I think any approach will be voluminous, but the appealing things about dealing with this in the Range are: (1) we only need to define the Domain once, (2) the high volume can easily be managed by splitting the Range among separate documents, so the client only needs to get the members they want. If I get some time I'll see if I can work up a proposal - there could be a fatal flaw I haven't spotted!

@chris-little chris-little added Priority 1 Highly desirable V1.1 Non-breaking change for Version 1.1 labels Jul 11, 2024
@jonblower
Copy link
Contributor

@chris-little Since I don't think this is a schema problem, but a question of how CovJSON can best support ensembles, would it be worth renaming this ticket, or closing it and opening another one? Did you and @m-burgoyne discuss a preferred approach?

@chris-little chris-little changed the title Fix schema to allow more than 4 dimensions How best to support Ensembles of data Jul 12, 2024
@chris-little
Copy link
Contributor Author

@jonblower As suggested, I have re-titled this issue, and created a separate one ( Issue #184 ) for more dimensions than 4 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Priority 1 Highly desirable V1.1 Non-breaking change for Version 1.1
Development

No branches or pull requests

2 participants