Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MUR SST Kerchunk recipe #647

Open
abarciauskas-bgse opened this issue Nov 8, 2023 · 5 comments
Open

MUR SST Kerchunk recipe #647

abarciauskas-bgse opened this issue Nov 8, 2023 · 5 comments
Assignees

Comments

@abarciauskas-bgse
Copy link
Contributor

No description provided.

@sbquinlan
Copy link

sbquinlan commented Nov 10, 2023

This issue should probably be on pangeo-forge/staged-recipes, but I've been working on a new branch this week for what I thought would be a relatively straightforward change to the existing MUR SST recipe branch and found that the lat and lon dimensions of the source dataset aren't working with Kerchunk.

Current recipe:

identical_dims = ['lat', 'lon']
selected_vars = ['analysed_sst', 'analysis_error', 'mask', 'sea_ice_fraction']
recipe = (
    beam.Create(pattern.items())
    | OpenWithKerchunk(
        remote_protocol='s3' if selected_rel == S3_REL else 'https',
        file_type=pattern.file_type,
        # inline_threshold=10000,
        storage_options=pattern.fsspec_open_kwargs
    )
    | FilterVars(
        keep={*pattern.concat_dims, *identical_dims, *selected_vars}
    )
    | WriteCombinedReference(
        store_name='mursst',
        concat_dims=pattern.concat_dims,
        identical_dims=identical_dims,
    )
)

The key line here is inline_threshold=10000 I found that the lat and lon dataarrays are about 5kb. So if I pick an inline_threshold larger than that, kerchunk copies the dataarrays into the references.

Without inline:

>>> ds
<xarray.Dataset>
Dimensions:           (time: 2, lat: 17999, lon: 36000)
Coordinates:
  * lat               (lat) float32 nan nan nan nan nan ... nan nan nan nan nan
  * lon               (lon) float32 nan nan nan nan nan ... nan nan nan nan nan
  * time              (time) datetime64[ns] 2002-06-01T09:00:00 2002-06-02T09...
Data variables:
    analysed_sst      (time, lat, lon) float32 ...
    analysis_error    (time, lat, lon) float32 ...
    mask              (time, lat, lon) float32 ...
    sea_ice_fraction  (time, lat, lon) float32 ...
Attributes: (12/47)
    Conventions:                CF-1.5
    Metadata_Conventions:       Unidata Observation Dataset v1.0
    acknowledgment:             Please acknowledge the use of these data with...
    cdm_data_type:              grid
    comment:                    MUR = "Multi-scale Ultra-high Reolution"
    creator_email:              [email protected]
    ...                         ...
    summary:                    A merged, multi-sensor L4 Foundation SST anal...
    time_coverage_end:          20020602T210000Z
    time_coverage_start:        20020601T210000Z
    title:                      Daily MUR SST, Final product
    uuid:                       27665bc0-d5fc-11e1-9b23-0800200c9a66
    westernmost_longitude:      -180.0

With inline:

>>> ds
<xarray.Dataset>
Dimensions:           (time: 2, lat: 17999, lon: 36000)
Coordinates:
  * lat               (lat) float32 -89.99 -89.98 -89.97 ... 89.97 89.98 89.99
  * lon               (lon) float32 -180.0 -180.0 -180.0 ... 180.0 180.0 180.0
  * time              (time) datetime64[ns] 2002-06-01T09:00:00 2002-06-02T09...
Data variables:
    analysed_sst      (time, lat, lon) float32 ...
    analysis_error    (time, lat, lon) float32 ...
    mask              (time, lat, lon) float32 ...
    sea_ice_fraction  (time, lat, lon) float32 ...
Attributes: (12/47)
    Conventions:                CF-1.5
    Metadata_Conventions:       Unidata Observation Dataset v1.0
    acknowledgment:             Please acknowledge the use of these data with...
    cdm_data_type:              grid
    comment:                    MUR = "Multi-scale Ultra-high Reolution"
    creator_email:              [email protected]
    ...                         ...
    summary:                    A merged, multi-sensor L4 Foundation SST anal...
    time_coverage_end:          20020601T210000Z
    time_coverage_start:        20020531T210000Z
    title:                      Daily MUR SST, Final product
    uuid:                       27665bc0-d5fc-11e1-9b23-0800200c9a66
    westernmost_longitude:      -180.0

I'm not entirely sure what the best practice is to handle this scenario or how to properly detect that identical_dims are not in the desired format, but I'll ask around.

Separately, the resulting metadata isn't consolidated which I also don't entirely understand and maybe related. I'm working through this and hope to resolve this next week.

@sbquinlan
Copy link

I was doing it wrong, forgot that I needed to forward along the storage_options to remote_options when creating a reference fs.

@sbquinlan
Copy link

@abarciauskas-bgse
Copy link
Contributor Author

abarciauskas-bgse commented Jan 12, 2024

I was able to run this recipe locally with the scripts/changes/tests outlined in this gist: https://gist.github.com/abarciauskas-bgse/cff4c77ea841601773d087bc6b45122b

A few questions / next steps:

@ranchodeluxe
Copy link
Contributor

I was able to run this recipe locally with the scripts/changes/tests outlined in this gist: https://gist.github.com/abarciauskas-bgse/cff4c77ea841601773d087bc6b45122b

Nice 🥳 Can you run it locally full on without --prune? Have you tried that? That's probably the next step to proceed the ones below. But I imagine we can make a new PR with the changes from the gist

A few questions / next steps:

* [ ]  I see that a number of transforms are defined in https://github.com/pangeo-forge/staged-recipes/blob/master/recipes/mursst/recipe.py. Should those be added to pangeo-forge-recipes transforms?

I think all these custom ones have already been merged or are about to by Raphael. Let me review what needs to change and comment/suggest on your new PR when you have it up

* [ ]  I think we need to convert to using a file pattern rather than relying on GranuleQuery, although I'm not sure it's absolutely necessary - @ranchodeluxe have you experienced failures due to using GranuleQuery rather than a file pattern?

I think forcing the lookup to happen on the workers instead of the host is ideal yes

* [ ]  Run on https://github.com/NASA-IMPACT/veda-pforge-job-runner, perhaps replacing or updating [Failing: MUR SST NASA-IMPACT/veda-pforge-job-runner#15](https://github.com/NASA-IMPACT/veda-pforge-job-runner/issues/15) and the linked code

Yeah, let me run this later today as a prune=True and prune=False to see what happens

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants