-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposed Recipes for NASA MODIS-COSP data (satellite observations of clouds) #125
Comments
Thanks for this proposal, @RobertPincus. If I understand correctly, there are at least two related (but distinct) goals outlined here:
Generally, Pangeo Forge is focused on being very good at goal 1: producing optimized mirrors of archival datasets (in complete form). Once this is accomplished, goal 2 becomes much easier, as the data will be staged in manner conducive to scalable parallel computation. In terms of goal 1 (producing the cloud optimized mirror), I note that programmatic distribution of this data is available via OPeNDAP. This would suggest to me that producing a Zarr dataset may be our best option. Kerchunk would be a more efficient option if the data were already staged as netCDFs. Given that is not the case, producing a kerchunk index would entail storing the dataset as netCDFs on the cloud, and then producing the kerchunk indexes. A less efficient (two step) process as compared to directly creating a Zarr store from the OPeNDAP endpoint. If starting with creation of a Zarr copy of this dataset is an acceptable starting place, we can use Is working on this recipe (a few dozen lines of Python code) something you or someone in your group is interested in? If so I can point to the relevant documentation for getting started. If not, we can open this up to others (myself included, perhaps) to collaborate on this development, though note that this latter option may take a bit longer to get spun up. Looking forward to bringing this vision to life! |
@cisaacstern, thanks very much for this feedback. As a point of clarification, the data already contains the means, joint histograms, etc. that we want - they are just accessed via netCDF groups. One wrinkle in the ointment is that the file names contain the date of production. Since we don't know this date a priori it amounts to a quasi-random string. Do you know if there's a way in OpenDAP to specify opening files that match a certain pattern including a wildcard? I'm open to outputting Zarr; if people want to recycle the recipe to make local netCDF mirrors that'll be easy enough. I don't yet understand if, say, one Zarr object is roughly equivalent to a netCDF file, or if a single object could include many variables. For a first try you can certainly point me to documentation and I can see how far I can get. Thanks a lot. |
This is a really annoying feature of many datasets. Do we know if the hyrax server exposes a TDS catalog or any other catalog? If so, we could crawl it to populate the FilePattern. |
I'll see if I can find out about a TDS catalog. JSON files are provided, at least (top level). |
@RobertPincus thanks for the clarification. Here is the documentation on recipe contribution. (This published just this morning, so if anything doesn't make sense, that's my fault! Please let me know if so and I will amend.) Re: your question about what a Zarr store can represent, a single Zarr store can include as many variables as we want, so long as they exist on the same time dimension. As you'll see in the linked docs, you'll want to define a Recipe Object (in this case, an I've worked out a start for this format function based on the (very helpful!) JSON catalog link you provided: import pandas as pd
import requests
BASE_URL = "http://ladsweb.modaps.eosdis.nasa.gov"
DATASET_ID = "61/MCD06COSP_M3_MODIS"
dates = pd.date_range("2002-07-01", "2021-07-01", freq="MS") # "MS" for "month start"
def make_url(date):
"""Make an OPeNDAP url for NASA MODIS-COSP data based on an input date.
:param date: A member of the ``pandas.core.indexes.datetimes.DatetimeIndex``
created with ``dates = pd.date_range("2002-07-01", "2021-07-01", freq="MS")``.
"""
day_of_year = date.timetuple().tm_yday
response = requests.get(
f"{BASE_URL}/archive/allData/{DATASET_ID}/{date.year}/{day_of_year}.json"
)
filename = [r["name"] for r in response.json()].pop(0)
return f"{BASE_URL}/opendap/hyrax/allData/{DATASET_ID}/{date.year}/{day_of_year}/{filename}" This function faithfully reproduces the example url you provided in your first comment on this thread: url = make_url(dates[0])
print(url)
However I get an error when trying to open this URL with xarray: import xarray as xr
ds = xr.open_dataset(url)
And the resulting dataset has no variables: print(ds)
Perhaps I am missing some essential keyword argument(s) for |
@cisaacstern Thanks for this. I'll catch up later this week, but meanwhile, perhaps you can try with |
This error message is coming from the netCDF4 C library. import netCDF4
url = 'http://ladsweb.modaps.eosdis.nasa.gov/opendap/hyrax/allData/61/MCD06COSP_M3_MODIS/2002/182/MCD06COSP_M3_MODIS.A2002182.061.2020181145824.nc'
ds = netCDF4.Dataset(url, "r") This means that the Hyrax server is emitting data that cannot be properly parsed by the official Unidata netCDF4 library. This is a problem with the server and needs to be brought to the attention of the NASA system administrator. Is there a direct link to netCDF file download (rather than OPeNDAP endpoint)? |
One can access the files through a GUI by appending |
That website - https://ladsweb.modaps.eosdis.nasa.gov/opendap/hyrax/allData/61/MCD06COSP_M3_MODIS/2002/182/MCD06COSP_M3_MODIS.A2002182.061.2020181145824.nc.dmr.html - does not show any data variables either, just lon and lat. |
To get variables one has to open a group, i.e. |
I cannot discover any groups from that opendap url. import netCDF4
url = 'http://ladsweb.modaps.eosdis.nasa.gov/opendap/hyrax/allData/61/MCD06COSP_M3_MODIS/2002/182/MCD06COSP_M3_MODIS.A2002182.061.2020181145824.nc'
ds = netCDF4.Dataset(url, "r")
print(ds.groups) # --> {} Where are you inputting the group information when you access the data? |
My attention is a little split these days, sorry. Like you both, I have been unable to open the files remotely via OpenDAP. I will see what I can learn from NASA but they have not been very responsive. I will also see if I can sleuth out direct download links, which I have not been able to find anywhere obvious. Once the files is downloaded I've been able to see data with e.g.
|
Now the server is returning a 500 server error |
The html page is back up. Inspecting the source there reveals that appending <input type="button" value="Get as NetCDF 4" onclick="getAs_button_action('NetCDF-4 Data', '.dap.nc4')"> Amending the earlier import fsspec
import pandas as pd
import requests
import xarray as xr
import yaml
BASE_URL = "http://ladsweb.modaps.eosdis.nasa.gov"
DATASET_ID = "61/MCD06COSP_M3_MODIS"
dates = pd.date_range("2002-07-01", "2021-07-01", freq="MS") # "MS" for "month start"
def make_url(date):
"""Make a NetCDF4 download url for NASA MODIS-COSP data based on an input date.
:param date: A member of the ``pandas.core.indexes.datetimes.DatetimeIndex``
created with ``dates = pd.date_range("2002-07-01", "2021-07-01", freq="MS")``.
"""
day_of_year = date.timetuple().tm_yday
response = requests.get(
f"{BASE_URL}/archive/allData/{DATASET_ID}/{date.year}/{day_of_year}.json"
)
filename = [r["name"] for r in response.json()].pop(0)
return f"{BASE_URL}/opendap/hyrax/allData/{DATASET_ID}/{date.year}/{day_of_year}/{filename}.dap.nc4"
test_filename = "test.nc"
with fsspec.open(make_url(dates[0])) as src:
with open(test_filename, mode="wb") as dst:
dst.write(src.read())
ds = xr.open_dataset(test_filename, engine='netcdf4')
yaml_config = yaml.safe_load(ds.YAML_config)
group_name_pairs = [(v["name_in"], v["name_out"]) for v in yaml_config["variable_settings"]]
for pair in group_name_pairs:
for group in pair:
try:
ds = xr.open_dataset(test_filename, engine='netcdf4', group=group)
except OSError as e:
print(e)
Here's the full YAML config:
|
@cisaacstern This part of the code is supposed to create a copy?
Because the file created is much smaller than the original:
|
Yes, that's the code block which aims to download the file. How did you get this 40 MB When I navigate to the GUI at and click Get as NetCDF 4 the file my web browser downloads is 47668 bytes
which is the same size as the Looks like your 40 MB |
... hmm on closer reading your file has an |
The better comparison would be to
|
Thanks for these helpful clarifications re: expected data size, Robert. I've made considerable headway with both file retrieval and a draft of the recipes themselves. Buckle up for a longish but hopefully useful post. Exploring the LAADS DAAC website a bit turned up the HTTP file service, e.g.
demonstrates a wget ... --header "Authorization: Bearer INSERT_DOWNLOAD_TOKEN_HERE" After generating a token according to these instructions and exporting it as the import os
import fsspec
base_url = (
"https://ladsweb.modaps.eosdis.nasa.gov/"
"archive/allData/61/MCD06COSP_M3_MODIS/2002/182"
)
filename = "MCD06COSP_M3_MODIS.A2002182.061.2020181145824.nc"
with fsspec.open(
f"{base_url}/{filename}",
client_kwargs=dict(headers=dict(Authorization=f"Bearer {os.environ['EARTHDATA_TOKEN']}")),
) as src:
with open(filename, mode="wb") as dst:
dst.write(src.read()) The resulting file is has an openable group for each of the names provided in its import xarray as xr
import yaml
ds = xr.open_dataset(filename)
yaml_config = yaml.safe_load(ds.YAML_config)
group_names = [v["name_out"] for v in yaml_config["variable_settings"]]
has_groups = []
for group in group_names:
try:
ds = xr.open_dataset(filename, group=group)
except OSError as e:
print(e)
else:
has_groups.append(group)
print(has_groups)
With this file access knowledge in hand, we can write a dictionary containing a naive
import os
import pandas as pd
import requests
from pangeo_forge_recipes.patterns import ConcatDim, FilePattern
from pangeo_forge_recipes.recipes import XarrayZarrRecipe
GROUPS = [
'Solar_Zenith',
'Solar_Azimuth',
'Sensor_Zenith',
'Sensor_Azimuth',
'Cloud_Top_Pressure',
'Cloud_Mask_Fraction',
'Cloud_Mask_Fraction_Low',
'Cloud_Mask_Fraction_Mid',
'Cloud_Mask_Fraction_High',
'Cloud_Optical_Thickness_Liquid',
'Cloud_Optical_Thickness_Ice',
'Cloud_Optical_Thickness_Total',
'Cloud_Optical_Thickness_PCL_Total',
'Cloud_Optical_Thickness_Log10_Liquid',
'Cloud_Optical_Thickness_Log10_Ice',
'Cloud_Optical_Thickness_Log10_Total',
'Cloud_Particle_Size_Liquid',
'Cloud_Particle_Size_Ice',
'Cloud_Water_Path_Liquid',
'Cloud_Water_Path_Ice',
'Cloud_Retrieval_Fraction_Liquid',
'Cloud_Retrieval_Fraction_Ice',
'Cloud_Retrieval_Fraction_Total',
]
BASE_URL = "https://ladsweb.modaps.eosdis.nasa.gov/archive/allData/61/MCD06COSP_M3_MODIS"
dates = pd.date_range("2002-07-01", "2021-07-01", freq="MS") # "MS" for "month start"
concat_dim = ConcatDim("date", keys=dates, nitems_per_file=1)
def make_url(date):
"""Make a NetCDF4 download url for NASA MODIS-COSP data based on an input date.
:param date: A member of the ``pandas.core.indexes.datetimes.DatetimeIndex``
created with ``dates = pd.date_range("2002-07-01", "2021-07-01", freq="MS")``.
"""
day_of_year = date.timetuple().tm_yday
response = requests.get(f"{BASE_URL}/{date.year}/{day_of_year}.json")
filename = [r["name"] for r in response.json()].pop(0)
return f"{BASE_URL}/{date.year}/{day_of_year}/{filename}"
pattern = FilePattern(
make_url,
concat_dim,
fsspec_open_kwargs={
"client_kwargs": dict(headers=dict(Authorization=f"Bearer {os.environ['EARTHDATA_TOKEN']}"))
},
)
def process_input(ds, filename):
"""Add missing "date" dimension to dataset to facilitate concatenation.
"""
import xarray as xr
return xr.concat([ds], dim="date")
per_group_recipes = {
group: XarrayZarrRecipe(
pattern,
xarray_open_kwargs=dict(group=group),
process_input=process_input,
)
for group in GROUPS
} We cannot execute these recipes on Pangeo Forge Cloud yet, because we don't yet have a mechanism to securely manage credentials (xref pangeo-forge/roadmap#36). However, I did execute a 2-month temporal subset of each of these recipes locally (and anyone else can too) with the following code:
from fsspec.implementations.local import LocalFileSystem
from pangeo_forge_recipes.recipes import setup_logging
from pangeo_forge_recipes.storage import CacheFSSpecTarget, FSSpecTarget
fs_local = LocalFileSystem()
setup_logging("DEBUG")
for group_name, recipe in per_group_recipes.items():
print(f"\n\n Building {group_name} onto local storage...")
recipe.storage_config.cache = CacheFSSpecTarget(fs_local, "cache")
recipe.storage_config.target = FSSpecTarget(fs_local, group_name + ".zarr")
recipe_pruned = recipe.copy_pruned()
recipe_pruned.to_function()() and the resulting Zarr stores (one for each group) can be accessed with import xarray as xr
ds = xr.open_zarr(f"{group_name}.zarr", consolidated=True) by way of conclusion, for now, based on this test I'd estimate the full temporal scope of each of these recipes to build Zarr stores of between ~ 1.1 and 12.8 GB per group, with a total dataset (consisting of a full temporal run for each of the 23 groups) size of about 69 GB: all_groups_full_size = 0
for group in GROUPS:
ds = xr.open_zarr(f"{group}.zarr", consolidated=True)
group_pruned_size = round(ds.nbytes/1e6)
group_full_size = group_pruned_size * len(dates)
print(f"{group} {group_pruned_size} MB -> {group_full_size/1e3} GB")
all_groups_full_size += group_full_size
print(f"\n{all_groups_full_size/1e3} GB")
|
@cisaacstern Thanks so much for continuing to work on this; it's spectacular. I'm not sure how y'all think of things at Pangeo-forge but, from a science user's perspective, there's a lot to be gained by more targeted processing. (By way of background, for some groups we want to extract only one field of four; for other groups we want to do some arithmetic on existing fields.) My understanding is that I should create a set of dictionary containing a set of Is there a way to handle appending new data as it is produced, month by month? |
Question: do these groups contain variables with the same dimensions / coordinates? If so, it would make sense logically to merge them into a single dataset. (That is not possible today but would become possible with the Opener refactor.) |
All variables share location and time coordinates. I would package all the scalar fields together in a single dataset. There are also some joint histograms with the same location and time coordinates but different histogram bins. Because they don't share bin definitions, and because they're large, I had though to create separate datasets for each unique set of bin definitions. |
There is no inhenernt size limit to the zarr group, because it is not a single file. It's all about doing whatever is most convenient for the person analyzing the data. In this case, it sounds like we want just one big dataset. As long as the dimensions use distinct names, we should be fine to merge into a single dataset. I.e.
Charles, I wonder if it is worthwhile to just special case earthdata login and inject some earthdata login credentials directly into our environments. This would allow us to move forward with some of these recipes before we solve the general secrets problem. |
Yes, merging is definitely the way to go. As Ryan said, we'll need pangeo-forge/pangeo-forge-recipes#245 to do this in a single recipe, but we can do it today in two steps, which I've done to complete the end-to-end demonstration.
I'll respond to the other questions/comments in another comment. |
Correct. As described in the API Reference, def process_input(ds: xr.Dataset, filename: str) -> ds: xr.Dataset so to use the group name (for renaming variables, etc.) within I agree that a great next step would be for you to refine the per-group recipes I prototyped in my earlier comment so that the per-group Zarr stores they output look as you'd like them to. (Merging all these together will become a lot simpler once the above-referenced refactor is complete, so we don't need to worry about that for now.) As you go along, you can run local tests of your recipes as described in the Running a Recipe Locally docs. Once you hit a point where you have questions, rather than posting your code in comments as I've done here, I'd recommend Making a PR, which will make it easier for me to clone and work with your code.
Once we've put everything together into one recipe, yes this will be true. In the interim, while we still have a single recipe for each group, that won't happen automatically, because each recipe maintains its own cache. If you get to a point where this becomes a barrier to recipe development, just let me know and I can show you some advanced config to point all of the recipes to a single cache. I'd recommend trying to execute a few recipes first before we get into that, though.
This is on the roadmap (xref pangeo-forge/pangeo-forge-recipes#37) but for now the solution for this would be to just overwrite the original dataset with an updated date range once new data is released. For this particular dataset, that does not concern me too much, because the entire dataset is less than a 100 GB, which is on the low end of what our infrastructure is designed to handle, so re-writing the whole thing should be relatively fast (a few hours, maybe).
Yes, that's a good idea. And we may end up wanting to the same for other commonly used portals. |
Last comment for now but wanted to add this because I realized I did not answer the aesthetic dimension of this question. The aim of Pangeo Forge is to produced analysis-ready, cloud-optimized (ARCO) datasets. The |
@cisaacstern I've cloned this repo and started work on my recipe, building on your generous help. A couple questions arising:
|
In general I would not recommend calling
Yes. They just have to be enumerate in meta.yaml. Thanks for your patience with this Rob. It's very helpful for us to have willing guinea pigs. 🐹 |
Everything in that linked comment should work as-is with the current release of As I show there, generally I've found the most concise way to define a number of recipes with some overlapping kwargs and some unique kwargs is with a dictionary comprehension. But you can also just write them out, "long-hand", one at a time, which is more verbose but has the benefit of being more easily (human) readable. For test execution of a collection of recipes, the code in that same linked comment should also work as-is, but certainly let me know if you find otherwise. |
@cisaacstern I'm coming back to this project and now have a condo environment that includes the pangeo-forge package. I'm a little unclear how the pieces of code are supposed to fit together. Looking at the other examples in this repo, it seems that |
@RobertPincus, glad you're working on it! And thanks for the question. Your
should all work as-is in a new conda environment with Once you've replicated those three steps with copy-and-paste, you can then start editing the recipes defined in the |
Dear @cisaacstern What you describe is how I'm doing the development - I did copy/paste your code, and now I'm modifying it as needed (and making mistakes, which I might ask you for help with). But once I have things working in this test environment, how will I configure the repo to work as a feedstock, which I understand iterates over the |
A per-recipe issue: as you've seen most data in the source files is organized within netcdf groups, ie. group "Solar_Zenith" has variables "Mean", "Pixel_Counts" etc. Datasets produced from the recipe(s) would ideally contain data from within a group (the mean value, for example) and data that doesn't belong to any group (latitude and longitude, maybe also attributes). I was hoping to address this by opening the dataset referred by the
But it seems like the Is there a way to point to the locally-cached copy of the data instead? How else might I accomplish what I'm after? |
We certainly need to improve our documentation of the As you may have noted, neither of those sources currently document the dictionary option, which I'll address below, but before doing so will first briefly suggest how you could use the conventional style for your case, in the event it's of interest. I've copied the relevant section from the linked template into this comment for ease of reference. As noted here, the recipes section of recipes:
# User chosen name for recipe. Likely similiar to dataset name, ~25 characters in length
- id: identifier-for-your-recipe
# The `object` below tells Pangeo Cloud specifically where your recipe instance(s) are located and uses the format <filename>:<object_name>
# <filename> is name of .py file where the Python recipe object is defined.
# For example, if <filename> is given as "recipe", Pangeo Cloud will expect a file named `recipe.py` to exist in your PR.
# <object_name> is the name of the recipe object (i.e. Python class instance) _within_ the specified file.
# For example, if you have defined `recipe = XarrayZarrRecipe(...)` within a file named `recipe.py`, then your `object` below would be `"recipe:recipe"`
object: "recipe:recipe" So if you were to not use the dictionary approach I've previously encouraged, and instead just define a number of recipes in a single # imports etc.
# define file pattern etc.
cloud_top_pressure_recipe = XarrayZarrRecipe( # .... )
cloud_mask_fraction_recipe = XarrayZarrRecipe( # .... )
cloud_optical_thickness_liquid_recipe = XarrayZarrRecipe( # .... )
# etc. Then following the model described in the template above, the recipes section of your recipes:
- id: cloud-top-pressure
object: "recipe:cloud_top_pressure_recipe"
- id: cloud-mask-fraction
object: "recipe:cloud_mask_fraction_recipe"
- id: cloud-optical-thickness
object: "recipe:cloud_optical_thickness_liquid_recipe"
# etc. as many of these as you want This then tells Pangeo Forge Cloud that there is a recipe with ID If you don't mind writing the recipes out "long hand" so to speak (i.e. assigning each one to a distinct variable name in The dictionary approach can be a bit more concise to write, which is why I previously demonstrated it. If you find you prefer that style, as mentioned above the recipes:
- dict_object: "recipe:per_group_recipes" Thanks again for your patient questions as we bring the documentation up to speed. |
Being able to process a dictionary of recipes is excellent. Thanks, I'll do that. |
Not sure if @rabernat will have another suggestion but to start, one option would be to adapt the local download code first posted in this comment to cache a single file at the top of your recipe (outside the import os
import fsspec
# pangeo_forge_recipes imports here
base_url = (
"https://ladsweb.modaps.eosdis.nasa.gov/"
"archive/allData/61/MCD06COSP_M3_MODIS/2002/182"
)
# define FilePattern here
# you could also get this directly from the FilePattern, instead of hardcoding it
reference_filename = "MCD06COSP_M3_MODIS.A2002182.061.2020181145824.nc"
with fsspec.open(
f"{base_url}/{reference_filename}",
client_kwargs=dict(headers=dict(Authorization=f"Bearer {os.environ['EARTHDATA_TOKEN']}")),
) as src:
with open(reference_filename, mode="wb") as dst:
dst.write(src.read())
reference_ds = xr.open_dataset(reference_filename, engine="netcdf4")
def extract_mean(ds, filename):
"""Add missing "date" dimension to dataset to facilitate concatenation.
Extract mean values and transpose latitude, longitude dimension
"""
import xarray as xr
# New dataset with group attributes
newds = xr.Dataset(attrs = ds.attrs)
# Add global attributes - needed?
newds.attrs.update(reference_ds.attrs)
# Discover the name of the netcdf group that ds contains.
# There might be more robust ways to do this
groupname = ds.Mean.attrs["title"].replace(": Mean","")
newds[groupname] = ds.Mean.T.rename(groupname)
#
# When accessing a group the lat and lon variables are indexes, not numerical values
#
newds["latitude"] = reference_ds.latitude
newds["longitude"] = reference_ds.longitude
return xr.concat([ds], dim="date")
I think this is more reliable than trying to access the files cached by the executor dynamically, but assumes that a file representing just one time step will have the correct latitude, longitude, and attributes for all time steps. Is that a fair assumption? |
@cisaacstern It is a far assumption that "one time step will have the correct latitude, longitude, and attributes for all time steps" Thanks for this idea - I've also been exploring the idea of adding latitude and longitude after the fact, extracting the fields outside of the groups and adding them later. As a technical matter, does code written in "recipe.py" get executed outside the executor? |
Gotcha.
No, all code in Just thinking aloud, the difficulty with accessing the recipe cache at runtime is that |
How can I extract an arbitrary file from the |
I thought I would try to extract the minimal amount of data from each group and add the coordinates later:
This fails for reasons I don't understand
And ideas? |
With the caveat that this is definitely a bit hacky, using the def get_url_for_nth_input(n):
input_key = [key for key in pattern if tuple(key)[0].index == n].pop(0)
url = pattern[input_key]
return url Where the argument get_url_for_nth_input(n=2)
and get_url_for_nth_input(n=50)
Perhaps more verbose than what you've tried, but rather than building up a new Edited: Had a typo in function (and example outputs) when I first posted this comment. Fixed now. |
@cisaacstern Thanks for the feedback. I fixed the problem of returning a Dataset rather than a Datarray by only renaming the field once :-). So how to extract the coordinates? In the comment above I asked about extracting a single file name from the pattern, but that's maybe too specific a question. Really, what I want to do is to identify any single instance of the files defined by my |
It's getting a bit hard for me to follow what particular code we are talking about at this point. Could you open a PR against this repo with your latest You can just omit |
@cisaacstern My code is very much a work in progress and indeed, the data I'm planning to apply this to won't be ready for several weeks. Does it help to point you to my fork, where the recipes live in At this stage it seems like my process is going to be
A few questions as I iterate on the recipes:
|
Yes, FilePattern can point at local files. I only moved everything to OSN so that you (or others) could interact with the output of my prototyped recipe, not as a requirement to make it run. Can you provide a link to the code which threw the error you reference, along with the full error traceback?
A minimal reproducible example would be helpful for debugging this. Could you provide a self-contained code snippet (presumably using only |
@cisaacstern Sorry for the radio silence. In the interests of having file-organizing code I can publish with a paper describing the data I've written a stand-alone Python script that manages the opening of different netCDF groups as needed. I'd still like to make the data available through Pangeo Forge, though. I was speaking to @TomNicholas earlier today; it sounds like I could do this easily if the recipes allowed me to use datatree instead of xarray datasets. Could I do this? |
@RobertPincus, thanks for the ongoing attention to this. As Tom has just posted, I believe the linked Opener refactor will be necessary to make this a reality: pangeo-forge/pangeo-forge-recipes#245 (comment) |
@cisaacstern Thanks, I had seen the note from @TomNicholas. I will be delighted to use the revised Opener when it becomes available. |
Source Dataset
This data provides satellite observations of targeted at the evaluation of global models, facilitated with the use of synthetic observations ("satellite simulators"). The data are a re-packaging of standard "Level-3" (gridded, aggregated) cloud products produced by NASA's MODIS satellites; data from both instruments (morning/Terra and afternoon/Aqua) are combined. The fields conform to output from the "MODIS simulator," one of several used in the CFMIP Observation Simulator Package (COSP, paper1, paper2). Output from COSP and the MODIS simulator is requested as part of the Cloud Feedbacks Model Intercomparison Project (CFMIP), part of CMIP.
Transformation / Alignment / Merging
Ideally we would provide several related datasets. One would contain the mean values for many or even all scalar fields. This means extracting the mean value from each group from each file and concatenating the fields in time. A second would be the joint histograms, which need to be extracted and normalized, metadata refined, and also concatenated in time. Since the joint histograms are roughly 50 times as large as each scalar field it may be best to create one dataset per joint histogram (there are about a dozen).
Output Dataset
Given the user community it would be useful to produce netCDF output, perhaps with a kerchunk index. It would be fine to also produce a Zarr or related dataset, perhaps in addition. Whatever the format, the data should be structured so that it's easy to append new data as it is produced.
The text was updated successfully, but these errors were encountered: