Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

⭐️ IBTRACS dataset adapter #493

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

kevinsantana11
Copy link
Contributor

No description provided.

@kevinsantana11 kevinsantana11 changed the title ⭐️ IBTRACS Dataset ⭐️ IBTRACS dataset adapter Jul 19, 2024
kevinsantana11 pushed a commit that referenced this pull request Aug 1, 2024
Associated to #493 and fixes the deployed dataset in s3 repo. The code changes update the dataset URL in the `gdp1h` function to version 2.01.1. This ensures that the latest version of the dataset is being used. The previous URL was "https://noaa-oar-hourly-gdp-pds.s3.amazonaws.com/latest/gdp-v2.01.zarr" and it has been updated to "https://noaa-oar-hourly-gdp-pds.s3.amazonaws.com/latest/gdp-v2.01.1.zarr".
@philippemiron
Copy link
Contributor

Is it normal that the tests take 40 minutes?

@kevinsantana11
Copy link
Contributor Author

Is it normal that the tests take 40 minutes?

Not really but sometimes the AOML servers can get overloaded and the exponential backoff will kick in and sometimes cause tests to take a really long time if the server can't recover in time.

@kevinsantana11 kevinsantana11 marked this pull request as ready for review October 21, 2024 08:48
@kevinsantana11
Copy link
Contributor Author

@selipot this PR is ready for review. I've also got started on the example notebook repo for the dataset:

https://github.com/Cloud-Drift/ibtracs-get-started/pull/1

xarray.Dataset
IBTRACS dataset as a ragged array.

Standard usage of the dataset.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this line? Should we add a simple example ds = ibtracs()?

@selipot
Copy link
Member

selipot commented Oct 23, 2024

I am not able to generate version v03r09. I get the following error

from clouddrift.datasets import ibtracs
ds03 = ibtracs(version='v03r09')

https://www.ncei.noaa.gov/data/international-best-track-archive-for-climate-stewardship-ibtracs/v03r09/access
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File /opt/homebrew/Caskroom/mambaforge/base/envs/clouddrift/lib/python3.12/site-packages/xarray/backends/file_manager.py:211, in CachingFileManager._acquire_with_cache_info(self, needs_lock)
    210 try:
--> 211     file = self._cache[self._key]
    212 except KeyError:

File /opt/homebrew/Caskroom/mambaforge/base/envs/clouddrift/lib/python3.12/site-packages/xarray/backends/lru_cache.py:56, in LRUCache.__getitem__(self, key)
     55 with self._lock:
---> 56     value = self._cache[key]
     57     self._cache.move_to_end(key)

KeyError: [<class 'netCDF4._netCDF4.Dataset'>, ('/var/folders/fx/qsnv05_94vs9qzp4p0qww8c00000gn/T/clouddrift/ibtracs/IBTrACS.last3years.v03r09.nc',), 'r', (('clobber', True), ('diskless', False), ('format', 'NETCDF4'), ('persist', False)), 'a22fcbb6-1058-4905-bd5c-fdc4c1e2e8ef']

During handling of the above exception, another exception occurred:

OSError                                   Traceback (most recent call last)
Cell In[10], line 1
----> 1 ds03 = ibtracs(version='v03r09')

File ~/projects.git/clouddrift/clouddrift/datasets.py:404, in ibtracs(version, kind, tmp_path, decode_times)
    370 def ibtracs(
    371     version: _Version = "v04r01",
    372     kind: _Kind = "LAST_3_YEARS",
    373     tmp_path: str = adapters.ibtracs._DEFAULT_FILE_PATH,
    374     decode_times: bool = True,
    375 ) -> xr.Dataset:
    376     """Returns International Best Track Archive for Climate Stewardship (IBTrACS) as a ragged array xarray dataset.
    377 
    378     The function will first look for the ragged array dataset on the local
   (...)
    402     Standard usage of the dataset.
    403     """
--> 404     return _dataset_filecache(
    405         f"ibtracs_ra_{version}_{kind}.nc",
    406         decode_times,
    407         lambda: adapters.ibtracs.to_raggedarray(version, kind, tmp_path),
    408     )

File ~/projects.git/clouddrift/clouddrift/datasets.py:752, in _dataset_filecache(filename, decode_times, get_ds)
    749 os.makedirs(os.path.dirname(fp), exist_ok=True)
    751 if not os.path.exists(fp):
--> 752     ds = get_ds()
    753     if ext == ".nc":
    754         ds.to_netcdf(fp)

File ~/projects.git/clouddrift/clouddrift/datasets.py:407, in ibtracs.<locals>.<lambda>()
    370 def ibtracs(
    371     version: _Version = "v04r01",
    372     kind: _Kind = "LAST_3_YEARS",
    373     tmp_path: str = adapters.ibtracs._DEFAULT_FILE_PATH,
    374     decode_times: bool = True,
    375 ) -> xr.Dataset:
    376     """Returns International Best Track Archive for Climate Stewardship (IBTrACS) as a ragged array xarray dataset.
    377 
    378     The function will first look for the ragged array dataset on the local
   (...)
    402     Standard usage of the dataset.
    403     """
    404     return _dataset_filecache(
    405         f"ibtracs_ra_{version}_{kind}.nc",
    406         decode_times,
--> 407         lambda: adapters.ibtracs.to_raggedarray(version, kind, tmp_path),
    408     )

File ~/projects.git/clouddrift/clouddrift/adapters/ibtracs.py:90, in to_raggedarray(version, kind, tmp_path)
     87 dst_path = os.path.join(tmp_path, filename)
     88 download_with_progress([(src_url, dst_path)])
---> 90 ds = xr.open_dataset(dst_path, engine="netcdf4")
     91 ds = ds.rename_dims({"date_time": "obs"})
     93 vars = list[Hashable]()

File /opt/homebrew/Caskroom/mambaforge/base/envs/clouddrift/lib/python3.12/site-packages/xarray/backends/api.py:611, in open_dataset(filename_or_obj, engine, chunks, cache, decode_cf, mask_and_scale, decode_times, decode_timedelta, use_cftime, concat_characters, decode_coords, drop_variables, inline_array, chunked_array_type, from_array_kwargs, backend_kwargs, **kwargs)
    599 decoders = _resolve_decoders_kwargs(
    600     decode_cf,
    601     open_backend_dataset_parameters=backend.open_dataset_parameters,
   (...)
    607     decode_coords=decode_coords,
    608 )
    610 overwrite_encoded_chunks = kwargs.pop("overwrite_encoded_chunks", None)
--> 611 backend_ds = backend.open_dataset(
    612     filename_or_obj,
    613     drop_variables=drop_variables,
    614     **decoders,
    615     **kwargs,
    616 )
    617 ds = _dataset_from_backend_dataset(
    618     backend_ds,
    619     filename_or_obj,
   (...)
    629     **kwargs,
    630 )
    631 return ds

File /opt/homebrew/Caskroom/mambaforge/base/envs/clouddrift/lib/python3.12/site-packages/xarray/backends/netCDF4_.py:649, in NetCDF4BackendEntrypoint.open_dataset(self, filename_or_obj, mask_and_scale, decode_times, concat_characters, decode_coords, drop_variables, use_cftime, decode_timedelta, group, mode, format, clobber, diskless, persist, lock, autoclose)
    628 def open_dataset(  # type: ignore[override]  # allow LSP violation, not supporting **kwargs
    629     self,
    630     filename_or_obj: str | os.PathLike[Any] | BufferedIOBase | AbstractDataStore,
   (...)
    646     autoclose=False,
    647 ) -> Dataset:
    648     filename_or_obj = _normalize_path(filename_or_obj)
--> 649     store = NetCDF4DataStore.open(
    650         filename_or_obj,
    651         mode=mode,
    652         format=format,
    653         group=group,
    654         clobber=clobber,
    655         diskless=diskless,
    656         persist=persist,
    657         lock=lock,
    658         autoclose=autoclose,
    659     )
    661     store_entrypoint = StoreBackendEntrypoint()
    662     with close_on_error(store):

File /opt/homebrew/Caskroom/mambaforge/base/envs/clouddrift/lib/python3.12/site-packages/xarray/backends/netCDF4_.py:410, in NetCDF4DataStore.open(cls, filename, mode, format, group, clobber, diskless, persist, lock, lock_maker, autoclose)
    404 kwargs = dict(
    405     clobber=clobber, diskless=diskless, persist=persist, format=format
    406 )
    407 manager = CachingFileManager(
    408     netCDF4.Dataset, filename, mode=mode, kwargs=kwargs
    409 )
--> 410 return cls(manager, group=group, mode=mode, lock=lock, autoclose=autoclose)

File /opt/homebrew/Caskroom/mambaforge/base/envs/clouddrift/lib/python3.12/site-packages/xarray/backends/netCDF4_.py:357, in NetCDF4DataStore.__init__(self, manager, group, mode, lock, autoclose)
    355 self._group = group
    356 self._mode = mode
--> 357 self.format = self.ds.data_model
    358 self._filename = self.ds.filepath()
    359 self.is_remote = is_remote_uri(self._filename)

File /opt/homebrew/Caskroom/mambaforge/base/envs/clouddrift/lib/python3.12/site-packages/xarray/backends/netCDF4_.py:419, in NetCDF4DataStore.ds(self)
    417 @property
    418 def ds(self):
--> 419     return self._acquire()

File /opt/homebrew/Caskroom/mambaforge/base/envs/clouddrift/lib/python3.12/site-packages/xarray/backends/netCDF4_.py:413, in NetCDF4DataStore._acquire(self, needs_lock)
    412 def _acquire(self, needs_lock=True):
--> 413     with self._manager.acquire_context(needs_lock) as root:
    414         ds = _nc4_require_group(root, self._group, self._mode)
    415     return ds

File /opt/homebrew/Caskroom/mambaforge/base/envs/clouddrift/lib/python3.12/contextlib.py:137, in _GeneratorContextManager.__enter__(self)
    135 del self.args, self.kwds, self.func
    136 try:
--> 137     return next(self.gen)
    138 except StopIteration:
    139     raise RuntimeError("generator didn't yield") from None

File /opt/homebrew/Caskroom/mambaforge/base/envs/clouddrift/lib/python3.12/site-packages/xarray/backends/file_manager.py:199, in CachingFileManager.acquire_context(self, needs_lock)
    196 @contextlib.contextmanager
    197 def acquire_context(self, needs_lock=True):
    198     """Context manager for acquiring a file."""
--> 199     file, cached = self._acquire_with_cache_info(needs_lock)
    200     try:
    201         yield file

File /opt/homebrew/Caskroom/mambaforge/base/envs/clouddrift/lib/python3.12/site-packages/xarray/backends/file_manager.py:217, in CachingFileManager._acquire_with_cache_info(self, needs_lock)
    215     kwargs = kwargs.copy()
    216     kwargs["mode"] = self._mode
--> 217 file = self._opener(*self._args, **kwargs)
    218 if self._mode == "w":
    219     # ensure file doesn't get overridden when opened again
    220     self._mode = "a"

File src/netCDF4/_netCDF4.pyx:2470, in netCDF4._netCDF4.Dataset.__init__()

File src/netCDF4/_netCDF4.pyx:2107, in netCDF4._netCDF4._ensure_nc_success()

OSError: [Errno -51] NetCDF: Unknown file format: '/var/folders/fx/qsnv05_94vs9qzp4p0qww8c00000gn/T/clouddrift/ibtracs/IBTrACS.last3years.v03r09.nc'


Parameters
----------
version : "v03r09", "v04r00", "v04r01" (default)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drop support of version 3

----------
version : "v03r09", "v04r00", "v04r01" (default)
Specify the dataset version to retrieve. Default to the latest version.
kind: "ACTIVE", "ALL", "EP", "NA", "NI", "SA", "SI", "SP", "WP", "SINCE_1980", "LAST_3_YEARS" (default)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does using "ACTIVE" or "LAST_3_YEARS" re-generate the ragged array or not? I think it should. So maybe disable caching for this dataset?

@selipot
Copy link
Member

selipot commented Nov 15, 2024

@kevinsantana11 what else do we need to do to merge?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants