⭐️ IBTRACS dataset adapter #493

kevinsantana11 · 2024-07-19T14:18:39Z

No description provided.

Associated to #493 and fixes the deployed dataset in s3 repo. The code changes update the dataset URL in the `gdp1h` function to version 2.01.1. This ensures that the latest version of the dataset is being used. The previous URL was "https://noaa-oar-hourly-gdp-pds.s3.amazonaws.com/latest/gdp-v2.01.zarr" and it has been updated to "https://noaa-oar-hourly-gdp-pds.s3.amazonaws.com/latest/gdp-v2.01.1.zarr".

philippemiron · 2024-08-04T04:03:43Z

Is it normal that the tests take 40 minutes?

kevinsantana11 · 2024-08-06T02:12:17Z

Is it normal that the tests take 40 minutes?

Not really but sometimes the AOML servers can get overloaded and the exponential backoff will kick in and sometimes cause tests to take a really long time if the server can't recover in time.

kevinsantana11 · 2024-10-21T08:48:54Z

@selipot this PR is ready for review. I've also got started on the example notebook repo for the dataset:

https://github.com/Cloud-Drift/ibtracs-get-started/pull/1

selipot · 2024-10-23T14:35:37Z

clouddrift/datasets.py

+    xarray.Dataset
+        IBTRACS dataset as a ragged array.
+
+    Standard usage of the dataset.


Why this line? Should we add a simple example ds = ibtracs()?

selipot · 2024-10-23T14:44:43Z

I am not able to generate version v03r09. I get the following error

from clouddrift.datasets import ibtracs
ds03 = ibtracs(version='v03r09')


https://www.ncei.noaa.gov/data/international-best-track-archive-for-climate-stewardship-ibtracs/v03r09/access
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File /opt/homebrew/Caskroom/mambaforge/base/envs/clouddrift/lib/python3.12/site-packages/xarray/backends/file_manager.py:211, in CachingFileManager._acquire_with_cache_info(self, needs_lock)
    210 try:
--> 211     file = self._cache[self._key]
    212 except KeyError:

File /opt/homebrew/Caskroom/mambaforge/base/envs/clouddrift/lib/python3.12/site-packages/xarray/backends/lru_cache.py:56, in LRUCache.__getitem__(self, key)
     55 with self._lock:
---> 56     value = self._cache[key]
     57     self._cache.move_to_end(key)

KeyError: [<class 'netCDF4._netCDF4.Dataset'>, ('/var/folders/fx/qsnv05_94vs9qzp4p0qww8c00000gn/T/clouddrift/ibtracs/IBTrACS.last3years.v03r09.nc',), 'r', (('clobber', True), ('diskless', False), ('format', 'NETCDF4'), ('persist', False)), 'a22fcbb6-1058-4905-bd5c-fdc4c1e2e8ef']

During handling of the above exception, another exception occurred:

OSError                                   Traceback (most recent call last)
Cell In[10], line 1
----> 1 ds03 = ibtracs(version='v03r09')

File ~/projects.git/clouddrift/clouddrift/datasets.py:404, in ibtracs(version, kind, tmp_path, decode_times)
    370 def ibtracs(
    371     version: _Version = "v04r01",
    372     kind: _Kind = "LAST_3_YEARS",
    373     tmp_path: str = adapters.ibtracs._DEFAULT_FILE_PATH,
    374     decode_times: bool = True,
    375 ) -> xr.Dataset:
    376     """Returns International Best Track Archive for Climate Stewardship (IBTrACS) as a ragged array xarray dataset.
    377 
    378     The function will first look for the ragged array dataset on the local
   (...)
    402     Standard usage of the dataset.
    403     """
--> 404     return _dataset_filecache(
    405         f"ibtracs_ra_{version}_{kind}.nc",
    406         decode_times,
    407         lambda: adapters.ibtracs.to_raggedarray(version, kind, tmp_path),
    408     )

File ~/projects.git/clouddrift/clouddrift/datasets.py:752, in _dataset_filecache(filename, decode_times, get_ds)
    749 os.makedirs(os.path.dirname(fp), exist_ok=True)
    751 if not os.path.exists(fp):
--> 752     ds = get_ds()
    753     if ext == ".nc":
    754         ds.to_netcdf(fp)

File ~/projects.git/clouddrift/clouddrift/datasets.py:407, in ibtracs.<locals>.<lambda>()
    370 def ibtracs(
    371     version: _Version = "v04r01",
    372     kind: _Kind = "LAST_3_YEARS",
    373     tmp_path: str = adapters.ibtracs._DEFAULT_FILE_PATH,
    374     decode_times: bool = True,
    375 ) -> xr.Dataset:
    376     """Returns International Best Track Archive for Climate Stewardship (IBTrACS) as a ragged array xarray dataset.
    377 
    378     The function will first look for the ragged array dataset on the local
   (...)
    402     Standard usage of the dataset.
    403     """
    404     return _dataset_filecache(
    405         f"ibtracs_ra_{version}_{kind}.nc",
    406         decode_times,
--> 407         lambda: adapters.ibtracs.to_raggedarray(version, kind, tmp_path),
    408     )

File ~/projects.git/clouddrift/clouddrift/adapters/ibtracs.py:90, in to_raggedarray(version, kind, tmp_path)
     87 dst_path = os.path.join(tmp_path, filename)
     88 download_with_progress([(src_url, dst_path)])
---> 90 ds = xr.open_dataset(dst_path, engine="netcdf4")
     91 ds = ds.rename_dims({"date_time": "obs"})
     93 vars = list[Hashable]()

File /opt/homebrew/Caskroom/mambaforge/base/envs/clouddrift/lib/python3.12/site-packages/xarray/backends/api.py:611, in open_dataset(filename_or_obj, engine, chunks, cache, decode_cf, mask_and_scale, decode_times, decode_timedelta, use_cftime, concat_characters, decode_coords, drop_variables, inline_array, chunked_array_type, from_array_kwargs, backend_kwargs, **kwargs)
    599 decoders = _resolve_decoders_kwargs(
    600     decode_cf,
    601     open_backend_dataset_parameters=backend.open_dataset_parameters,
   (...)
    607     decode_coords=decode_coords,
    608 )
    610 overwrite_encoded_chunks = kwargs.pop("overwrite_encoded_chunks", None)
--> 611 backend_ds = backend.open_dataset(
    612     filename_or_obj,
    613     drop_variables=drop_variables,
    614     **decoders,
    615     **kwargs,
    616 )
    617 ds = _dataset_from_backend_dataset(
    618     backend_ds,
    619     filename_or_obj,
   (...)
    629     **kwargs,
    630 )
    631 return ds

File /opt/homebrew/Caskroom/mambaforge/base/envs/clouddrift/lib/python3.12/site-packages/xarray/backends/netCDF4_.py:649, in NetCDF4BackendEntrypoint.open_dataset(self, filename_or_obj, mask_and_scale, decode_times, concat_characters, decode_coords, drop_variables, use_cftime, decode_timedelta, group, mode, format, clobber, diskless, persist, lock, autoclose)
    628 def open_dataset(  # type: ignore[override]  # allow LSP violation, not supporting **kwargs
    629     self,
    630     filename_or_obj: str | os.PathLike[Any] | BufferedIOBase | AbstractDataStore,
   (...)
    646     autoclose=False,
    647 ) -> Dataset:
    648     filename_or_obj = _normalize_path(filename_or_obj)
--> 649     store = NetCDF4DataStore.open(
    650         filename_or_obj,
    651         mode=mode,
    652         format=format,
    653         group=group,
    654         clobber=clobber,
    655         diskless=diskless,
    656         persist=persist,
    657         lock=lock,
    658         autoclose=autoclose,
    659     )
    661     store_entrypoint = StoreBackendEntrypoint()
    662     with close_on_error(store):

File /opt/homebrew/Caskroom/mambaforge/base/envs/clouddrift/lib/python3.12/site-packages/xarray/backends/netCDF4_.py:410, in NetCDF4DataStore.open(cls, filename, mode, format, group, clobber, diskless, persist, lock, lock_maker, autoclose)
    404 kwargs = dict(
    405     clobber=clobber, diskless=diskless, persist=persist, format=format
    406 )
    407 manager = CachingFileManager(
    408     netCDF4.Dataset, filename, mode=mode, kwargs=kwargs
    409 )
--> 410 return cls(manager, group=group, mode=mode, lock=lock, autoclose=autoclose)

File /opt/homebrew/Caskroom/mambaforge/base/envs/clouddrift/lib/python3.12/site-packages/xarray/backends/netCDF4_.py:357, in NetCDF4DataStore.__init__(self, manager, group, mode, lock, autoclose)
    355 self._group = group
    356 self._mode = mode
--> 357 self.format = self.ds.data_model
    358 self._filename = self.ds.filepath()
    359 self.is_remote = is_remote_uri(self._filename)

File /opt/homebrew/Caskroom/mambaforge/base/envs/clouddrift/lib/python3.12/site-packages/xarray/backends/netCDF4_.py:419, in NetCDF4DataStore.ds(self)
    417 @property
    418 def ds(self):
--> 419     return self._acquire()

File /opt/homebrew/Caskroom/mambaforge/base/envs/clouddrift/lib/python3.12/site-packages/xarray/backends/netCDF4_.py:413, in NetCDF4DataStore._acquire(self, needs_lock)
    412 def _acquire(self, needs_lock=True):
--> 413     with self._manager.acquire_context(needs_lock) as root:
    414         ds = _nc4_require_group(root, self._group, self._mode)
    415     return ds

File /opt/homebrew/Caskroom/mambaforge/base/envs/clouddrift/lib/python3.12/contextlib.py:137, in _GeneratorContextManager.__enter__(self)
    135 del self.args, self.kwds, self.func
    136 try:
--> 137     return next(self.gen)
    138 except StopIteration:
    139     raise RuntimeError("generator didn't yield") from None

File /opt/homebrew/Caskroom/mambaforge/base/envs/clouddrift/lib/python3.12/site-packages/xarray/backends/file_manager.py:199, in CachingFileManager.acquire_context(self, needs_lock)
    196 @contextlib.contextmanager
    197 def acquire_context(self, needs_lock=True):
    198     """Context manager for acquiring a file."""
--> 199     file, cached = self._acquire_with_cache_info(needs_lock)
    200     try:
    201         yield file

File /opt/homebrew/Caskroom/mambaforge/base/envs/clouddrift/lib/python3.12/site-packages/xarray/backends/file_manager.py:217, in CachingFileManager._acquire_with_cache_info(self, needs_lock)
    215     kwargs = kwargs.copy()
    216     kwargs["mode"] = self._mode
--> 217 file = self._opener(*self._args, **kwargs)
    218 if self._mode == "w":
    219     # ensure file doesn't get overridden when opened again
    220     self._mode = "a"

File src/netCDF4/_netCDF4.pyx:2470, in netCDF4._netCDF4.Dataset.__init__()

File src/netCDF4/_netCDF4.pyx:2107, in netCDF4._netCDF4._ensure_nc_success()

OSError: [Errno -51] NetCDF: Unknown file format: '/var/folders/fx/qsnv05_94vs9qzp4p0qww8c00000gn/T/clouddrift/ibtracs/IBTrACS.last3years.v03r09.nc'

selipot · 2024-10-23T15:14:38Z

clouddrift/datasets.py

+
+    Parameters
+    ----------
+    version : "v03r09", "v04r00", "v04r01" (default)


drop support of version 3

selipot · 2024-10-23T15:20:06Z

clouddrift/datasets.py

+    ----------
+    version : "v03r09", "v04r00", "v04r01" (default)
+        Specify the dataset version to retrieve. Default to the latest version.
+    kind: "ACTIVE", "ALL", "EP", "NA", "NI", "SA", "SI", "SP", "WP", "SINCE_1980", "LAST_3_YEARS" (default)


Does using "ACTIVE" or "LAST_3_YEARS" re-generate the ragged array or not? I think it should. So maybe disable caching for this dataset?

selipot · 2024-11-15T14:05:17Z

@kevinsantana11 what else do we need to do to merge?

kevinsantana11 changed the title ~~⭐️ IBTRACS Dataset~~ ⭐️ IBTRACS dataset adapter Jul 19, 2024

kevinsantana11 force-pushed the ibtracs branch from 3166863 to 6c92b9b Compare August 1, 2024 04:05

kevinsantana11 force-pushed the ibtracs branch from 3d68f63 to 470b5e1 Compare August 9, 2024 04:10

kevinsantana11 added 7 commits August 24, 2024 00:37

initial changes

ba28d6b

Initial implementation

3d86309

linting, formatting, typing fixes

0a5301e

update dim mapping

2d3a9f9

Rework data structure to work with higher dimensional data variables

7586476

Include data vairables with quadrant dimension

312038c

Fix typing issues

bc000cc

kevinsantana11 force-pushed the ibtracs branch from 25ce09a to bc000cc Compare August 24, 2024 04:37

kevinsantana11 added 10 commits August 26, 2024 23:21

dim name needed to be updated

789ba90

variables in a criteria may contain other dims than our expected

4b27bb8

remove dangling dependency

2a26463

data and metadata vars work with multi dimensional variables

1214589

fmt

6098fac

Add datasets function and testing

390fa4e

Update docs

61faed9

Grab only the trajectory of the storm

8b52d59

typing fix

012981b

fmt

eadea44

kevinsantana11 marked this pull request as ready for review October 21, 2024 08:48

philippemiron requested a review from selipot October 21, 2024 14:10

selipot reviewed Oct 23, 2024

View reviewed changes

clouddrift/datasets.py

Parameters

----------

version : "v03r09", "v04r00", "v04r01" (default)

Copy link

Member

selipot Oct 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drop support of version 3

selipot reviewed Oct 23, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

⭐️ IBTRACS dataset adapter #493

⭐️ IBTRACS dataset adapter #493

kevinsantana11 commented Jul 19, 2024

philippemiron commented Aug 4, 2024

kevinsantana11 commented Aug 6, 2024

kevinsantana11 commented Oct 21, 2024

selipot Oct 23, 2024

selipot commented Oct 23, 2024

selipot Oct 23, 2024

selipot Oct 23, 2024

selipot commented Nov 15, 2024

⭐️ IBTRACS dataset adapter #493

Are you sure you want to change the base?

⭐️ IBTRACS dataset adapter #493

Conversation

kevinsantana11 commented Jul 19, 2024

philippemiron commented Aug 4, 2024

kevinsantana11 commented Aug 6, 2024

kevinsantana11 commented Oct 21, 2024

selipot Oct 23, 2024

Choose a reason for hiding this comment

selipot commented Oct 23, 2024

selipot Oct 23, 2024

Choose a reason for hiding this comment

selipot Oct 23, 2024

Choose a reason for hiding this comment

selipot commented Nov 15, 2024