Skip to content

Commit

Permalink
Doc edits (#483)
Browse files Browse the repository at this point in the history
* reorder datasets.rst

* added ragged array example in usage.rst

* break example

* break example in two
  • Loading branch information
selipot authored Jun 26, 2024
1 parent 3c57be0 commit ad08d4e
Show file tree
Hide file tree
Showing 2 changed files with 120 additions and 37 deletions.
67 changes: 33 additions & 34 deletions docs/datasets.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,39 @@ Datasets
========

CloudDrift provides convenience functions to access real-world ragged-array
datasets.
datasets. Currently available datasets are:

- :func:`clouddrift.datasets.andro`: The ANDRO dataset as a ragged array
processed from the upstream dataset hosted at the `SEANOE repository
<https://www.seanoe.org/data/00360/47077/>`_.
- :func:`clouddrift.datasets.gdp1h`: 1-hourly Global Drifter Program (GDP) data
from a `cloud-optimized Zarr dataset on AWS <https://registry.opendata.aws/noaa-oar-hourly-gdp/.>`_.
- :func:`clouddrift.datasets.gdp6h`: 6-hourly GDP data from a ragged-array
NetCDF file hosted by the public HTTPS server at
`NOAA's Atlantic Oceanographic and Meteorological Laboratory (AOML) <https://www.aoml.noaa.gov/phod/gdp/index.php>`_.
- :func:`clouddrift.datasets.glad`: 15-minute Grand LAgrangian Deployment (GLAD)
data produced by the Consortium for Advanced Research on Transport of
Hydrocarbon in the Environment (CARTHE) and hosted upstream at the `Gulf of
Mexico Research Initiative Information and Data Cooperative (GRIIDC)
<https://doi.org/10.7266/N7VD6WC8>`_.
- :func:`clouddrift.datasets.mosaic`: MOSAiC sea-ice drift dataset as a ragged
array processed from the upstream dataset hosted at the
`NSF's Arctic Data Center <https://doi.org/10.18739/A2KP7TS83>`_.
- :func:`clouddrift.datasets.subsurface_floats`: The subsurface float trajectories dataset as
hosted by NOAA AOML at
`NOAA's Atlantic Oceanographic and Meteorological Laboratory (AOML) <https://www.aoml.noaa.gov/phod/float_traj/index.php>_`
and maintained by Andree Ramsey and Heather Furey from the Woods Hole Oceanographic Institution.
- :func:`clouddrift.datasets.spotters`: The Sofar Ocean Spotters archive dataset as hosted at the public `AWS S3 bucket <https://sofar-spotter-archive.s3.amazonaws.com/spotter_data_bulk_zarr>`_.
- :func:`clouddrift.datasets.yomaha`: The YoMaHa'07 dataset as a ragged array
processed from the upstream dataset hosted at the `Asia-Pacific Data-Research
Center (APDRC) <http://apdrc.soest.hawaii.edu/projects/yomaha/>`_.
- :func:`clouddrift.datasets.hurdat2`: The HURricane DATa 2nd generation (HURDAT2)
processed from the upstream dataset hosted at the `NOAA AOML Hurricane Research Devision <https://www.aoml.noaa.gov/hrd/hurdat/Data_Storm.html>`_.

The GDP and the Spotters datasets are accessed lazily, so the data is only downloaded when
specific array values are referenced. The ANDRO, GLAD, MOSAiC, Subsurface Floats, and YoMaHa'07
datasets are downloaded in their entirety when the function is called for the first
time and stored locally for later use.

>>> from clouddrift.datasets import gdp1h
>>> ds = gdp1h()
Expand Down Expand Up @@ -44,36 +76,3 @@ Attributes: (12/16)
summary: Global Drifter Program hourly data
title: Global Drifter Program hourly drifting buoy collection

Currently available datasets are:

- :func:`clouddrift.datasets.andro`: The ANDRO dataset as a ragged array
processed from the upstream dataset hosted at the `SEANOE repository
<https://www.seanoe.org/data/00360/47077/>`_.
- :func:`clouddrift.datasets.gdp1h`: 1-hourly Global Drifter Program (GDP) data
from a `cloud-optimized Zarr dataset on AWS <https://registry.opendata.aws/noaa-oar-hourly-gdp/.>`_.
- :func:`clouddrift.datasets.gdp6h`: 6-hourly GDP data from a ragged-array
NetCDF file hosted by the public HTTPS server at
`NOAA's Atlantic Oceanographic and Meteorological Laboratory (AOML) <https://www.aoml.noaa.gov/phod/gdp/index.php>`_.
- :func:`clouddrift.datasets.glad`: 15-minute Grand LAgrangian Deployment (GLAD)
data produced by the Consortium for Advanced Research on Transport of
Hydrocarbon in the Environment (CARTHE) and hosted upstream at the `Gulf of
Mexico Research Initiative Information and Data Cooperative (GRIIDC)
<https://doi.org/10.7266/N7VD6WC8>`_.
- :func:`clouddrift.datasets.mosaic`: MOSAiC sea-ice drift dataset as a ragged
array processed from the upstream dataset hosted at the
`NSF's Arctic Data Center <https://doi.org/10.18739/A2KP7TS83>`_.
- :func:`clouddrift.datasets.subsurface_floats`: The subsurface float trajectories dataset as
hosted by NOAA AOML at
`NOAA's Atlantic Oceanographic and Meteorological Laboratory (AOML) <https://www.aoml.noaa.gov/phod/float_traj/index.php>_`
and maintained by Andree Ramsey and Heather Furey from the Woods Hole Oceanographic Institution.
- :func:`clouddrift.datasets.spotters`: The Sofar Ocean Spotters archive dataset as hosted at the public `AWS S3 bucket <https://sofar-spotter-archive.s3.amazonaws.com/spotter_data_bulk_zarr>`_.
- :func:`clouddrift.datasets.yomaha`: The YoMaHa'07 dataset as a ragged array
processed from the upstream dataset hosted at the `Asia-Pacific Data-Research
Center (APDRC) <http://apdrc.soest.hawaii.edu/projects/yomaha/>`_.
- :func:`clouddrift.datasets.hurdat2`: The HURricane DATa 2nd generation (HURDAT2)
processed from the upstream dataset hosted at the `NOAA AOML Hurricane Research Devision <https://www.aoml.noaa.gov/hrd/hurdat/Data_Storm.html>`_.

The GDP and the Spotters datasets are accessed lazily, so the data is only downloaded when
specific array values are referenced. The ANDRO, GLAD, MOSAiC, Subsurface Floats, and YoMaHa'07
datasets are downloaded in their entirety when the function is called for the first
time and stored locally for later use.
90 changes: 87 additions & 3 deletions docs/usage.rst
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,7 @@ Attributes:
long_name: Longitude
units: degrees_east

You see that this array is very long--it has 165754333 elements.
You see that this array is very long--it has 197214787 elements.
This is because in a ragged array, many varying-length arrays are laid out as a
contiguous 1-dimensional array in memory.

Expand Down Expand Up @@ -252,9 +252,93 @@ CloudDrift provides an easy way to convert custom Lagrangian datasets into
# Alternatively, convert to Awkward Array for analysis
ds = ra.to_awkward()
This snippet is specific to the hourly GDP dataset, however, you can use the
The snippet above is specific to the hourly GDP dataset, however, you can use the
``RaggedArray`` class directly to convert other custom datasets into a ragged
array structure that is analysis ready via Xarray or Awkward Array packages.
The functions to do that are defined in the ``clouddrift.adapters`` submodule.
You can use these examples as a reference to ingest your own or other custom
Lagrangian datasets into ``RaggedArray``.
Lagrangian datasets into ``RaggedArray``. We provide below a simple example on
how to build ragged array datasets from simulated data:

.. code-block:: python
# create a synthetic Lagrangian data of 8 trajectories of random walks
import numpy as np
rowsize = [100,65,7,22,56,78,99,70]
x = []
y = []
for i in range(len(rowsize)):
x.append(np.cumsum(np.random.normal(0,1,rowsize[i]) + np.random.uniform(0,1,1)))
y.append(np.cumsum(np.random.normal(0,1,rowsize[i]) + np.random.uniform(0,1,1)))
x = np.concatenate(x)
y = np.concatenate(y)
# create an instance of the RaggedArray class
from clouddrift.raggedarray import RaggedArray
# define the actual coordinates
coords = {"time": np.arange(len(x)), "id": np.arange(len(rowsize))}
# define the data
data = {"x": x, "y": y}
# define the metadata which here include the `rowsize` parameter
metadata = {"rowsize": rowsize}
# map the names of the dimensions to what the class expects, that is
# what are the names of "rows" and "obs"
name_dims = {"traj": "rows", "obs": "obs"}
# map the dimensions of the coordinates defined above
coord_dims = {"time": "obs", "id": "traj"}
# define some attributes for the dataset and its variables
attrs_global = {"title":"An example of synthetic data"}
attrs_variables = {"id" : {"long_name": "trajectory id"},
"time": {"long_name": "time"},
"x": {"long_name": "x coordinate"},
"y": {"long_name": "y coordinate"},
"rowsize": {"long_name": "number of observations in each trajectory"}}
# instantiate the RaggedArray class
ra = RaggedArray(
coords, metadata, data, attrs_global, attrs_variables, name_dims, coord_dims
)
Next the ragged array object ``ra`` can be used to generate xarray and awkward datasets for further analysis and processing:

.. code-block:: python
# convert to xarray dataset
ds = ra.to_xarray()
ds
<xarray.Dataset> Size: 12kB
Dimensions: (traj: 8, obs: 497)
Coordinates:
time (obs) int64 4kB 0 1 2 3 4 5 6 7 ... 489 490 491 492 493 494 495 496
id (traj) int64 64B 0 1 2 3 4 5 6 7
Dimensions without coordinates: traj, obs
Data variables:
rowsize (traj) int64 64B 100 65 7 22 56 78 99 70
x (obs) float64 4kB -0.3243 -0.2817 0.1442 1.31 ... 13.18 13.07 14.02
y (obs) float64 4kB 1.25 2.073 3.493 3.44 ... 11.56 9.913 10.11 11.03
Attributes:
title: An example of synthetic data
# convert to awkward array
ds_ak = ra.to_awkward()
ds_ak
[{rowsize: 100, obs: {time: [...], ...}},
{rowsize: 65, obs: {time: [...], id: 1, ...}},
{rowsize: 7, obs: {time: [...], id: 2, ...}},
{rowsize: 22, obs: {time: [...], id: 3, ...}},
{rowsize: 56, obs: {time: [...], id: 4, ...}},
{rowsize: 78, obs: {time: [...], id: 5, ...}},
{rowsize: 99, obs: {time: [...], id: 6, ...}},
{rowsize: 70, obs: {time: [...], id: 7, ...}}]
-----------------------------------------------------------------------------------------------------
type: 8 * struct[{
rowsize: int64[parameters={"attrs": {"long_name": "number of observations in each trajectory"}}],
obs: {
time: [var * int64, parameters={"attrs": {"long_name": "time"}}],
id: int64[parameters={"attrs": {"long_name": "trajectory id"}}],
x: [var * float64, parameters={"attrs": {"long_name": "x coordinate"}}],
y: [var * float64, parameters={"attrs": {"long_name": "y coordinate"}}]
}
}, parameters={"attrs": {"title": "An example of synthetic data"}}]

0 comments on commit ad08d4e

Please sign in to comment.