Doc edits (#483)

* reorder datasets.rst * added ragged array example in usage.rst * break example * break example in two
Cloud-Drift · Jun 26, 2024 · ad08d4e · ad08d4e
1 parent 3c57be0
commit ad08d4e
Show file tree

Hide file tree

Showing 2 changed files with 120 additions and 37 deletions.
diff --git a/docs/datasets.rst b/docs/datasets.rst
@@ -4,7 +4,39 @@ Datasets
 ========
 
 CloudDrift provides convenience functions to access real-world ragged-array
-datasets.
+datasets. Currently available datasets are:
+
+- :func:`clouddrift.datasets.andro`: The ANDRO dataset as a ragged array
+  processed from the upstream dataset hosted at the `SEANOE repository
+  <https://www.seanoe.org/data/00360/47077/>`_.
+- :func:`clouddrift.datasets.gdp1h`: 1-hourly Global Drifter Program (GDP) data
+  from a `cloud-optimized Zarr dataset on AWS <https://registry.opendata.aws/noaa-oar-hourly-gdp/.>`_.
+- :func:`clouddrift.datasets.gdp6h`: 6-hourly GDP data from a ragged-array
+  NetCDF file hosted by the public HTTPS server at
+  `NOAA's Atlantic Oceanographic and Meteorological Laboratory (AOML) <https://www.aoml.noaa.gov/phod/gdp/index.php>`_.
+- :func:`clouddrift.datasets.glad`: 15-minute Grand LAgrangian Deployment (GLAD)
+  data produced by the Consortium for Advanced Research on Transport of
+  Hydrocarbon in the Environment (CARTHE) and hosted upstream at the `Gulf of
+  Mexico Research Initiative Information and Data Cooperative (GRIIDC)
+  <https://doi.org/10.7266/N7VD6WC8>`_.
+- :func:`clouddrift.datasets.mosaic`: MOSAiC sea-ice drift dataset as a ragged
+  array processed from the upstream dataset hosted at the
+  `NSF's Arctic Data Center <https://doi.org/10.18739/A2KP7TS83>`_.
+- :func:`clouddrift.datasets.subsurface_floats`: The subsurface float trajectories dataset as
+  hosted by NOAA AOML at 
+  `NOAA's Atlantic Oceanographic and Meteorological Laboratory (AOML) <https://www.aoml.noaa.gov/phod/float_traj/index.php>_`
+  and maintained by Andree Ramsey and Heather Furey from the Woods Hole Oceanographic Institution.
+- :func:`clouddrift.datasets.spotters`: The Sofar Ocean Spotters archive dataset as hosted at the public `AWS S3 bucket <https://sofar-spotter-archive.s3.amazonaws.com/spotter_data_bulk_zarr>`_.
+- :func:`clouddrift.datasets.yomaha`: The YoMaHa'07 dataset as a ragged array
+  processed from the upstream dataset hosted at the `Asia-Pacific Data-Research
+  Center (APDRC) <http://apdrc.soest.hawaii.edu/projects/yomaha/>`_.
+- :func:`clouddrift.datasets.hurdat2`: The HURricane DATa 2nd generation (HURDAT2)
+  processed from the upstream dataset hosted at the `NOAA AOML Hurricane Research Devision <https://www.aoml.noaa.gov/hrd/hurdat/Data_Storm.html>`_.
+
+The GDP and the Spotters datasets are accessed lazily, so the data is only downloaded when
+specific array values are referenced. The ANDRO, GLAD, MOSAiC, Subsurface Floats, and YoMaHa'07
+datasets are downloaded in their entirety when the function is called for the first 
+time and stored locally for later use.
 
 >>> from clouddrift.datasets import gdp1h
 >>> ds = gdp1h()
@@ -44,36 +76,3 @@ Attributes: (12/16)
     summary:           Global Drifter Program hourly data
     title:             Global Drifter Program hourly drifting buoy collection
 
-Currently available datasets are:
-
-- :func:`clouddrift.datasets.andro`: The ANDRO dataset as a ragged array
-  processed from the upstream dataset hosted at the `SEANOE repository
-  <https://www.seanoe.org/data/00360/47077/>`_.
-- :func:`clouddrift.datasets.gdp1h`: 1-hourly Global Drifter Program (GDP) data
-  from a `cloud-optimized Zarr dataset on AWS <https://registry.opendata.aws/noaa-oar-hourly-gdp/.>`_.
-- :func:`clouddrift.datasets.gdp6h`: 6-hourly GDP data from a ragged-array
-  NetCDF file hosted by the public HTTPS server at
-  `NOAA's Atlantic Oceanographic and Meteorological Laboratory (AOML) <https://www.aoml.noaa.gov/phod/gdp/index.php>`_.
-- :func:`clouddrift.datasets.glad`: 15-minute Grand LAgrangian Deployment (GLAD)
-  data produced by the Consortium for Advanced Research on Transport of
-  Hydrocarbon in the Environment (CARTHE) and hosted upstream at the `Gulf of
-  Mexico Research Initiative Information and Data Cooperative (GRIIDC)
-  <https://doi.org/10.7266/N7VD6WC8>`_.
-- :func:`clouddrift.datasets.mosaic`: MOSAiC sea-ice drift dataset as a ragged
-  array processed from the upstream dataset hosted at the
-  `NSF's Arctic Data Center <https://doi.org/10.18739/A2KP7TS83>`_.
-- :func:`clouddrift.datasets.subsurface_floats`: The subsurface float trajectories dataset as
-  hosted by NOAA AOML at 
-  `NOAA's Atlantic Oceanographic and Meteorological Laboratory (AOML) <https://www.aoml.noaa.gov/phod/float_traj/index.php>_`
-  and maintained by Andree Ramsey and Heather Furey from the Woods Hole Oceanographic Institution.
-- :func:`clouddrift.datasets.spotters`: The Sofar Ocean Spotters archive dataset as hosted at the public `AWS S3 bucket <https://sofar-spotter-archive.s3.amazonaws.com/spotter_data_bulk_zarr>`_.
-- :func:`clouddrift.datasets.yomaha`: The YoMaHa'07 dataset as a ragged array
-  processed from the upstream dataset hosted at the `Asia-Pacific Data-Research
-  Center (APDRC) <http://apdrc.soest.hawaii.edu/projects/yomaha/>`_.
-- :func:`clouddrift.datasets.hurdat2`: The HURricane DATa 2nd generation (HURDAT2)
-  processed from the upstream dataset hosted at the `NOAA AOML Hurricane Research Devision <https://www.aoml.noaa.gov/hrd/hurdat/Data_Storm.html>`_.
-
-The GDP and the Spotters datasets are accessed lazily, so the data is only downloaded when
-specific array values are referenced. The ANDRO, GLAD, MOSAiC, Subsurface Floats, and YoMaHa'07
-datasets are downloaded in their entirety when the function is called for the first 
-time and stored locally for later use.
diff --git a/docs/usage.rst b/docs/usage.rst
@@ -81,7 +81,7 @@ Attributes:
     long_name:  Longitude
     units:      degrees_east
 
-You see that this array is very long--it has 165754333 elements.
+You see that this array is very long--it has 197214787 elements.
 This is because in a ragged array, many varying-length arrays are laid out as a
 contiguous 1-dimensional array in memory.
 
@@ -252,9 +252,93 @@ CloudDrift provides an easy way to convert custom Lagrangian datasets into
     # Alternatively, convert to Awkward Array for analysis
     ds = ra.to_awkward()
 
-This snippet is specific to the hourly GDP dataset, however, you can use the
+The snippet above is specific to the hourly GDP dataset, however, you can use the
 ``RaggedArray`` class directly to convert other custom datasets into a ragged
 array structure that is analysis ready via Xarray or Awkward Array packages.
 The functions to do that are defined in the ``clouddrift.adapters`` submodule.
 You can use these examples as a reference to ingest your own or other custom
-Lagrangian datasets into ``RaggedArray``.
+Lagrangian datasets into ``RaggedArray``. We provide below a simple example on 
+how to build ragged array datasets from simulated data:
+
+.. code-block:: python
+
+    # create a synthetic Lagrangian data of 8 trajectories of random walks
+    import numpy as np
+
+    rowsize = [100,65,7,22,56,78,99,70]
+    x = []
+    y = []
+    for i in range(len(rowsize)):
+        x.append(np.cumsum(np.random.normal(0,1,rowsize[i]) + np.random.uniform(0,1,1)))
+        y.append(np.cumsum(np.random.normal(0,1,rowsize[i]) + np.random.uniform(0,1,1)))
+    x = np.concatenate(x)
+    y = np.concatenate(y)
+
+    # create an instance of the RaggedArray class
+    from clouddrift.raggedarray import RaggedArray
+
+    # define the actual coordinates
+    coords = {"time": np.arange(len(x)), "id": np.arange(len(rowsize))}
+    # define the data
+    data = {"x": x, "y": y}
+    # define the metadata which here include the `rowsize` parameter
+    metadata = {"rowsize": rowsize}
+
+    # map the names of the dimensions to what the class expects, that is 
+    # what are the names of "rows" and "obs"
+    name_dims = {"traj": "rows", "obs": "obs"}
+    # map the dimensions of the coordinates defined above 
+    coord_dims = {"time": "obs", "id": "traj"}
+    # define some attributes for the dataset and its variables
+    attrs_global = {"title":"An example of synthetic data"}
+    attrs_variables = {"id" : {"long_name": "trajectory id"},
+                    "time": {"long_name": "time"},
+                    "x": {"long_name": "x coordinate"},
+                    "y": {"long_name": "y coordinate"},
+                    "rowsize": {"long_name": "number of observations in each trajectory"}}
+    # instantiate the RaggedArray class
+    ra = RaggedArray(
+        coords, metadata, data, attrs_global, attrs_variables, name_dims, coord_dims
+    )
+
+Next the ragged array object ``ra`` can be used to generate xarray and awkward datasets for further analysis and processing:
+
+.. code-block:: python
+
+    # convert to xarray dataset
+    ds = ra.to_xarray()
+    ds
+    <xarray.Dataset> Size: 12kB
+    Dimensions:  (traj: 8, obs: 497)
+    Coordinates:
+        time     (obs) int64 4kB 0 1 2 3 4 5 6 7 ... 489 490 491 492 493 494 495 496
+        id       (traj) int64 64B 0 1 2 3 4 5 6 7
+    Dimensions without coordinates: traj, obs
+    Data variables:
+        rowsize  (traj) int64 64B 100 65 7 22 56 78 99 70
+        x        (obs) float64 4kB -0.3243 -0.2817 0.1442 1.31 ... 13.18 13.07 14.02
+        y        (obs) float64 4kB 1.25 2.073 3.493 3.44 ... 11.56 9.913 10.11 11.03
+    Attributes:
+        title:    An example of synthetic data
+
+    # convert to awkward array
+    ds_ak = ra.to_awkward()
+    ds_ak
+    [{rowsize: 100, obs: {time: [...], ...}},
+    {rowsize: 65, obs: {time: [...], id: 1, ...}},
+    {rowsize: 7, obs: {time: [...], id: 2, ...}},
+    {rowsize: 22, obs: {time: [...], id: 3, ...}},
+    {rowsize: 56, obs: {time: [...], id: 4, ...}},
+    {rowsize: 78, obs: {time: [...], id: 5, ...}},
+    {rowsize: 99, obs: {time: [...], id: 6, ...}},
+    {rowsize: 70, obs: {time: [...], id: 7, ...}}]
+    -----------------------------------------------------------------------------------------------------
+    type: 8 * struct[{
+        rowsize: int64[parameters={"attrs": {"long_name": "number of observations in each trajectory"}}],
+        obs: {
+            time: [var * int64, parameters={"attrs": {"long_name": "time"}}],
+            id: int64[parameters={"attrs": {"long_name": "trajectory id"}}],
+            x: [var * float64, parameters={"attrs": {"long_name": "x coordinate"}}],
+            y: [var * float64, parameters={"attrs": {"long_name": "y coordinate"}}]
+        }
+    }, parameters={"attrs": {"title": "An example of synthetic data"}}]