Skip to content

Commit

Permalink
Update docs (#108)
Browse files Browse the repository at this point in the history
* Add top-level docstrings

* Add adapters to the API docs

* Rewrite Usage

* Update installation
  • Loading branch information
milancurcic authored Feb 7, 2023
1 parent cb148c7 commit 271e728
Show file tree
Hide file tree
Showing 5 changed files with 113 additions and 117 deletions.
8 changes: 8 additions & 0 deletions clouddrift/adapters/__init__.py
Original file line number Diff line number Diff line change
@@ -1 +1,9 @@
"""
This module provides adapters to custom datasets.
Each adapter module provides convenience functions and metadata to convert a
custom dataset to a `clouddrift.RaggedArray` instance.
Currently, clouddrift only provides an adapter module for the hourly Global
Drifter Program (GDP) data, and more adapters will be added in the future.
"""

import clouddrift.adapters.gdp
31 changes: 20 additions & 11 deletions clouddrift/adapters/gdp.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
"""
This module provides functions and metadata that can be used to convert the
hourly Global Drifter Program (GDP) data to a ``clouddrift.RaggedArray`` instance.
"""

from ..dataformat import RaggedArray
import numpy as np
import pandas as pd
Expand Down Expand Up @@ -254,24 +259,28 @@ def str_to_float(value: str, default=np.nan) -> float:
return default


def cut_str(value, max_length):
"""
Cut a string to a specific length.
:param value: string
max_length: length of the output
:return: string with max_length chars
def cut_str(value: str, max_length: int) -> np.chararray:
"""Cut a string to a specific length and return it as a numpy chararray.
Args:
value (str): String to cut
max_length (int): Length of the output
Returns:
out (np.chararray): String with max_length characters
"""
charar = np.chararray(1, max_length)
charar[:max_length] = value
return charar


def drogue_presence(lost_time, time):
"""
Create drogue status from the drogue lost time and the trajectory time
:params lost_time: timestamp of the drogue loss (or NaT)
time[obs]: observation time
:return: bool[obs]: 1 drogued, 0 undrogued
"""Create drogue status from the drogue lost time and the trajectory time.
Args:
lost_time: Timestamp of the drogue loss (or NaT)
time: Observation time
Returns:
out (bool): True if drogues and False otherwise
"""
if pd.isnull(lost_time) or lost_time >= time[-1]:
return np.ones_like(time, dtype="bool")
Expand Down
14 changes: 14 additions & 0 deletions docs/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,20 @@ API

Auto-generated summary of CloudDrift's API. For more details and examples, refer to the different Jupyter Notebooks.

Adapters
--------

.. automodule:: clouddrift.adapters
:members:
:undoc-members:

GDP
^^^

.. automodule:: clouddrift.adapters.gdp
:members:
:undoc-members:

Analysis
--------

Expand Down
52 changes: 46 additions & 6 deletions docs/install.rst
Original file line number Diff line number Diff line change
@@ -1,24 +1,64 @@
.. _install:

Installation
=============
============

You can install the latest release of CloudDrift using pip or Conda.
You can also install the latest development (unreleased) version from GitHub.

pip
---

For most *users*, the latest official package can be obtained from the `PyPi <pypi.org/project/clouddrift/>`_ repository:
In your virtual environment, type:

.. code-block:: text
pip install clouddrift
or (soon!) from the conda-forge repository:
Conda
-----

First add ``conda-forge`` to your channels in your Conda environment:

.. code-block:: text
conda config --add channels conda-forge
conda config --set channel_priority strict
then install CloudDrift:

.. code-block:: text
conda install clouddrift
Developers
----------

If you need the latest development version, get it from GitHub using pip:

.. code-block:: text
pip install git+https://github.com/Cloud-Drift/clouddrift
Running tests
=============

To run the tests, you need to first download the CloudDrift source code from
GitHub and install it in your virtual environment:


.. code-block:: text
conda install -c conda-forge clouddrift
git clone https://github.com/cloud-drift/clouddrift
cd clouddrift
python3 -m venv venv
source venv/bin/activate
pip install .
For *developpers* who want to install the latest development version, you can install directly from the clouddrift's GitHub repository:
Then, run the tests like this:

.. code-block:: text
pip install git+https://github.com/Cloud-Drift/clouddrift.git
python -m unittest tests/*.py
A quick how-to guide is provided on the `Usage <https://cloud-drift.github.io/clouddrift/usage.html>`_ page.
125 changes: 25 additions & 100 deletions docs/usage.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,112 +3,37 @@
Usage
=====

Data format
-----------

The first release of CloudDrift provide a relatively *easy* way to convert any Lagrangian datasets into an archive of `contiguous ragged arrays <https://cfconventions.org/cf-conventions/cf-conventions.html#_contiguous_ragged_array_representation>`_. We provide a step-by-step guides to convert the individual trajectories from the Global Drifter Program (GDP) hourly and 6-hourly datasets, the drifters from the `CARTHE <http://carthe.org/>`_ experiment, and a typical output from a numerical Lagrangian experiment.

Below is a quick overview on how to transform an observational Lagrangian dataset stored into multiple files, or a numerical output from a Lagrangian simulation framework. Detailed examples are provided as Jupyter Notebooks which can be tested directly in a `Binder <https://mybinder.org/v2/gh/Cloud-Drift/clouddrift/main?labpath=examples>`_ executable environment.

Collection of files
~~~~~~~~~~~~~~~~~~~

First, to create a ragged arrays archive for a dataset for which each trajectory is stored into a individual archive, e.g. the FTP distribution of the `GDP hourly dataset <https://www.aoml.noaa.gov/phod/gdp/hourly_data.php>`_, it is required to define a `preprocessing` function that returns an `xarray.Dataset <https://docs.xarray.dev/en/stable/generated/xarray.Dataset.html>`_ for a trajectory from its identification number.

.. code-block:: python
def preprocess(index: int) -> xr.Dataset:
"""
:param index: drifter's identification number
:return: xr.Dataset containing the data and attributes
"""
ds = xr.load_dataset(f'data/file_{index}.nc')
# perform required preprocessing steps
# e.g. change units, remove variables, fix attributes, etc.
return ds
This function will be called for each indices of the dataset (`ids`) to construct the ragged arrays archive, as follow. The ragged arrays contains the required coordinates variables, as well as the specified metadata and data variables. Note that metadata variables contain one value per trajectory while the data variables contain `n` observations per trajectory.

.. code-block:: python
ids = [1,2,3] # trajectories to combine
# mandatory coordinates variables
coords = {'ids': 'ids', 'time': 'time', 'lon': 'longitude', 'lat': 'latitude'}
# list of metadata and data from files to include in archive
metadata = ['ID', 'rowsize']
data = ['ve', 'vn']
ra = RaggedArray.from_files(ids, preprocess, coords, metadata, data)
which can be easily export to either a parquet archive file,

.. code-block:: python
ra.to_parquet('data/archive.parquet')
or a NetCDF archive file.

.. code-block:: python
ra.to_parquet('data/archive.nc')
Lagrangian numerical output
~~~~~~~~~~~~~~~~~~~~~~~~~~~

For a two-dimensional output (`lon`, `lat`, `time`) from a Lagrangian simulation framework (such as `OceanParcels <https://oceanparcels.org/>`_ or `OpenDrift <https://opendrift.github.io/>`_), the ragged arrays archive can be obtained by reshaping the variables to ragged arrays and populating the following dictionaries containing the coordinates, metadata, data, and attributes.
CloudDrift provides an easy way to convert Lagrangian datasets into
`contiguous ragged arrays <https://cfconventions.org/cf-conventions/cf-conventions.html#_contiguous_ragged_array_representation>`_.

.. code-block:: python
# initialize dictionaries
coords = {}
metadata = {}
# note that this example dataset does not contain other data than time, lon, lat, and ids
# an empty dictionary "data" is initialize anyway
data = {}
# Import a GDP-hourly adapter function
from clouddrift.adapters.gdp import to_raggedarray
Numerical outputs are usually stored as a 2D matrix (`trajectory`, `time`) filled with `nan` where there is no data. The first step is to identify the finite values and reshape the dataset.

.. code-block:: python
# Download 100 random GDP-hourly trajectories as a ragged array
ra = to_raggedarray(n_random_id=100)
ds = xr.open_dataset(join(folder, file), decode_times=False)
finite_values = np.isfinite(ds['lon'])
idx_finite = np.where(finite_values)
# Store to NetCDF and Parquet files
ra.to_netcdf("gdp.nc")
ra.to_parquet("gdp.parquet")
# dimension and id of each trajectory
rowsize = np.bincount(idx_finite[0])
unique_id = np.unique(idx_finite[0])
# Convert to Xarray Dataset for analysis
ds = ra.to_xarray()
# coordinate variables
coords["time"] = np.tile(ds.time.data, (ds.dims['traj'],1))[idx_finite]
coords["lon"] = ds.lon.data[idx_finite]
coords["lat"] = ds.lat.data[idx_finite]
coords["ids"] = np.repeat(unique_id, rowsize)
Once this is done, we can include extra metadata, such as the size of each trajectory (`rowsize`), and create the ragged arrays archive.

.. code-block:: python
# metadata
metadata["rowsize"] = rowsize
metadata["ID"] = unique_id
# create the ragged arrays
ra = RaggedArray(coords, metadata, data)
ra.to_parquet('data/archive.parquet')
Analysis
--------

Once an archive of ragged arrays is created, CloudDrift provides way to read in and convert the data to an `Awkward Array <https://awkward-array.org/quickstart.html>`_.

.. code-block:: python
# Alternatively, convert to Awkward Array for analysis
ds = ra.to_awkward()
ra = RaggedArray.from_parquet('data/archive.parquet')
ds = ra.to_awkward()
This snippet is specific to the hourly GDP dataset, however, you can use the
``RaggedArray`` class directly to convert other custom datasets into a ragged
array structure that is analysis ready via Xarray or Awkward Array packages.
We provide step-by-step guides to convert the individual trajectories from the
Global Drifter Program (GDP) hourly and 6-hourly datasets, the drifters from the
`CARTHE <http://carthe.org/>`_ experiment, and a typical output from a numerical
Lagrangian experiment in our
`repository of example Jupyter Notebooks <https://github.com/cloud-drift/clouddrift-examples>`_.
You can use these examples as a reference to ingest your own or other custom
Lagrangian datasets into ``RaggedArray``.

Over the next year, the CloudDrift project will be developing a cloud-ready analysis library to perform typical Lagrangian workflows.
In the future, ``clouddrift`` will be including functions to perform typical
oceanographic Lagrangian analyses.

0 comments on commit 271e728

Please sign in to comment.