From 1cae21a172ae121555582cfd14f9e9d1b87031b2 Mon Sep 17 00:00:00 2001 From: wrongkindofdoctor <20195932+wrongkindofdoctor@users.noreply.github.com> Date: Tue, 9 Apr 2024 17:26:44 -0400 Subject: [PATCH] more doc updating and cleanup --- doc/sphinx/fmwk_datafetch.rst | 40 --------- doc/sphinx/fmwk_dataquery.rst | 68 --------------- doc/sphinx/fmwk_datasources.rst | 78 ----------------- doc/sphinx/fmwk_toc.rst | 18 +--- doc/sphinx/pod_settings.rst | 143 ++++++++++++++++++++++++++------ doc/sphinx/ref_data.rst | 94 +++++++++++++++------ doc/sphinx/ref_dev_toc.rst | 8 -- doc/sphinx/ref_envvars.rst | 97 ++-------------------- doc/sphinx/ref_settings.rst | 129 ++++++++++------------------ 9 files changed, 242 insertions(+), 433 deletions(-) delete mode 100644 doc/sphinx/fmwk_datafetch.rst delete mode 100644 doc/sphinx/fmwk_dataquery.rst delete mode 100644 doc/sphinx/fmwk_datasources.rst delete mode 100644 doc/sphinx/ref_dev_toc.rst diff --git a/doc/sphinx/fmwk_datafetch.rst b/doc/sphinx/fmwk_datafetch.rst deleted file mode 100644 index 0433e8b2c..000000000 --- a/doc/sphinx/fmwk_datafetch.rst +++ /dev/null @@ -1,40 +0,0 @@ -Data layer: Fetch -================= - -This section describes the **Select** and **Fetch** stages of the data request process, implemented in the :doc:`src.data_manager`. See :doc:`fmwk_datasources` for an overview of the process. - - -.. _ref-datasources-select: - -Selection stage ---------------- - -The purpose of the **Select** stage is to select the minimal amount of data to download which will satisfy the requirements of all the PODs. This logic comes into play when different PODs request the same variable, or when the query results for a single variable include multiple copies of the same data. The latter situation happens frequently in practice: in addition to the example above of the same CMIP6 variable being present in multiple MIP tables, model postprocessing workflows can output the same data in several formats. - - - - -Methods called -++++++++++++++ - - -Termination conditions -++++++++++++++++++++++ - -The logic for handling selection errors differs from the other stages, which operate on individual variables independently. - -.. _ref-datasources-fetch: - -Fetch stage ------------ - -The purpose of the **Fetch** stage is straightforward: after the **Select** stage has completed, we have an unambiguous list of remote model data we need to transfer. This stage does so, in general by calling third-party library functions. - -Methods called -++++++++++++++ - - - - -Termination conditions -++++++++++++++++++++++ \ No newline at end of file diff --git a/doc/sphinx/fmwk_dataquery.rst b/doc/sphinx/fmwk_dataquery.rst deleted file mode 100644 index ef8a7e0e6..000000000 --- a/doc/sphinx/fmwk_dataquery.rst +++ /dev/null @@ -1,68 +0,0 @@ -Data layer: Query -================= - -This section describes the **Query** stage of the data request process, implemented in the :doc:`src.data_manager`. See :doc:`fmwk_datasources` for an overview of the process. - -Overview --------- - -Currently all data sources implement the **Query** stage by querying an `intake-esm `__ catalog (in a nonstandard way), which is implemented by :class:`~src.data_manager.DataframeQueryDataSourceBase`. In addition, all current data sources also assemble this catalog on the fly, by crawling data files in a regular directory hierarchy and parsing metadata from the file naming convention in a **Pre-Query** stage. This is provided by :class:`~src.data_manager.OnTheFlyDirectoryHierarchyQueryMixin`, which inherits from :class:`~src.data_manager.OnTheFlyFilesystemQueryMixin`. The **Pre-Query** stage is done once, during the :meth:`~src.data_manager.AbstractDataSource.setup_query` hook that is executed before any queries. - -Specific data sources, which correspond to different directory hierarchy naming conventions, inherit from these classes and provide logic describing the file naming convention. - -.. _ref-datasources-prequery: - -Pre-query stage ---------------- - -The purpose of the **Pre-Query** stage is to perform any setup tasks that only need to be done once in order to enable data queries. As described above, current data sources crawl a directory to construct a catalog on the fly, but other sources could use this stage to open a connection to a remote database, etc. - -Catalog construction -++++++++++++++++++++ - -Data sources that inherit from the :class:`~src.data_manager.OnTheFlyFilesystemQueryMixin` class (currently, all of them) construct an intake catalog before any queries are executed. The catalog gets constructed by the :meth:`~src.data_manager.setup_query` method of OnTheFlyFilesystemQueryMixin, which is called once, before any queries take place, as part of the hooks offered by the :class:`~src.data_manager.AbstractDataSource` base class. setup_query calls :meth:`~src.data_manager.generate_catalog`, as implemented by OnTheFlyDirectoryHierarchyQueryMixin, to crawl the directory and assemble a Pandas DataFrame, which is converted to an `intake-esm `__ catalog. - - Child classes of OnTheFlyDirectoryHierarchyQueryMixin must supply two classes as attributes, ``_FileRegexClass`` and ``_DirectoryRegex``. ``_DirectoryRegex`` is a :class:`~src.util.dataclass.RegexPattern` -- a wrapper around a python regular expression -- which selects the subdirectories to be included in the catalog, based on whether they match the regex. - - ``_FileRegexClass`` implements parsing paths in the directory hierarchy into usable metadata, and is expected to be a :func:`~src.util.dataclass.regex_dataclass`: the regex_dataclass decorator extends python :py:mod:`dataclasses` to the case where the fields of a dataclass are populated by named capture groups in a regular expression. - -For concreteness, we'll describe how the CMIP6 directory hierarchy (DRS) is implemented by :class:`~src.data_sources.CMIP6LocalFileDataSource`. In this case ``_DirectoryRegex`` is the :func:`~src.cmip6.drs_directory_regex`, matching directories in the CMIP6 DRS, and ``_FileRegexClass`` is :class:`~src.cmip6.CMIP6_DRSPath`, which parses CMIP6 filenames and paths. Individual fields of a regex_dataclass can also be regex_dataclasses (under inheritance), in which case they apply regexes and populate fields of all parent classes as well. This is used in CMIP6_DRSPath, which simply concatenates the fields from :class:`~src.cmip6.CMIP6_DRSDirectory` and :class:`~src.cmip6.CMIP6_DRSFilename`, and so on. This is part of a more general mechanism in which the strings matched by the regex groups are used to instantiate objects of the type in the corresponding field's type annotation, e.g. the CMIP6 ``version_date`` attribute is used to create a :class:`~src.util.datelabel.Date` object. - -The regex_dataclass mechanism is intended to streamline the common aspects of parsing metadata from a string. In addition to the conditions of the regex, arbitrary validation and checking logic can be implemented in the class's ``__post_init__`` method. At the expense of regex syntax, this provides parsing functionality not available in other tools. - -Catalog column specifications -+++++++++++++++++++++++++++++ - -Each field of the ``_FileRegexClass`` dataclass defines a column of the DataFrame which is used as the catalog, and each parseable file encountered in the directory crawl is added to it as a row. Metadata about the columns for a specific data source is provided by a "column specification" object, which inherits from :class:`~src.data_manager.DataframeQueryColumnSpec` and is assigned to the ``col_spec`` attribute of the data source's class. - -The ``expt_cols`` attribute of this class is a list of column names whose values must all be the same for two files to be considered to belong to the same experiment. This is needed, e.g., to collect timeseries data chunked by date across multiple files. This is used to define an "experiment key", which is used to test if two files belong to the same or different experiments. Currently this just concatenates string representations of all the entries in ``expt_cols``. - -The ```pod_expt_cols`` and ```var_expt_cols`` attributes of the column spec come into play during the **Select** stage, and are discussed in :ref:`that section `. Finally, the column spec also identifies the names of the columns containing the path to the file on the remote filesystem (``remote_data_col``) and the column containing the :class:`~src.util.datelabel.DateRange` of data in each file. - - -.. _ref-datasources-query: - -Query stage ------------ - -The purpose of the **Query** stage is to locate remote data, if any is present, for each active variable for which this information is unknown. - -Methods called -++++++++++++++ - -The overarching method for the **Query** stage is the :meth:`~src.data_manager.DataSourceBase.query_data` method of DataSourceBase, which does a query for all active PODs at once. This calls :meth:`~src.data_manager.DataframeQueryDataSourceBase.query_dataset` on the child class (DataframeQueryDataSourceBase), which queries a single variable requested by a POD. The catalog query itself is done in :meth:`~src.data_manager.DataframeQueryDataSourceBase._query_catalog`. Individual conditions of the query are assembled by :meth:`~src.data_manager.DataframeQueryDataSourceBase._query_clause`, except for the clause specifying that data cover the analysis period, which is done first for technical reasons involving the use of comparison operators in object-valued columns. - -By default, \_query_clause assumes the names of columns in the catalog are the same as the corresponding attributes on the :class:`~src.diagnostic.VarlistEntry` object defining the query. This can be changed by defining a class attribute named ``_query_attrs_synonyms``: a dict that will be used to map attributes on the variable to the correct column names. (Translating the *values* in those columns between the naming conventions of the POD's settings file and the naming convention used by the data source is done by :class:`~src.core.VariableTranslator`). - -The query is executed by Pandas' `query `__ method, which returns a DataFrame containing a subset of the catalog's rows. There is no good reason for this, and this should be reimplemented in terms of Intake's `search `__ method, which is closely equivalent. - -The query results are then grouped by values of the "experiment key" (defined :ref:`above `). If a group is not eliminated by :meth:`~src.data_manager.check_group_daterange` or custom logic in :meth:`~src.data_manager._query_group_hook`, it's considered a successful query. A "data key" (an object of the class given in the data source's ``_DataKeyClass`` attribute) corresponding to the result is generated and stored in the ``data`` attribute of the variable being queried. Specifically, the ``data`` attribute is a dict mapping experiment keys to data keys. - -"Data keys" inherit from :class:`~src.data_manager.DataKeyBase` and are used to associate remote files (or URLs, etc.) with local paths to downloaded data during the Fetch stage. All data sources based on the DataframeQueryDataSourceBase use the :class:`~src.data_manager.DataFrameDataKey`, which identifies files based on their row index in the catalog; the path to the remote file (in ``remote_data_col``) is looked up separately. - -Termination conditions -++++++++++++++++++++++ - -The **Query** stage operates in "batch mode," executing queries for all active variables (VarlistEntry objects with ``status`` = ACTIVE) which have not already been queried (``stage`` attribute < QUERIED enum value). A successful query is one that returns a nonempty result from the catalog, which causes its ``stage`` to be updated to QUERIED and the VarlistEntry to be removed from the batch. Unsuccessful queries result in the deactivation of the variable and the activation of its alternates, as described :ref:`above `. These alternates will be included in the batch when it's recalculated (unless they've already been queried as a result of being an alternate for another variable as well.) - -The **Query** stage terminates when the batch of variables to query is empty (or when the batch-query process repeats more than a maximum number of times, to guard against infinite loops.) Recall, though, that because of the structure of the query-fetch-preprocess loop, the **Query** stage may execute multiple times with batches of different variables. diff --git a/doc/sphinx/fmwk_datasources.rst b/doc/sphinx/fmwk_datasources.rst deleted file mode 100644 index 98217182e..000000000 --- a/doc/sphinx/fmwk_datasources.rst +++ /dev/null @@ -1,78 +0,0 @@ -Data layer: Overview -==================== - -This section describes the :doc:`src.data_manager`, which implements general (base class) functionality for how the package finds and downloads model data requested by the PODs. It also describes some aspects of the :doc:`src.data_sources`, which contains the child class implementations of these methods that are selectable by the user through the ``--data-manager`` flag. - -In the code (and this document), the terms "data manager" and "data source" are used interchangeably to refer to this functionality. In the future, this should be standardized on "data source," to avoid user confusion. - -Overview --------- - -Functionality -+++++++++++++ - -One of the major goals of the MDTF package is to allow PODs to analyze data from a wide range of sources and formats without rewriting. This is done by having PODs specify their :doc:`data requirements `) in a model-agnostic way (see :doc:`fmwk_datamodel`, and providing separate :doc:`data source `) "plug-ins" that implement the details of the data query and transfer for each source of model data supported by the package. - -At a high level, the job of a data source plug-in is simple. The PODs' data request gets translated into the native format (variable naming convention, etc.) of the data source by the :class:`~src.core.VariableTranslator`). The plug-in does a search for each variable requested by each POD: if the search is successful and the data is available, the plug-in downloads the data for the POD; if not, we log an error and the POD can't run. - -This simple picture gets complicated because we also implement the following functionality that provides more flexibility in the data search process. By shifting the following responsibilities from the user to the framework, we get a piece of software that's more usable in practice. - -- PODs can be flexible in what data they accept by specifying *alternate variables*, to be used as a "fallback" or "plan B" if a variable isn't present in the model output. (Implementing and testing code that handles both cases is entirely the POD's responsibility.) -- Downloading is structured to *minimize data movement*: if multiple PODs request the same variable, only one copy of the remote files will be downloaded. If a time series is chunked across multiple files, only the files corresponding to the analysis period will be downloaded. -- The framework has a *data preprocessing* step which can do a limited set of transformations on data (in addition to changing its format), eg. automatically extracting a vertical level from 3D data. -- We allow for *optional settings* in the model data specification, which fall into several classes. Using CMIP6 as an example: - - - The values of some model data settings might be uniquely determined by others: eg, if the user wants to analyze data from the CESM2 model, setting ``source_id`` to CESM2 means ``institution_id`` must be NCAR. The user shouldn't need to supply both settings. - - Some settings for the data source may be irrelevant for the user's purposes. E.g., (mean) surface air pressure at monthly frequency is provided in the ``Amon``, ``cfMon`` and ``Emon`` MIP tables, but not the other monthly tables. Since the user isn't running a MIP but only cares about obtaining that variable's data, they shouldn't need to look up which MIP table contains the variable they want. - - Finally, in some use cases the user may be willing to have the framework infer settings on their behalf. E.g., if the user is doing initial exploratory data analysis, they probably want the ``revision_date`` to be the most recent version available for that model's data, without having to look up what that date is. Of course, the user may *not* want this (eg, for reproducing an earlier analysis), so this functionality can be controlled with the ``--strict`` command-line option or by explicitly setting the desired ``revision_date``. - -In general, we break the logic up into a hierarchy of multiple classes to make future customization possible without code duplication. Recall that we expect to have many data source child classes, one for each format and location of model data supported by the package, so by moving common logic into parent classes and using inheritance we can enable new child data sources to be added with less new code. - -.. _ref-datasources-mainloop: - -"Query-fetch-preprocess" main loop -++++++++++++++++++++++++++++++++++ - -All current data sources operate in the following distinct **stages**, which are described by the :class:`~src.data_manager.DataSourceBase` class. This is a base class which only provides the skeleton for the stages below, leaving details to be implemented by child classes. The entry point for the entire process is the :meth:`~src.data_manager.DataSourceBase.request_data` method, which works "backwards" through the following stages: - -0. (**Pre-Query**): in situations where a pre-existing data catalog is not available, construct one "on the fly" by crawling the directories where the data files are stored. This is done once, during the :meth:`~src.data_manager.AbstractDataSource.setup_query` hook that is executed before any queries. -1. **Query** the external source for the presence of variables requested by the PODs, done by :meth:`~src.data_manager.DataSourceBase.query_data`; -2. **Select** the specific files (or atomic units of data) to be transferred in order to minimize data movement, done by :meth:`~src.data_manager.DataSourceBase.select_data`; -3. **Fetch** the selected files from the provider's location via some file transfer protocol, downloading them to a local temp directory, done by :meth:`~src.data_manager.DataSourceBase.fetch_data`; -4. **Preprocess** the local copies of data, by converting them from their native format to the format expected by each POD, done by :meth:`~src.data_manager.DataSourceBase.preprocess_data`. - -Due to length, each of the stages is described in subsequent sections. The **Pre-Query** and **Query** stages are described in :doc:`fmwk_dataquery`, the **Select** and **Fetch** stages are described in :doc:`fmwk_datafetch`, and the **Preprocess** stage is described in :doc:`fmwk_preprocess`. - -Although the stages are described as a linear progression above, when we incorporate error handling the process becomes a do-while loop. We mentioned that the stages (other than the **Pre-Query** setup) are executed backwards: first **Preprocess** is called, but it discovers it doesn't have any locally downloaded files to preprocess, so it calls **Fetch**, which discovers it doesn't know which files to download, etc. The loop is organized in this "backwards" fashion to make error handling more straightforward, especially since variables (and their alternates, see next section) are processed in a batch. - -There are many situations in which processing of a variable may fail at a given stage: a query may return zero results, a file transfer may be interrupted, or the data may have mis-labeled metadata. The general pattern for handling such a failure is to look for alternate representations for that variable, and start processing them from the beginning of the loop. - -.. _ref-datasources-varlist: - -VarlistEntries as input -+++++++++++++++++++++++ - -The job of the data source is to obtain model data requested by the PODs from an experiment selected by the user at runtime. The way PODs request data is through a declaration in their :doc:`settings file `, which is used to define :class:`~src.diagnostic.VarlistEntry` objects. These objects, along with the user's experiment selection, are the input to the data source. We summarize relevant attributes of the VarlistEntry objects here. - -Each VarlistEntry object has a ``stage`` attribute, taking values in the :class:`~src.diagnostic.VarlistEntryStage` enum, which tracks the last stage of the loop that the variable has successfully completed. In addition, its ``status`` attribute is also relevant: only variables that have ACTIVE status are progressed through the pipeline; when a failure occurs on a variable, it's deactivated (via :meth:`~src.core.MDTFObjectBase.deactivate`) and its alternates are activated. - -As part of the :meth:`~src.data_manager.setup` process (:meth:`~src.data_manager.setup_var`), model-agnostic information in each VarlistEntry object is translated into the naming convention used by the model. This is stored in a :class:`~src.core.TranslatedVarlistEntry` object, in the ``translation`` attribute of the corresponding VarlistEntry. These form the main input to the **Preprocess** stage, as described below. - -Finally, the ``alternates`` attribute is central to how errors are handled during the data request process. Each variable (VarlistEntry) can optionally specify one or more alternate, or "backup" variables, as a list of other VarlistEntry objects stored in this attribute. These variables can specify their own alternates, etc., so that a single "data request" corresponding to a single logical variable is implemented as a *linked list* of VarlistEntry objects. - -The data source traverses this list in breadth-first order until data corresponding to a viable set of alternates is fully processed (makes it through all the stages): if the data specified by one VarlistEntry isn't available, we try its alternates (if it has any), and if one of those isn't found, we try its alternates, and so on. - -.. _ref-datasources-keys: - -Experiment keys and data keys -+++++++++++++++++++++++++++++ - -The final pieces of terminology we need to introduce are "*experiment keys*" and "*data keys*". These are most relevant in the **Select** stage. - -From the point of view of the MDTF package, an *experiment* is any collection of datasets that are "compatible," in that having a POD analyze the datasets together will produce a result that's sensible to look at. It makes no sense to feed variables from different CO2 forcings into a POD (unless that's part of that POD's purpose), but it may make sense to use variables from different model runs if forcings and other conditions are identical. - -As described above, a data source is a python class that provides an interface to obtain data from many different experiments (stored in the same way). The class uniquely distinguishes different experiments by defining values in a set of "experiment attributes". Likewise, the results of a single experiment will comprise multiple variables and physical files, which are described by "data attributes" -- think of columns in a data catalog. Because each unit of data is associated with one experiment, the set of experiment attributes is a subset of the set of data attributes. - -"Data keys" and "experiment keys," then, are objects that store or refer to these attributes for purposes of implementing the query, and are in one-to-one correspondence with units of data and experiments. In particular, the results of the query for one variable are stored in a dict on its ``data`` attribute, mapping experiment keys to data keys found by the query. - -We spell this out in detail because this is our mechanism for enabling flexible and intelligent queries, as described in the overview section. In particular, we *don't* require that the user explicitly provide values for all experiment attributes at runtime: the job of the **Select** stage is to select an experiment key consistent with all data keys that have been found by the preceding **Query** stage. diff --git a/doc/sphinx/fmwk_toc.rst b/doc/sphinx/fmwk_toc.rst index b093650c9..d67e6b10e 100644 --- a/doc/sphinx/fmwk_toc.rst +++ b/doc/sphinx/fmwk_toc.rst @@ -10,17 +10,6 @@ Internal code documentation .. Package design .. -------------- -.. These sections describe design features of the code that cut across multiple modules. - -.. .. toctree:: -.. :maxdepth: 1 - -.. fmwk_intro -.. fmwk_plugins -.. fmwk_obj_hierarchy -.. fmwk_datamodel -.. fmwk_provenance - Package code and API documentation ---------------------------------- @@ -30,9 +19,6 @@ These sections provide an overview of specific parts of the code that's more hig :maxdepth: 1 fmwk_cli - fmwk_datasources - fmwk_dataquery - fmwk_datafetch fmwk_preprocess fmwk_utils @@ -67,7 +53,8 @@ Supporting framework modules Utility modules ^^^^^^^^^^^^^^^ -The ``src.util`` subpackage provides non-MDTF-specific utility functionality used many places in the modules above. See the :doc:`fmwk_utils` documentation for an overview. +The ``src.util`` subpackage provides non-MDTF-specific utility functionality used many places in the modules above. +See the :doc:`fmwk_utils` documentation for an overview. .. autosummary:: @@ -79,4 +66,3 @@ The ``src.util`` subpackage provides non-MDTF-specific utility functionality use src.util.logs src.util.path_utils src.util.processes - diff --git a/doc/sphinx/pod_settings.rst b/doc/sphinx/pod_settings.rst index 6e351b99e..4799efe08 100644 --- a/doc/sphinx/pod_settings.rst +++ b/doc/sphinx/pod_settings.rst @@ -3,20 +3,28 @@ POD settings file summary ========================= -This page gives a quick introduction to how to write the settings file for your POD. See the full :doc:`documentation <./ref_settings>` on this file format for a complete list of all the options you can specify. +This page gives a quick introduction to how to write the settings file for your POD. See the full +:doc:`documentation <./ref_settings>` on this file format for a complete list of all the options you can specify. Overview -------- -The MDTF framework can be viewed as a "wrapper" for your code that handles data fetching and munging. Your code communicates with this wrapper in two ways: +The MDTF framework can be viewed as a "wrapper" for your code that handles data fetching and munging. Your code +communicates with this wrapper in two ways: -- The *settings file* is where your code talks to the framework: when you write your code, you document what model data your code uses and what format it expects it in. When the framework is run, it will fulfill the requests you make here (or tell the user what went wrong). -- When your code is run, the framework talks to it by setting :doc:`environment variables ` containing paths to the data files and other information specific to the run. +- The *settings file* is where your code talks to the framework: when you write your code, you document what model data +your code uses and what format it expects it in. When the framework is run, it will fulfill the requests you make here +(or tell the user what went wrong). +- When your code is run, the framework talks to it by setting :doc:`environment variables ` + containing paths to the data files and other information specific to the run. In the settings file, you specify what model data your diagnostic uses in a vocabulary you're already familiar with: - The `CF conventions `__ for standardized variable names and units. -- The netCDF4 (classic) data model, in particular the notions of `variables `__ and `dimensions `__ as they're used in a netCDF file. +- The netCDF4 (classic) data model, in particular the notions of + `variables `__ and + `dimensions `__ as they're used + in a netCDF file. Example @@ -57,13 +65,11 @@ Example "varlist" : { "my_precip_data": { "standard_name": "precipitation_flux", - "path_variable": "PATH_TO_PR_FILE", "units": "kg m-2 s-1", "dimensions" : ["time", "lat", "lon"] }, "my_3d_u_data": { "standard_name": "eastward_wind", - "path_variable": "PATH_TO_UA_FILE", "units": "m s-1", "dimensions" : ["time", "plev", "lat", "lon"] } @@ -83,13 +89,22 @@ This is where you describe your diagnostic and list the programs it needs to run Filename of the driver script the framework should call to run your diagnostic. ``realm``: - One or more of the eight CMIP6 modeling realms (aerosol, atmos, atmosChem, land, landIce, ocean, ocnBgchem, seaIce) describing what data your diagnostic uses. This is give the user an easy way to, eg, run only ocean diagnostics on data from an ocean model. + One or more of the eight CMIP6 modeling realms (aerosol, atmos, atmosChem, land, landIce, ocean, ocnBgchem, seaIce) + describing what data your diagnostic uses. This is give the user an easy way to, eg, run only ocean diagnostics on + data from an ocean model. Realm can be specified in the `settings`` section, or specified separately for each variable + in the `varlist` section. ``runtime_requirements``: - This is a list of key-value pairs describing the programs your diagnostic needs to run, and any third-party libraries used by those programs. + This is a list of key-value pairs describing the programs your diagnostic needs to run, and any third-party libraries + used by those programs. - - The *key* is program's name, eg. languages such as "`python `__" or "`ncl `__" etc. but also any utilities such as "`ncks `__", "`cdo `__", etc. - - The *value* for each program is a list of third-party libraries in that language that your diagnostic needs. You do *not* need to list built-in libraries: eg, in python, you should to list `numpy `__ but not `math `__. If no third-party libraries are needed, the value should be an empty list. + - The *key* is program's name, eg. languages such as "`python `__" or + "`ncl `__" etc. but also any utilities such as "`ncks `__", + "`cdo `__", etc. + - The *value* for each program is a list of third-party libraries in that language that your diagnostic needs. You do + *not* need to list built-in libraries: eg, in python, you should to list `numpy `__ but not + `math `__. If no third-party libraries are needed, + the value should be an empty list. Data section ------------ @@ -97,41 +112,119 @@ Data section This section contains settings that apply to all the data your diagnostic uses. Most of them are optional. ``frequency``: - The time frequency the model data should be provided at, eg. "1hr", "6hr", "day", "mon", ... + A string specifying a time span, used e.g. to describe how frequently data is sampled. + It consists of an optional integer (if omitted, the integer is assumed to be 1) and a units string which is one of + ``hr``, ``day``, ``mon``, ``yr`` or ``fx``. ``fx`` is used where appropriate to denote time-independent data. + Common synonyms for these units are also recognized (e.g. ``monthly``, ``month``, ``months``, ``mo`` for ``mon``, + ``static`` for ``fx``, etc.) +.. _sec_dimensions: Dimensions section ------------------ -This section is where you list the dimensions (coordinate axes) your variables are provided on. Each entry should be a key-value pair, where the key is the name your diagnostic uses for that dimension internally, and the value is a list of settings describing that dimension. In order to be unambiguous, all dimensions must specify at least: +This section is where you list the dimensions (coordinate axes) your variables are provided on. Each entry should be a +key-value pair, where the key is the name your diagnostic uses for that dimension internally, and the value is a list of +settings describing that dimension. In order to be unambiguous, all dimensions must specify at least: ``standard_name``: - The CF `standard name `__ for that coordinate. + The CF `standard name `__ for + that coordinate. ``units``: - The units the diagnostic expects that coordinate to be in (using the syntax of the `UDUnits library `__). This is optional: if not given, the framework will assume you want CF convention `canonical units `__. + The units the diagnostic expects that coordinate to be in (using the syntax of the + `UDUnits library `__). This is + optional: if not given, the framework will assume you want CF convention + `canonical units `__. In addition, any vertical (Z axis) dimension must specify: ``positive``: - Either ``"up"`` or ``"down"``, according to the `CF conventions `__. A pressure axis is always ``"down"`` (increasing values are closer to the center of the earth). + Either ``"up"`` or ``"down"``, according to the + `CF conventions `__. A pressure axis is always + ``"down"`` (increasing values are closer to the center of the earth). + +.. _sec_varlist: Varlist section --------------- -This section is where you list the variables your diagnostic uses. Each entry should be a key-value pair, where the key is the name your diagnostic uses for that variable internally, and the value is a list of settings describing that variable. Most settings here are optional, but the main ones are: +Varlist entry example +^^^^^^^^^^^^^^^^^^^^^ -``standard_name``: - The CF `standard name `__ for that variable. +.. code-block:: js + + "u500": { + "standard_name": "eastward_wind", + "units": "m s-1", + "realm": "atmos", + "dimensions" : ["time", "lat", "lon"], + "scalar_coordinates": {"plev": 500}, + "requirement": "optional", + "alternates": ["another_variable_name", "a_third_variable_name"] + } -``path_variable``: - Name of the shell environment variable the framework will use to pass the location of the file containing this variable to your diagnostic when it's run. See the environment variable :doc:`documentation ` for details. +This section is where you list the variables your diagnostic uses. Each entry should be a key-value pair, where the key +is the name your diagnostic uses for that variable internally, and the value is a list of settings describing that +variable. Most settings here are optional, but the main ones are: + +``standard_name``: + The CF `standard name `__ + for that variable. ``units``: - The units the diagnostic expects the variable to be in (using the syntax of the `UDUnits library `__). This is optional: if not given, the framework will assume you want CF convention `canonical units `__. + The units the diagnostic expects the variable to be in (using the syntax of the + `UDUnits library `__). ``dimensions``: - List of names of dimensions specified in the "dimensions" section, to specify the coordinate dependence of each variable. + List of names of dimensions specified in the "dimensions" section, to specify the coordinate dependence of each + variable. + +``realm`` (if not specified in the `settings` section): + string or list of CMIP modeling realm(s) that the variable belongs to + +``modifier``: + String, optional; Descriptor to distinguish variables with identical standard names and different dimensionalities or + realms. See `modifiers.jsonc `__ for + supported modfiers. Open an issue to request the addition of a new modifier to the modifiers.jsonc file, or submit a + pull request that includes the new modifier in the modifiers.jsonc file and the necessary POD settings.jsonc file(s). + +``requirement``: + String. Optional; assumed ``"required"`` if not specified. One of three values: + + - ``"required"``: variable is necessary for the diagnostic's calculations. If the data source doesn't provide the + variable (at the requested frequency, etc., for the user-specified analysis period) the framework will *not* run the + diagnostic, but will instead log an error message explaining that the lack of this data was at fault. + - ``"optional"``: variable will be supplied to the diagnostic if provided by the data source. If not available, the + diagnostic will still run, and the ``path_variable`` for this variable will be set to the empty string. + **The diagnostic is responsible for testing the environment variable** for the existence of all optional variables. + - ``"alternate"``: variable is specified as an alternate source of data for some other variable (see next property). + The framework will only query the data source for this variable if it's unable to obtain one of the *other* variables + that list it as an alternate. + +``alternates``: + Array (list) of strings (e.g., ["A", "B"]), which must be keys (names) of other variables. Optional: if provided, + specifies an alternative method for obtaining needed data if this variable isn't provided by the data source. + + - If the data source provides this variable (at the requested frequency, etc., for the user-specified + analysis period), this property is ignored. + - If this variable isn't available as requested, the framework will query the data source for all of the variables + listed in this property. If *all* of the alternate variables are available, the diagnostic will be run; if any are + missing it will be skipped. Note that, as currently implemented, only one set of alternates may be given + (no "plan B", "plan C", etc.) + +``scalar_coordinates``: + optional key-value pair specifying a level to select from a 4-D field. This implements what the CF conventions refer + to as + "`scalar coordinates `__", + with the use case here being the ability to request slices of higher-dimensional data. For example, the snippet at + the beginning of this section `{"plev": 500}` shows how to request the u component of wind velocity on a 500-mb + pressure level. + + - *keys* are the key (name) of an entry in the :ref:`dimensions` section. + - *values* are a single number (integer or floating-point) corresponding to the value of the slice to extract. + **Units** of this number are taken to be the ``units`` property of the dimension named as the key. + + In order to request multiple slices (e.g. wind velocity on multiple pressure levels, with each level saved to a + different file), create one varlist entry per slice. -``modifier`` (optional): - Descriptor to distinguish variables with identical standard names and different dimensionalities or realms. See `modifiers.jsonc `__ for supported modfiers. Open an issue to request the addition of a new modifier to the modifiers.jsonc file, or submit a pull request that includes the new modifier in the modifiers.jsonc file and the necessary POD settings.jsonc file(s). diff --git a/doc/sphinx/ref_data.rst b/doc/sphinx/ref_data.rst index 5e07ca7bb..8595af0fc 100644 --- a/doc/sphinx/ref_data.rst +++ b/doc/sphinx/ref_data.rst @@ -5,11 +5,15 @@ Model data format ================= -This section describes how all input model data must be "formatted" for use by the framework. By "format" we mean not only the binary file format, but also the organization of data within and across files and metadata conventions. +This section describes how all input model data must be "formatted" for use by the framework. By "format" we mean not +only the binary file format, but also the organization of data within and across files and metadata conventions. -All these questions are answered by the choice of :doc:`data source`: different data sources can, in principle, process data from different formats. The format information is presented in this section because, currently, all data sources require input data in the format described below. This accommodates the majority of use cases encountered at NCAR and GFDL; in particular, any `CF-compliant `__ data, or data published as part of `CMIP6 `__, should satisfy the requirements below. - -A core design goal of this project is the ability to run diagnostics seamlessly on data from a wide variety of sources, including different formats. If you would like the package to support formats or metadata conventions that aren't currently supported, please make a request in the appropriate GitHub `discussion thread `__. +A core design goal of this project is the ability to run diagnostics seamlessly on data from a wide variety of sources, +including different formats. The MDTF-diagnostics package leverages ESM-intake catalogs and APIs to query and access the +model datasets. As such, we can expand the package requirements to query additional metadata like grid type, institution, +or cell methods. If you would like the package to support formats or metadata conventions that aren't +currently supported, please make a request in the appropriate GitHub +`discussion thread `__. Model data format requirements ------------------------------ @@ -17,44 +21,83 @@ Model data format requirements File organization +++++++++++++++++ -- Model data must be supplied in the form of a set of netCDF files. -- We support reading from all netCDF-3 and netCDF-4 binary formats, as defined in the `netCDF FAQ `__, via `xarray `__ (see below), with the exception that variables nested under netCDF-4 groups are not currently supported. -- Each netCDF file should only contain one variable (i.e., an array with the values of a single dependent variable, along with all the values of the coordinates at which the dependent variable was sampled). Additional variables (coordinate bounds, auxiliary or transformed coordinates) may be present in the file, but will be ignored by the framework. -- The data for one variable may be spread across multiple netCDF files, but this must take the form of contiguous chunks by date (e.g., one file for 2000-2009, another for 2010-2019, etc.). The spatial coordinates in each file in a series of chunks must be identical. +- Model data must be supplied in the form of a set of netCDF or Zarr files with locations and metadata defined in an + ESM-intake catalog. +- The framework developers have provided a simple tool for generating data catalogs using CMIP, GFDL, and CESM + conventions. The user community may modify this generator to suit their needs +- Each file may contain one variable (i.e., an array with the values of a single dependent variable, along with all of + the values of the coordinates at which the dependent variable was sampled), or multiple variables. Refer to the + ESM-intake documentation for `instructions to create and access data catalogs with multiple assets + `__. +- The data for one variable may be spread across multiple netCDF files, but this must take the form of contiguous chunks + by date (e.g., one file for 2000-2009, another for 2010-2019, etc.). The spatial coordinates in each file in a series + of chunks must be identical. Coordinates +++++++++++ -- The framework currently only supports model data provided on a latitude-longitude grid. -- The framework currently only supports vertical coordinates given in terms of pressure. The pressure coordinate may be in any units (*mb*, *Pa*, *atm*, ...). We hope to offer support for `parametric vertical coordinates `__ in the near future. -- The time coordinate of the data must follow the `CF conventions `__; in particular, it must have a ``calendar`` attribute which matches one of the CF conventions' recognized calendars (case-insensitive). -- The framework doesn't impose any limitations on the minimum or maximum resolution of model data, beyond the storage and memory available on the machine where the PODs are run. +- The framework currently only supports model data provided on a latitude-longitude grid. The framework developers + will extend support for non-rectilinear grids once requirements are finalized and use cases are provided. +- The framework currently only supports vertical coordinates given in terms of pressure. The pressure coordinate may be + in any units (*mb*, *Pa*, *atm*, ...). We plan to offer support for + `parametric vertical coordinates `__ + in the near future +- The time coordinate of the data must follow the + `CF conventions `__; + in particular, it must have a ``calendar`` attribute which matches one of the CF conventions' recognized calendars + (case-insensitive). +- The framework doesn't impose any limitations on the minimum or maximum resolution of model data, beyond the storage + and memory available on the machine where the PODs are run. .. _ref-data-metadata: Metadata ++++++++ -The framework currently makes use of two pieces of metadata (attributes for each variable in the netCDF header), in addition to the ``calendar`` attribute on the time coordinate: +The framework currently makes use of the following metadata (attributes for each variable in the netCDF header), +in addition to the ``calendar`` attribute on the time coordinate: -- ``units``: Required for all variables and coordinates. This should be a string of the form recognized by `UDUNITS2 `__, specifically the python `cfunits `__ package (which improves CF convention support, e.g. by recognizing ``'psu'`` as "practical salinity units.") +- ``units``: Required for all variables and coordinates. This should be a string of the form recognized by + `UDUNITS2 `__, specifically the python + `cfunits `__ package (which improves CF convention support, e.g. by recognizing + ``'psu'`` as "practical salinity units.") - This attribute is required because we allow PODs to request model data with specific units, rather than requiring each POD to implement and debug redundant unit conversion logic. Instead, unit checking and conversion is done by the framework. This can't be done if it's not clear what units the input data are in. - -- ``standard_name``: If present, should be set to a recognized CF convention `standard name `__. This is used to confirm that the framework has downloaded the physical quantity that the POD has requested, independently of what name the model has given to the variable. + This attribute is required because we allow PODs to request model data with specific units, rather than requiring each + POD to implement and debug redundant unit conversion logic. Instead, unit checking and conversion is done by the + framework. This can't be done if it's not clear what units the input data are in. + +- ``standard_name``: If present, should be set to a recognized CF convention + `standard name `__. + This is used to confirm that the framework has downloaded the physical quantity that the POD has requested, + independently of what name the model has given to the variable. If the input files do not contain a `standard_name`, + attribute, substitute the `long_name`. +- ``realm``: The model realm(s) that each variable is part of. - If the user or data source has specified a :ref:`naming convention`, missing values for this attribute will be filled in based on the variable names used in that convention. + If the user or data source has specified a :ref:`naming convention`, missing values for this + attribute will be filled in based on the variable names used in that convention. -Many utilities exist for editing metadata in netCDF headers. Popular examples are the `ncatted `__ tool in the `NCO `__ utilities and the `setattribute `__ operator in `CDO `__, as well as the functionality provided by xarray itself. Additionally, the :ref:`ref-data-source-explictfile` provides limited functionality for overwriting metadata attributes. +Many utilities exist for editing metadata in netCDF headers. Popular examples are the +`ncatted `__ tool in the `NCO `__ +utilities and the `setattribute `__ operator in +`CDO `__, as well as the functionality provided by xarray itself. Additionally, +the :ref:`ref-data-source-explictfile` provides limited functionality for overwriting metadata attributes. -In situations where none of the above options are feasible, the ``--disable-preprocessor`` :ref:`command-line flag` may be used to disable all functionality based on this metadata. In this case, the user is fully responsible for ensuring that the input model data has the units required by each POD, is provided on the correct pressure levels, etc. xarray reference implementation ------------------------------- -The framework uses `xarray `__ to preprocess and validate model data before the PODs are run; specifically using the `netcdf4 `__ engine and with `CF convention support `__ provided via the `cftime `__ library. We also use `cf_xarray `__ to access data attributes in a more convention-independent way. +The framework uses `xarray `__ to preprocess and validate model data before the +PODs are run; specifically using the `netcdf4 `__ engine and with +`CF convention support `__ +provided via the `cftime `__ library. We also use +`cf_xarray `__ to access data attributes in a more convention-independent +way. -If you're deciding how to post-process your model's data for use by the MDTF package, or are debugging issues with your model's data format, it may be simpler to load and examine your data with these packages interactively, rather than by invoking the entire MDTF package. The following python snippet approximates how the framework loads datasets for preprocessing. Use the `\_MDTF_base `__ conda environment to install the correct versions of each package. +If you're deciding how to post-process your model's data for use by the MDTF package, or are debugging issues with your +model's data format, it may be simpler to load and examine your data with these packages interactively, rather than by +invoking the entire MDTF package. The following python snippet approximates how the framework loads datasets for +preprocessing. Use the `\_MDTF_base `__ +conda environment to install the correct versions of each package. .. code-block:: python @@ -77,6 +120,9 @@ If you're deciding how to post-process your model's data for use by the MDTF pac # print summary ds.info() -The framework has additional logic for cleaning up noncompliant metadata (e.g., stripping whitespace from netCDF headers), but if you can load a dataset with the above commands, the framework should be able to deal with it as well. +The framework has additional logic for cleaning up noncompliant metadata (e.g., +stripping whitespace from netCDF headers), but if you can load a dataset with the above commands, +the framework should be able to deal with it as well. -If the framework runs into errors when run on a dataset that meets the criteria above, please file a bug report via the gitHub `issue tracker `__. +If the framework runs into errors when run on a dataset that meets the criteria above, please file a bug report via +the GitHub `issue tracker `__. diff --git a/doc/sphinx/ref_dev_toc.rst b/doc/sphinx/ref_dev_toc.rst deleted file mode 100644 index e2ea08f8a..000000000 --- a/doc/sphinx/ref_dev_toc.rst +++ /dev/null @@ -1,8 +0,0 @@ -Developer reference -------------------- - -.. toctree:: - :maxdepth: 2 - - ref_settings - ref_envvars diff --git a/doc/sphinx/ref_envvars.rst b/doc/sphinx/ref_envvars.rst index cd6ae72b7..5435e6f56 100644 --- a/doc/sphinx/ref_envvars.rst +++ b/doc/sphinx/ref_envvars.rst @@ -65,43 +65,12 @@ Locations of model data files These are set depending on the data your diagnostic requests in its :doc:`settings file <./pod_settings>`. Refer to the examples below if you're unfamiliar with how that file is organized. -Each variable listed in the ``varlist`` section of the settings file must specify a ``path_variable`` property. -The value you enter there will be used as the name of an environment variable, and the framework will set the value -of that environment variable to the absolute path to the file containing data for that variable. - -**From a diagnostic writer's point of view**, this means all you need to do here is replace paths to input data that -are hard-coded or passed from the command line with calls to read the value of the corresponding environment variable. - -- If the framework was not able to obtain the variable from the data source (at the requested frequency, etc., - for the user-specified analysis period), this variable will be set equal to the **empty string**. Your diagnostic is - responsible for testing for this possibility for all variables that are listed as ``optional`` or have alternates - listed (if a required variable without alternates isn't found, your diagnostic won't be run.) -- If ``multi_file_ok`` is set to ``true`` in the settings file, this environment variable may be a list of paths to - *multiple* files in chronological order, separated by colons. For example, - ``/dir/precip_1980_1989.nc:/dir/precip_1990_1999.nc:/dir/precip_2000_2009.nc`` for an analysis period of 1980-2009. - Names of variables and dimensions --------------------------------- These are set depending on the data your diagnostic requests in its :doc:`settings file <./pod_settings>`. Refer to the examples below if you're unfamiliar with how that file is organized. -*For each dimension:* - If is the name of the key labeling the key:value entry for this dimension, the framework will set an environment - variable named ``_coord`` equal to the name that dimension has in the data files it's providing. - - - If ``rename_dimensions`` is set to ``true`` in the settings file, this will always be equal to . If - ``rename_dimensions`` is ``false``, this will be whatever the model or data source's native name for this dimension - is, and your diagnostic should read the name from this variable. Your diagnostic should **only** use hard-coded - names for dimensions if ``rename_dimensions`` is set to ``true`` in its :doc:`settings file `. - - If the data source has provided (one-dimensional) bounds for this dimension, the name of the netCDF variable containing those bounds will be set in an environment variable named ``_bnds``. If bounds are not provided, this will be set to the empty string. **Note** that multidimensional boundaries (e.g. for horizontal cells) should be listed as separate entries in the varlist section. - -*For each variable:* - If be the name of the key labeling the key:value entry for this variable in the varlist section, the framework will set an environment variable named ``_var`` equal to the name that variable has in the data files it's providing. - - - If ``rename_variables`` is set to ``true`` in the settings file, this will always be equal to . If ``rename_variables`` is ``false``, this will be whatever the model or data source's native name for this variable is, and your diagnostic should read the name from this variable. Your diagnostic should **only** use hard-coded names for variables if ``rename_variables`` is set to ``true`` in its :doc:`settings file `. - Simple example -------------- @@ -109,13 +78,6 @@ Simple example We only give the relevant parts of the :doc:`settings file ` below. .. code-block:: js - - "data": { - "rename_dimensions": false, - "rename_variables": false, - "multi_file_ok": false, - ... - }, "dimensions": { "lat": { "standard_name": "latitude", @@ -133,63 +95,16 @@ We only give the relevant parts of the :doc:`settings file ` below "varlist": { "pr": { "standard_name": "precipitation_flux", - "path_variable": "PR_FILE" } } The framework will set the following environment variables: -#. ``lat_coord``: Name of the latitude dimension in the model's native format (because ``rename_dimensions`` is false). -#. ``lon_coord``: Name of the longitude dimension in the model's native format (because ``rename_dimensions`` is false). -#. ``time_coord``: Name of the time dimension in the model's native format (because ``rename_dimensions`` is false). -#. ``pr_var``: Name of the precipitation variable in the model's native format (because ``rename_variables`` is false). -#. ``PR_FILE``: Absolute path to the file containing ``pr`` data, e.g. ``/dir/precip.nc``. - - -More complex example --------------------- - -Let's elaborate on the previous example, and assume that the diagnostic is being called on model that provides precipitation_flux but not convective_precipitation_flux. - -.. code-block:: js - - "data": { - "rename_dimensions": true, - "rename_variables": false, - "multi_file_ok": true, - ... - }, - "dimensions": { - "lat": { - "standard_name": "latitude", - ... - }, - "lon": { - "standard_name": "longitude", - ... - }, - "time": { - "standard_name": "time", - ... - } - }, - "varlist": { - "prc": { - "standard_name": "convective_precipitation_flux", - "path_variable": "PRC_FILE", - "alternates": ["pr"] - }, - "pr": { - "standard_name": "precipitation_flux", - "path_variable": "PR_FILE" - } - } - - -Comparing this with the previous example: +#. ``lat_coord``: Name of the latitude dimension in the model's native format +#. ``lon_coord``: Name of the longitude dimension in the model's native format +#. ``time_coord``: Name of the time dimension in the model's native format +#. ``pr_var``: Name of the precipitation variable +#. ``PR_FILE`` (retained for backwards compatibility): Absolute path to the file containing + ``pr`` data, e.g. ``/dir/precip.nc``. -- ``lat_coord``, ``lon_coord`` and ``time_coord`` will be set to "lat", "lon" and "time", respectively, because ``rename_dimensions`` is true. The framework will have renamed these dimensions to have these names in all data files provided to the diagnostic. -- ``prc_var`` and ``pr_var`` will be set to the model's native names for these variables. Names for all variables are always set, regardless of which variables are available from the data source. -- In this example, ``PRC_FILE`` will be set to ``''``, the empty string, because it wasn't found. -- ``PR_FILE`` will be set to ``/dir/precip_1980_1989.nc:/dir/precip_1990_1999.nc:/dir/precip_2000_2009.nc``, because ``multi_file_ok`` was set to ``true``. diff --git a/doc/sphinx/ref_settings.rst b/doc/sphinx/ref_settings.rst index b105f38c1..4174886c8 100644 --- a/doc/sphinx/ref_settings.rst +++ b/doc/sphinx/ref_settings.rst @@ -22,11 +22,7 @@ In addition, for the purposes of the configuration file we define .. _time_duration: -4. a "time duration": this is a string specifying a time span, used e.g. to describe how frequently data is sampled. -It consists of an optional integer (if omitted, the integer is assumed to be 1) and a units string which is one of -``hr``, ``day``, ``mon``, ``yr`` or ``fx``. ``fx`` is used where appropriate to denote time-independent data. -Common synonyms for these units are also recognized (e.g. ``monthly``, ``month``, ``months``, ``mo`` for ``mon``, -``static`` for ``fx``, etc.) + **In addition**, the string ``"any"`` may be used to signify that any value is acceptable. @@ -142,15 +138,26 @@ Diagnostic runtime ^^^^^^^^^^^^^^^^^^ ``runtime_requirements``: - :ref:`object`, **required**. Programs your diagnostic needs to run (for example, scripting language interpreters) and any third-party libraries needed in those languages. Each executable should be listed in a separate key-value pair: + :ref:`object`, **required**. Programs your diagnostic needs to run (for example, scripting language + interpreters) and any third-party libraries needed in those languages. Each executable should be listed in a separate + key-value pair: - - The *key* is the name of the required executable, e.g. languages such as "`python `__" or "`ncl `__" etc. but also any utilities such as "`ncks `__", "`cdo `__", etc. - - The *value* corresponding to each key is an :ref:`array` (list) of strings, which are names of third-party libraries in that language that your diagnostic needs. You do *not* need to list standard libraries or scripts that are provided in a standard installation of your language: eg, in python, you need to list `numpy `__ but not `math `__. If no third-party libraries are needed, the value should be an empty list. + - The *key* is the name of the required executable, e.g. languages such as "`python `__" or + "`ncl `__" etc. but also any utilities such as "`ncks `__", + "`cdo `__", etc. + - The *value* corresponding to each key is an :ref:`array` (list) of strings, which are names of third-party + libraries in that language that your diagnostic needs. You do *not* need to list standard libraries or scripts that + are provided in a standard installation of your language: eg, in python, you need to list + `numpy `__ but not `math `__. If no third-party + libraries are needed, the value should be an empty list. - In the future we plan to offer the capability to request specific `versions `__. For now, please communicate your diagnostic's version requirements to the MDTF organizers. ``pod_env_vars``: - :ref:`object`, optional. Names and values of shell environment variables used by your diagnostic, *in addition* to those supplied by the framework. The user can't change these at runtime, but this can be used to set site-specific installation settings for your diagnostic (eg, switching between low- and high-resolution observational data depending on what the user has chosen to download). Note that environment variable values must be provided as strings. + :ref:`object`, optional. Names and values of shell environment variables used by your diagnostic, + *in addition* to those supplied by the framework. The user can't change these at runtime, but this can be used to set + site-specific installation settings for your diagnostic (eg, switching between low- and high-resolution observational + data depending on what the user has chosen to download). Note that environment variable values must be provided as + strings. .. _sec_data: @@ -167,10 +174,7 @@ Example "data": { "format": "netcdf4_classic", - "rename_dimensions": false, - "rename_variables": false, "realm": "atmos", - "multi_file_ok": true, "frequency": "3hr", "min_frequency": "1hr", "max_frequency": "6hr", @@ -184,8 +188,8 @@ Example ``format``: String. Optional: assumed ``"any_netcdf_classic"`` if not specified. Specifies the format(s) of *model* data your - diagnostic is able to read. As of this writing, the framework only supports retrieval of netCDF formats, so only the - following values are allowed: + diagnostic is able to read. As of this writing, the framework only supports retrieval of netCDF or Zarr formats, so + only the following values are allowed: - ``"any_netcdf"`` includes all of: @@ -205,37 +209,11 @@ Example See the `netCDF FAQ `__ for information on the distinctions. Any recent version of a supported language for diagnostics with netCDF support will be able to read all of these. However, the extended features of the ``"netcdf4"`` data model are not commonly used in practice and currently only supported at a beta level in NCL, which is why we've chosen ``"any_netcdf_classic"`` as the default. -``rename_dimensions``: - Boolean. Optional: assumed ``false`` if not specified. If set to ``true``, the framework will change the name of all - :ref:`dimensions` in the model data from the model's native value to the string specified in the - ``name`` property for that dimension. If set to ``false``, **the diagnostic is responsible for reading dimension names - from the environment variable**. See the environment variable :doc:`documentation ` for details - on how these names are provided. - -``rename_variables``: - Boolean. Optional: assumed ``false`` if not specified. If set to ``true``, the framework will change the name of all - :ref:`variables` in the model data from the model's native value to the string specified in the ``name`` - property for that variable. If set to ``false``, **the diagnostic is responsible for reading dimension names from the - environment variable**. See the environment variable :doc:`documentation ` for details on how these names - are provided. - ``realm``: String or :ref:`array` (list) of strings, **required**. One of the eight CMIP6 modeling realms (aerosol, atmos, atmosChem, land, landIce, ocean, ocnBgchem, seaIce) describing what data your diagnostic uses. If your diagnostic uses data from multiple realms, list them in an array (e.g. ``["atmos", "ocean"]``). This is used as part of the data catalog query to help determine which file(s) match the POD's requirements -.. _multi_file: - -``multi_file_ok``: - Boolean. Optional: assumed ``false`` if not specified. If set to ``true``, the diagnostic is signalling that it's able - to accept data for a single variable that may be spread out in multiple files, to be aggregated along the time - dimension (e.g. through the use of - `xarray `__.) Aggregation along the time - dimension is the only type of aggregation the diagnostic will need to consider. - - If ``false``, the framework will ensure all data for a single variable is presented as a single netCDF file. This may - lead to large file sizes if your diagnostic uses high-frequency data, in which case you should consider setting a - limit via ``max_duration``. ``min_duration``, ``max_duration``: :ref:`Time durations`. Optional: assumed ``"any"`` if not specified. Set minimum and maximum length of @@ -372,7 +350,9 @@ Time the number of days per month. ``need_bounds``: - Boolean. Optional: assumed ``false`` if not specified. If ``true``, the framework will ensure that bounds are supplied for this dimension, in addition to its midpoint values, following the `CF conventions `__: the ``bounds`` attribute of this dimension will be set to the name of another netCDF variable containing the bounds information. + Boolean. Optional: assumed ``false`` if not specified. If ``true``, the framework will ensure that bounds are supplied + for this dimension, in addition to its midpoint values, following the + `CF conventions `__: the ``bounds`` attribute of this dimension will be set to the name of another netCDF variable containing the bounds information. ``axis``: String, optional. Assumed to be ``T`` if omitted or provided. @@ -387,13 +367,22 @@ Z axis (height/depth, pressure, ...) the CMIP6 MIP tables. ``units``: - Optional, a :ref:`CFunit`. Units the diagnostic expects the dimension to be in. **If not provided, the framework will assume CF convention** `canonical units `__. + Optional, a :ref:`CFunit`. Units the diagnostic expects the dimension to be in. **If not provided, the + framework will assume CF convention** + `canonical units `__. ``positive``: - String, **required**. Must be ``"up"`` or ``"down"``, according to the `CF conventions `__. A pressure axis is always ``"down"`` (increasing values are closer to the center of the earth), but this is not set automatically. + String, **required**. Must be ``"up"`` or ``"down"``, according to the + `CF conventions `__. + A pressure axis is always ``"down"`` (increasing values are closer to the center of the earth), but this is not set + automatically. ``need_bounds``: - Boolean. Optional: assumed ``false`` if not specified. If ``true``, the framework will ensure that bounds are supplied for this dimension, in addition to its midpoint values, following the `CF conventions `__: the ``bounds`` attribute of this dimension will be set to the name of another netCDF variable containing the bounds information. + Boolean. Optional: assumed ``false`` if not specified. If ``true``, the framework will ensure that bounds are supplied + for this dimension, in addition to its midpoint values, following the + `CF conventions `__: + the ``bounds`` attribute of this dimension will be set to the name of another netCDF variable containing the bounds + information. ``axis``: String, optional. Assumed to be ``Z`` if omitted or provided. @@ -402,16 +391,24 @@ Other dimensions (wavelength, ...) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ``standard_name``: - **Required**, string. `Standard name `__ of the variable as defined by the `CF conventions `__, or a commonly used synonym as employed in the CMIP6 MIP tables. + **Required**, string. `Standard name `__ + of the variable as defined by the `CF conventions `__, or a commonly used synonym as + employed in the CMIP6 MIP tables. ``units``: Optional, a :ref:`CFunit`. Units the diagnostic expects the dimension to be in. **If not provided, the framework will assume CF convention** `canonical units `__. ``need_bounds``: - Boolean. Optional: assumed ``false`` if not specified. If ``true``, the framework will ensure that bounds are supplied for this dimension, in addition to its midpoint values, following the `CF conventions `__: the ``bounds`` attribute of this dimension will be set to the name of another netCDF variable containing the bounds information. + Boolean. Optional: assumed ``false`` if not specified. If ``true``, the framework will ensure that bounds are supplied + for this dimension, in addition to its midpoint values, following the + `CF conventions `__: + the ``bounds`` attribute of this dimension will be set to the name of another netCDF variable containing the bounds + information. -``modifier``: -String, Optional. Used to distinguish variables that are defined on a vertical level that is not a pressure level (e.g., 2-meter temperature) from variables that are defined on pressure levels. Modfiers are defined in data/modifiers.jsonc. MDTF-diagnostics currently supports `atmos_height`. +``modifier`` (backward: +String, Optional. Used to distinguish variables that are defined on a vertical level that is not a pressure level +(e.g., 2-meter temperature) from variables that are defined on pressure levels. Modfiers are defined in data/modifiers.jsonc. +MDTF-diagnostics currently supports `atmos_height`. .. _sec_varlist: @@ -478,43 +475,9 @@ The value of the key-value pair is an :ref:`object` containing propertie .. _item_var_coords: -``scalar_coordinates``: - :ref:`object`, optional. This implements what the CF conventions refer to as - "`scalar coordinates `__", - with the use case here being the ability to request slices of higher-dimensional data. For example, the snippet at - the beginning of this section shows how to request the u component of wind velocity on a 500 mb pressure level. - - - *keys* are the key (name) of an entry in the :ref:`dimensions` section. - - *values* are a single number (integer or floating-point) corresponding to the value of the slice to extract. - **Units** of this number are taken to be the ``units`` property of the dimension named as the key. - In order to request multiple slices (e.g. wind velocity on multiple pressure levels, with each level saved to a - different file), create one varlist entry per slice. ``frequency``, ``min_frequency``, ``max_frequency``: :ref:`Time durations`. Optional. Time frequency at which the variable's data is provided. If given here, overrides the values set globally in the ``data`` section (see :ref:`description` there). -``requirement``: - String. Optional: assumed ``"required"`` if not specified. One of three values: - - - ``"required"``: variable is necessary for the diagnostic's calculations. If the data source doesn't provide the - variable (at the requested frequency, etc., for the user-specified analysis period) the framework will *not* run the - diagnostic, but will instead log an error message explaining that the lack of this data was at fault. - - ``"optional"``: variable will be supplied to the diagnostic if provided by the data source. If not available, the - diagnostic will still run, and the ``path_variable`` for this variable will be set to the empty string. - **The diagnostic is responsible for testing the environment variable** for the existence of all optional variables. - - ``"alternate"``: variable is specified as an alternate source of data for some other variable (see next property). - The framework will only query the data source for this variable if it's unable to obtain one of the *other* variables - that list it as an alternate. - -``alternates``: - :ref:`Array` (list) of strings, which must be keys (names) of other variables. Optional: if provided, - specifies an alternative method for obtaining needed data if this variable isn't provided by the data source. - - - If the data source provides this variable (at the requested frequency, etc., for the user-specified - analysis period), this property is ignored. - - If this variable isn't available as requested, the framework will query the data source for all of the variables - listed in this property. If *all* of the alternate variables are available, the diagnostic will be run; if any are - missing it will be skipped. Note that, as currently implemented, only one set of alternates may be given - (no "plan B", "plan C", etc.)