Add minimal glossary (#185)

* Experiments towards glossary * Remove unnecessary setup.py * Update recommended Python version. * Remove --user recommendation. * Update quickstart. * Update tutorial * projects * job * querying * flowproject * remove collection * configuration * Add pre-commit config * Add codespell. * community * Put back original hooks. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Apply Bradley's suggestions Co-authored-by: Bradley Dice <[email protected]> * Fix job reference. Co-authored-by: Corwin Kerr <[email protected]> * Fix a/an. Co-authored-by: Corwin Kerr <[email protected]> * Clarify analogy to primary keys. * Remove from "data space" terms * Small wording * Add initial glossary file * Replace state point schema with project schema * Update defs * Update definitions taking Brandon's feedback * add hoverxref to build requirements * Use 4 space indentation in glossary * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add references to glossary terms --------- Co-authored-by: Vyas Ramasubramani <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Bradley Dice <[email protected]>
glotzerlab · Mar 30, 2023 · 07a209b · 07a209b
1 parent a4dbccb
commit 07a209b
Show file tree

Hide file tree

Showing 12 changed files with 70 additions and 41 deletions.
diff --git a/docs/requirements.txt b/docs/requirements.txt
@@ -3,6 +3,7 @@ jupyter_client
 jupyter_sphinx
 nbconvert
 nbsphinx
+sphinx-hoverxref
 sphinx>=4.0.0
 sphinx_rtd_theme>=1.0.0
 sphinxcontrib-bibtex>=2.2.0
diff --git a/docs/source/conf.py b/docs/source/conf.py
@@ -41,6 +41,7 @@
 # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
 # ones.
 extensions = [
+    "hoverxref.extension",
     "sphinx.ext.autodoc",
     "sphinx.ext.intersphinx",
     "sphinx.ext.todo",
@@ -51,6 +52,11 @@
     "sphinxcontrib.bibtex",
 ]
 
+# For hover x ref
+hoverxref_roles = [
+    "term",
+]
+
 # For sphinxcontrib.bibtex.
 bibtex_bibfiles = ["signac.bib", "acknowledge.bib"]
 

diff --git a/docs/source/dashboard.rst b/docs/source/dashboard.rst
@@ -3,7 +3,7 @@
 The Dashboard
 =============
 
-The **signac-dashboard** visualizes data stored in a **signac** project.
+The **signac-dashboard** visualizes data stored in a **signac** :term:`project`.
 To install the **signac-dashboard** package, see :ref:`dashboard-installation`.
 
 .. danger::

diff --git a/docs/source/flow-project.rst b/docs/source/flow-project.rst
@@ -54,7 +54,7 @@ Operations
 It is highly recommended to divide individual modifications of your project's data space into distinct functions.
 In this context, an *operation* is defined as a function whose only positional arguments are instances of :py:class:`~signac.job.Job`.
 We will demonstrate this concept with a simple example.
-Let's initialize a project with a few jobs, by executing the following ``init.py`` script within a ``~/my_project`` directory:
+Let's initialize a signac :term:`project` with a few :term:`jobs<job>`, by executing the following ``init.py`` script within a ``~/my_project`` directory:
 
 .. code-block:: python
 
@@ -66,7 +66,7 @@ Let's initialize a project with a few jobs, by executing the following ``init.py
     for i in range(10):
         project.open_job({"a": i}).init()
 
-A very simple *operation*, which creates a file called ``hello.txt`` within a job's workspace directory, could be implemented like this:
+A very simple *operation*, which creates a file called ``hello.txt`` within the :term:`job directory`, could be implemented like this:
 
 .. code-block:: python
 
@@ -104,8 +104,8 @@ Conditions
 Here the :py:meth:`~flow.FlowProject.operation` decorator function specifies that the ``hello`` operation function is part of our workflow.
 If we run ``python project.py run``, **signac-flow** will execute ``hello`` for all jobs in the project.
 
-However, we only want to execute ``hello`` if ``hello.txt`` does not yet exist in the job's workspace.
-To do this, we need to create a condition function named ``greeted`` that tells us if ``hello.txt`` already exists in the job workspace:
+However, we only want to execute ``hello`` if ``hello.txt`` does not yet exist in the job directory.
+To do this, we need to create a condition function named ``greeted`` that tells us if ``hello.txt`` already exists in the job directory:
 
 
 .. code-block:: python

diff --git a/docs/source/glossary.rst b/docs/source/glossary.rst
@@ -0,0 +1,34 @@
+Glossary
+========
+
+.. glossary::
+
+    parameter
+        A variable, like `T` or `version` or `bench_number`. The smallest unit in **signac**. Specifically, these are the dictionary keys of the state point.
+
+    state point
+        A dictionary of parameters and their values that uniquely specifies a :term:`job`, kept in sync with the file `signac_statepoint.json`.
+
+    job
+        An object holding data and metadata of the :term:`state point` that defines it.
+
+    job id
+        The MD-5 hash of a job's state point that is used to distinguish jobs.
+
+    job directory
+        The directory, named for the :term:`job id`, created when a job is initialized that will contain all data and metadata pertaining to the given job.
+
+    job document
+        A persistent dictionary for storage of simple key-value pairs in a job, kept in sync with the file `signac_job_document.json`.
+
+    workspace
+        The directory that contains all job directories of a **signac** project.
+
+    project
+        The primary interface to access and work with jobs and their data stored in the workspace.
+
+    project schema
+        The emergent database schema as defined by jobs in the project workspace. The set of all keys present in all state points, as well as their range of values.
+
+    signac schema
+        A configuration schema that defines accepted options and values, currently on v2.
diff --git a/docs/source/hooks.rst b/docs/source/hooks.rst
@@ -10,13 +10,13 @@ Introduction
 ============
 
 One of the goals of the **signac** framework is to make it easy to track the provenance of research data and to ensure its reproducibility.
-Hooks make it possible to execute user-defined functions before or after :ref:`FlowProject <flow-project>` operations act on a **signac** project.
+Hooks make it possible to execute user-defined functions before or after :ref:`FlowProject <flow-project>` operations act on a **signac** :term:`project`.
 For example, hooks can be used to track state changes before and after each operation.
 
 A hook is a function that is called at a specific time relative to the execution of a **signac-flow** :ref:`operation <operations>`.
 A hook can be triggered when an operation starts, exits, succeeds, or raises an exception.
 
-A basic use case is to log the success/failure of an operation by creating a hook that sets a job document value ``job.doc.operation_success`` to ``True`` or ``False``.
+A basic use case is to log the success/failure of an operation by creating a hook that sets a :term:`job document` value ``job.doc.operation_success`` to ``True`` or ``False``.
 As another example, a user may record the `git commit ID <https://git-scm.com/book/en/v2/Git-Basics-Viewing-the-Commit-History>`_ upon the start of an operation, allowing them to track which version of code ran the operation.
 
 .. _hook_triggers:
@@ -95,7 +95,7 @@ The ``on_exception`` hook trigger will run, and ``job.doc.error_on_a_0_success``
 Project-Level Hooks
 ===================
 
-It may be desirable to install the same hook or set of hooks for all operations in a project.
+It may be desirable to install the same hook or set of hooks for all operations in a FlowProject.
 In the following example FlowProject, the hook ``track_start_time`` is triggered when each operation starts.
 The hook appends the current time to a list in the job document that is named based on the name of the operation.
 

diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -57,6 +57,7 @@ If you are new to **signac**, the best place to start is to read the :ref:`intro
    :maxdepth: 2
    :caption: Reference
 
+   glossary
    community
    scientific_papers
    GitHub <https://github.com/glotzerlab/signac>

diff --git a/docs/source/jobs.rst b/docs/source/jobs.rst
@@ -10,7 +10,7 @@ Overview
 ========
 
 A *job* is a directory on the file system, which is part of a *project workspace*.
-That directory is called the *job workspace* and contains **all data** associated with that particular job.
+That directory is called the job directory and contains **all data** associated with that particular job.
 Every job has a unique address called the *state point*.
 
 There are two ways to access associated metadata with your job:
@@ -23,7 +23,7 @@ In other words, all data associated with a particular job should be a direct or
 
 .. important::
 
-    Every parameter that, when changed, would invalidate the job's data, should be part of the *state point*; all others should not.
+    Every :term:`parameter` that, when changed, would invalidate the job's data, should be part of the :term:`state point`; all others should not.
 
 However, you only have to add those parameters that are **actually changed** (or anticipated to be changed) to the *state point*.
 It is perfectly acceptable to hard-code parameters up until the point where you **actually change them**, at which point you would add them to the *state point* :ref:`retroactively <add-sp-keys>`.
@@ -461,15 +461,15 @@ Please see the h5py_ documentation for more information on how to interact with
 Job Stores
 ==========
 
-As mentioned before, the :attr:`Job.data` property represents an instance of :class:`~signac.H5Store`, specifically one that operates on a file called ``signac_data.h5`` in the job workspace.
+As mentioned before, the :attr:`Job.data` property represents an instance of :class:`~signac.H5Store`, specifically one that operates on a file called ``signac_data.h5`` in the :term:`job directory`.
 However, there are some reasons why one would want to operate on multiple different HDF5_ files instead of only one.
 
  1. While the HDF5-format is generally mutable, it is fundamentally designed to be used as an immutable data container.
     It is therefore advantageous to write large arrays to a new file instead of modifying an existing file many times.
  2. It easier to synchronize multiple files instead of just one.
  3. Multiple operations executed in parallel can operate on different files circumventing file locking issues.
 
-The :attr:`Job.stores` property provides a dict-like interface to access *multiple different* HDF5 files within the job workspace directory.
+The :attr:`Job.stores` property provides a dict-like interface to access *multiple different* HDF5 files within the job directory.
 In fact, the :attr:`Job.data` container is essentially just an alias for ``job.stores.signac_data``.
 
 For example, to store an array ``X`` within a file called ``my_data.h5``, one could use the following approach:

diff --git a/docs/source/projects.rst b/docs/source/projects.rst
@@ -79,7 +79,7 @@ Jobs
 
 The central assumption of the **signac** data model is that the data space is divisible into individual data points consisting of data and metadata that are uniquely addressable in some manner.
 Specifically, the workspace is divided into subdirectories where each directory corresponds to exactly one :py:class:`Job`.
-All data associated with a job is contained in the corresponding *job workspace* directory.
+All data associated with a job is contained in the corresponding job directory.
 A job can consist of any type of data, ranging from a single value to multiple terabytes of simulation data; **signac**'s only requirement is that this data can be encoded in a file.
 Each job is uniquely addressable via its *state point*, a key-value mapping describing its data.
 There can never be two jobs that share the same state point within the same project.
@@ -355,7 +355,7 @@ In addition, **signac** also provides the :py:meth:`signac.Project.fn` method, w
 Schema Detection
 ================
 
-While **signac** does not require you to specify an *explicit* state point schema, it is always possible to deduce an *implicit* semi-structured schema from a project's data space.
+While **signac** does not require you to specify an *explicit* :term:`project schema`, it is always possible to deduce an *implicit* semi-structured schema from a project's data space.
 This schema is comprised of the set of all keys present in all state points, as well as the range of values that these keys are associated with.
 
 Assuming that we initialize our data space with two state point keys, ``a`` and ``b``, where ``a`` is associated with some set of numbers and ``b`` contains a boolean value:
@@ -473,7 +473,7 @@ Linked Views
 ============
 
 Data space organization by job id is both efficient and flexible, but the obfuscation introduced by the job id makes inspecting the workspace on the command line or *via* a file browser much harder.
-A *linked view* is a directory hierarchy with human-interpretable names that link to to the actual job workspace directories.
+A *linked view* is a directory hierarchy with human-interpretable names that link to the actual :term:`job directories<job directory>`.
 Unlike the default mode for :ref:`data export <data-export>`, no data is copied for the generation of linked views.
 See :py:meth:`~.signac.Project.create_linked_view` for the Python API.
 

diff --git a/docs/source/query.rst b/docs/source/query.rst
@@ -4,8 +4,8 @@
 Query API
 =========
 
-As briefly described in :ref:`project-job-finding`, the :py:meth:`~signac.Project.find_jobs()` method provides much more powerful search functionality beyond simple selection of jobs with specific state point values.
-One of the key features of **signac** is the possibility to immediately search managed data spaces to select desired subsets as needed.
+As briefly described in :ref:`project-job-finding`, the :py:meth:`~signac.Project.find_jobs()` method provides much more powerful search functionality beyond simple selection of jobs with specific :term:`state points <state point>`.
+One of the key features of **signac** is the possibility to search the :term:`project` workspace to select desired subsets as needed.
 
 .. note::
 
@@ -20,8 +20,8 @@ This means that any filter can be used to simultaneously search for keys in both
 Namespaces are identified by prefixing filter keys with the appropriate prefixes.
 Currently, the following prefixes are recognized:
 
-  * **sp**: job state point
-  * **doc**: document
+  * **sp**: job :term:`state point`
+  * **doc**: :term:`job document`
 
 For example, in order to select all jobs whose state point key *a* has the value "foo" and document key *b* has the value "bar", you would use:
 

diff --git a/docs/source/recipes.rst b/docs/source/recipes.rst
@@ -11,13 +11,13 @@ This is a collection of recipes on how to solve typical problems using **signac*
     Move all recipes below into a 'General' section once we have added more recipes.
 
 
-Migrating (changing) the data space schema
+Migrating (changing) the project schema
 ==========================================
 
 Adding/renaming/deleting keys
 -----------------------------
 
-Oftentimes, one discovers at a later stage that important keys are missing from the metadata schema.
+Oftentimes, one discovers at a later stage that important :term:`parameters<parameter>` are missing from the :term:`project schema`.
 For example, in the tutorial we are modeling a gas using the ideal gas law, but we might discover later that important effects are not captured using this overly simplistic model and decide to replace it with the van der Waals equation:
 
 .. math::
@@ -45,19 +45,6 @@ The ``setdefault()`` function sets the value for :math:`a` and :math:`b` to 0 in
 
 .. _document-wide-migration:
 
-Initializing Jobs with Replica Indices
---------------------------------------
-If you want to initialize your workspace with multiple instances of the same state point, you may want to include a **replica_index** or **random_seed** parameter in the state point.
-
-.. code-block:: python
-
-    num_reps = 3
-    for i in range(num_reps):
-        for p in range(1, 11):
-            sp = {"p": p, "kT": 1.0, "N": 1000, "replica_index": i}
-            job = project.open_job(sp)
-            job.init()
-
 Applying document-wide changes
 ------------------------------
 
@@ -88,7 +75,7 @@ This approach makes it also easy to compare the pre- and post-migration states b
 Initializing state points with replica indices
 ==============================================
 
-We often require multiple jobs with the same state point to collect enough information to make statistical inferences about the data. Instead of creating multiple projects to handle this, we can simply add a **replica_index** to the state point. For example, we can use the following code to generate 3 copies of each state point in a workspace:
+We often require multiple jobs with the same state point to collect enough information to make statistical inferences about the data. Instead of creating multiple projects to handle this, we can add a **replica_index** to the state point. For example, we can use the following code to generate 3 copies of each state point in a workspace:
 
 .. code-block:: python
 
@@ -110,7 +97,7 @@ We often require multiple jobs with the same state point to collect enough infor
 Defining a grid of state point values
 =====================================
 
-Many signac data spaces are structured like a "grid" where the goal is an exhaustive search or a Cartesian product of multiple sets of input parameters. While this can be done with nested ``for`` loops, that approach can be cumbersome for state points with many keys. Here we offer a helper function that can assist in this kind of initialization, inspired by `this StackOverflow answer <https://stackoverflow.com/a/5228294>`__:
+Some **signac** :term:`project schemas<project schema>` are structured like a "grid" where the goal is an exhaustive search or a Cartesian product of multiple sets of :term:`parameters<parameter>`. While this can be done with nested ``for`` loops, that approach can be cumbersome for state points with many keys. Here we offer a helper function that can assist in this kind of initialization, inspired by `this StackOverflow answer <https://stackoverflow.com/a/5228294>`__:
 
 .. code-block:: python