Skip to content

Commit

Permalink
Add minimal glossary (#185)
Browse files Browse the repository at this point in the history
* Experiments towards glossary

* Remove unnecessary setup.py

* Update recommended Python version.

* Remove --user recommendation.

* Update quickstart.

* Update tutorial

* projects

* job

* querying

* flowproject

* remove collection

* configuration

* Add pre-commit config

* Add codespell.

* community

* Put back original hooks.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Apply Bradley's suggestions

Co-authored-by: Bradley Dice <[email protected]>

* Fix job reference.

Co-authored-by: Corwin Kerr <[email protected]>

* Fix a/an.

Co-authored-by: Corwin Kerr <[email protected]>

* Clarify analogy to primary keys.

* Remove from "data space" terms

* Small wording

* Add initial glossary file

* Replace state point schema with project schema

* Update defs

* Update definitions taking Brandon's feedback

* add hoverxref to build requirements

* Use 4 space indentation in glossary

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add references to glossary terms

---------

Co-authored-by: Vyas Ramasubramani <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Bradley Dice <[email protected]>
  • Loading branch information
4 people authored Mar 30, 2023
1 parent a4dbccb commit 07a209b
Show file tree
Hide file tree
Showing 12 changed files with 70 additions and 41 deletions.
1 change: 1 addition & 0 deletions docs/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ jupyter_client
jupyter_sphinx
nbconvert
nbsphinx
sphinx-hoverxref
sphinx>=4.0.0
sphinx_rtd_theme>=1.0.0
sphinxcontrib-bibtex>=2.2.0
6 changes: 6 additions & 0 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
extensions = [
"hoverxref.extension",
"sphinx.ext.autodoc",
"sphinx.ext.intersphinx",
"sphinx.ext.todo",
Expand All @@ -51,6 +52,11 @@
"sphinxcontrib.bibtex",
]

# For hover x ref
hoverxref_roles = [
"term",
]

# For sphinxcontrib.bibtex.
bibtex_bibfiles = ["signac.bib", "acknowledge.bib"]

Expand Down
2 changes: 1 addition & 1 deletion docs/source/dashboard.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
The Dashboard
=============

The **signac-dashboard** visualizes data stored in a **signac** project.
The **signac-dashboard** visualizes data stored in a **signac** :term:`project`.
To install the **signac-dashboard** package, see :ref:`dashboard-installation`.

.. danger::
Expand Down
8 changes: 4 additions & 4 deletions docs/source/flow-project.rst
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ Operations
It is highly recommended to divide individual modifications of your project's data space into distinct functions.
In this context, an *operation* is defined as a function whose only positional arguments are instances of :py:class:`~signac.job.Job`.
We will demonstrate this concept with a simple example.
Let's initialize a project with a few jobs, by executing the following ``init.py`` script within a ``~/my_project`` directory:
Let's initialize a signac :term:`project` with a few :term:`jobs<job>`, by executing the following ``init.py`` script within a ``~/my_project`` directory:

.. code-block:: python
Expand All @@ -66,7 +66,7 @@ Let's initialize a project with a few jobs, by executing the following ``init.py
for i in range(10):
project.open_job({"a": i}).init()
A very simple *operation*, which creates a file called ``hello.txt`` within a job's workspace directory, could be implemented like this:
A very simple *operation*, which creates a file called ``hello.txt`` within the :term:`job directory`, could be implemented like this:

.. code-block:: python
Expand Down Expand Up @@ -104,8 +104,8 @@ Conditions
Here the :py:meth:`~flow.FlowProject.operation` decorator function specifies that the ``hello`` operation function is part of our workflow.
If we run ``python project.py run``, **signac-flow** will execute ``hello`` for all jobs in the project.

However, we only want to execute ``hello`` if ``hello.txt`` does not yet exist in the job's workspace.
To do this, we need to create a condition function named ``greeted`` that tells us if ``hello.txt`` already exists in the job workspace:
However, we only want to execute ``hello`` if ``hello.txt`` does not yet exist in the job directory.
To do this, we need to create a condition function named ``greeted`` that tells us if ``hello.txt`` already exists in the job directory:


.. code-block:: python
Expand Down
34 changes: 34 additions & 0 deletions docs/source/glossary.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
Glossary
========

.. glossary::

parameter
A variable, like `T` or `version` or `bench_number`. The smallest unit in **signac**. Specifically, these are the dictionary keys of the state point.

state point
A dictionary of parameters and their values that uniquely specifies a :term:`job`, kept in sync with the file `signac_statepoint.json`.

job
An object holding data and metadata of the :term:`state point` that defines it.

job id
The MD-5 hash of a job's state point that is used to distinguish jobs.

job directory
The directory, named for the :term:`job id`, created when a job is initialized that will contain all data and metadata pertaining to the given job.

job document
A persistent dictionary for storage of simple key-value pairs in a job, kept in sync with the file `signac_job_document.json`.

workspace
The directory that contains all job directories of a **signac** project.

project
The primary interface to access and work with jobs and their data stored in the workspace.

project schema
The emergent database schema as defined by jobs in the project workspace. The set of all keys present in all state points, as well as their range of values.

signac schema
A configuration schema that defines accepted options and values, currently on v2.
6 changes: 3 additions & 3 deletions docs/source/hooks.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,13 +10,13 @@ Introduction
============

One of the goals of the **signac** framework is to make it easy to track the provenance of research data and to ensure its reproducibility.
Hooks make it possible to execute user-defined functions before or after :ref:`FlowProject <flow-project>` operations act on a **signac** project.
Hooks make it possible to execute user-defined functions before or after :ref:`FlowProject <flow-project>` operations act on a **signac** :term:`project`.
For example, hooks can be used to track state changes before and after each operation.

A hook is a function that is called at a specific time relative to the execution of a **signac-flow** :ref:`operation <operations>`.
A hook can be triggered when an operation starts, exits, succeeds, or raises an exception.

A basic use case is to log the success/failure of an operation by creating a hook that sets a job document value ``job.doc.operation_success`` to ``True`` or ``False``.
A basic use case is to log the success/failure of an operation by creating a hook that sets a :term:`job document` value ``job.doc.operation_success`` to ``True`` or ``False``.
As another example, a user may record the `git commit ID <https://git-scm.com/book/en/v2/Git-Basics-Viewing-the-Commit-History>`_ upon the start of an operation, allowing them to track which version of code ran the operation.

.. _hook_triggers:
Expand Down Expand Up @@ -95,7 +95,7 @@ The ``on_exception`` hook trigger will run, and ``job.doc.error_on_a_0_success``
Project-Level Hooks
===================

It may be desirable to install the same hook or set of hooks for all operations in a project.
It may be desirable to install the same hook or set of hooks for all operations in a FlowProject.
In the following example FlowProject, the hook ``track_start_time`` is triggered when each operation starts.
The hook appends the current time to a list in the job document that is named based on the name of the operation.

Expand Down
1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,7 @@ If you are new to **signac**, the best place to start is to read the :ref:`intro
:maxdepth: 2
:caption: Reference

glossary
community
scientific_papers
GitHub <https://github.com/glotzerlab/signac>
Expand Down
8 changes: 4 additions & 4 deletions docs/source/jobs.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ Overview
========

A *job* is a directory on the file system, which is part of a *project workspace*.
That directory is called the *job workspace* and contains **all data** associated with that particular job.
That directory is called the job directory and contains **all data** associated with that particular job.
Every job has a unique address called the *state point*.

There are two ways to access associated metadata with your job:
Expand All @@ -23,7 +23,7 @@ In other words, all data associated with a particular job should be a direct or

.. important::

Every parameter that, when changed, would invalidate the job's data, should be part of the *state point*; all others should not.
Every :term:`parameter` that, when changed, would invalidate the job's data, should be part of the :term:`state point`; all others should not.

However, you only have to add those parameters that are **actually changed** (or anticipated to be changed) to the *state point*.
It is perfectly acceptable to hard-code parameters up until the point where you **actually change them**, at which point you would add them to the *state point* :ref:`retroactively <add-sp-keys>`.
Expand Down Expand Up @@ -461,15 +461,15 @@ Please see the h5py_ documentation for more information on how to interact with
Job Stores
==========

As mentioned before, the :attr:`Job.data` property represents an instance of :class:`~signac.H5Store`, specifically one that operates on a file called ``signac_data.h5`` in the job workspace.
As mentioned before, the :attr:`Job.data` property represents an instance of :class:`~signac.H5Store`, specifically one that operates on a file called ``signac_data.h5`` in the :term:`job directory`.
However, there are some reasons why one would want to operate on multiple different HDF5_ files instead of only one.

1. While the HDF5-format is generally mutable, it is fundamentally designed to be used as an immutable data container.
It is therefore advantageous to write large arrays to a new file instead of modifying an existing file many times.
2. It easier to synchronize multiple files instead of just one.
3. Multiple operations executed in parallel can operate on different files circumventing file locking issues.

The :attr:`Job.stores` property provides a dict-like interface to access *multiple different* HDF5 files within the job workspace directory.
The :attr:`Job.stores` property provides a dict-like interface to access *multiple different* HDF5 files within the job directory.
In fact, the :attr:`Job.data` container is essentially just an alias for ``job.stores.signac_data``.

For example, to store an array ``X`` within a file called ``my_data.h5``, one could use the following approach:
Expand Down
6 changes: 3 additions & 3 deletions docs/source/projects.rst
Original file line number Diff line number Diff line change
Expand Up @@ -79,7 +79,7 @@ Jobs

The central assumption of the **signac** data model is that the data space is divisible into individual data points consisting of data and metadata that are uniquely addressable in some manner.
Specifically, the workspace is divided into subdirectories where each directory corresponds to exactly one :py:class:`Job`.
All data associated with a job is contained in the corresponding *job workspace* directory.
All data associated with a job is contained in the corresponding job directory.
A job can consist of any type of data, ranging from a single value to multiple terabytes of simulation data; **signac**'s only requirement is that this data can be encoded in a file.
Each job is uniquely addressable via its *state point*, a key-value mapping describing its data.
There can never be two jobs that share the same state point within the same project.
Expand Down Expand Up @@ -355,7 +355,7 @@ In addition, **signac** also provides the :py:meth:`signac.Project.fn` method, w
Schema Detection
================

While **signac** does not require you to specify an *explicit* state point schema, it is always possible to deduce an *implicit* semi-structured schema from a project's data space.
While **signac** does not require you to specify an *explicit* :term:`project schema`, it is always possible to deduce an *implicit* semi-structured schema from a project's data space.
This schema is comprised of the set of all keys present in all state points, as well as the range of values that these keys are associated with.

Assuming that we initialize our data space with two state point keys, ``a`` and ``b``, where ``a`` is associated with some set of numbers and ``b`` contains a boolean value:
Expand Down Expand Up @@ -473,7 +473,7 @@ Linked Views
============

Data space organization by job id is both efficient and flexible, but the obfuscation introduced by the job id makes inspecting the workspace on the command line or *via* a file browser much harder.
A *linked view* is a directory hierarchy with human-interpretable names that link to to the actual job workspace directories.
A *linked view* is a directory hierarchy with human-interpretable names that link to the actual :term:`job directories<job directory>`.
Unlike the default mode for :ref:`data export <data-export>`, no data is copied for the generation of linked views.
See :py:meth:`~.signac.Project.create_linked_view` for the Python API.

Expand Down
8 changes: 4 additions & 4 deletions docs/source/query.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,8 @@
Query API
=========

As briefly described in :ref:`project-job-finding`, the :py:meth:`~signac.Project.find_jobs()` method provides much more powerful search functionality beyond simple selection of jobs with specific state point values.
One of the key features of **signac** is the possibility to immediately search managed data spaces to select desired subsets as needed.
As briefly described in :ref:`project-job-finding`, the :py:meth:`~signac.Project.find_jobs()` method provides much more powerful search functionality beyond simple selection of jobs with specific :term:`state points <state point>`.
One of the key features of **signac** is the possibility to search the :term:`project` workspace to select desired subsets as needed.

.. note::

Expand All @@ -20,8 +20,8 @@ This means that any filter can be used to simultaneously search for keys in both
Namespaces are identified by prefixing filter keys with the appropriate prefixes.
Currently, the following prefixes are recognized:

* **sp**: job state point
* **doc**: document
* **sp**: job :term:`state point`
* **doc**: :term:`job document`

For example, in order to select all jobs whose state point key *a* has the value "foo" and document key *b* has the value "bar", you would use:

Expand Down
21 changes: 4 additions & 17 deletions docs/source/recipes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,13 +11,13 @@ This is a collection of recipes on how to solve typical problems using **signac*
Move all recipes below into a 'General' section once we have added more recipes.


Migrating (changing) the data space schema
Migrating (changing) the project schema
==========================================

Adding/renaming/deleting keys
-----------------------------

Oftentimes, one discovers at a later stage that important keys are missing from the metadata schema.
Oftentimes, one discovers at a later stage that important :term:`parameters<parameter>` are missing from the :term:`project schema`.
For example, in the tutorial we are modeling a gas using the ideal gas law, but we might discover later that important effects are not captured using this overly simplistic model and decide to replace it with the van der Waals equation:

.. math::
Expand Down Expand Up @@ -45,19 +45,6 @@ The ``setdefault()`` function sets the value for :math:`a` and :math:`b` to 0 in

.. _document-wide-migration:

Initializing Jobs with Replica Indices
--------------------------------------
If you want to initialize your workspace with multiple instances of the same state point, you may want to include a **replica_index** or **random_seed** parameter in the state point.

.. code-block:: python
num_reps = 3
for i in range(num_reps):
for p in range(1, 11):
sp = {"p": p, "kT": 1.0, "N": 1000, "replica_index": i}
job = project.open_job(sp)
job.init()
Applying document-wide changes
------------------------------

Expand Down Expand Up @@ -88,7 +75,7 @@ This approach makes it also easy to compare the pre- and post-migration states b
Initializing state points with replica indices
==============================================

We often require multiple jobs with the same state point to collect enough information to make statistical inferences about the data. Instead of creating multiple projects to handle this, we can simply add a **replica_index** to the state point. For example, we can use the following code to generate 3 copies of each state point in a workspace:
We often require multiple jobs with the same state point to collect enough information to make statistical inferences about the data. Instead of creating multiple projects to handle this, we can add a **replica_index** to the state point. For example, we can use the following code to generate 3 copies of each state point in a workspace:

.. code-block:: python
Expand All @@ -110,7 +97,7 @@ We often require multiple jobs with the same state point to collect enough infor
Defining a grid of state point values
=====================================

Many signac data spaces are structured like a "grid" where the goal is an exhaustive search or a Cartesian product of multiple sets of input parameters. While this can be done with nested ``for`` loops, that approach can be cumbersome for state points with many keys. Here we offer a helper function that can assist in this kind of initialization, inspired by `this StackOverflow answer <https://stackoverflow.com/a/5228294>`__:
Some **signac** :term:`project schemas<project schema>` are structured like a "grid" where the goal is an exhaustive search or a Cartesian product of multiple sets of :term:`parameters<parameter>`. While this can be done with nested ``for`` loops, that approach can be cumbersome for state points with many keys. Here we offer a helper function that can assist in this kind of initialization, inspired by `this StackOverflow answer <https://stackoverflow.com/a/5228294>`__:

.. code-block:: python
Expand Down
Loading

0 comments on commit 07a209b

Please sign in to comment.