diff --git a/docs/requirements.txt b/docs/requirements.txt index 2900ffc..f7d406e 100644 --- a/docs/requirements.txt +++ b/docs/requirements.txt @@ -3,6 +3,7 @@ jupyter_client jupyter_sphinx nbconvert nbsphinx +sphinx-hoverxref sphinx>=4.0.0 sphinx_rtd_theme>=1.0.0 sphinxcontrib-bibtex>=2.2.0 diff --git a/docs/source/conf.py b/docs/source/conf.py index a68e463..dbbdc5f 100644 --- a/docs/source/conf.py +++ b/docs/source/conf.py @@ -41,6 +41,7 @@ # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom # ones. extensions = [ + "hoverxref.extension", "sphinx.ext.autodoc", "sphinx.ext.intersphinx", "sphinx.ext.todo", @@ -51,6 +52,11 @@ "sphinxcontrib.bibtex", ] +# For hover x ref +hoverxref_roles = [ + "term", +] + # For sphinxcontrib.bibtex. bibtex_bibfiles = ["signac.bib", "acknowledge.bib"] diff --git a/docs/source/dashboard.rst b/docs/source/dashboard.rst index 7fb7425..72e4ff1 100644 --- a/docs/source/dashboard.rst +++ b/docs/source/dashboard.rst @@ -3,7 +3,7 @@ The Dashboard ============= -The **signac-dashboard** visualizes data stored in a **signac** project. +The **signac-dashboard** visualizes data stored in a **signac** :term:`project`. To install the **signac-dashboard** package, see :ref:`dashboard-installation`. .. danger:: diff --git a/docs/source/flow-project.rst b/docs/source/flow-project.rst index ae62810..738abc5 100644 --- a/docs/source/flow-project.rst +++ b/docs/source/flow-project.rst @@ -54,7 +54,7 @@ Operations It is highly recommended to divide individual modifications of your project's data space into distinct functions. In this context, an *operation* is defined as a function whose only positional arguments are instances of :py:class:`~signac.job.Job`. We will demonstrate this concept with a simple example. -Let's initialize a project with a few jobs, by executing the following ``init.py`` script within a ``~/my_project`` directory: +Let's initialize a signac :term:`project` with a few :term:`jobs`, by executing the following ``init.py`` script within a ``~/my_project`` directory: .. code-block:: python @@ -66,7 +66,7 @@ Let's initialize a project with a few jobs, by executing the following ``init.py for i in range(10): project.open_job({"a": i}).init() -A very simple *operation*, which creates a file called ``hello.txt`` within a job's workspace directory, could be implemented like this: +A very simple *operation*, which creates a file called ``hello.txt`` within the :term:`job directory`, could be implemented like this: .. code-block:: python @@ -104,8 +104,8 @@ Conditions Here the :py:meth:`~flow.FlowProject.operation` decorator function specifies that the ``hello`` operation function is part of our workflow. If we run ``python project.py run``, **signac-flow** will execute ``hello`` for all jobs in the project. -However, we only want to execute ``hello`` if ``hello.txt`` does not yet exist in the job's workspace. -To do this, we need to create a condition function named ``greeted`` that tells us if ``hello.txt`` already exists in the job workspace: +However, we only want to execute ``hello`` if ``hello.txt`` does not yet exist in the job directory. +To do this, we need to create a condition function named ``greeted`` that tells us if ``hello.txt`` already exists in the job directory: .. code-block:: python diff --git a/docs/source/glossary.rst b/docs/source/glossary.rst new file mode 100644 index 0000000..b916b9c --- /dev/null +++ b/docs/source/glossary.rst @@ -0,0 +1,34 @@ +Glossary +======== + +.. glossary:: + + parameter + A variable, like `T` or `version` or `bench_number`. The smallest unit in **signac**. Specifically, these are the dictionary keys of the state point. + + state point + A dictionary of parameters and their values that uniquely specifies a :term:`job`, kept in sync with the file `signac_statepoint.json`. + + job + An object holding data and metadata of the :term:`state point` that defines it. + + job id + The MD-5 hash of a job's state point that is used to distinguish jobs. + + job directory + The directory, named for the :term:`job id`, created when a job is initialized that will contain all data and metadata pertaining to the given job. + + job document + A persistent dictionary for storage of simple key-value pairs in a job, kept in sync with the file `signac_job_document.json`. + + workspace + The directory that contains all job directories of a **signac** project. + + project + The primary interface to access and work with jobs and their data stored in the workspace. + + project schema + The emergent database schema as defined by jobs in the project workspace. The set of all keys present in all state points, as well as their range of values. + + signac schema + A configuration schema that defines accepted options and values, currently on v2. diff --git a/docs/source/hooks.rst b/docs/source/hooks.rst index cf92944..67a56b9 100644 --- a/docs/source/hooks.rst +++ b/docs/source/hooks.rst @@ -10,13 +10,13 @@ Introduction ============ One of the goals of the **signac** framework is to make it easy to track the provenance of research data and to ensure its reproducibility. -Hooks make it possible to execute user-defined functions before or after :ref:`FlowProject ` operations act on a **signac** project. +Hooks make it possible to execute user-defined functions before or after :ref:`FlowProject ` operations act on a **signac** :term:`project`. For example, hooks can be used to track state changes before and after each operation. A hook is a function that is called at a specific time relative to the execution of a **signac-flow** :ref:`operation `. A hook can be triggered when an operation starts, exits, succeeds, or raises an exception. -A basic use case is to log the success/failure of an operation by creating a hook that sets a job document value ``job.doc.operation_success`` to ``True`` or ``False``. +A basic use case is to log the success/failure of an operation by creating a hook that sets a :term:`job document` value ``job.doc.operation_success`` to ``True`` or ``False``. As another example, a user may record the `git commit ID `_ upon the start of an operation, allowing them to track which version of code ran the operation. .. _hook_triggers: @@ -95,7 +95,7 @@ The ``on_exception`` hook trigger will run, and ``job.doc.error_on_a_0_success`` Project-Level Hooks =================== -It may be desirable to install the same hook or set of hooks for all operations in a project. +It may be desirable to install the same hook or set of hooks for all operations in a FlowProject. In the following example FlowProject, the hook ``track_start_time`` is triggered when each operation starts. The hook appends the current time to a list in the job document that is named based on the name of the operation. diff --git a/docs/source/index.rst b/docs/source/index.rst index 36b6158..1682703 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -57,6 +57,7 @@ If you are new to **signac**, the best place to start is to read the :ref:`intro :maxdepth: 2 :caption: Reference + glossary community scientific_papers GitHub diff --git a/docs/source/jobs.rst b/docs/source/jobs.rst index d09f042..60cc289 100644 --- a/docs/source/jobs.rst +++ b/docs/source/jobs.rst @@ -10,7 +10,7 @@ Overview ======== A *job* is a directory on the file system, which is part of a *project workspace*. -That directory is called the *job workspace* and contains **all data** associated with that particular job. +That directory is called the job directory and contains **all data** associated with that particular job. Every job has a unique address called the *state point*. There are two ways to access associated metadata with your job: @@ -23,7 +23,7 @@ In other words, all data associated with a particular job should be a direct or .. important:: - Every parameter that, when changed, would invalidate the job's data, should be part of the *state point*; all others should not. + Every :term:`parameter` that, when changed, would invalidate the job's data, should be part of the :term:`state point`; all others should not. However, you only have to add those parameters that are **actually changed** (or anticipated to be changed) to the *state point*. It is perfectly acceptable to hard-code parameters up until the point where you **actually change them**, at which point you would add them to the *state point* :ref:`retroactively `. @@ -461,7 +461,7 @@ Please see the h5py_ documentation for more information on how to interact with Job Stores ========== -As mentioned before, the :attr:`Job.data` property represents an instance of :class:`~signac.H5Store`, specifically one that operates on a file called ``signac_data.h5`` in the job workspace. +As mentioned before, the :attr:`Job.data` property represents an instance of :class:`~signac.H5Store`, specifically one that operates on a file called ``signac_data.h5`` in the :term:`job directory`. However, there are some reasons why one would want to operate on multiple different HDF5_ files instead of only one. 1. While the HDF5-format is generally mutable, it is fundamentally designed to be used as an immutable data container. @@ -469,7 +469,7 @@ However, there are some reasons why one would want to operate on multiple differ 2. It easier to synchronize multiple files instead of just one. 3. Multiple operations executed in parallel can operate on different files circumventing file locking issues. -The :attr:`Job.stores` property provides a dict-like interface to access *multiple different* HDF5 files within the job workspace directory. +The :attr:`Job.stores` property provides a dict-like interface to access *multiple different* HDF5 files within the job directory. In fact, the :attr:`Job.data` container is essentially just an alias for ``job.stores.signac_data``. For example, to store an array ``X`` within a file called ``my_data.h5``, one could use the following approach: diff --git a/docs/source/projects.rst b/docs/source/projects.rst index 6754498..2bbd271 100644 --- a/docs/source/projects.rst +++ b/docs/source/projects.rst @@ -79,7 +79,7 @@ Jobs The central assumption of the **signac** data model is that the data space is divisible into individual data points consisting of data and metadata that are uniquely addressable in some manner. Specifically, the workspace is divided into subdirectories where each directory corresponds to exactly one :py:class:`Job`. -All data associated with a job is contained in the corresponding *job workspace* directory. +All data associated with a job is contained in the corresponding job directory. A job can consist of any type of data, ranging from a single value to multiple terabytes of simulation data; **signac**'s only requirement is that this data can be encoded in a file. Each job is uniquely addressable via its *state point*, a key-value mapping describing its data. There can never be two jobs that share the same state point within the same project. @@ -355,7 +355,7 @@ In addition, **signac** also provides the :py:meth:`signac.Project.fn` method, w Schema Detection ================ -While **signac** does not require you to specify an *explicit* state point schema, it is always possible to deduce an *implicit* semi-structured schema from a project's data space. +While **signac** does not require you to specify an *explicit* :term:`project schema`, it is always possible to deduce an *implicit* semi-structured schema from a project's data space. This schema is comprised of the set of all keys present in all state points, as well as the range of values that these keys are associated with. Assuming that we initialize our data space with two state point keys, ``a`` and ``b``, where ``a`` is associated with some set of numbers and ``b`` contains a boolean value: @@ -473,7 +473,7 @@ Linked Views ============ Data space organization by job id is both efficient and flexible, but the obfuscation introduced by the job id makes inspecting the workspace on the command line or *via* a file browser much harder. -A *linked view* is a directory hierarchy with human-interpretable names that link to to the actual job workspace directories. +A *linked view* is a directory hierarchy with human-interpretable names that link to the actual :term:`job directories`. Unlike the default mode for :ref:`data export `, no data is copied for the generation of linked views. See :py:meth:`~.signac.Project.create_linked_view` for the Python API. diff --git a/docs/source/query.rst b/docs/source/query.rst index 4f922d6..7bd8c18 100644 --- a/docs/source/query.rst +++ b/docs/source/query.rst @@ -4,8 +4,8 @@ Query API ========= -As briefly described in :ref:`project-job-finding`, the :py:meth:`~signac.Project.find_jobs()` method provides much more powerful search functionality beyond simple selection of jobs with specific state point values. -One of the key features of **signac** is the possibility to immediately search managed data spaces to select desired subsets as needed. +As briefly described in :ref:`project-job-finding`, the :py:meth:`~signac.Project.find_jobs()` method provides much more powerful search functionality beyond simple selection of jobs with specific :term:`state points `. +One of the key features of **signac** is the possibility to search the :term:`project` workspace to select desired subsets as needed. .. note:: @@ -20,8 +20,8 @@ This means that any filter can be used to simultaneously search for keys in both Namespaces are identified by prefixing filter keys with the appropriate prefixes. Currently, the following prefixes are recognized: - * **sp**: job state point - * **doc**: document + * **sp**: job :term:`state point` + * **doc**: :term:`job document` For example, in order to select all jobs whose state point key *a* has the value "foo" and document key *b* has the value "bar", you would use: diff --git a/docs/source/recipes.rst b/docs/source/recipes.rst index 332814b..cc0ab44 100644 --- a/docs/source/recipes.rst +++ b/docs/source/recipes.rst @@ -11,13 +11,13 @@ This is a collection of recipes on how to solve typical problems using **signac* Move all recipes below into a 'General' section once we have added more recipes. -Migrating (changing) the data space schema +Migrating (changing) the project schema ========================================== Adding/renaming/deleting keys ----------------------------- -Oftentimes, one discovers at a later stage that important keys are missing from the metadata schema. +Oftentimes, one discovers at a later stage that important :term:`parameters` are missing from the :term:`project schema`. For example, in the tutorial we are modeling a gas using the ideal gas law, but we might discover later that important effects are not captured using this overly simplistic model and decide to replace it with the van der Waals equation: .. math:: @@ -45,19 +45,6 @@ The ``setdefault()`` function sets the value for :math:`a` and :math:`b` to 0 in .. _document-wide-migration: -Initializing Jobs with Replica Indices --------------------------------------- -If you want to initialize your workspace with multiple instances of the same state point, you may want to include a **replica_index** or **random_seed** parameter in the state point. - -.. code-block:: python - - num_reps = 3 - for i in range(num_reps): - for p in range(1, 11): - sp = {"p": p, "kT": 1.0, "N": 1000, "replica_index": i} - job = project.open_job(sp) - job.init() - Applying document-wide changes ------------------------------ @@ -88,7 +75,7 @@ This approach makes it also easy to compare the pre- and post-migration states b Initializing state points with replica indices ============================================== -We often require multiple jobs with the same state point to collect enough information to make statistical inferences about the data. Instead of creating multiple projects to handle this, we can simply add a **replica_index** to the state point. For example, we can use the following code to generate 3 copies of each state point in a workspace: +We often require multiple jobs with the same state point to collect enough information to make statistical inferences about the data. Instead of creating multiple projects to handle this, we can add a **replica_index** to the state point. For example, we can use the following code to generate 3 copies of each state point in a workspace: .. code-block:: python @@ -110,7 +97,7 @@ We often require multiple jobs with the same state point to collect enough infor Defining a grid of state point values ===================================== -Many signac data spaces are structured like a "grid" where the goal is an exhaustive search or a Cartesian product of multiple sets of input parameters. While this can be done with nested ``for`` loops, that approach can be cumbersome for state points with many keys. Here we offer a helper function that can assist in this kind of initialization, inspired by `this StackOverflow answer `__: +Some **signac** :term:`project schemas` are structured like a "grid" where the goal is an exhaustive search or a Cartesian product of multiple sets of :term:`parameters`. While this can be done with nested ``for`` loops, that approach can be cumbersome for state points with many keys. Here we offer a helper function that can assist in this kind of initialization, inspired by `this StackOverflow answer `__: .. code-block:: python diff --git a/docs/source/tips_and_tricks.rst b/docs/source/tips_and_tricks.rst index 854ee44..4a348bc 100644 --- a/docs/source/tips_and_tricks.rst +++ b/docs/source/tips_and_tricks.rst @@ -21,7 +21,7 @@ Nonetheless, there are some basic rules worth following: What is the difference between the job state point and the job document? ------------------------------------------------------------------------ -The *state point* defines the *identity* of each job in form of the *job id*. +The :term:`state point` defines the *identity* of each job in form of the :term:`job id`. Conceptually, all data related to a job should be a function of the *state point*. That means that any metadata that could be changed without invalidating the data, should in principle be placed in the job document. @@ -33,7 +33,7 @@ That means that any metadata that could be changed without invalidating the data How do I avoid replicating metadata in filenames? ------------------------------------------------- -Many users, especially those new to **signac**, fall into the trap of storing metadata in filenames within a job's workspace even though that metadata is already encoded in the job itself. +Many users, especially those new to **signac**, fall into the trap of storing metadata in filenames within the :term:`job directory` even though that metadata is already encoded in the job itself. Using the :ref:`tutorial` project as an example, we might have stored the volume corresponding to the job at pressure 4 in a file called ``volume_pressure_4.txt``. However, this is completely unnecessary since that information can already be accessed through the job *via* ``job.sp.p``. @@ -58,7 +58,7 @@ How do I reference data/jobs in scripts? You can reference other jobs in a script using the path to the project root directory in combination with a query-expression. While it is perfectly fine to copy & paste job ids during interactive work or for small tests, hard-coded job ids within code are almost always a bad sign. -One of the main advantages of using **signac** for data management is that the schema is flexible and may be migrated at any time without too much hassle. +One of the main advantages of using **signac** for data management is that the :term:`project schema` is flexible and may be migrated at any time without too much hassle. That also means that existing ids will change and scripts that used them in a hard-coded fashion will fail. Whenever you find yourself hard-coding ids into your code, consider replacing it with a function that uses the :py:meth:`~.signac.Project.find_jobs` function instead. @@ -69,10 +69,10 @@ How do I achieve optimal performance? What are the practical scaling limits for Because **signac** uses a filesystem backend, there are some practical limitations for project size. While there is no hard limit imposed by **signac**, some heuristics can be helpful. -On a system with a fast SSD, a project can hold about 100,000 jobs before the latency for various operations (searching, filtering, iteration) becomes unwieldy. +On a system with a fast SSD, a project :term:`can hold` about 100,000 jobs before the latency for various operations (searching, filtering, iteration) becomes unwieldy. Some **signac** projects have scaled up to around 1,000,000 jobs, but the performance can be slower. This is especially difficult on network file systems found on HPC clusters, because accessing many small files is expensive compared to accessing fewer large files. -If your project needs to explore a large parameter space with many jobs, consider a state point schema that allows you to do more work with fewer jobs, instead of a small amount of work for many jobs, perhaps by reducing one dimension of the parameter space being explored. +If your project needs to explore a large parameter space with many jobs, consider a :term:`project schema` that allows you to do more work with fewer jobs, instead of a small amount of work for many jobs, perhaps by reducing one dimension of the parameter space being explored. After adding or removing jobs, it is recommended to run the CLI command ``$ signac update-cache`` or the Python method ``Project.update_cache()`` to update the persistent (centralized) cache of all state points in the project. For workflows implemented with **signac-flow**, the choice of pre-conditions and post-conditions can have a dramatic effect on performance. In particular, conditions that check for file existence, like ``FlowProject.post.isfile``, are typically much faster than conditions that require reading a file's contents.