From 1d7a8c82a2833a4b0698d5d4290de75b7c579e65 Mon Sep 17 00:00:00 2001 From: Corwin Kerr Date: Tue, 12 Jul 2022 15:13:37 -0400 Subject: [PATCH 01/31] Experiments towards glossary --- docs/source/conf.py | 6 ++++++ docs/source/index.rst | 1 + docs/source/jobs.rst | 2 +- 3 files changed, 8 insertions(+), 1 deletion(-) diff --git a/docs/source/conf.py b/docs/source/conf.py index 3caa0fe1..48c4e3ec 100644 --- a/docs/source/conf.py +++ b/docs/source/conf.py @@ -41,6 +41,7 @@ # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom # ones. extensions = [ + "hoverxref.extension", "sphinx.ext.autodoc", "sphinx.ext.intersphinx", "sphinx.ext.todo", @@ -51,6 +52,11 @@ "sphinxcontrib.bibtex", ] +# For hover x ref +hoverxref_roles = [ + 'term', +] + # For sphinxcontrib.bibtex. bibtex_bibfiles = ["signac.bib", "acknowledge.bib"] diff --git a/docs/source/index.rst b/docs/source/index.rst index 6472935d..9934fe25 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -58,6 +58,7 @@ If you are new to **signac**, the best place to start is to read the :ref:`intro :maxdepth: 2 :caption: Reference + glossary community scientific_papers GitHub diff --git a/docs/source/jobs.rst b/docs/source/jobs.rst index c42deb32..26b8ea6c 100644 --- a/docs/source/jobs.rst +++ b/docs/source/jobs.rst @@ -23,7 +23,7 @@ In other words, all data associated with a particular job should be a direct or .. important:: - Every parameter that, when changed, would invalidate the job's data, should be part of the *state point*; all others should not. + Every :term:`parameter` that, when changed, would invalidate the job's data, should be part of the *state point*; all others should not. However, you only have to add those parameters that are **actually changed** (or anticipated to be changed) to the *state point*. It is perfectly acceptable to hard-code parameters up until the point where you **actually change them**, at which point you would add them to the *state point* :ref:`retroactively `. From 8c7c0d0fe3f0ad55005a0030833a446df05766b2 Mon Sep 17 00:00:00 2001 From: Vyas Ramasubramani Date: Sun, 19 Mar 2023 15:43:18 -0400 Subject: [PATCH 02/31] Remove unnecessary setup.py --- setup.py | 0 1 file changed, 0 insertions(+), 0 deletions(-) delete mode 100644 setup.py diff --git a/setup.py b/setup.py deleted file mode 100644 index e69de29b..00000000 From 624fa3a04e750eb8cd5b28cb0b713672b1b8bdf8 Mon Sep 17 00:00:00 2001 From: Vyas Ramasubramani Date: Sun, 19 Mar 2023 16:05:35 -0400 Subject: [PATCH 03/31] Update recommended Python version. --- docs/source/installation.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/installation.rst b/docs/source/installation.rst index f5780e04..8e727fea 100644 --- a/docs/source/installation.rst +++ b/docs/source/installation.rst @@ -10,7 +10,7 @@ Installation The **signac** framework consists of three packages: **signac**, **signac-flow**, and **signac-dashboard**. All packages in the **signac** framework depend on the core **signac** package, which provides the data management functionality used by all other packages. -Most users should install the **signac** and the **signac-flow** packages, which are tested for Python 3.6+ and are built for all major platforms. +Most users should install the **signac** and the **signac-flow** packages, which are tested for Python 3.8+ and are built for all major platforms. For more details about the functionalities of individual packages, please see :ref:`package-overview`. From 4e995ee70f6b737389dd2fb4cc7d0a36613f5786 Mon Sep 17 00:00:00 2001 From: Vyas Ramasubramani Date: Sun, 19 Mar 2023 16:06:15 -0400 Subject: [PATCH 04/31] Remove --user recommendation. --- docs/source/installation.rst | 6 +----- 1 file changed, 1 insertion(+), 5 deletions(-) diff --git a/docs/source/installation.rst b/docs/source/installation.rst index 8e727fea..e0d37bcb 100644 --- a/docs/source/installation.rst +++ b/docs/source/installation.rst @@ -37,11 +37,7 @@ For a standard installation with pip_, execute: .. code:: bash - $ pip install --user signac signac-flow - -.. note:: - - If you want to install packages for all users on a machine, you can remove the ``--user`` option in the install command. + $ pip install signac signac-flow Installation from Source From 8089f8c2bfdd5d55aec0676fd36f2628136212da Mon Sep 17 00:00:00 2001 From: Vyas Ramasubramani Date: Sun, 19 Mar 2023 16:24:11 -0400 Subject: [PATCH 05/31] Update quickstart. --- docs/source/quickstart.rst | 14 ++++++-------- 1 file changed, 6 insertions(+), 8 deletions(-) diff --git a/docs/source/quickstart.rst b/docs/source/quickstart.rst index 78e437e0..4d8183ee 100644 --- a/docs/source/quickstart.rst +++ b/docs/source/quickstart.rst @@ -12,8 +12,8 @@ To get started, first :ref:`install ` **signac** and **signac-flo ~ $ mkdir my_project ~ $ cd my_project/ - ~/my_project $ signac init MyProject - Initialized project 'MyProject'. + ~/my_project $ signac init + Initialized project. .. important:: @@ -61,11 +61,9 @@ Operations can be executed for all of your jobs with: .. code-block:: bash ~/my_project $ python project.py run - Execute operation 'hello_job(15e548a2d943845b33030e68801bd125)'... - Hello from job 15e548a2d943845b33030e68801bd125, my foo is '1'. - Execute operation 'hello_job(2b985fa90138327bef586f9ad87fc310)'... - Hello from job 2b985fa90138327bef586f9ad87fc310, my foo is '2'. - Execute operation 'hello_job(7f3e901b4266f28348b38721c099d612)'... - Hello from job 7f3e901b4266f28348b38721c099d612, my foo is '0'. + Hello from job 15e548a2d943845b33030e68801bd125, my foo is 1. + Hello from job 7f3e901b4266f28348b38721c099d612, my foo is 0. + Hello from job 2b985fa90138327bef586f9ad87fc310, my foo is 2. + WARNING:flow.project:Operation 'hello_job' has no postconditions! See the :ref:`tutorial` for a more detailed introduction to how to use **signac** to manage data and implement workflows. From a0a75b5db76e793db7efbf3af5cb6093c3d9e0c8 Mon Sep 17 00:00:00 2001 From: Vyas Ramasubramani Date: Sun, 19 Mar 2023 17:33:22 -0400 Subject: [PATCH 06/31] Update tutorial --- docs/source/tutorial.rst | 239 +++++++++++++++++++++------------------ 1 file changed, 126 insertions(+), 113 deletions(-) diff --git a/docs/source/tutorial.rst b/docs/source/tutorial.rst index f56be8af..487ca392 100644 --- a/docs/source/tutorial.rst +++ b/docs/source/tutorial.rst @@ -3,6 +3,8 @@ ======== Tutorial ======== +######## IMPORTANT ####### +All of the example outputs on Stampede2 need to be refreshed since our output format has changed substantially over time. .. sidebar:: License @@ -35,7 +37,6 @@ To test this relationship, we start by creating an empty project directory where ~ $ mkdir ideal_gas_project ~ $ cd ideal_gas_project/ - ~/ideal_gas_project $ We then proceed by initializing the data space within a Python script called ``init.py``: @@ -44,20 +45,28 @@ We then proceed by initializing the data space within a Python script called ``i # init.py import signac - project = signac.init_project("ideal-gas-project") + project = signac.init_project() for p in range(1, 10): sp = {"p": p, "kT": 1.0, "N": 1000} - job = project.open_job(sp) - job.init() + job = project.open_job(sp).init() -The :py:func:`signac.init_project` function initializes the **signac** project in the current working directory by creating a configuration file called ``signac.rc``. -The location of this file defines the *project root directory*. -We can access the project interface from anywhere within and below the root directory by calling the :py:func:`signac.get_project` function, or from outside this directory by providing an explicit path, *e.g.*, ``signac.get_project('~/ideal_gas_project')``. +The :py:func:`signac.init_project` function initializes the **signac** project in the current working directory by creating a hidden ``.signac`` subdirectory. +The location of this directory defines the *project root directory*. +Initially, the ``.signac`` directory will contain the minimal configuration information required to define the project. -.. note:: +.. code:: bash - The name of the project stored in the configuration file is independent of the directory name it resides in. + ~/ideal_gas_project $ python init.py + ~/ideal_gas_project $ ls -a + . .. .signac init.py workspace + ~/ideal_gas_project $ ls .signac + config + ~/ideal_gas_project $ cat .signac/config + schema_version = 2 + + +We can access the project interface from anywhere within and below the root directory by calling the :py:func:`signac.get_project` function, or from outside this directory by providing an explicit path, *e.g.*, ``signac.get_project('~/ideal_gas_project')``. We can verify that the initialization worked by examining the *implicit* schema of the project we just created: @@ -71,7 +80,7 @@ We can verify that the initialization worked by examining the *implicit* schema } -The output of the ``$ signac schema`` command gives us a brief overview of all keys that were used as well as their value (range). +The output of the ``$ signac schema`` command gives us a brief overview of all keys that were used as well as their values (range). .. note:: @@ -83,8 +92,8 @@ Exploring the data space ------------------------ The core function that **signac** offers is the ability to associate metadata --- for example, a specific set of parameters such as temperature, pressure, and system size --- with a distinct directory on the file system that contains all data related to said metadata. -The :py:meth:`~signac.Project.open_job` method associates the metadata specified as its first argument with a distinct directory called a *job workspace*. -These directories are located in the ``workspace`` sub-directory within the project directory and the directory name is the so called *job id*. +The :py:meth:`~signac.Project.open_job` method associates the metadata specified as its first argument with a distinct directory, the *job directory*. +These directories are located in the ``workspace`` subdirectory within the project directory and the directory name is the so-called *job id*. .. code-block:: bash @@ -94,9 +103,9 @@ These directories are located in the ``workspace`` sub-directory within the proj 71855b321a04dd9ee27ce6c9cc0436f4 # ... -The *job id* is a highly compact, unambiguous representation of the *full metadata*, *i.e.*, a distinct set of key-value pairs will always map to the same job id. +The *job id* is a highly compact, unambiguous representation of the full metadata, *i.e.*, a distinct set of key-value pairs will always map to the same job id. However, it can also be somewhat cryptic, especially for users who would like to browse the data directly on the file system. -Fortunately, you don't need to worry about this *internal representation* of the data space while you are actively working with the data. +Fortunately, you don't need to worry about this internal representation of the data space while you are actively working with the data. Instead, you can create a *linked view* with the ``signac view`` command: .. code-block:: bash @@ -105,23 +114,29 @@ Instead, you can create a *linked view* with the ``signac view`` command: ~/ideal_gas_project $ ls -d view/p/* view/p/1 view/p/2 view/p/3 view/p/4 view/p/5 view/p/6 view/p/7 view/p/8 view/p/9 -The linked view is **the most compact** representation of the data space in form of a nested directory structure. -*Most compact* means in this case, that **signac** detected that the values for *kT* and *N* are constant across all jobs and are therefore safely omitted. -It is designed to provide a human-readable representation of the metadata in the form of a nested directory structure. -Each directory contains a ``job`` directory, which is a symbolic link to the actual workspace directory. +Views are designed to provide a human-readable representation of the metadata in the form of a nested directory structure. +The directory hierarchy is composed of a sequence of nested ``key/value`` subdirectories such that the entire metadata associated with a job is encoded in the full path to the view directory. +Each leaf node in the directory tree contains a ``job`` directory, which is a symbolic link to the actual workspace directory: + +.. code-block:: bash + + ~/ideal_gas_project $ ls view/p/1 + job + +To minimize the directory tree depth, the linked view constructed is the most compact representation of the data space, in the sense that any parameters that do not vary across the entire data space are omitted from the directory tree. +In our example, **signac** detected that the values for *kT* and *N* are constant across all jobs and therefore omitted creating nested subdirectories for them. .. note:: Make sure to update the view paths by executing the ``$ signac view`` command (or equivalently with the :py:meth:`~signac.Project.create_linked_view` method) everytime you add or remove jobs from your data space. - Interacting with the **signac** project --------------------------------------- You interact with the **signac** project on the command line using the ``signac`` command. You can also interact with the project within Python *via* the :py:class:`signac.Project` class. -You can obtain an instance of that class within the project root directory and all sub-directories with: +You can obtain an instance of that class within the project root directory and all subdirectories with: .. code-block:: pycon @@ -130,9 +145,9 @@ You can obtain an instance of that class within the project root directory and a .. tip:: - You can use the ``$ signac shell`` command to launch a Python interpreter with ``signac`` already imported - as well as depending on the current working directory, with variables ``project`` and ``job`` set to - :py:func:`~signac.get_project()` and :py:func:`~signac.get_job()` respectively. + You can use the ``$ signac shell`` command to launch a Python interpreter with ``signac`` already imported. + If this command is executed within a project directory or a job directory, the additional variables like + ``project`` and ``job`` will be set to :py:func:`~signac.get_project()` and :py:func:`~signac.get_job()` respectively. Iterating through all jobs within the data space is then as easy as: @@ -147,7 +162,7 @@ Iterating through all jobs within the data space is then as easy as: 71855b321a04dd9ee27ce6c9cc0436f4 # ... -We can iterate through a select set of jobs with the :py:meth:`~signac.Project.find_jobs` method in combination with a query expression: +To iterate oer a subset of jobs, use the :py:meth:`~signac.Project.find_jobs` method in combination with a query expression: .. code-block:: pycon @@ -159,7 +174,7 @@ We can iterate through a select set of jobs with the :py:meth:`~signac.Project.f >>> In this example we selected all jobs, where the value for :math:`kT` is equal to 1.0 -- which would be all of them -- and where the value for :math:`p` is less than 3.0. -The equivalent selection on the command line would be achieved with ``$ signac find kT 1.0 p.\$lt 3.0``. +The equivalent selection would be achieved on the command line with ``$ signac find kT 1.0 p.\$lt 3.0``. See the detailed :ref:`query` documentation for more information on how to find and select specific jobs. .. note:: @@ -170,10 +185,10 @@ Operating on the data space --------------------------- Each job represents a data set associated with specific metadata. -The point is to generate data which is a **function** of that metadata. -Within the framework's language, such a function is called a *data space operation*. +The point is to generate data which is a function of that metadata. +Within the framework's language, such a function is called an *operation*. -Coming back to our example, we could implement a very simple operation that calculates the volume :math:`V` as a function of our metadata like this: +Coming back to our example, a very simple operation that calculates the volume :math:`V` might look like this: .. code-block:: python @@ -213,7 +228,7 @@ Implementing a simple workflow ------------------------------ In many cases, it is desirable to avoid the repeat execution of data space operations, especially if they are not `idempotent `_ or are significantly more expensive than our simple example. -For this, we will incoporate the ``compute_volume()`` function into a workflow using the package ``signac-flow`` and its :class:`~.flow.FlowProject` class. +For this, we will incorporate the ``compute_volume()`` function into a workflow using the package ``signac-flow`` and its :class:`~.flow.FlowProject` class. We slightly modify our ``project.py`` script: .. code-block:: python @@ -222,7 +237,11 @@ We slightly modify our ``project.py`` script: from flow import FlowProject - @FlowProject.operation + class Project(FlowProject): + pass + + + @Project.operation def compute_volume(job): volume = job.sp.N * job.sp.kT / job.sp.p with open(job.fn("volume.txt"), "w") as file: @@ -230,43 +249,45 @@ We slightly modify our ``project.py`` script: if __name__ == "__main__": - FlowProject().main() + Project().main() The :py:meth:`~.flow.FlowProject.operation` decorator identifies the ``compute_volume`` function as an *operation function* of our project. Furthermore, it is now directly executable from the command line via an interface provided by the :py:meth:`~flow.FlowProject.main` method. +Note that we created a (trivial) subclass of ``FlowProject`` rather than using ``FlowProject`` directly. +Operations are associated with a class, not an instance, so encapsulating distinct workflows into separate classes is a good organizational best practice. -We can then execute all operations defined within the project with: +We can now execute all operations defined within the project with: .. code-block:: bash ~/ideal_gas_project $ python project.py run + Using environment configuration: StandardEnvironment + WARNING:flow.project:Operation 'compute_volume' has no postconditions! -However, if you execute this in your own terminal, you might have noticed a warning message printed out at the end, that looks like: - -.. code-block:: none - - WARNING:flow.project:Operation 'compute_volume' has no post-conditions! - -That is because by default, the ``run`` command will continue to execute all defined operations until they are considered *completed*. -An operation is considered completed when all its *post conditions* are met, and it is up to the user to define those post conditions. -Since we have not defined any post conditions yet, **signac** would continue to execute the same operation indefinitely. +We'll come back to discussing :ref:`environments ` later. +The warning indicates that the ``run`` command will continue to execute all defined operations until they are considered completed. +An operation is considered completed when all its *postconditions* are met, and it is up to the user to define those postconditions. +Since we have not defined any postconditions yet, **signac** would continue to execute the same operation indefinitely. -For this example, a good post condition would be the existence of the ``volume.txt`` file. +For this example, a good postcondition would be the existence of the ``volume.txt`` file. To tell the :py:class:`~.flow.FlowProject` class when an operation is *completed*, we can modify the above example by adding a function that defines this condition: .. code-block:: python # project.py from flow import FlowProject - import os + + + class Project(FlowProject): + pass def volume_computed(job): return job.isfile("volume.txt") - @FlowProject.post(volume_computed) - @FlowProject.operation + @Project.post(volume_computed) + @Project.operation def compute_volume(job): volume = job.sp.N * job.sp.kT / job.sp.p with open(job.fn("volume.txt"), "w") as file: @@ -274,7 +295,8 @@ To tell the :py:class:`~.flow.FlowProject` class when an operation is *completed if __name__ == "__main__": - FlowProject().main() + Project().main() + .. tip:: @@ -287,13 +309,13 @@ This should now return without any message because all operations have already b .. note:: - To simply, execute a specific operation from the command line ignoring all logic, use the ``exec`` command, *e.g.*: ``$ python project.py exec compute_volume``. + To simply execute a specific operation from the command line ignoring all logic, use the ``exec`` command, *e.g.*: ``$ python project.py exec compute_volume``. This command (as well as the run command) also accepts jobs as arguments, so you can specify that you only want to run operations for a specific set of jobs. Extending the workflow ---------------------- -So far we learned how to define and implement *data space operations* and how to define simple post conditions to control the execution of said operations. +So far we learned how to define and implement operations and how to define simple postconditions to control the execution of said operations. In the next step, we will learn how to integrate multiple operations into a cohesive workflow. First, let's verify that the volume has actually been computed for all jobs. @@ -301,11 +323,9 @@ For this we transform the ``volume_computed()`` function into a *label function* .. code-block:: python - # project.py - from flow import FlowProject - + # ... - @FlowProject.label + @Project.label def volume_computed(job): return job.isfile("volume.txt") @@ -317,16 +337,24 @@ We can then view the project's status with the ``status`` command: .. code-block:: bash ~/ideal_gas_project $ python project.py status - Generate output... + Using environment configuration: StandardEnvironment + Fetching status: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 27941.33it/s] + Fetching labels: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 58344.26it/s] + + Overview: 9 jobs/aggregates, 0 jobs/aggregates with eligible operations. + + label ratio + --------------- -------------------------------------------------------- + volume_computed |████████████████████████████████████████| 9/9 (100.00%) + + operation/group + ----------------- - Status project 'ideal-gas-project': - Total # of jobs: 10 - label progress - --------------- -------------------------------------------------- - volume_computed |########################################| 100.00% + [U]:unknown [R]:registered [I]:inactive [S]:submitted [H]:held [Q]:queued [A]:active [E]:error [GR]:group_registered [GI]:group_inactive [GS]:group_submitted [GH]:group_held [GQ]:group_queued [GA]:group_active [GE]:group_e + rror -That means that there is a ``volume.txt`` file in each and every job workspace directory. +The labels section shows that 9/9 jobs have the volume_computed label, meaning that there is a ``volume.txt`` file in each and every job directory. Let's assume that instead of storing the volume in a text file, we wanted to store in it in a `JSON`_ file called ``data.json``. Since we are pretending that computing the volume is an expensive operation, we will implement a second operation that copies the result stored in the ``volume.txt`` file into the ``data.json`` file instead of recomputing it: @@ -341,31 +369,30 @@ Since we are pretending that computing the volume is an expensive operation, we # ... - - @FlowProject.pre(volume_computed) - @FlowProject.post.isfile("data.json") - @FlowProject.operation + @Project.pre(volume_computed) + @Project.post.isfile("data.json") + @Project.operation def store_volume_in_json_file(job): with open(job.fn("volume.txt")) as textfile: data = {"volume": float(textfile.read())} with open(job.fn("data.json"), "w") as jsonfile: json.dump(data, jsonfile) - # ... -Here we reused the ``volume_computed`` condition function as a **pre-condition** and took advantage of the ``post.isfile`` short-cut function to define the post-condition for this operation function. +Here we reused the ``volume_computed`` condition function as a **precondition** and took advantage of the ``post.isfile`` function to define the postcondition for this operation function. .. important:: - An operation function is **eligible** for execution if all pre-conditions are met, at least one post-condition is not met and the operation is not currently submitted or running. + An operation function is **eligible** for execution if all preconditions are met, at least one postcondition is not met and the operation is not currently submitted or running. Next, instead of running this new function for all jobs, let's test it for one job first. .. code-block:: bash ~/ideal_gas_project $ python project.py run -n 1 - Execute operation 'store_volume_in_json_file(742c883cbee8e417bbb236d40aea9543)'... + Using environment configuration: StandardEnvironment + WARNING:flow.project:Reached the maximum number of operations that can be executed, but there are still eligible operations. We can verify the output with: @@ -378,36 +405,25 @@ Since that seems right, we can then store all other volumes in the respective `` .. tip:: - We could further simplify our workflow definition by replacing the ``pre(volume_computed)`` condition with ``pre.after(compute_volume)``, which is a short-cut to reuse all of ``compute_volume()``'s post-conditions as pre-conditions for the ``store_volume_in_json_file()`` operation. + We could further simplify our workflow definition by replacing the ``pre(volume_computed)`` condition with ``pre.after(compute_volume)``, which is a shortcut to reuse all of ``compute_volume()``'s postconditions as preconditions for the ``store_volume_in_json_file()`` operation. Grouping Operations ------------------- -If we wanted to submit :code:`compute_volume` and -:code:`store_volume_in_json_file` together to run in series, we currently couldn't, even though we -know that :code:`store_volume_in_json_file` can run immediately after -:code:`compute_volume`. With the :py:class:`FlowGroup` class, we can group the -two operations together and submit any job that is ready to run -:code:`compute_volume`. To do this, we create a group and decorate the operations -with it. +If we wanted to submit :code:`compute_volume` and :code:`store_volume_in_json_file` together to run in series, we currently couldn't, even though we know that :code:`store_volume_in_json_file` can run immediately after :code:`compute_volume`. +With the :py:class:`FlowGroup` class, we can group the two operations together and submit any job that is ready to run :code:`compute_volume`. +To do this, we create a group and decorate the operations with it. .. code-block:: python - # project.py - from flow import FlowProject - import json - - volume_group = FlowProject.make_group(name="volume") - + # ... - @FlowProject.label - def volume_computed(job): - return job.isfile("volume.txt") + volume_group = Project.make_group(name="volume") @volume_group - @FlowProject.post(volume_computed) - @FlowProject.operation + @Project.post(volume_computed) + @Project.operation def compute_volume(job): volume = job.sp.N * job.sp.kT / job.sp.p with open(job.fn("volume.txt"), "w") as file: @@ -415,45 +431,42 @@ with it. @volume_group - @FlowProject.pre(volume_computed) - @FlowProject.post.isfile("data.json") - @FlowProject.operation + @Project.pre(volume_computed) + @Project.post.isfile("data.json") + @Project.operation def store_volume_in_json_file(job): with open(job.fn("volume.txt")) as textfile: data = {"volume": float(textfile.read())} with open(job.fn("data.json"), "w") as jsonfile: json.dump(data, jsonfile) + Project().main() + # ... - if __name__ == "__main__": - FlowProject().main() -We can now run :code:`python project.py run -o volume` or -:code:`python project.py submit -o volume` to run or submit both operations. +We can now run :code:`python project.py run -o volume` to run both operations. The job document ---------------- -Storing results in JSON format -- as shown in the previous section -- is good practice because the JSON format is an open, human-readable format, and parsers are readily available in a wide range of languages. -Because of this, **signac** stores all metadata in JSON files and in addition comes with a built-in JSON-storage container for each job (see: :ref:`project-job-document`). - -Let's add another operation to our ``project.py`` script that stores the volume in the *job document*: +Storing results in JSON files is good practice because JSON is an open, human-readable format, and parsers are readily available in a wide range of languages. +**signac** stores all metadata in JSON files. +In addition, each job supports storing data in a separate JSON file called the :ref:`job document `. +Let's add another operation to our ``project.py`` script that stores the volume in the job document: .. code-block:: python - # project.py # ... - - @FlowProject.pre.after(compute_volume) - @FlowProject.post(lambda job: "volume" in job.document) - @FlowProject.operation + @Project.pre.after(compute_volume) + @Project.post(lambda job: "volume" in job.document) + @Project.operation def store_volume_in_document(job): with open(job.fn("volume.txt")) as textfile: job.document.volume = float(textfile.read()) -Besides needing fewer lines of code, storing data in the *job document* has one more distinct advantage: it is directly searchable. -That means that we can find and select jobs based on its content. +Besides needing fewer lines of code, storing data in the job document has one more distinct advantage: it is directly searchable. +That means that we can find and select jobs through the signac API (or CLI) based on the contents of their documents. Executing the ``$ python project.py run`` command after adding the above function to the ``project.py`` script will store all volume in the job documents. We can then inspect all *searchable* data with the ``$ signac find`` command in combination with the ``--show`` option: @@ -472,8 +485,8 @@ We can then inspect all *searchable* data with the ``$ signac find`` command in {'volume': 250.0} # ... -When executed with ``--show``, the ``find`` command not only prints the *job id*, but also the metadata and the document for each job. -In addition to selecting by metadata as shown earlier, we can also find and select jobs by their *job document* content, *e.g.*: +When executed with ``--show``, the ``find`` command not only prints the job id, but also the metadata and the document for each job. +In addition to selecting by metadata as shown earlier, we can also find and select jobs by their job document content, *e.g.*: .. code-block:: bash @@ -493,9 +506,9 @@ In addition to selecting by metadata as shown earlier, we can also find and sele Job.data and Job.stores ----------------------- -The job :py:attr:`~signac.contrib.job.Job.data` attribute provides a dict-like interface to an HDF5-file, which is designed to store large numerical data, such as numpy arrays. - -For example: +The job document is useful for storing small sets of numerical values or textual data. +Text files like JSON are generally unsuitable for large numerical data, however, due to issues with floating point precision as well as sheer file size. +To support storing such data with **signac**, the job :py:attr:`~signac.contrib.job.Job.data` attribute provides a dict-like interface to an HDF5 file, a much more suitable format for storing large numerical data such as numpy arrays. .. code-block:: python @@ -503,7 +516,7 @@ For example: job.data.my_array = numpy.zeros(64, 32) You can use the ``data``-attribute to store both built-in types, numpy arrays, and pandas dataframes. -The ``job.data`` property is a short-cut for ``job.stores['signac_data']``, you can access many different data stores by providing your own name, e.g., ``job.stores.my_data``. +The ``job.data`` property is a shortcut for ``job.stores['signac_data']``, you can access many different data stores by providing your own name, e.g., ``job.stores.my_data``. See :ref:`project-job-data` for an in-depth discussion. @@ -511,7 +524,7 @@ Submit operations to a scheduling system ======================================== In addition to executing operations directly on the command line, **signac** can also submit operations to a scheduler such as SLURM_. -The submit command will generate and submit a script containing the operations to run and relevant scheduler directives such as the number of processors to request. +The ``submit`` command will generate and submit a script containing the operations to run along with relevant scheduler directives such as the number of processors to request. In addition, **signac** will also keep track of submitted operations in addition to workflow progress, which almost completely automates the submission process as well as preventing the accidental repeated submission of operations. .. _SLURM: https://slurm.schedmd.com/ @@ -522,8 +535,8 @@ As an example, we could submit the operation ``compute_volume`` to the cluster. ``$ python project.py submit -o compute_volume -n 1 -w 1.5`` -This command submits to the cluster for the next available job (because we specified ``-n 1``), which is submitted with a walltime of 1.5 hours. -We can use the ``--pretend`` option to output the text of the submission document. +This command submits the next available job to the cluster with a walltime of 1.5 hours (only one job because we specified ``-n 1``). +To inspect the submission script before submitting, use the ``--pretend`` option to print the script to the console. Here is some sample output used on Stampede2, a SLURM-based queuing system: .. code-block:: bash From 4ee981a959b92dc3bec0d0637bc39360e41f36c0 Mon Sep 17 00:00:00 2001 From: Vyas Ramasubramani Date: Sun, 19 Mar 2023 17:50:39 -0400 Subject: [PATCH 07/31] projects --- docs/source/projects.rst | 70 +++++++++++++++++----------------------- 1 file changed, 29 insertions(+), 41 deletions(-) diff --git a/docs/source/projects.rst b/docs/source/projects.rst index 16abb994..09269002 100644 --- a/docs/source/projects.rst +++ b/docs/source/projects.rst @@ -34,41 +34,32 @@ The project interface provides simple and consistent access to the project's und .. [#f1] You can access a project interface from other locations by explicitly specifying the root directory. -To initialize a project, simply execute ``$ signac init `` on the command line inside the desired project directory (create a new project directory if needed). -For example, to initialize a **signac** project named *MyProject* in a directory called ``my_project``, execute: +To initialize a project, simply execute ``$ signac init`` on the command line inside the desired project directory (create a new project directory if needed). +For example, to initialize a **signac** project in a directory called ``my_project``, execute: .. code-block:: bash $ mkdir my_project $ cd my_project - $ signac init MyProject + $ signac init You can alternatively initialize your project within Python with the :py:func:`~signac.init_project` function: .. code-block:: pycon - >>> project = signac.init_project("MyProject") + >>> project = signac.init_project() -This will create a configuration file which contains the name of the project. -The directory that contains this configuration file is the project's root directory. +This will create a ``.signac`` directory with a configuration file. +The directory containing the ``.signac`` subdirectory is the project's root directory. .. _project-data-space: The Data Space ============== -The project data space is stored in the *workspace directory*. -By default this is a sub-directory within the project's root directory named *workspace*. +The project data space is stored in the *workspace directory*, a subdirectory within the project's root directory named ``workspace``. Once a project has been initialized, any data inserted into the data space will be stored within this directory. -This association is not permanent; a project can be reassociated with a new workspace at any time, and it may at times be beneficial to maintain multiple separate workspaces for a single project. -You can access your signac :py:class:`~signac.Project` and the associated *data space* from within your project's root directory or any subdirectory from the command line: - -.. code-block:: shell - - $ signac project - MyProject - -Or with the :py:func:`~signac.get_project` function: +You can access your signac :py:class:`~signac.Project` and the associated data from within your project's root directory or any subdirectory with the :py:func:`~signac.get_project` function: .. code-block:: pycon @@ -81,22 +72,20 @@ Or with the :py:func:`~signac.get_project` function: .. _project-jobs: -.. currentmodule:: signac.contrib.job +.. currentmodule:: signac.job Jobs ---- -The central assumption of the **signac** data model is that the *data space* is divisible into individual data points, consisting of data and metadata, which are uniquely addressable in some manner. -Specifically, the workspace is divided into sub-directories, where each directory corresponds to exactly one :py:class:`Job`. -Each job has a unique address, which is referred to as a *state point*. +The central assumption of the **signac** data model is that the data space is divisible into individual data points consisting of data and metadata that are uniquely addressable in some manner. +Specifically, the workspace is divided into subdirectories where each directory corresponds to exactly one :py:class:`Job`. +All data associated with a job is contained in the corresponding *job workspace* directory. A job can consist of any type of data, ranging from a single value to multiple terabytes of simulation data; **signac**'s only requirement is that this data can be encoded in a file. +Each job is uniquely addressable via its *state point*, a key-value mapping describing its data. +There can never be two jobs that share the same state point within the same project. +All data associated with your job should be a unique function of the state point, e.g., the parameters that go into your physics or machine learning model. -A job is essentially just a directory on the file system, which is part of a *project workspace*. -That directory is called the *job workspace* and contains **all data** associated with that particular job. - -You access a job by providing a *state point*, which is a unique key-value mapping describing your data. -All data associated with your job should be a unique function of the *state point*, e.g., the parameters that go into your physics or machine learning model. -For example, to store data associated with particular temperature or pressure of a simulation, you would first initialize a project, and then *open* a job like this: +For example, to store data associated with particular temperature or pressure of a simulation, you would first initialize a project and then *open* a job like this: .. code-block:: python @@ -109,15 +98,14 @@ For example, to store data associated with particular temperature or pressure of .. tip:: You only need to call the :meth:`Job.init` function the first time that you are accessing a job. - Furthermore, the :meth:`Job.init` function returns itself, so you can abbreviate like this: + Furthermore, the :meth:`Job.init` function returns the job itself, so you can abbreviate like this: .. code-block:: python job = project.open_job({"temperature": 20, "pressure": 1.0}).init() -The job *state point* represents a **unique address** of your data within one project. -There can never be two jobs that share the same *state point* within the same project. -Any other kind of data and metadata that describe your job, but do not represent a unique address should be stored within the :attr:`Job.doc`, which has the exact same interface like the :attr:`Job.sp`, but does not represent a unique address of the job. +The uniqueness of a state point should be familiar to anyone who has used a relational database: the state point parameters constitute the primary key of the data space. +Any other kind of data and metadata that describe a job but are not part of the key should be stored within the :attr:`Job.doc`, which has the exact same interface like the :attr:`Job.sp`. .. tip:: @@ -172,7 +160,7 @@ Grouping Grouping operations can be performed on the complete project data space or the results of search queries, enabling aggregated analysis of multiple jobs and state points. The return value of the :py:meth:`~Project.find_jobs()` method is a cursor that we can use to iterate over all jobs (or all jobs matching an optional filter if one is specified). -This cursor is an instance of :py:class:`~signac.contrib.project.JobsCursor` and allows us to group these jobs by state point parameters, the job document values, or even arbitrary functions. +This cursor is an instance of :py:class:`~signac.project.JobsCursor` and allows us to group these jobs by state point parameters, the job document values, or even arbitrary functions. .. note:: @@ -202,7 +190,7 @@ Similarly, we can group by values in the job document as well. Here, we group al Grouping by Multiple Keys ^^^^^^^^^^^^^^^^^^^^^^^^^ -Grouping by multiple state point parameters or job document values is possible, by passing an iterable of fields that should be used for grouping. +To group by multiple state point parameters or job document values, pass an iterable of fields that should be used for grouping. For example, we can group jobs by state point parameters *c* and *d*: .. code-block:: python @@ -244,7 +232,7 @@ Moving, Copying and Removal --------------------------- In some cases it may desirable to divide or merge a project data space. -To **move** a job to a different project, use the :py:meth:`~signac.contrib.job.Job.move` method: +To **move** a job to a different project, use the :py:meth:`~signac.job.Job.move` method: .. code-block:: python @@ -262,15 +250,15 @@ To **move** a job to a different project, use the :py:meth:`~signac.contrib.job. for job in jobs_to_copy: project.clone(job) -Trying to move or copy a job to a project which has already an initialized job with the same *state point*, will trigger a :py:class:`~signac.errors.DestinationExistsError`. +Trying to move or copy a job to a project which has already an initialized job with the same state point will trigger a :py:class:`~signac.errors.DestinationExistsError`. .. warning:: While **moving** is a cheap renaming operation, **copying** may be much more expensive since all of the job's data will be copied from one workspace into the other. -To **clear** all data associated with a specific job, call the :py:meth:`~signac.contrib.job.Job.clear` method. -Note that this function will do nothing if the job is uninitialized; the :py:meth:`~signac.contrib.job.Job.reset` method will also clear all data associated with a job, but it will also automatically initialize the job if it was not originally initialized. -To **permanently delete** a job and its contents use the :py:meth:`~signac.contrib.job.Job.remove` method: +To **clear** all data associated with a specific job, call the :py:meth:`~signac.job.Job.clear` method. +Note that this function will do nothing if the job is uninitialized; the :py:meth:`~signac.job.Job.reset` method will also clear all data associated with a job, but it will also automatically initialize the job if it was not originally initialized. +To **permanently delete** a job and its contents use the :py:meth:`~signac.job.Job.remove` method: .. code-block:: python @@ -330,7 +318,7 @@ To access data through a functional interface: ... x = project.data.get("x")[:] ... -.. currentmodule:: signac.contrib.job +.. currentmodule:: signac.job In addition, **signac** also provides the :py:meth:`signac.Project.fn` method, which is analogous to the :py:meth:`Job.fn` method described above: @@ -495,7 +483,7 @@ To create views from the command line, use the ``$ signac view`` command. When the project data space is changed by adding or removing jobs, simply update the view, by executing :py:meth:`~signac.Project.create_linked_view` or ``$ signac view`` for the same view directory again. -You can limit the *linked view* to a specific data subset by providing a set of *job ids* to the :py:meth:`~signac.Project.create_linked_view` method. +You can limit the linked view to a specific data subset by providing a set of job ids to the :py:meth:`~signac.Project.create_linked_view` method. This works similarly for ``$ signac view`` on the command line, but here you can also specify a filter directly: .. code-block:: bash @@ -518,7 +506,7 @@ Users who are familiar with ``rsync`` will recognize that most of the core funct As an example, let's assume that we have a project stored locally in the path ``/data/my_project`` and want to synchronize it with ``/remote/my_project``. We would first change into the root directory of the project that we want to synchronize data into. -Then we would call ``signac sync`` with the path of the project that we want to *synchronize with*: +Then we would call ``signac sync`` with the path of the project that we want to synchronize with: .. code-block:: bash From 7b94c749038505c6ba2fdcac69b90131e64160db Mon Sep 17 00:00:00 2001 From: Vyas Ramasubramani Date: Sun, 19 Mar 2023 18:10:48 -0400 Subject: [PATCH 08/31] job --- docs/source/jobs.rst | 15 +++++++-------- docs/source/tutorial.rst | 2 -- 2 files changed, 7 insertions(+), 10 deletions(-) diff --git a/docs/source/jobs.rst b/docs/source/jobs.rst index c42deb32..752c9836 100644 --- a/docs/source/jobs.rst +++ b/docs/source/jobs.rst @@ -4,7 +4,7 @@ Jobs ==== -.. currentmodule:: signac.contrib.job +.. currentmodule:: signac.job Overview ======== @@ -74,12 +74,10 @@ You can initialize a job **explicitly**, by calling the :py:meth:`Job.init` meth .. code-block:: pycon - >>> job = project.open_job({"a": 2}) - # Job does not exist yet + >>> job = project.open_job({"a": 2}) # Job does not exist yet >>> job in project False - >>> job.init() - # Job now exists + >>> job.init() # Job now exists >>> job in project True @@ -188,15 +186,16 @@ You can modify **nested** *state points* in-place, but you will need to use dict modifiable copy that will not modify the underlying JSON file, you can access a dict copy of the statepoint by calling it, e.g. ``sp_dict = job.statepoint()`` instead of ``sp = job.statepoint``. - For more information, see :class:`~signac.JSONDict`. + For more information, see :attr:`signac.JSONDict`. .. _project-job-document: + The Job Document ================ -In addition to the state point, additional metadata can be associated with your job in the form of simple key-value pairs using the job :py:attr:`~Job.document`. +In addition to the state point, additional metadata can be associated with your job in the form of simple key-value pairs using the job :attr:`Job.document`. This *job document* is automatically stored in the job's workspace directory in `JSON`_ format. You can access it via the :py:attr:`Job.document` or the :py:attr:`Job.doc` attribute. @@ -246,7 +245,7 @@ Like the :py:attr:`Job.document`, this information can be accessed using key-val Unlike the :py:attr:`Job.document`, :attr:`Job.data` is not searchable. Data written with :py:attr:`Job.data` is stored in a file named ``signac_data.h5`` in the associated job folder. -Data written with :py:attr:`Job.stores['key_name']` is stored in a file named ``key_name.h5``. +Data written with ``Job.stores['key_name']`` is stored in a file named ``key_name.h5``. For cases where job-associated data may be accessed from multiple sources at the same time or other instances where multiple files may be preferred to one large file, :py:attr:`Job.stores` should be used instead of :py:attr:`Job.data`. This section will focus on examples and usage of :py:attr:`Job.data`. Further discussion of :py:attr:`Job.stores` is provided in the following topic, :ref:`Job Stores `. diff --git a/docs/source/tutorial.rst b/docs/source/tutorial.rst index 487ca392..fec205e8 100644 --- a/docs/source/tutorial.rst +++ b/docs/source/tutorial.rst @@ -3,8 +3,6 @@ ======== Tutorial ======== -######## IMPORTANT ####### -All of the example outputs on Stampede2 need to be refreshed since our output format has changed substantially over time. .. sidebar:: License From 919126b341ef3b3b1a23e6b5c8314c028230800a Mon Sep 17 00:00:00 2001 From: Vyas Ramasubramani Date: Sun, 19 Mar 2023 18:16:10 -0400 Subject: [PATCH 09/31] querying --- docs/source/query.rst | 15 +++------------ 1 file changed, 3 insertions(+), 12 deletions(-) diff --git a/docs/source/query.rst b/docs/source/query.rst index df019ff3..90a84b09 100644 --- a/docs/source/query.rst +++ b/docs/source/query.rst @@ -5,10 +5,7 @@ Query API ========= As briefly described in :ref:`project-job-finding`, the :py:meth:`~signac.Project.find_jobs()` method provides much more powerful search functionality beyond simple selection of jobs with specific state point values. -More generally, all **find()** functions within the framework accept filter arguments that will return a selection of jobs or documents. One of the key features of **signac** is the possibility to immediately search managed data spaces to select desired subsets as needed. -Internally, all search operations are processed by an instance of :py:class:`~signac.Collection` (see :ref:`collections`). -Therefore, they all follow the same syntax, so you can use the same type of filter arguments in :py:meth:`~signac.Project.find_jobs`, :py:meth:`~signac.Project.find_statepoints`, and so on. .. note:: @@ -39,9 +36,6 @@ This means that the following query is equivalent to the one above: project.find_jobs({"a": "foo", "doc.b": "bar"}) -For backwards compatibility, some methods in **signac** such as :py:meth:`~signac.Project.find_jobs()` accept separate ``filter`` and ``doc_filter`` arguments, where keys in the ``doc_filter`` are implicitly prefixed with ``'doc.'`` (and state point prefixes in ``filter`` are implicit). -However, any combination of ``filter`` and ``doc_filter`` without prefixes can be represented by an appropriately namespaced ``filter``, and the unified approach with prefixes should be preferred. - Basic Expressions ================= @@ -61,16 +55,13 @@ Select All If you want to select the complete data set, don't provide any filter argument at all. The default argument of ``None`` or an empty expression ``{}`` will select all jobs or documents. -As was previously demonstrated, iterating over all jobs in a project or all documents in a collection can be accomplished directly without using any *find* method at all: +As was previously demonstrated, iterating over all jobs in a project can be accomplished directly: .. code-block:: python for job in project: pass - for doc in collection: - pass - .. _simple-selection: Simple Selection @@ -122,7 +113,7 @@ If we wanted to match all documents where *p is greater than 2*, we would use th {"p": {"$gt": 2}} -Note that we have replaced the value for p with the expression ``{'$gt': 2}`` to select *all all jobs withe p values greater than 2*. +Note that we have replaced the value for p with the expression ``{'$gt': 2}`` to select all jobs withe p values greater than 2. Here is a complete list of all available **arithmetic operators**: * ``$eq``: equal to @@ -255,7 +246,7 @@ Simplified Syntax on the Command Line It is possible to use search expressions directly on the command line, for example in combination with the ``$ signac find`` command. In this case filter arguments are expected to be provided as valid JSON expressions. -However, for simple filters you can also use a *simplified syntax*. +However, for simple filters you can also use a simplified syntax in lieu of writing JSON. For example, instead of ``{'p': 2}``, you can simply type ``p 2``. A simplified expression consists of key-value pairs in alternation. From 2c409ffac766c98e6cd147f95e2188db84fd936c Mon Sep 17 00:00:00 2001 From: Vyas Ramasubramani Date: Sun, 19 Mar 2023 18:21:19 -0400 Subject: [PATCH 10/31] flowproject --- docs/source/flow-project.rst | 17 ++++++++++------- 1 file changed, 10 insertions(+), 7 deletions(-) diff --git a/docs/source/flow-project.rst b/docs/source/flow-project.rst index b7ccba4a..989d48da 100644 --- a/docs/source/flow-project.rst +++ b/docs/source/flow-project.rst @@ -52,9 +52,7 @@ Operations ========== It is highly recommended to divide individual modifications of your project's data space into distinct functions. - -In this context, an *operation* is defined as a function whose only positional argument is an instance of :py:class:`~signac.contrib.job.Job` (in the special case of :ref:`aggregate operations `, variable positional arguments ``*jobs`` are permitted). - +In this context, an *operation* is defined as a function whose only positional arguments are instances of :py:class:`~signac.contrib.job.Job`. We will demonstrate this concept with a simple example. Let's initialize a project with a few jobs, by executing the following ``init.py`` script within a ``~/my_project`` directory: @@ -92,6 +90,11 @@ A very simple *operation*, which creates a file called ``hello.txt`` within a jo if __name__ == "__main__": MyProject().main() +.. tip: + + By default operations only act on a single job and can simply be defined with the signature `def op(job)`. + When using :ref:`aggregate operations `, it is recommended to define operations as accepting a variadic list of ``*jobs`` parameters so that the operation is not restricted to a specific aggregate size. + .. _conditions: @@ -151,7 +154,7 @@ The entirety of the code is as follows: for more information. We can define both :py:meth:`~flow.FlowProject.pre` and :py:meth:`~flow.FlowProject.post` conditions, which allow us to define arbitrary workflows as a `directed acyclic graph `__. -A operation is only executed if **all** pre-conditions are met, and at *at least one* post-condition is not met. +A operation is only executed if **all** preconditions are met, and at *at least one* postcondition is not met. These are added above a `~flow.FlowProject.operation` decorator. Using these decorators before declaring a function an operation is an error. @@ -229,7 +232,7 @@ The Project Status The :py:class:`~flow.FlowProject` class allows us to generate a **status** view of our project. The status view provides information about which conditions are met and what operations are pending execution. -A *label-function* is a condition function which will be shown in the **status** view. +A *label function* is a condition function which will be shown in the **status** view. We can convert any condition function into a label function by adding the :py:meth:`~.flow.FlowProject.label` decorator: .. code-block:: python @@ -238,13 +241,13 @@ We can convert any condition function into a label function by adding the :py:me def greeted(job): return job.isfile("hello.txt") -We will reset the workflow for only a few jobs to get a more interesting *status* view: +We will reset the workflow for only a few jobs to get a more interesting status view: .. code-block:: bash ~/my_project $ signac find a.\$lt 5 | xargs -I{} rm workspace/{}/hello.txt -We then generate a *detailed* status view with: +We then generate a detailed status view with: .. code-block:: bash From 78ba0acf01b78ac04dc00e21e5fa4f0cfb73913e Mon Sep 17 00:00:00 2001 From: Vyas Ramasubramani Date: Sun, 19 Mar 2023 18:26:53 -0400 Subject: [PATCH 11/31] remove collection --- docs/source/collections.rst | 77 ------------------------------------- docs/source/index.rst | 1 - 2 files changed, 78 deletions(-) delete mode 100644 docs/source/collections.rst diff --git a/docs/source/collections.rst b/docs/source/collections.rst deleted file mode 100644 index 906ec249..00000000 --- a/docs/source/collections.rst +++ /dev/null @@ -1,77 +0,0 @@ -.. _collections: - -============ -Collections -============ - -An instance of :py:class:`~signac.Collection` is a *container* for multiple documents, where a document is an associative array of key-value pairs. -Examples are the job state point, or the job document. - -The :py:class:`~signac.Collection` class is used internally to manage and search data space indexes which are generated on-the-fly. -But you can also use such a container explicitly for managing document data. - - -Creating collections -==================== - -To create an empty collection, simply call the default constructor: - -.. code-block:: python - - from signac import Collection - - collection = Collection() - -You can then add documents with the :py:meth:`signac.Collection.insert_one` method. - -By default, the collection is stored purely in memory. -But you can use the :py:class:`signac.Collection` container also to manage collections **directly on disk**. -For this, simply *open* a file like this: - -.. code-block:: python - - with Collection.open("my_collection.txt") as collection: - pass - -A collection file by default is openend in *append plus* mode, that means it is opened for both reading and writing. -The :py:func:`~signac.Collection.open` function accepts all standard file open modes, such as `r` for *read-only*, etc. - -Large collections can also be stored in a compressed format using gzip for efficiency. -To use a compressed collection, simply pass in a compression level from 1-9 as a `compresslevel` argument to the :py:class:`signac.Collection` constructor: - -.. code-block:: python - - from signac import Collection - - collection = Collection(compresslevel=9) - - -Searching collections -===================== - - -To search a collection, use the :py:meth:`signac.Collection.find` method. -As an example, to search all documents where the value ``a`` is equal to 42, execute: - -.. code-block:: python - - for doc in collection.find({"a": 42}): - pass - -The :py:meth:`signac.Collection.find` method uses the framework-wide `query` API. - -Command Line Interface -====================== - -To manage and search a collection file directly from the command line, create a Python script with the following content: - -.. code-block:: python - - from signac import Collection - - with Collection.open("my_collection.txt") as c: - c.main() - -Storing the code above in a file called ``find.py`` and then executing it will allow you to search for all or specific documents within the collection, directly from the command line ``$ python find.py``. - -For more information on how to use the command line interface, execute: ``$ python find.py --help``. diff --git a/docs/source/index.rst b/docs/source/index.rst index 6472935d..81809a4a 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -49,7 +49,6 @@ If you are new to **signac**, the best place to start is to read the :ref:`intro flow-group aggregation hooks - collections configuration recipes tips_and_tricks From 3f1d8e5bac69a15298f3ed0ddc9bba9af9b67a8f Mon Sep 17 00:00:00 2001 From: Vyas Ramasubramani Date: Sun, 19 Mar 2023 18:40:01 -0400 Subject: [PATCH 12/31] configuration --- docs/source/configuration.rst | 146 +++------------------------------- 1 file changed, 10 insertions(+), 136 deletions(-) diff --git a/docs/source/configuration.rst b/docs/source/configuration.rst index f858d81d..9b5b1c31 100644 --- a/docs/source/configuration.rst +++ b/docs/source/configuration.rst @@ -7,24 +7,12 @@ Configuration Overview ======== -The **signac** framework is configured with configuration files, which are named either ``.signacrc`` or ``signac.rc``. -These configuration files are searched for at multiple locations in the following order: +The **signac** framework is configured with configuration files. +The configuration files are stored using the standard `INI file format `__. +In general, two config files are supported: - 1. in the current working directory, - 2. in each directory above the current working directory until a project configuration file is found, - 3. and the user's home directory. - -The configuration file follows the standard "ini-style". -Global configuration options, should be stored in the home directory, while project-specific options should be stored *locally* in a project configuration file. - -This is an example for a global configuration file in the user's home directory: - -.. code-block:: ini - - # ~/.signarc - [hosts] - [[localhost]] - url = mongodb://localhost + 1. Project-specific configuration uses the ``.signac/config`` file at the project root directory. + 3. Per-user configuration is stored in a global file at ``$HOME/.signacrc``. You can either edit these configuration files manually, or execute ``signac config`` on the command line. Please see ``signac config --help`` for more information. @@ -32,128 +20,14 @@ Please see ``signac config --help`` for more information. Project configuration ===================== -A project configuration file is defined by containing the keyword *project*. -Once **signac** found a project configuration file it will stop to search for more configuration files above the current working directory. - -For example, to initialize a project named *MyProject*, navigate to the project's root directory and either execute ``$ signac init MyProject`` on the command line, use the :py:func:`signac.init_project` function or create the project configuration file manually. +A project configuration file is defined as a file named ``config`` contained within a ``.signac`` directory. +Functions like :py:func:`~signac.get_project` will search upwards from a provided directory until a project configuration is found to indicate the project root. This is an example for a project configuration file: .. code-block:: ini # signac.rc - project = MyProject - workspace_dir = $HOME/myproject/workspace - -project - The name is required for the identification of the project's root directory. - -workspace_dir - The path to your project's workspace, which defaults to ``$project_root_dir/workspace``. - Can be configured relative to the project's root directory or as absolute path and may contain environment variables. - - -Host configuration -================== - -The current version of **signac** supports MongoDB databases as a backend. -To use **signac** in combination with a MongoDB database, make sure to install ``pymongo``. - -Configuring a new host ----------------------- - -To configure a new MongoDB database host, create a new entry in the ``[hosts]`` section of the configuration file. -We can do so manually or by using the ``signac config host`` command. - -Assuming that we a have a MongoDB database reachable via *example.com*, which requires a username and a password for login, execute: - -.. code-block:: bash - - $ signac config host example mongodb://example.com -u johndoe -p - Configuring new host 'example'. - Password: - Configured host 'example': - [hosts] - [[example]] - url = mongodb://example.com - username = johndoe - auth_mechanism = SCRAM-SHA-1 - password = *** - -The name of the configured host (here: *example*) can be freely chosen. -You can omit the ``-p/--password`` argument, in which case the password will not be stored and you will prompted to enter it for each session. - -We can now connect to this host with: - -.. code-block:: pycon - - >>> import signac - >>> db = signac.get_database("mydatabase", hostname="example") - -The ``hostname`` argument defaults to the first configured host and can always be omitted if there is only one configured host. - -.. note:: - - To prevent unauthorized users from obtaining your login credentials, **signac** will update the configuration file permissions such that it is only readable by yourself. - - -Changing the password ---------------------- - -To change the password for a configured host, execute - -.. code-block:: bash - - $ signac host example --update-pw -p - -.. warning:: - - By default, any password set in this way will be **encrypted**. This means that the actual password is different from the one that you entered. - However, while it is practically impossible to guess what you entered, a stored password hash will give any intruder access to the database. - This means you need to **treat the hash like a password!** - -Copying a configuration ------------------------ - -In general, in order to copy a configuration from one machine to another, you can simply copy the ``.signacrc`` file as is. -If you only want to copy a single host configuration, you can either manually copy the associated section or use the ``signac config host`` command for export: - -.. code-block:: bash - - $ signac config host example > example_config.rc - -Then copy the ``example_config.rc`` file to the new machine and rename or append it to an existing ``.signacrc`` file. -For security reasons, any stored password is not directly copied in this way. -To copy the password, follow: - -.. code-block:: bash - - # Copy the password from the old machine: - johndoe@oldmachine $ signac config host example --show-pw - XXXX - # Enter it on the new machine: - johndoe@newmachine $ signac config host example -p - - -Manual host configuration -------------------------- - -You can configure one or multiple hosts in the ``[hosts]`` section, where each subsection header specifies the host's name. - -url - The url specifies the MongoDB host url, e.g. ``mongodb://localhost``. -authentication_method (default=none) - Specify the authentication method with the database, possible choices are: ``none`` or ``SCRAM-SHA-1``. -username - A username is required if you authenticate via ``SCRAM-SHA-1``. -password - The password to authenticate via ``SCRAM-SHA-1``. -db_auth (default=admin) - The database to authenticate with. -password_config - In case that you update, but not store your password, the configuration file will contain only meta hashing data, such as the salt. - This allows to authenticate by entering the password for each session, which is generally more secure than storing the actual password hash. - -.. warning:: + schema_version = 2 - **signac** will automatically change the file permissions of the configuration file to *user read-write only* in case that it contains authentication credentials. - In case that this fails, you can set the permissions manually, e.g., on UNIX-like operating systems with: ``chmod 600 ~/.signacrc``. +schema_version + Identifier for the current internal schema used by signac. This schema version determines internal details such as the location of configuration files or caches. From 0455a9e836db37b72707b2b4d842a2d0e96a4246 Mon Sep 17 00:00:00 2001 From: Vyas Ramasubramani Date: Sun, 19 Mar 2023 18:48:28 -0400 Subject: [PATCH 13/31] Add pre-commit config --- .pre-commit-config.yaml | 59 +++--------------------------------- docs/source/aggregation.rst | 2 +- docs/source/flow-project.rst | 4 +-- docs/source/index.rst | 2 +- docs/source/jobs.rst | 2 +- docs/source/query.rst | 4 +-- docs/source/recipes.rst | 2 +- docs/source/tutorial.rst | 6 ++-- 8 files changed, 16 insertions(+), 65 deletions(-) diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml index ec3153f9..38fe7cc1 100644 --- a/.pre-commit-config.yaml +++ b/.pre-commit-config.yaml @@ -1,61 +1,12 @@ -ci: - autoupdate_schedule: quarterly - repos: - repo: https://github.com/pre-commit/pre-commit-hooks rev: 'v4.4.0' hooks: - - id: check-json - - id: check-merge-conflict - - id: check-yaml - - id: debug-statements - id: end-of-file-fixer - - id: mixed-line-ending - id: trailing-whitespace - - repo: https://github.com/asottile/pyupgrade - rev: 'v3.3.1' + - repo: https://github.com/pre-commit/pygrep-hooks + rev: 'v1.10.0' hooks: - - id: pyupgrade - args: - - --py36-plus - - repo: https://github.com/PyCQA/isort - rev: '5.12.0' - hooks: - - id: isort - - repo: https://github.com/psf/black - rev: '22.12.0' - hooks: - - id: black - - repo: https://github.com/PyCQA/flake8 - rev: '6.0.0' - hooks: - - id: flake8 - args: - - --max-line-length=100 - - repo: https://github.com/nbQA-dev/nbQA - rev: 1.6.0 - hooks: - - id: nbqa-black - - id: nbqa-pyupgrade - - repo: https://github.com/asottile/blacken-docs - rev: v1.12.1 - hooks: - - id: blacken-docs - - repo: https://github.com/FlamingTempura/bibtex-tidy - rev: 'v1.8.5' - hooks: - - id: bibtex-tidy - args: - - --omit=abstract,keywords,archiveprefix,mendeley-tags,pmid,eprint,arxivid - - --wrap=90 - - --curly - - --numeric - - --space=2 - - --align=0 - - --sort=key,type,author,-year - - --duplicates=key,doi,citation - - --strip-enclosing-braces - - --drop-all-caps - - --sort-fields=title,shorttitle,author,year,month,day,journal,booktitle,location,on,publisher,address,series,volume,number,pages,doi,isbn,issn,url,urldate,copyright,category,note,metadata - - --remove-empty-fields - - --no-remove-dupe-fields + - id: rst-backticks + - id: rst-directive-colons + - id: rst-inline-touching-normal diff --git a/docs/source/aggregation.rst b/docs/source/aggregation.rst index a0bff90d..f72bc942 100644 --- a/docs/source/aggregation.rst +++ b/docs/source/aggregation.rst @@ -12,7 +12,7 @@ This chapter provides information about passing aggregates of jobs to operation Definition ========== -An :py:class:`~flow.aggregator` creates generators of aggregates for use in operation functions via `FlowProject.operation`. +An :py:class:`~flow.aggregator` creates generators of aggregates for use in operation functions via :attr:`FlowProject.operation`. Such functions may accept a variable number of positional arguments, ``*jobs``. The argument ``*jobs`` is unpacked into an *aggregate*, defined as an ordered tuple of jobs. See also the Python documentation about :ref:`argument unpacking `. diff --git a/docs/source/flow-project.rst b/docs/source/flow-project.rst index 989d48da..e610b275 100644 --- a/docs/source/flow-project.rst +++ b/docs/source/flow-project.rst @@ -90,7 +90,7 @@ A very simple *operation*, which creates a file called ``hello.txt`` within a jo if __name__ == "__main__": MyProject().main() -.. tip: +.. tip:: By default operations only act on a single job and can simply be defined with the signature `def op(job)`. When using :ref:`aggregate operations `, it is recommended to define operations as accepting a variadic list of ``*jobs`` parameters so that the operation is not restricted to a specific aggregate size. @@ -155,7 +155,7 @@ The entirety of the code is as follows: We can define both :py:meth:`~flow.FlowProject.pre` and :py:meth:`~flow.FlowProject.post` conditions, which allow us to define arbitrary workflows as a `directed acyclic graph `__. A operation is only executed if **all** preconditions are met, and at *at least one* postcondition is not met. -These are added above a `~flow.FlowProject.operation` decorator. +These are added above a :attr:`~flow.FlowProject.operation` decorator. Using these decorators before declaring a function an operation is an error. .. tip:: diff --git a/docs/source/index.rst b/docs/source/index.rst index 81809a4a..36b61582 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -1,7 +1,7 @@ .. signac documentation master file, created by sphinx-quickstart on Fri Oct 23 17:41:32 2015. You can adapt this file completely to your liking, but it should at least - contain the root `toctree` directive. + contain the root ``toctree`` directive. Welcome to the signac framework documentation! ============================================== diff --git a/docs/source/jobs.rst b/docs/source/jobs.rst index 752c9836..e002a9b2 100644 --- a/docs/source/jobs.rst +++ b/docs/source/jobs.rst @@ -473,7 +473,7 @@ However, there are some reasons why one would want to operate on multiple differ The :attr:`Job.stores` property provides a dict-like interface to access *multiple different* HDF5 files within the job workspace directory. In fact, the :attr:`Job.data` container is essentially just an alias for ``job.stores.signac_data``. -For example, to store an array `X` within a file called ``my_data.h5``, one could use the following approach: +For example, to store an array ``X`` within a file called ``my_data.h5``, one could use the following approach: .. code-block:: pycon diff --git a/docs/source/query.rst b/docs/source/query.rst index 90a84b09..3b924aba 100644 --- a/docs/source/query.rst +++ b/docs/source/query.rst @@ -84,7 +84,7 @@ We can select the 2nd document with ``{'p': 2}``, but also ``{'N': 1000, 'p': 2} Nested Keys ----------- -To match **nested** keys, avoid nesting the filter arguments, but instead use the `.`-operator. +To match **nested** keys, avoid nesting the filter arguments, but instead use the ``.``-operator. For example, if the documents shown in the example above were all nested like this: .. code-block:: python @@ -252,7 +252,7 @@ For example, instead of ``{'p': 2}``, you can simply type ``p 2``. A simplified expression consists of key-value pairs in alternation. The first argument will then be interpreted as the first key, the second argument as the first value, the third argument as the second key, and so on. If you provide an odd number of arguments, the last value will default to ``{'$exists': True}``. -Querying via operator is supported using the `.`-operator. +Querying via operator is supported using the ``.``-operator. Finally, you can use ``//`` intead of ``{'$regex': ''}`` for regular expressions. The following list shows simplified expressions on the left and their equivalent standard expression on the right. diff --git a/docs/source/recipes.rst b/docs/source/recipes.rst index d9fd1010..b111e1bc 100644 --- a/docs/source/recipes.rst +++ b/docs/source/recipes.rst @@ -440,7 +440,7 @@ Passing command line options to operations run in a container or other environme When executing an operation in a container (e.g. Singularity or Docker) or a different environment, the operation will not receive command line flags from the submitting process. ``FlowGroups`` can be -used to pass options to an ``exec`` command. This example shows how to use the `run_options` +used to pass options to an ``exec`` command. This example shows how to use the ``run_options`` argument to tell an operation executed in a container to run in debug mode. .. code-block:: python diff --git a/docs/source/tutorial.rst b/docs/source/tutorial.rst index fec205e8..82ff301b 100644 --- a/docs/source/tutorial.rst +++ b/docs/source/tutorial.rst @@ -58,9 +58,9 @@ Initially, the ``.signac`` directory will contain the minimal configuration info ~/ideal_gas_project $ python init.py ~/ideal_gas_project $ ls -a . .. .signac init.py workspace - ~/ideal_gas_project $ ls .signac + ~/ideal_gas_project $ ls .signac config - ~/ideal_gas_project $ cat .signac/config + ~/ideal_gas_project $ cat .signac/config schema_version = 2 @@ -118,7 +118,7 @@ Each leaf node in the directory tree contains a ``job`` directory, which is a sy .. code-block:: bash - ~/ideal_gas_project $ ls view/p/1 + ~/ideal_gas_project $ ls view/p/1 job To minimize the directory tree depth, the linked view constructed is the most compact representation of the data space, in the sense that any parameters that do not vary across the entire data space are omitted from the directory tree. From 5418a1ff80816082bc6ba6683a03bee9c25e6ea0 Mon Sep 17 00:00:00 2001 From: Vyas Ramasubramani Date: Sun, 19 Mar 2023 19:10:14 -0400 Subject: [PATCH 14/31] Add codespell. --- .codespellrc | 4 ++++ .pre-commit-config.yaml | 4 ++++ docs/source/query.rst | 2 +- docs/source/recipes.rst | 4 ++-- docs/source/tutorial.rst | 2 +- 5 files changed, 12 insertions(+), 4 deletions(-) create mode 100644 .codespellrc diff --git a/.codespellrc b/.codespellrc new file mode 100644 index 00000000..1fd853c4 --- /dev/null +++ b/.codespellrc @@ -0,0 +1,4 @@ +[codespell] +builtin = clear +quiet-level = 2 +ignore-words-list = musil diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml index 38fe7cc1..c24c6bc7 100644 --- a/.pre-commit-config.yaml +++ b/.pre-commit-config.yaml @@ -10,3 +10,7 @@ repos: - id: rst-backticks - id: rst-directive-colons - id: rst-inline-touching-normal + - repo: https://github.com/codespell-project/codespell + rev: v2.2.4 + hooks: + - id: codespell diff --git a/docs/source/query.rst b/docs/source/query.rst index 3b924aba..eaf9e448 100644 --- a/docs/source/query.rst +++ b/docs/source/query.rst @@ -253,7 +253,7 @@ A simplified expression consists of key-value pairs in alternation. The first argument will then be interpreted as the first key, the second argument as the first value, the third argument as the second key, and so on. If you provide an odd number of arguments, the last value will default to ``{'$exists': True}``. Querying via operator is supported using the ``.``-operator. -Finally, you can use ``//`` intead of ``{'$regex': ''}`` for regular expressions. +Finally, you can use ``//`` instead of ``{'$regex': ''}`` for regular expressions. The following list shows simplified expressions on the left and their equivalent standard expression on the right. diff --git a/docs/source/recipes.rst b/docs/source/recipes.rst index b111e1bc..332814b6 100644 --- a/docs/source/recipes.rst +++ b/docs/source/recipes.rst @@ -144,7 +144,7 @@ Creating parameter-dependent operations ======================================= Operations defined as a function as part of a **signac-flow** workflow can only have one required argument: the job. -That is to ensure reproduciblity of these operations. +That is to ensure reproducibility of these operations. An operation should be a true function of the job's data without any hidden parameters. Here we show how to define operations that are a function of one or more additional parameters without violating the above mentioned principle. @@ -339,7 +339,7 @@ If you are using the ``run`` command for execution, simply execute the whole scr This means that the actual submission, (e.g. ``python project.py submit`` or similar) will need to be executed with a **local** Python executable. To avoid issues with dependencies that are only available in the container image, move imports into the operation function. - Condition functions will be executed during the submission process to determine *what* to submit, so depedencies for those must be installed into the local environment as well. + Condition functions will be executed during the submission process to determine *what* to submit, so dependencies for those must be installed into the local environment as well. .. tip:: diff --git a/docs/source/tutorial.rst b/docs/source/tutorial.rst index 82ff301b..3a8d2d1a 100644 --- a/docs/source/tutorial.rst +++ b/docs/source/tutorial.rst @@ -126,7 +126,7 @@ In our example, **signac** detected that the values for *kT* and *N* are constan .. note:: - Make sure to update the view paths by executing the ``$ signac view`` command (or equivalently with the :py:meth:`~signac.Project.create_linked_view` method) everytime you add or remove jobs from your data space. + Make sure to update the view paths by executing the ``$ signac view`` command (or equivalently with the :py:meth:`~signac.Project.create_linked_view` method) every time you add or remove jobs from your data space. Interacting with the **signac** project From 0ee7cddfe77e9e25f3b486fb956804dd93e46331 Mon Sep 17 00:00:00 2001 From: Vyas Ramasubramani Date: Sun, 19 Mar 2023 19:12:55 -0400 Subject: [PATCH 15/31] community --- docs/source/community.rst | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/source/community.rst b/docs/source/community.rst index 5fd07ca5..554fede8 100644 --- a/docs/source/community.rst +++ b/docs/source/community.rst @@ -57,16 +57,16 @@ All code contributed via pull request needs to adhere to the following guideline * Use the `OneFlow`_ model of development: - - Both new features and bug fixes should be developed in branches based on ``master``. + - Both new features and bug fixes should be developed in branches based on ``main``. - Hotfixes (critical bugs that need to be released *fast*) should be developed in a branch based on the latest tagged release. * Write code that is compatible with all supported versions of Python (listed in the package ``pyproject.toml`` file). * Avoid introducing dependencies -- especially those that might be harder to install in high-performance computing environments. * All code needs to adhere to the PEP8_ style guide, with the exception that a line may have up to 100 characters. * Create `unit tests `_ and `integration tests `_ that cover the common cases and the corner cases of the code. - * Preserve backwards-compatibility whenever possible, and make clear if something must change. + * Preserve backwards compatibility whenever possible, and make clear if something must change. * Document any portions of the code that might be less clear to others, especially to new developers. - * Write API documentation as part of the doc-strings of the package, and put usage information, guides, and concept overviews in the `framework documentation `_, the page you are currently on (`source `_). + * Write API documentation as part of the docstrings of the package, and put usage information, guides, and concept overviews in the `framework documentation `_, the page you are currently on (`source `_). .. _GitHub: https://github.com/glotzerlab/ .. _PEP8: https://www.python.org/dev/peps/pep-0008/ From 62b0eed69c4ee0489626161a5bf6fc00f18f48c5 Mon Sep 17 00:00:00 2001 From: Vyas Ramasubramani Date: Tue, 21 Mar 2023 09:24:35 -0400 Subject: [PATCH 16/31] Put back original hooks. --- .pre-commit-config.yaml | 55 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 55 insertions(+) diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml index c24c6bc7..83ecbee3 100644 --- a/.pre-commit-config.yaml +++ b/.pre-commit-config.yaml @@ -1,9 +1,64 @@ +ci: + autoupdate_schedule: quarterly + repos: - repo: https://github.com/pre-commit/pre-commit-hooks rev: 'v4.4.0' hooks: + - id: check-json + - id: check-merge-conflict + - id: check-yaml + - id: debug-statements - id: end-of-file-fixer + - id: mixed-line-ending - id: trailing-whitespace + - repo: https://github.com/asottile/pyupgrade + rev: 'v3.3.1' + hooks: + - id: pyupgrade + args: + - --py36-plus + - repo: https://github.com/PyCQA/isort + rev: '5.12.0' + hooks: + - id: isort + - repo: https://github.com/psf/black + rev: '22.12.0' + hooks: + - id: black + - repo: https://github.com/PyCQA/flake8 + rev: '6.0.0' + hooks: + - id: flake8 + args: + - --max-line-length=100 + - repo: https://github.com/nbQA-dev/nbQA + rev: 1.6.0 + hooks: + - id: nbqa-black + - id: nbqa-pyupgrade + - repo: https://github.com/asottile/blacken-docs + rev: v1.12.1 + hooks: + - id: blacken-docs + - repo: https://github.com/FlamingTempura/bibtex-tidy + rev: 'v1.8.5' + hooks: + - id: bibtex-tidy + args: + - --omit=abstract,keywords,archiveprefix,mendeley-tags,pmid,eprint,arxivid + - --wrap=90 + - --curly + - --numeric + - --space=2 + - --align=0 + - --sort=key,type,author,-year + - --duplicates=key,doi,citation + - --strip-enclosing-braces + - --drop-all-caps + - --sort-fields=title,shorttitle,author,year,month,day,journal,booktitle,location,on,publisher,address,series,volume,number,pages,doi,isbn,issn,url,urldate,copyright,category,note,metadata + - --remove-empty-fields + - --no-remove-dupe-fields - repo: https://github.com/pre-commit/pygrep-hooks rev: 'v1.10.0' hooks: From cc93b7355f88d874f36c4c76ea5c16c942428718 Mon Sep 17 00:00:00 2001 From: "pre-commit-ci[bot]" <66853113+pre-commit-ci[bot]@users.noreply.github.com> Date: Tue, 21 Mar 2023 13:28:07 +0000 Subject: [PATCH 17/31] [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --- docs/source/tutorial.rst | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/docs/source/tutorial.rst b/docs/source/tutorial.rst index 3a8d2d1a..7099de6e 100644 --- a/docs/source/tutorial.rst +++ b/docs/source/tutorial.rst @@ -323,6 +323,7 @@ For this we transform the ``volume_computed()`` function into a *label function* # ... + @Project.label def volume_computed(job): return job.isfile("volume.txt") @@ -367,6 +368,7 @@ Since we are pretending that computing the volume is an expensive operation, we # ... + @Project.pre(volume_computed) @Project.post.isfile("data.json") @Project.operation @@ -376,6 +378,7 @@ Since we are pretending that computing the volume is an expensive operation, we with open(job.fn("data.json"), "w") as jsonfile: json.dump(data, jsonfile) + # ... Here we reused the ``volume_computed`` condition function as a **precondition** and took advantage of the ``post.isfile`` function to define the postcondition for this operation function. @@ -439,6 +442,7 @@ To do this, we create a group and decorate the operations with it. json.dump(data, jsonfile) Project().main() + # ... @@ -456,6 +460,7 @@ Let's add another operation to our ``project.py`` script that stores the volume # ... + @Project.pre.after(compute_volume) @Project.post(lambda job: "volume" in job.document) @Project.operation From 7ef83344e9b1484655aac202ddec487a16ec84a7 Mon Sep 17 00:00:00 2001 From: Corwin Kerr Date: Mon, 27 Mar 2023 16:53:51 -0400 Subject: [PATCH 18/31] Apply Bradley's suggestions Co-authored-by: Bradley Dice --- docs/source/configuration.rst | 2 +- docs/source/flow-project.rst | 2 +- docs/source/jobs.rst | 1 - docs/source/query.rst | 4 ++-- docs/source/tutorial.rst | 6 +++--- 5 files changed, 7 insertions(+), 8 deletions(-) diff --git a/docs/source/configuration.rst b/docs/source/configuration.rst index 9b5b1c31..9f62c681 100644 --- a/docs/source/configuration.rst +++ b/docs/source/configuration.rst @@ -12,7 +12,7 @@ The configuration files are stored using the standard `INI file format `, it is recommended to define operations as accepting a variadic list of ``*jobs`` parameters so that the operation is not restricted to a specific aggregate size. diff --git a/docs/source/jobs.rst b/docs/source/jobs.rst index e002a9b2..2a484ef2 100644 --- a/docs/source/jobs.rst +++ b/docs/source/jobs.rst @@ -191,7 +191,6 @@ You can modify **nested** *state points* in-place, but you will need to use dict .. _project-job-document: - The Job Document ================ diff --git a/docs/source/query.rst b/docs/source/query.rst index eaf9e448..4f922d61 100644 --- a/docs/source/query.rst +++ b/docs/source/query.rst @@ -113,7 +113,7 @@ If we wanted to match all documents where *p is greater than 2*, we would use th {"p": {"$gt": 2}} -Note that we have replaced the value for p with the expression ``{'$gt': 2}`` to select all jobs withe p values greater than 2. +Note that we have replaced the value for p with the expression ``{'$gt': 2}`` to select all jobs with p values greater than 2. Here is a complete list of all available **arithmetic operators**: * ``$eq``: equal to @@ -246,7 +246,7 @@ Simplified Syntax on the Command Line It is possible to use search expressions directly on the command line, for example in combination with the ``$ signac find`` command. In this case filter arguments are expected to be provided as valid JSON expressions. -However, for simple filters you can also use a simplified syntax in lieu of writing JSON. +For simple filters, you can use a simplified syntax instead of writing JSON. For example, instead of ``{'p': 2}``, you can simply type ``p 2``. A simplified expression consists of key-value pairs in alternation. diff --git a/docs/source/tutorial.rst b/docs/source/tutorial.rst index 7099de6e..f5059e01 100644 --- a/docs/source/tutorial.rst +++ b/docs/source/tutorial.rst @@ -91,7 +91,7 @@ Exploring the data space The core function that **signac** offers is the ability to associate metadata --- for example, a specific set of parameters such as temperature, pressure, and system size --- with a distinct directory on the file system that contains all data related to said metadata. The :py:meth:`~signac.Project.open_job` method associates the metadata specified as its first argument with a distinct directory, the *job directory*. -These directories are located in the ``workspace`` subdirectory within the project directory and the directory name is the so-called *job id*. +These directories are located in the ``workspace`` subdirectory within the project directory and the directory name is the *job id*. .. code-block:: bash @@ -160,7 +160,7 @@ Iterating through all jobs within the data space is then as easy as: 71855b321a04dd9ee27ce6c9cc0436f4 # ... -To iterate oer a subset of jobs, use the :py:meth:`~signac.Project.find_jobs` method in combination with a query expression: +To iterate over a subset of jobs, use the :py:meth:`~signac.Project.find_jobs` method in combination with a query expression: .. code-block:: pycon @@ -511,7 +511,7 @@ Job.data and Job.stores The job document is useful for storing small sets of numerical values or textual data. Text files like JSON are generally unsuitable for large numerical data, however, due to issues with floating point precision as well as sheer file size. -To support storing such data with **signac**, the job :py:attr:`~signac.contrib.job.Job.data` attribute provides a dict-like interface to an HDF5 file, a much more suitable format for storing large numerical data such as numpy arrays. +To support storing such data with **signac**, the job :py:attr:`~signac.contrib.job.Job.data` attribute provides a dict-like interface to an HDF5 file, a much more suitable format for storing large numerical data such as NumPy arrays. .. code-block:: python From f9b75a1d0491206a99916ff25e3f1aab1cfb5cd4 Mon Sep 17 00:00:00 2001 From: Bradley Dice Date: Tue, 28 Mar 2023 10:30:29 -0500 Subject: [PATCH 19/31] Fix job reference. Co-authored-by: Corwin Kerr --- docs/source/flow-project.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/flow-project.rst b/docs/source/flow-project.rst index 10ca14dc..e76f8e1d 100644 --- a/docs/source/flow-project.rst +++ b/docs/source/flow-project.rst @@ -52,7 +52,7 @@ Operations ========== It is highly recommended to divide individual modifications of your project's data space into distinct functions. -In this context, an *operation* is defined as a function whose only positional arguments are instances of :py:class:`~signac.contrib.job.Job`. +In this context, an *operation* is defined as a function whose only positional arguments are instances of :py:class:`~signac.job.Job`. We will demonstrate this concept with a simple example. Let's initialize a project with a few jobs, by executing the following ``init.py`` script within a ``~/my_project`` directory: From 82b34d76eaecb40505fbb04b6899b36e68702b8d Mon Sep 17 00:00:00 2001 From: Bradley Dice Date: Tue, 28 Mar 2023 10:30:44 -0500 Subject: [PATCH 20/31] Fix a/an. Co-authored-by: Corwin Kerr --- docs/source/flow-project.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/flow-project.rst b/docs/source/flow-project.rst index e76f8e1d..bbab8caa 100644 --- a/docs/source/flow-project.rst +++ b/docs/source/flow-project.rst @@ -154,7 +154,7 @@ The entirety of the code is as follows: for more information. We can define both :py:meth:`~flow.FlowProject.pre` and :py:meth:`~flow.FlowProject.post` conditions, which allow us to define arbitrary workflows as a `directed acyclic graph `__. -A operation is only executed if **all** preconditions are met, and at *at least one* postcondition is not met. +An operation is only executed if **all** preconditions are met, and at *at least one* postcondition is not met. These are added above a :attr:`~flow.FlowProject.operation` decorator. Using these decorators before declaring a function an operation is an error. From 5b19a9eab8005f534e65bbe924c12604df94ce67 Mon Sep 17 00:00:00 2001 From: Bradley Dice Date: Tue, 28 Mar 2023 10:31:34 -0500 Subject: [PATCH 21/31] Clarify analogy to primary keys. --- docs/source/projects.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/projects.rst b/docs/source/projects.rst index 09269002..78e1ba61 100644 --- a/docs/source/projects.rst +++ b/docs/source/projects.rst @@ -104,7 +104,7 @@ For example, to store data associated with particular temperature or pressure of job = project.open_job({"temperature": 20, "pressure": 1.0}).init() -The uniqueness of a state point should be familiar to anyone who has used a relational database: the state point parameters constitute the primary key of the data space. +The uniqueness of a state point in a signac data space is analogous to the uniqueness of a primary key in a relational database. Any other kind of data and metadata that describe a job but are not part of the key should be stored within the :attr:`Job.doc`, which has the exact same interface like the :attr:`Job.sp`. .. tip:: From df78ae05ee395aecb03549e10949cc2e20406dc0 Mon Sep 17 00:00:00 2001 From: Corwin Kerr Date: Tue, 28 Mar 2023 23:20:23 -0400 Subject: [PATCH 22/31] Remove from "data space" terms --- docs/source/intro.rst | 5 ++--- docs/source/jobs.rst | 4 ++-- 2 files changed, 4 insertions(+), 5 deletions(-) diff --git a/docs/source/intro.rst b/docs/source/intro.rst index 957b098c..573920ed 100644 --- a/docs/source/intro.rst +++ b/docs/source/intro.rst @@ -8,12 +8,11 @@ Concept The **signac** framework is designed to simplify the storage, generation and analysis of multidimensional data sets associated with large-scale, file-based computational studies. Any computational work that requires you to manage files and execute workflows may benefit from an integration with **signac**. Typical examples include hyperparameter optimization for machine learning applications and high-throughput screening of material properties with various simulation methods. -The data model assumes that the work can be divided into so called *projects*, where each project is roughly confined by similarly structured data, e.g., a parameter study. -In **signac**, the elements of a project's data space are called *jobs*. +In **signac**, collections of parameter values are *jobs* and are stored in a flat directory structure. Every job is defined by a unique set of well-defined parameters that define the job's context, and it also contains all the data associated with this metadata. This means that all data is uniquely addressable from the associated parameters. With **signac**, we define the processes generating and manipulating a specific data set as a sequence of operations on a job. -Using this abstraction, **signac** can define workflows on an arbitrary **signac** data space. +Using this abstraction, **signac** can define workflows on an arbitrary **signac** workspace. .. image:: images/signac_data_space.png diff --git a/docs/source/jobs.rst b/docs/source/jobs.rst index 2a484ef2..d09f042b 100644 --- a/docs/source/jobs.rst +++ b/docs/source/jobs.rst @@ -55,7 +55,7 @@ This subdirectory is named by the *job id*, therefore guaranteeing a unique file Because **signac** assumes that the state point is a unique identifier, multiple jobs cannot share the same state point. A typical remedy for scenarios where, *e.g.*, multiple replicas are required, is to append the replica number to the state point to generate a unique state point. -Both the state point and the job id are equivalent addresses for jobs in the data space. +Both the state point and the job id are equivalent addresses for jobs in the project. To access or modify a data point, obtain an instance of :py:class:`Job` by passing the associated metadata as a mapping of key-value pairs (for example, as an instance of :py:class:`dict`) into the :py:meth:`~signac.Project.open_job` method. .. code-block:: pycon @@ -69,7 +69,7 @@ To access or modify a data point, obtain an instance of :py:class:`Job` by passi In general an instance of :py:class:`Job` only gives you a handle to a Python object. -To create the underlying workspace directory and thus make the job part of the data space, you must *initialize* it. +To create the underlying workspace directory, you must *initialize* it. You can initialize a job **explicitly**, by calling the :py:meth:`Job.init` method, or **implicitly**, by either accessing the job's :ref:`job document ` or by switching into the job's workspace directory. .. code-block:: pycon From 12afcdd1de85acab5f8d8fd4d28285a868dfeefa Mon Sep 17 00:00:00 2001 From: Corwin Kerr Date: Tue, 28 Mar 2023 23:20:44 -0400 Subject: [PATCH 23/31] Small wording --- docs/source/projects.rst | 6 +++--- docs/source/tutorial.rst | 2 +- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/source/projects.rst b/docs/source/projects.rst index 78e1ba61..9b400000 100644 --- a/docs/source/projects.rst +++ b/docs/source/projects.rst @@ -28,13 +28,13 @@ In the process, **signac**'s maintenance of the data space also effectively func Project Initialization ====================== -In order to use **signac** to manage a project's data, the project must be **initialized** as a **signac** project. +In order to use **signac** to manage a project's data, the project must be initialized as a **signac** project. After a project has been initialized in **signac**, all shell and Python scripts executed within or below the project's root directory have access to **signac**'s central facility, the **signac** project interface. The project interface provides simple and consistent access to the project's underlying *data space*. [#f1]_ .. [#f1] You can access a project interface from other locations by explicitly specifying the root directory. -To initialize a project, simply execute ``$ signac init`` on the command line inside the desired project directory (create a new project directory if needed). +To initialize a project, execute ``$ signac init`` on the command line inside the desired project directory (create a new project directory if needed). For example, to initialize a **signac** project in a directory called ``my_project``, execute: .. code-block:: bash @@ -172,7 +172,7 @@ Basic Grouping by Key Grouping can be quickly performed using a statepoint or job document key. -If *a* was a state point variable in a project's parameter space, we can quickly enumerate the groups corresponding to each value of *a* like this: +If *a* were a state point parameter, we can enumerate the groups corresponding to each value of *a* like this: .. code-block:: python diff --git a/docs/source/tutorial.rst b/docs/source/tutorial.rst index f5059e01..860978e7 100644 --- a/docs/source/tutorial.rst +++ b/docs/source/tutorial.rst @@ -148,7 +148,7 @@ You can obtain an instance of that class within the project root directory and a ``project`` and ``job`` will be set to :py:func:`~signac.get_project()` and :py:func:`~signac.get_job()` respectively. -Iterating through all jobs within the data space is then as easy as: +We can then iterate through all jobs in the project: .. code-block:: pycon From 944d89f8074b23c5960174c357f4fd6ec637438b Mon Sep 17 00:00:00 2001 From: Corwin Kerr Date: Tue, 28 Mar 2023 23:39:34 -0400 Subject: [PATCH 24/31] Add initial glossary file --- docs/source/glossary.rst | 25 +++++++++++++++++++++++++ 1 file changed, 25 insertions(+) create mode 100644 docs/source/glossary.rst diff --git a/docs/source/glossary.rst b/docs/source/glossary.rst new file mode 100644 index 00000000..d1c5e98e --- /dev/null +++ b/docs/source/glossary.rst @@ -0,0 +1,25 @@ +Glossary +======== + +.. glossary:: + + parameter + A variable, like `T` or `version` or `bench_number`. The smallest unit in signac. Specifically, these are the dictionary keys of the state point. + + state point + A dictionary of parameters and their values that uniquely specifies a :term:`job`. + + job + blah + + job directory + The directory, named for the :term:`job id`, containing all data and metadata pertaining to the given job. Upon initialization of a job, the job directory contains the files `signac_statepoint.json` and `signac_job_document.json`. + + project + blah + + workspace + The directory that contains all job directories. + + job id + A unique MD-5 hash of a job's state point that is used to identify a job. From e802abcc4319e41e863d283c7251ca2bb4bac184 Mon Sep 17 00:00:00 2001 From: Corwin Kerr Date: Tue, 28 Mar 2023 23:57:23 -0400 Subject: [PATCH 25/31] Replace state point schema with project schema --- docs/source/projects.rst | 2 +- docs/source/tips_and_tricks.rst | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/source/projects.rst b/docs/source/projects.rst index 9b400000..bc188ee1 100644 --- a/docs/source/projects.rst +++ b/docs/source/projects.rst @@ -355,7 +355,7 @@ In addition, **signac** also provides the :py:meth:`signac.Project.fn` method, w Schema Detection ================ -While **signac** does not require you to specify an *explicit* state point schema, it is always possible to deduce an *implicit* semi-structured schema from a project's data space. +While **signac** does not require you to specify an *explicit* :term:`project schema`, it is always possible to deduce an *implicit* semi-structured schema from a project's data space. This schema is comprised of the set of all keys present in all state points, as well as the range of values that these keys are associated with. Assuming that we initialize our data space with two state point keys, ``a`` and ``b``, where ``a`` is associated with some set of numbers and ``b`` contains a boolean value: diff --git a/docs/source/tips_and_tricks.rst b/docs/source/tips_and_tricks.rst index 854ee446..c6560cd2 100644 --- a/docs/source/tips_and_tricks.rst +++ b/docs/source/tips_and_tricks.rst @@ -72,7 +72,7 @@ While there is no hard limit imposed by **signac**, some heuristics can be helpf On a system with a fast SSD, a project can hold about 100,000 jobs before the latency for various operations (searching, filtering, iteration) becomes unwieldy. Some **signac** projects have scaled up to around 1,000,000 jobs, but the performance can be slower. This is especially difficult on network file systems found on HPC clusters, because accessing many small files is expensive compared to accessing fewer large files. -If your project needs to explore a large parameter space with many jobs, consider a state point schema that allows you to do more work with fewer jobs, instead of a small amount of work for many jobs, perhaps by reducing one dimension of the parameter space being explored. +If your project needs to explore a large parameter space with many jobs, consider a :term:`project schema` that allows you to do more work with fewer jobs, instead of a small amount of work for many jobs, perhaps by reducing one dimension of the parameter space being explored. After adding or removing jobs, it is recommended to run the CLI command ``$ signac update-cache`` or the Python method ``Project.update_cache()`` to update the persistent (centralized) cache of all state points in the project. For workflows implemented with **signac-flow**, the choice of pre-conditions and post-conditions can have a dramatic effect on performance. In particular, conditions that check for file existence, like ``FlowProject.post.isfile``, are typically much faster than conditions that require reading a file's contents. From ef22bb4a36cc0231352df7c619cd1cfb63e1fe85 Mon Sep 17 00:00:00 2001 From: Corwin Kerr Date: Tue, 28 Mar 2023 23:57:38 -0400 Subject: [PATCH 26/31] Update defs --- docs/source/glossary.rst | 28 ++++++++++++++++++++-------- 1 file changed, 20 insertions(+), 8 deletions(-) diff --git a/docs/source/glossary.rst b/docs/source/glossary.rst index d1c5e98e..5a629e5e 100644 --- a/docs/source/glossary.rst +++ b/docs/source/glossary.rst @@ -4,22 +4,34 @@ Glossary .. glossary:: parameter - A variable, like `T` or `version` or `bench_number`. The smallest unit in signac. Specifically, these are the dictionary keys of the state point. + A variable, like `T` or `version` or `bench_number`. The smallest unit in **signac**. Specifically, these are the dictionary keys of the state point. state point - A dictionary of parameters and their values that uniquely specifies a :term:`job`. + A dictionary of parameters and their values that uniquely specifies a :term:`job`, kept in sync with the file `signac_statepoint.json`. job - blah + An object holding data and metadata of a :term:`state point`. job directory - The directory, named for the :term:`job id`, containing all data and metadata pertaining to the given job. Upon initialization of a job, the job directory contains the files `signac_statepoint.json` and `signac_job_document.json`. + The directory, named for the :term:`job id`, created when a job is initialized containing all data and metadata pertaining to the given job. + job id + A unique MD-5 hash of a job's state point that is used to identify a job. + + job document + A persistent dictionary for storage of simple key-value pairs in a job, kept in sync with the file `signac_job_document.json`. + project - blah + The primary interface to access and work with jobs and their data. workspace - The directory that contains all job directories. + The directory that contains all job directories of a **signac** project. - job id - A unique MD-5 hash of a job's state point that is used to identify a job. + project schema + The emergent database schema as defined by jobs in the project workspace. The set of all keys present in all state points, as well as their range of values. + + signac schema + a database schema that defines how signac reads configuration options, currently on v2. + + + From e75c89f1e7e00fc63c9544beeb84106465503083 Mon Sep 17 00:00:00 2001 From: Corwin Kerr Date: Thu, 30 Mar 2023 09:53:00 -0400 Subject: [PATCH 27/31] Update definitions taking Brandon's feedback --- docs/source/glossary.rst | 21 +++++++++------------ docs/source/jobs.rst | 2 +- 2 files changed, 10 insertions(+), 13 deletions(-) diff --git a/docs/source/glossary.rst b/docs/source/glossary.rst index 5a629e5e..533ed26b 100644 --- a/docs/source/glossary.rst +++ b/docs/source/glossary.rst @@ -10,28 +10,25 @@ Glossary A dictionary of parameters and their values that uniquely specifies a :term:`job`, kept in sync with the file `signac_statepoint.json`. job - An object holding data and metadata of a :term:`state point`. - - job directory - The directory, named for the :term:`job id`, created when a job is initialized containing all data and metadata pertaining to the given job. + An object holding data and metadata of the :term:`state point` that defines it. job id - A unique MD-5 hash of a job's state point that is used to identify a job. + The MD-5 hash of a job's state point that is used to distinguish jobs. + + job directory + The directory, named for the :term:`job id`, created when a job is initialized that will contain all data and metadata pertaining to the given job. job document A persistent dictionary for storage of simple key-value pairs in a job, kept in sync with the file `signac_job_document.json`. - - project - The primary interface to access and work with jobs and their data. workspace The directory that contains all job directories of a **signac** project. + project + The primary interface to access and work with jobs and their data stored in the workspace. + project schema The emergent database schema as defined by jobs in the project workspace. The set of all keys present in all state points, as well as their range of values. signac schema - a database schema that defines how signac reads configuration options, currently on v2. - - - + A configuration schema that defines accepted options and values, currently on v2. diff --git a/docs/source/jobs.rst b/docs/source/jobs.rst index febad070..16a06425 100644 --- a/docs/source/jobs.rst +++ b/docs/source/jobs.rst @@ -23,7 +23,7 @@ In other words, all data associated with a particular job should be a direct or .. important:: - Every :term:`parameter` that, when changed, would invalidate the job's data, should be part of the *state point*; all others should not. + Every :term:`parameter` that, when changed, would invalidate the job's data, should be part of the :term:`state point`; all others should not. However, you only have to add those parameters that are **actually changed** (or anticipated to be changed) to the *state point*. It is perfectly acceptable to hard-code parameters up until the point where you **actually change them**, at which point you would add them to the *state point* :ref:`retroactively `. From 579a367e60d27596fb74d4a8fe7d9b1b71af2390 Mon Sep 17 00:00:00 2001 From: Corwin Kerr Date: Thu, 30 Mar 2023 09:53:08 -0400 Subject: [PATCH 28/31] add hoverxref to build requirements --- docs/requirements.txt | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/requirements.txt b/docs/requirements.txt index 2900ffc2..f7d406e2 100644 --- a/docs/requirements.txt +++ b/docs/requirements.txt @@ -3,6 +3,7 @@ jupyter_client jupyter_sphinx nbconvert nbsphinx +sphinx-hoverxref sphinx>=4.0.0 sphinx_rtd_theme>=1.0.0 sphinxcontrib-bibtex>=2.2.0 From f793ba1040058dd5ff8470b70aa2cf0cfadbe0c2 Mon Sep 17 00:00:00 2001 From: Corwin Kerr Date: Thu, 30 Mar 2023 09:54:43 -0400 Subject: [PATCH 29/31] Use 4 space indentation in glossary --- docs/source/glossary.rst | 40 ++++++++++++++++++++-------------------- 1 file changed, 20 insertions(+), 20 deletions(-) diff --git a/docs/source/glossary.rst b/docs/source/glossary.rst index 533ed26b..b916b9ce 100644 --- a/docs/source/glossary.rst +++ b/docs/source/glossary.rst @@ -3,32 +3,32 @@ Glossary .. glossary:: - parameter - A variable, like `T` or `version` or `bench_number`. The smallest unit in **signac**. Specifically, these are the dictionary keys of the state point. + parameter + A variable, like `T` or `version` or `bench_number`. The smallest unit in **signac**. Specifically, these are the dictionary keys of the state point. - state point - A dictionary of parameters and their values that uniquely specifies a :term:`job`, kept in sync with the file `signac_statepoint.json`. + state point + A dictionary of parameters and their values that uniquely specifies a :term:`job`, kept in sync with the file `signac_statepoint.json`. - job - An object holding data and metadata of the :term:`state point` that defines it. + job + An object holding data and metadata of the :term:`state point` that defines it. - job id - The MD-5 hash of a job's state point that is used to distinguish jobs. + job id + The MD-5 hash of a job's state point that is used to distinguish jobs. - job directory - The directory, named for the :term:`job id`, created when a job is initialized that will contain all data and metadata pertaining to the given job. + job directory + The directory, named for the :term:`job id`, created when a job is initialized that will contain all data and metadata pertaining to the given job. - job document - A persistent dictionary for storage of simple key-value pairs in a job, kept in sync with the file `signac_job_document.json`. + job document + A persistent dictionary for storage of simple key-value pairs in a job, kept in sync with the file `signac_job_document.json`. - workspace - The directory that contains all job directories of a **signac** project. + workspace + The directory that contains all job directories of a **signac** project. - project - The primary interface to access and work with jobs and their data stored in the workspace. + project + The primary interface to access and work with jobs and their data stored in the workspace. - project schema - The emergent database schema as defined by jobs in the project workspace. The set of all keys present in all state points, as well as their range of values. + project schema + The emergent database schema as defined by jobs in the project workspace. The set of all keys present in all state points, as well as their range of values. - signac schema - A configuration schema that defines accepted options and values, currently on v2. + signac schema + A configuration schema that defines accepted options and values, currently on v2. From 86bf749d6cfcffbfa283ddeb0f5676afd8275230 Mon Sep 17 00:00:00 2001 From: "pre-commit-ci[bot]" <66853113+pre-commit-ci[bot]@users.noreply.github.com> Date: Thu, 30 Mar 2023 13:55:46 +0000 Subject: [PATCH 30/31] [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --- docs/source/conf.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/conf.py b/docs/source/conf.py index da1c22c4..dbbdc5fb 100644 --- a/docs/source/conf.py +++ b/docs/source/conf.py @@ -54,7 +54,7 @@ # For hover x ref hoverxref_roles = [ - 'term', + "term", ] # For sphinxcontrib.bibtex. From 7462c11b20b2eabdada85d53f165579398600d74 Mon Sep 17 00:00:00 2001 From: Corwin Kerr Date: Thu, 30 Mar 2023 12:17:54 -0400 Subject: [PATCH 31/31] Add references to glossary terms --- docs/source/dashboard.rst | 2 +- docs/source/flow-project.rst | 8 ++++---- docs/source/hooks.rst | 6 +++--- docs/source/jobs.rst | 6 +++--- docs/source/projects.rst | 4 ++-- docs/source/query.rst | 8 ++++---- docs/source/recipes.rst | 21 ++++----------------- docs/source/tips_and_tricks.rst | 8 ++++---- 8 files changed, 25 insertions(+), 38 deletions(-) diff --git a/docs/source/dashboard.rst b/docs/source/dashboard.rst index 7fb74255..72e4ff14 100644 --- a/docs/source/dashboard.rst +++ b/docs/source/dashboard.rst @@ -3,7 +3,7 @@ The Dashboard ============= -The **signac-dashboard** visualizes data stored in a **signac** project. +The **signac-dashboard** visualizes data stored in a **signac** :term:`project`. To install the **signac-dashboard** package, see :ref:`dashboard-installation`. .. danger:: diff --git a/docs/source/flow-project.rst b/docs/source/flow-project.rst index bbab8caa..8ed237a0 100644 --- a/docs/source/flow-project.rst +++ b/docs/source/flow-project.rst @@ -54,7 +54,7 @@ Operations It is highly recommended to divide individual modifications of your project's data space into distinct functions. In this context, an *operation* is defined as a function whose only positional arguments are instances of :py:class:`~signac.job.Job`. We will demonstrate this concept with a simple example. -Let's initialize a project with a few jobs, by executing the following ``init.py`` script within a ``~/my_project`` directory: +Let's initialize a signac :term:`project` with a few :term:`jobs`, by executing the following ``init.py`` script within a ``~/my_project`` directory: .. code-block:: python @@ -66,7 +66,7 @@ Let's initialize a project with a few jobs, by executing the following ``init.py for i in range(10): project.open_job({"a": i}).init() -A very simple *operation*, which creates a file called ``hello.txt`` within a job's workspace directory, could be implemented like this: +A very simple *operation*, which creates a file called ``hello.txt`` within the :term:`job directory`, could be implemented like this: .. code-block:: python @@ -104,8 +104,8 @@ Conditions Here the :py:meth:`~flow.FlowProject.operation` decorator function specifies that the ``hello`` operation function is part of our workflow. If we run ``python project.py run``, **signac-flow** will execute ``hello`` for all jobs in the project. -However, we only want to execute ``hello`` if ``hello.txt`` does not yet exist in the job's workspace. -To do this, we need to create a condition function named ``greeted`` that tells us if ``hello.txt`` already exists in the job workspace: +However, we only want to execute ``hello`` if ``hello.txt`` does not yet exist in the job directory. +To do this, we need to create a condition function named ``greeted`` that tells us if ``hello.txt`` already exists in the job directory: .. code-block:: python diff --git a/docs/source/hooks.rst b/docs/source/hooks.rst index cf929448..67a56b96 100644 --- a/docs/source/hooks.rst +++ b/docs/source/hooks.rst @@ -10,13 +10,13 @@ Introduction ============ One of the goals of the **signac** framework is to make it easy to track the provenance of research data and to ensure its reproducibility. -Hooks make it possible to execute user-defined functions before or after :ref:`FlowProject ` operations act on a **signac** project. +Hooks make it possible to execute user-defined functions before or after :ref:`FlowProject ` operations act on a **signac** :term:`project`. For example, hooks can be used to track state changes before and after each operation. A hook is a function that is called at a specific time relative to the execution of a **signac-flow** :ref:`operation `. A hook can be triggered when an operation starts, exits, succeeds, or raises an exception. -A basic use case is to log the success/failure of an operation by creating a hook that sets a job document value ``job.doc.operation_success`` to ``True`` or ``False``. +A basic use case is to log the success/failure of an operation by creating a hook that sets a :term:`job document` value ``job.doc.operation_success`` to ``True`` or ``False``. As another example, a user may record the `git commit ID `_ upon the start of an operation, allowing them to track which version of code ran the operation. .. _hook_triggers: @@ -95,7 +95,7 @@ The ``on_exception`` hook trigger will run, and ``job.doc.error_on_a_0_success`` Project-Level Hooks =================== -It may be desirable to install the same hook or set of hooks for all operations in a project. +It may be desirable to install the same hook or set of hooks for all operations in a FlowProject. In the following example FlowProject, the hook ``track_start_time`` is triggered when each operation starts. The hook appends the current time to a list in the job document that is named based on the name of the operation. diff --git a/docs/source/jobs.rst b/docs/source/jobs.rst index 16a06425..60cc2899 100644 --- a/docs/source/jobs.rst +++ b/docs/source/jobs.rst @@ -10,7 +10,7 @@ Overview ======== A *job* is a directory on the file system, which is part of a *project workspace*. -That directory is called the *job workspace* and contains **all data** associated with that particular job. +That directory is called the job directory and contains **all data** associated with that particular job. Every job has a unique address called the *state point*. There are two ways to access associated metadata with your job: @@ -461,7 +461,7 @@ Please see the h5py_ documentation for more information on how to interact with Job Stores ========== -As mentioned before, the :attr:`Job.data` property represents an instance of :class:`~signac.H5Store`, specifically one that operates on a file called ``signac_data.h5`` in the job workspace. +As mentioned before, the :attr:`Job.data` property represents an instance of :class:`~signac.H5Store`, specifically one that operates on a file called ``signac_data.h5`` in the :term:`job directory`. However, there are some reasons why one would want to operate on multiple different HDF5_ files instead of only one. 1. While the HDF5-format is generally mutable, it is fundamentally designed to be used as an immutable data container. @@ -469,7 +469,7 @@ However, there are some reasons why one would want to operate on multiple differ 2. It easier to synchronize multiple files instead of just one. 3. Multiple operations executed in parallel can operate on different files circumventing file locking issues. -The :attr:`Job.stores` property provides a dict-like interface to access *multiple different* HDF5 files within the job workspace directory. +The :attr:`Job.stores` property provides a dict-like interface to access *multiple different* HDF5 files within the job directory. In fact, the :attr:`Job.data` container is essentially just an alias for ``job.stores.signac_data``. For example, to store an array ``X`` within a file called ``my_data.h5``, one could use the following approach: diff --git a/docs/source/projects.rst b/docs/source/projects.rst index bc188ee1..1119c508 100644 --- a/docs/source/projects.rst +++ b/docs/source/projects.rst @@ -79,7 +79,7 @@ Jobs The central assumption of the **signac** data model is that the data space is divisible into individual data points consisting of data and metadata that are uniquely addressable in some manner. Specifically, the workspace is divided into subdirectories where each directory corresponds to exactly one :py:class:`Job`. -All data associated with a job is contained in the corresponding *job workspace* directory. +All data associated with a job is contained in the corresponding job directory. A job can consist of any type of data, ranging from a single value to multiple terabytes of simulation data; **signac**'s only requirement is that this data can be encoded in a file. Each job is uniquely addressable via its *state point*, a key-value mapping describing its data. There can never be two jobs that share the same state point within the same project. @@ -473,7 +473,7 @@ Linked Views ============ Data space organization by job id is both efficient and flexible, but the obfuscation introduced by the job id makes inspecting the workspace on the command line or *via* a file browser much harder. -A *linked view* is a directory hierarchy with human-interpretable names that link to to the actual job workspace directories. +A *linked view* is a directory hierarchy with human-interpretable names that link to the actual :term:`job directories`. Unlike the default mode for :ref:`data export `, no data is copied for the generation of linked views. See :py:meth:`~.signac.Project.create_linked_view` for the Python API. diff --git a/docs/source/query.rst b/docs/source/query.rst index 4f922d61..7bd8c185 100644 --- a/docs/source/query.rst +++ b/docs/source/query.rst @@ -4,8 +4,8 @@ Query API ========= -As briefly described in :ref:`project-job-finding`, the :py:meth:`~signac.Project.find_jobs()` method provides much more powerful search functionality beyond simple selection of jobs with specific state point values. -One of the key features of **signac** is the possibility to immediately search managed data spaces to select desired subsets as needed. +As briefly described in :ref:`project-job-finding`, the :py:meth:`~signac.Project.find_jobs()` method provides much more powerful search functionality beyond simple selection of jobs with specific :term:`state points `. +One of the key features of **signac** is the possibility to search the :term:`project` workspace to select desired subsets as needed. .. note:: @@ -20,8 +20,8 @@ This means that any filter can be used to simultaneously search for keys in both Namespaces are identified by prefixing filter keys with the appropriate prefixes. Currently, the following prefixes are recognized: - * **sp**: job state point - * **doc**: document + * **sp**: job :term:`state point` + * **doc**: :term:`job document` For example, in order to select all jobs whose state point key *a* has the value "foo" and document key *b* has the value "bar", you would use: diff --git a/docs/source/recipes.rst b/docs/source/recipes.rst index 332814b6..cc0ab44d 100644 --- a/docs/source/recipes.rst +++ b/docs/source/recipes.rst @@ -11,13 +11,13 @@ This is a collection of recipes on how to solve typical problems using **signac* Move all recipes below into a 'General' section once we have added more recipes. -Migrating (changing) the data space schema +Migrating (changing) the project schema ========================================== Adding/renaming/deleting keys ----------------------------- -Oftentimes, one discovers at a later stage that important keys are missing from the metadata schema. +Oftentimes, one discovers at a later stage that important :term:`parameters` are missing from the :term:`project schema`. For example, in the tutorial we are modeling a gas using the ideal gas law, but we might discover later that important effects are not captured using this overly simplistic model and decide to replace it with the van der Waals equation: .. math:: @@ -45,19 +45,6 @@ The ``setdefault()`` function sets the value for :math:`a` and :math:`b` to 0 in .. _document-wide-migration: -Initializing Jobs with Replica Indices --------------------------------------- -If you want to initialize your workspace with multiple instances of the same state point, you may want to include a **replica_index** or **random_seed** parameter in the state point. - -.. code-block:: python - - num_reps = 3 - for i in range(num_reps): - for p in range(1, 11): - sp = {"p": p, "kT": 1.0, "N": 1000, "replica_index": i} - job = project.open_job(sp) - job.init() - Applying document-wide changes ------------------------------ @@ -88,7 +75,7 @@ This approach makes it also easy to compare the pre- and post-migration states b Initializing state points with replica indices ============================================== -We often require multiple jobs with the same state point to collect enough information to make statistical inferences about the data. Instead of creating multiple projects to handle this, we can simply add a **replica_index** to the state point. For example, we can use the following code to generate 3 copies of each state point in a workspace: +We often require multiple jobs with the same state point to collect enough information to make statistical inferences about the data. Instead of creating multiple projects to handle this, we can add a **replica_index** to the state point. For example, we can use the following code to generate 3 copies of each state point in a workspace: .. code-block:: python @@ -110,7 +97,7 @@ We often require multiple jobs with the same state point to collect enough infor Defining a grid of state point values ===================================== -Many signac data spaces are structured like a "grid" where the goal is an exhaustive search or a Cartesian product of multiple sets of input parameters. While this can be done with nested ``for`` loops, that approach can be cumbersome for state points with many keys. Here we offer a helper function that can assist in this kind of initialization, inspired by `this StackOverflow answer `__: +Some **signac** :term:`project schemas` are structured like a "grid" where the goal is an exhaustive search or a Cartesian product of multiple sets of :term:`parameters`. While this can be done with nested ``for`` loops, that approach can be cumbersome for state points with many keys. Here we offer a helper function that can assist in this kind of initialization, inspired by `this StackOverflow answer `__: .. code-block:: python diff --git a/docs/source/tips_and_tricks.rst b/docs/source/tips_and_tricks.rst index c6560cd2..4a348bc1 100644 --- a/docs/source/tips_and_tricks.rst +++ b/docs/source/tips_and_tricks.rst @@ -21,7 +21,7 @@ Nonetheless, there are some basic rules worth following: What is the difference between the job state point and the job document? ------------------------------------------------------------------------ -The *state point* defines the *identity* of each job in form of the *job id*. +The :term:`state point` defines the *identity* of each job in form of the :term:`job id`. Conceptually, all data related to a job should be a function of the *state point*. That means that any metadata that could be changed without invalidating the data, should in principle be placed in the job document. @@ -33,7 +33,7 @@ That means that any metadata that could be changed without invalidating the data How do I avoid replicating metadata in filenames? ------------------------------------------------- -Many users, especially those new to **signac**, fall into the trap of storing metadata in filenames within a job's workspace even though that metadata is already encoded in the job itself. +Many users, especially those new to **signac**, fall into the trap of storing metadata in filenames within the :term:`job directory` even though that metadata is already encoded in the job itself. Using the :ref:`tutorial` project as an example, we might have stored the volume corresponding to the job at pressure 4 in a file called ``volume_pressure_4.txt``. However, this is completely unnecessary since that information can already be accessed through the job *via* ``job.sp.p``. @@ -58,7 +58,7 @@ How do I reference data/jobs in scripts? You can reference other jobs in a script using the path to the project root directory in combination with a query-expression. While it is perfectly fine to copy & paste job ids during interactive work or for small tests, hard-coded job ids within code are almost always a bad sign. -One of the main advantages of using **signac** for data management is that the schema is flexible and may be migrated at any time without too much hassle. +One of the main advantages of using **signac** for data management is that the :term:`project schema` is flexible and may be migrated at any time without too much hassle. That also means that existing ids will change and scripts that used them in a hard-coded fashion will fail. Whenever you find yourself hard-coding ids into your code, consider replacing it with a function that uses the :py:meth:`~.signac.Project.find_jobs` function instead. @@ -69,7 +69,7 @@ How do I achieve optimal performance? What are the practical scaling limits for Because **signac** uses a filesystem backend, there are some practical limitations for project size. While there is no hard limit imposed by **signac**, some heuristics can be helpful. -On a system with a fast SSD, a project can hold about 100,000 jobs before the latency for various operations (searching, filtering, iteration) becomes unwieldy. +On a system with a fast SSD, a project :term:`can hold` about 100,000 jobs before the latency for various operations (searching, filtering, iteration) becomes unwieldy. Some **signac** projects have scaled up to around 1,000,000 jobs, but the performance can be slower. This is especially difficult on network file systems found on HPC clusters, because accessing many small files is expensive compared to accessing fewer large files. If your project needs to explore a large parameter space with many jobs, consider a :term:`project schema` that allows you to do more work with fewer jobs, instead of a small amount of work for many jobs, perhaps by reducing one dimension of the parameter space being explored.