Use mongodb as optional index for statepoints and job docs #331

jglaser · 2020-04-29T15:51:59Z

Description

Use mongodb as an optional backend (via pymongo). Store state points and job docs in the database instead of creating a directory for every statepoint. Local workspaces are still created by calling job.open() explicitly. Jobs are added to the database only with job.add().

To enable the database backend, configure a database host (myhost) as described here and put these variables into the configuration (signac.rc):

index_host = myhost
index_db = signac

Motivation and Context

We aim to overcome a scalability issue in signac for large indices (>100,000), when the file system limits the number of subdirectories, i.e. state points, to be created. By decoupling state points from local workspace directories, distributed workers could e.g. create those directories on node-local flash memory as needed for a subset of the entire workspace, while connecting to the central database to store and retrieve job information.

A reference implementation using mongodb as a persistent backend on an OLCF OpenShift cluster is underway (actually, figuring out how to allow network access to the cluster took most of the time, rather than implementing these changes :-), which follow ideas by @csadorf.

Types of Changes

Documentation update
Bug fix
New feature
Breaking change¹

¹The change breaks (or has the potential to break) existing functionality.

Checklist:

I am familiar with the Contributing Guidelines.
I agree with the terms of the Contributor Agreement.
My name is on the list of contributors.
My code follows the code style guideline of this project.
The changes introduced by this pull request are covered by existing or newly introduced tests.

If necessary:

I have updated the API documentation as part of the package doc-strings.
I have created a separate pull request to update the framework documentation on signac-docs and linked it here.
I have updated the changelog and added all related issue and pull request numbers for future reference (if applicable). See example below.

Example for a changelog entry: Fix issue with launching rockets to the moon (#101, #212).

jglaser · 2020-04-29T17:51:56Z

An obvious bottleneck is now the state point search, which loads all statepoints into the cache first before filtering. We will want to map that onto a mongodb query

use mongodb query API

codecov · 2020-04-29T19:16:07Z

Codecov Report

Merging #331 into master will decrease coverage by 0.95%.
The diff coverage is 48.71%.

@@            Coverage Diff             @@
##           master     #331      +/-   ##
==========================================
- Coverage   76.20%   75.24%   -0.96%     
==========================================
  Files          43       44       +1     
  Lines        7090     7163      +73     
==========================================
- Hits         5403     5390      -13     
- Misses       1687     1773      +86

Impacted Files	Coverage Δ
signac/common/connection.py	`0.00% <ø> (-38.28%)`	⬇️
signac/common/host.py	`41.53% <0.00%> (-4.90%)`	⬇️
signac/contrib/indexing.py	`75.59% <0.00%> (ø)`
signac/db/__init__.py	`58.33% <0.00%> (ø)`
signac/db/database.py	`100.00% <ø> (ø)`
signac/core/pymongodict.py	`22.50% <22.50%> (ø)`
signac/contrib/project.py	`89.13% <54.54%> (-2.27%)`	⬇️
signac/contrib/job.py	`90.00% <77.77%> (-1.17%)`	⬇️
signac/__main__.py	`80.08% <100.00%> (+0.04%)`	⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b21bdeb...75e6a5c. Read the comment docs.

csadorf · 2020-04-30T11:52:53Z

signac/contrib/project.py

-    def index(self, formats=None, depth=0,
-              skip_errors=False, include_job_document=True):
+    def index_from_workspace(self, formats=None, depth=0,
+                             skip_errors=False, include_job_document=True):


I think we should modify the index() method such that it calls both index_from_workspace() and index_from_db() and make the latter two private methods.

That might be a good option. Are you proposing to use the union of the two sets of statepoints then? There could be statepoints that are only local, statepoints that are in the db, and state points that are in both.. need to make the latter unique.

to avoid making pymongo a hard dependency

also enable deletion of job doc keys

Enable opening and adding multiple jobs at once to the DB

b-butler · 2020-08-17T20:25:42Z

This is handled in a current GSoC project. A MongoDB is one of the many backends for signac synced collections in #364. When the refactoring of data structures is done, adding support for different backends to job statepoints is planned.

Use mongodb as optional index for statepoints and job docs

45f7dbc

jglaser requested review from csadorf and bdice April 29, 2020 15:51

jglaser added 3 commits April 29, 2020 12:26

try fix style

bf841eb

fix style

18eb5d6

try fix failing unit test

b471637

jglaser added 6 commits April 29, 2020 13:54

try fixing index unit tests

a466eb6

try fix style

0ff5667

another instance of index -> index_from_workspace

bffd1f6

try fix unit test failure again

3e60fc4

fix one more instance of .index()

4b65700

try fixing more unit test errors

b030498

csadorf requested changes Apr 30, 2020

View reviewed changes

Glaser J and others added 7 commits May 4, 2020 21:50

lazily load connection.py

427b7ab

to avoid making pymongo a hard dependency

implement job.remove()

06713ed

update nested doc entries using pymongo query

060c2d2

Add buffering for job doc updates in pymongodict

a06eb6e

also enable deletion of job doc keys

add project.add_many()

c1df50f

Enable opening and adding multiple jobs at once to the DB

use parallel bulk_write

5cac821

return list of added jobs that do not already exist in the index

75e6a5c

b-butler closed this Aug 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use mongodb as optional index for statepoints and job docs #331

Use mongodb as optional index for statepoints and job docs #331

jglaser commented Apr 29, 2020 •

edited

Loading

jglaser commented Apr 29, 2020

codecov bot commented Apr 29, 2020 •

edited

Loading

csadorf Apr 30, 2020

jglaser Jun 1, 2020

b-butler commented Aug 17, 2020

Use mongodb as optional index for statepoints and job docs #331

Use mongodb as optional index for statepoints and job docs #331

Conversation

jglaser commented Apr 29, 2020 • edited Loading

Description

Motivation and Context

Types of Changes

Checklist:

jglaser commented Apr 29, 2020

codecov bot commented Apr 29, 2020 • edited Loading

Codecov Report

csadorf Apr 30, 2020

Choose a reason for hiding this comment

jglaser Jun 1, 2020

Choose a reason for hiding this comment

b-butler commented Aug 17, 2020

jglaser commented Apr 29, 2020 •

edited

Loading

codecov bot commented Apr 29, 2020 •

edited

Loading