Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use mongodb as optional index for statepoints and job docs #331

Closed
wants to merge 17 commits into from

Conversation

jglaser
Copy link

@jglaser jglaser commented Apr 29, 2020

Description

Use mongodb as an optional backend (via pymongo). Store state points and job docs in the database instead of creating a directory for every statepoint. Local workspaces are still created by calling job.open() explicitly. Jobs are added to the database only with job.add().

To enable the database backend, configure a database host (myhost) as described here and put these variables into the configuration (signac.rc):

index_host = myhost
index_db = signac

Motivation and Context

We aim to overcome a scalability issue in signac for large indices (>100,000), when the file system limits the number of subdirectories, i.e. state points, to be created. By decoupling state points from local workspace directories, distributed workers could e.g. create those directories on node-local flash memory as needed for a subset of the entire workspace, while connecting to the central database to store and retrieve job information.

A reference implementation using mongodb as a persistent backend on an OLCF OpenShift cluster is underway (actually, figuring out how to allow network access to the cluster took most of the time, rather than implementing these changes :-), which follow ideas by @csadorf.

Types of Changes

  • Documentation update
  • Bug fix
  • New feature
  • Breaking change1

1The change breaks (or has the potential to break) existing functionality.

Checklist:

If necessary:

  • I have updated the API documentation as part of the package doc-strings.
  • I have created a separate pull request to update the framework documentation on signac-docs and linked it here.
  • I have updated the changelog and added all related issue and pull request numbers for future reference (if applicable). See example below.

Example for a changelog entry: Fix issue with launching rockets to the moon (#101, #212).

@jglaser jglaser requested review from csadorf and bdice April 29, 2020 15:51
@jglaser
Copy link
Author

jglaser commented Apr 29, 2020

An obvious bottleneck is now the state point search, which loads all statepoints into the cache first before filtering. We will want to map that onto a mongodb query

  • use mongodb query API

@codecov
Copy link

codecov bot commented Apr 29, 2020

Codecov Report

Merging #331 into master will decrease coverage by 0.95%.
The diff coverage is 48.71%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #331      +/-   ##
==========================================
- Coverage   76.20%   75.24%   -0.96%     
==========================================
  Files          43       44       +1     
  Lines        7090     7163      +73     
==========================================
- Hits         5403     5390      -13     
- Misses       1687     1773      +86     
Impacted Files Coverage Δ
signac/common/connection.py 0.00% <ø> (-38.28%) ⬇️
signac/common/host.py 41.53% <0.00%> (-4.90%) ⬇️
signac/contrib/indexing.py 75.59% <0.00%> (ø)
signac/db/__init__.py 58.33% <0.00%> (ø)
signac/db/database.py 100.00% <ø> (ø)
signac/core/pymongodict.py 22.50% <22.50%> (ø)
signac/contrib/project.py 89.13% <54.54%> (-2.27%) ⬇️
signac/contrib/job.py 90.00% <77.77%> (-1.17%) ⬇️
signac/__main__.py 80.08% <100.00%> (+0.04%) ⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b21bdeb...75e6a5c. Read the comment docs.

def index(self, formats=None, depth=0,
skip_errors=False, include_job_document=True):
def index_from_workspace(self, formats=None, depth=0,
skip_errors=False, include_job_document=True):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should modify the index() method such that it calls both index_from_workspace() and index_from_db() and make the latter two private methods.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That might be a good option. Are you proposing to use the union of the two sets of statepoints then? There could be statepoints that are only local, statepoints that are in the db, and state points that are in both.. need to make the latter unique.

@b-butler
Copy link
Member

This is handled in a current GSoC project. A MongoDB is one of the many backends for signac synced collections in #364. When the refactoring of data structures is done, adding support for different backends to job statepoints is planned.

@b-butler b-butler closed this Aug 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants