-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use mongodb as optional index for statepoints and job docs #331
Conversation
An obvious bottleneck is now the state point search, which loads all statepoints into the cache first before filtering. We will want to map that onto a
|
Codecov Report
@@ Coverage Diff @@
## master #331 +/- ##
==========================================
- Coverage 76.20% 75.24% -0.96%
==========================================
Files 43 44 +1
Lines 7090 7163 +73
==========================================
- Hits 5403 5390 -13
- Misses 1687 1773 +86
Continue to review full report at Codecov.
|
def index(self, formats=None, depth=0, | ||
skip_errors=False, include_job_document=True): | ||
def index_from_workspace(self, formats=None, depth=0, | ||
skip_errors=False, include_job_document=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should modify the index()
method such that it calls both index_from_workspace()
and index_from_db()
and make the latter two private methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That might be a good option. Are you proposing to use the union of the two sets of statepoints then? There could be statepoints that are only local, statepoints that are in the db, and state points that are in both.. need to make the latter unique.
to avoid making pymongo a hard dependency
also enable deletion of job doc keys
Enable opening and adding multiple jobs at once to the DB
This is handled in a current GSoC project. A MongoDB is one of the many backends for signac synced collections in #364. When the refactoring of data structures is done, adding support for different backends to job statepoints is planned. |
Description
Use
mongodb
as an optional backend (via pymongo). Store state points and job docs in the database instead of creating a directory for every statepoint. Local workspaces are still created by callingjob.open()
explicitly. Jobs are added to the database only withjob.add()
.To enable the database backend, configure a database host (
myhost
) as described here and put these variables into the configuration (signac.rc
):Motivation and Context
We aim to overcome a scalability issue in signac for large indices (>100,000), when the file system limits the number of subdirectories, i.e. state points, to be created. By decoupling state points from local workspace directories, distributed workers could e.g. create those directories on node-local flash memory as needed for a subset of the entire workspace, while connecting to the central database to store and retrieve job information.
A reference implementation using
mongodb
as a persistent backend on an OLCF OpenShift cluster is underway (actually, figuring out how to allow network access to the cluster took most of the time, rather than implementing these changes :-), which follow ideas by @csadorf.Types of Changes
1The change breaks (or has the potential to break) existing functionality.
Checklist:
If necessary:
Example for a changelog entry:
Fix issue with launching rockets to the moon (#101, #212).