-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve performance #975
Improve performance #975
Conversation
_StatePointDict takes significant time to initialize, even when the statepoint dict is known. Adjust `Job` initialization to make more use of the statepoint cache and initialize `_StatePointDict` only when `Job.statepoint` is accessed. Provide a faster path for cached *read* access to the statepoint dict via the new property `Job.statepoint_dict`. One side effect of this change is that some warnings are now deferred to `statepoint` access that were previously issued during `Job.__init__` (see changes in tests/). There are additional opportunities to use the cached statepoint dict in `Project.groupby` that this commit does not address.
Cache the ids matching the job filter. This enables O(1) cost for __len__ and __contains__ as users would expect. In some use-cases, signac-flow repeatedly calls __contains__ on a JobsCursor. The side effect of this change is that modifications to the workspace will not be reflected in existing JobsCursor instances. This behavior was not previously documented in the user API.
`with job`, `Job.document`, and `Job.stores` call `init()` because they require that the job directory exists. Prior to this change, `init()` also forced a load of the `_StatepointDict`. These methods now call `init(validate_statepoints=False)` which exits early when the job directory exists. This change provides a reasonable performance boost (5x on NVME, more on network storage). There may be more room for improvement as there are currently 2N stat calls in this use-case: ```python for job in project: with job: pass ```
deepcopy is unexpectedly expensive. Refactor the earlier commit to deepcopy only user-provided statepoint dicts. Statepoints from the cache are passed to the user read-only via MappingProxyType.
`open_job` uses the statepoint cache to improve performance. Read the cache from disk in `open_job` (if it has not already been read). This provides consistently high performance in cases where `open_job` is called before any other method that may have triggered `_get_statepoint`.
Users may find the messages to verbose. At the same time, users might never realize that they should run `signac update-cache` without this message...
`open_job` is a user-facing function and performs error checking on the id. This check involves a stat call to verify the job directory exists. When `Project` is looping over ids from `_get_job_ids`, the directory is known to exist (subject to race conditions). `stat` calls are expensive, especially on networked filesystems. Instantiating `Job` directly bypasses this check in `open_job`.
for more information, see https://pre-commit.ci
@tcmoore3 @janbridley here are the signac modifications I've been talking about. I will post benchmarks soon. |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #975 +/- ##
==========================================
+ Coverage 85.71% 86.09% +0.37%
==========================================
Files 20 20
Lines 3466 3503 +37
Branches 760 770 +10
==========================================
+ Hits 2971 3016 +45
+ Misses 337 330 -7
+ Partials 158 157 -1 ☔ View full report in Codecov by Sentry. |
…Job. This allows Job to avoid some `stat` calls which greatly improves performance on networked file systems.
5e7ab62
to
57ef8a9
Compare
These missed opportunities to pre-populate _statepoint_mapping triggered slow code paths.
Here are the results for a suite of benchmarks run on cheme-hodges (NVME) with 100,000 jobs in the workspace. For example, the for job in project:
job.statepoint_mapping['a'] The "no cache" column is run with no statepoint cache file on disk.
This pull request (7998f9a0b2084f99493312d17ee28391cffce8e4):
|
Here is the same on Great Lakes scratch (GPFS).
This pull request (7998f9a0b2084f99493312d17ee28391cffce8e4):
|
Nearly all usage scenarios are significantly faster. There are some remaining
|
bc862f4
to
4f135b6
Compare
This is ready for review. I suggest waiting to merge until I complete additional testing of this branch with signac-flow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The benchmarks look great! Nice job.
I worked quite a bit on improving signac performance prior to signac 2. At first glance, these optimizations seem fine, but it would be nice to know if there are tradeoffs in guaranteed consistency. Two important cases that are very hard to protect with increased levels of caching / reduced validation are:
- parallel access from multiple Python interpreters modifying the same signac project
- files being manually modified on disk by researchers who are unaware of signac's data model (we can't protect fully here, but if possible, we want to avoid data corruption / loss and give information to the user that the data model has been violated)
At some point, the filesystem itself is the layer that gives signac atomicity, in the sense of a database transaction. Cutting out the filesystem where possible is important for performance, but may come at a cost to ACID properties (atomicity, consistency, isolation, durability). If you anticipate significant impact to those properties, please share your thoughts.
Great! That's the analysis I needed. Please verify both signac-flow and signac-dashboard tests pass, if possible. Then I can approve. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good and seems to perform well! Thanks for this
Yes, I plan to test this with flow and dashboard soon. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great!
Discussing naming offline with Josh
Also attempt to fix sphinx-doc links.
This pull request breaks flow |
Python set has a randomized iteration order. Preserve the original iteration order with a list and converto to set only for __contains__ checks.
This has the added benefit of validating all statepoints that are added to the cache. I needed to add a validate argument to update_cache because one of the unit tests relies on adding invalid statepoints to the cache.
6b30785
to
4ebf74f
Compare
Since the last review, I:
flow and dashboard work well with 1de7155 installed - behaving correctly in production runs and passing all unit tests. Here are updated benchmarks (1de7155).
Grea Lakes (scratch):
|
@@ -406,6 +414,33 @@ def update_statepoint(self, update, overwrite=False): | |||
statepoint.update(update) | |||
self.statepoint = statepoint | |||
|
|||
@property | |||
def cached_statepoint(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Must we add a new public API in order to provide the performance benefits of this PR? I am unsure if we should permit users to call this, or if it should be only leveraged internally as a private property job._cached_statepoint
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As noted below, I want to conceal as much as possible about topics like caching and validation from the user API as we can.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I intentionally make this public.
Job.statepoint
is writeable and carries significant overhead from synced_collections
as shown in the benchmarks - reading one key from every job's statepoint takes 116 seconds even when the statepoint is in the cache.
Many workflows only need to read the statepoint, flow in particular. While flow could internally use a private _cached_statepiont
for str
keys, users need public access to the fast path so that their user-defined callable methods (key
, select
, sort_by
) can complete quickly. Many users are frustrated with 10+ minute flow status updates. As shown in the benchmarks, the same loop over projects accessing cached_statepoint
completes in 0.379 seconds - 306 times faster. This alone improves flow performance tremendously when using aggregates.
The alternative API I considered was to replace statepoint
with the read-only statepoint and require update_statepoint
to change it. I opted for a new attribute as changing statepoint
semantics is a massive breaking change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation, that is helpful.
I would be open to changing statepoint
semantics to be read-only in a future major version. We had discussed this at one point as a possibility for signac 2. Let's file an issue for that proposal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noted in #983.
Locally catch the JobsCorruptedError and ignore it in the test that needs to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the hard work on this @joaander.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding changes to unify docstrings around state point being two words (https://docs.signac.io/en/latest/glossary.html#term-state-point).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excited to release this!
Description
Improve the performance of signac for many use-cases, especially when workspaces have with large job counts. Read the complete commit messages for full details on each change.
To summarize:
Job.statepoint_mapping
(edit:Job.cached_statepoint
)which allows cached, read-only access to the statepoint.statepoint_mapping
is loaded on demand when the job is not in the cache.open_job
code paths now lazily loadJob._statepoint
.validate_statepoint
argument toJob.init
. WhenFalse
,init
checks only that the job directory exists.JobsCursor
so that__len__
and__contains__
are O(1).listdir
when iterating over jobs inProject
and bypass theexists
check on every job as it is opened.Motivation and Context
Users would like their scripts to complete quickly. I will post benchmark results in the comments.
Checklist: