Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance #975

Merged
merged 23 commits into from
Feb 13, 2024
Merged

Improve performance #975

merged 23 commits into from
Feb 13, 2024

Conversation

joaander
Copy link
Member

@joaander joaander commented Feb 7, 2024

Description

Improve the performance of signac for many use-cases, especially when workspaces have with large job counts. Read the complete commit messages for full details on each change.

To summarize:

  • Introduce Job.statepoint_mapping (edit: Job.cached_statepoint)which allows cached, read-only access to the statepoint. statepoint_mapping is loaded on demand when the job is not in the cache.
  • Use the statepoint cache in additional code paths.
  • All open_job code paths now lazily load Job._statepoint.
  • Add validate_statepoint argument to Job.init. When False, init checks only that the job directory exists.
  • Cache job ids in JobsCursor so that __len__ and __contains__ are O(1).
  • Re-use the results from listdir when iterating over jobs in Project and bypass the exists check on every job as it is opened.

Motivation and Context

Users would like their scripts to complete quickly. I will post benchmark results in the comments.

Checklist:

joaander and others added 8 commits February 7, 2024 09:38
_StatePointDict takes significant time to initialize, even when the statepoint dict is
known.

Adjust `Job` initialization to make more use of the statepoint cache and initialize
`_StatePointDict` only when `Job.statepoint` is accessed. Provide a faster path for
cached *read* access to the statepoint dict via the new property `Job.statepoint_dict`.

One side effect of this change is that some warnings are now deferred to `statepoint`
access that were previously issued during `Job.__init__` (see changes in tests/).

There are additional opportunities to use the cached statepoint dict in
`Project.groupby` that this commit does not address.
Cache the ids matching the job filter. This enables O(1) cost for __len__ and
__contains__ as users would expect. In some use-cases, signac-flow repeatedly calls
__contains__ on a JobsCursor.

The side effect of this change is that modifications to the workspace will not be
reflected in existing JobsCursor instances. This behavior was not previously documented
in the user API.
`with job`, `Job.document`, and `Job.stores` call `init()` because they require that the
job directory exists. Prior to this change, `init()` also forced a load of the
`_StatepointDict`. These methods now call `init(validate_statepoints=False)` which exits
early when the job directory exists.

This change provides a reasonable performance boost (5x on NVME, more on network
storage). There may be more room for improvement as there are currently 2N stat calls
in this use-case:
```python
for job in project:
    with job:
        pass
```
deepcopy is unexpectedly expensive. Refactor the earlier commit to deepcopy only
user-provided statepoint dicts. Statepoints from the cache are passed to the user
read-only via MappingProxyType.
`open_job` uses the statepoint cache to improve performance. Read the
cache from disk in `open_job` (if it has not already been read). This provides
consistently high performance in cases where `open_job` is called before any other
method that may have triggered `_get_statepoint`.
Users may find the messages to verbose. At the same time, users might never realize
that they should run `signac update-cache` without this message...
`open_job` is a user-facing function and performs error checking on the id. This check
involves a stat call to verify the job directory exists. When `Project` is looping over
ids from `_get_job_ids`, the directory is known to exist (subject to race conditions).
`stat` calls are expensive, especially on networked filesystems.

Instantiating `Job` directly bypasses this check in `open_job`.
@joaander
Copy link
Member Author

joaander commented Feb 7, 2024

@tcmoore3 @janbridley here are the signac modifications I've been talking about. I will post benchmarks soon.

Copy link

codecov bot commented Feb 7, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (5a82c4c) 85.71% compared to head (6639566) 86.09%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #975      +/-   ##
==========================================
+ Coverage   85.71%   86.09%   +0.37%     
==========================================
  Files          20       20              
  Lines        3466     3503      +37     
  Branches      760      770      +10     
==========================================
+ Hits         2971     3016      +45     
+ Misses        337      330       -7     
+ Partials      158      157       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

…Job.

This allows Job to avoid some `stat` calls which greatly improves performance on
networked file systems.
@joaander joaander force-pushed the improve-performance branch from 5e7ab62 to 57ef8a9 Compare February 8, 2024 01:26
These missed opportunities to pre-populate _statepoint_mapping triggered slow code
paths.
@janbridley janbridley self-assigned this Feb 8, 2024
@joaander
Copy link
Member Author

joaander commented Feb 8, 2024

Here are the results for a suite of benchmarks run on cheme-hodges (NVME) with 100,000 jobs in the workspace. For example, the access_statepoint_mapping benchmark is:

for job in project:
    job.statepoint_mapping['a']

The "no cache" column is run with no statepoint cache file on disk. cached is measured after executing signac update-cache.

main:

Benchmark no cache (s) cached (s)
iterate_jobs 0.365 0.379
with_job 5.6 5.65
open_job_id 0.299 0.303
open_job_statepoint 1.34 1.31
open_job_statepoint_id 0.942 0.938
access_statepoint_mapping 3.73 3.72
access_statepoint 3.55 3.62
access_job_document 8.08 8.03
access_job_stores 34.2 33.9
find_all 2.07 1.36
groupby 2.47 1.66
project_len 6.6 6.62
project_contains 0.47 0.474
find_jobs_len 24.9 30.1
find_jobs_contains >240 >240

This pull request (7998f9a0b2084f99493312d17ee28391cffce8e4):

Benchmark no cache (s) cached (s)
iterate_jobs 0.158 0.16
with_job 0.533 0.541
open_job_id 0.323 0.169
open_job_statepoint 0.507 0.579
open_job_statepoint_id 0.0733 0.0762
access_statepoint_mapping 1.08 0.292
access_statepoint 3.27 3.3
access_job_document 3.01 2.79
access_job_stores 25.5 25.6
find_all 1.15 0.478
groupby 2.54 1.65
project_len 6.64 6.67
project_contains 0.497 0.398
find_jobs_len 1.06 0.391
find_jobs_contains 1.17 0.527

@joaander
Copy link
Member Author

joaander commented Feb 8, 2024

Here is the same on Great Lakes scratch (GPFS).

main:

Benchmark no cache (s) cached (s)
iterate_jobs 45.5 46.6
with_job 136.0 128.0
open_job_id 49.0 50.6
open_job_statepoint 1.69 1.66
open_job_statepoint_id 1.19 1.17
access_statepoint_mapping 117.0 114.0
access_statepoint 117.0 123.0
access_job_document 199.0 201.0
access_job_stores >240 >240
find_all 120.0 2.17
groupby 113.0 2.24
project_len 9.37 9.36
project_contains 50.1 52.3
find_jobs_len 151.0 45.1
find_jobs_contains >240 >240

This pull request (7998f9a0b2084f99493312d17ee28391cffce8e4):

Benchmark no cache (s) cached (s)
iterate_jobs 0.214 0.215
with_job 52.5 53.2
open_job_id 48.8 0.256
open_job_statepoint 0.616 0.732
open_job_statepoint_id 0.0893 0.0877
access_statepoint_mapping 116.0 0.403
access_statepoint 110.0 115.0
access_job_document 116.0 111.0
access_job_stores >240 >240
find_all 119.0 0.764
groupby 115.0 2.23
project_len 9.42 9.41
project_contains 50.8 50.7
find_jobs_len 112.0 0.66
find_jobs_contains 106.0 0.763

@joaander
Copy link
Member Author

joaander commented Feb 8, 2024

Nearly all usage scenarios are significantly faster. There are some remaining cached benchmarks that take more than several seconds:

  • with_job cd's into the job directory. This requires O(N) chdir calls which account for nearly the entire 49 seconds in the benchmark. There is no opportunity for further optimization here.
  • access_statepoint and access_job_document activate the expensive synced_collections interface to the json files.
  • access_jobs_stores spends almost all time in h5store.py and presumably the hdf5 Python package.
  • groupby uses Job.statepoint. groupby could be updated to use statepoint_mapping to greatly improve performance, but doing so would require additional work to emulate the dotted access notation.
  • project_contains checks job in project for every job. This makes O(N) stat calls which account for nearly the entire 50 seconds. Caching this is challenging because signac can make few assumptions about when the job workspace directory changes. We can relax these assumptions in signac-flow and cache the listdir results while in buffered mode. One listdir is much faster than O(N) calls to stat (see also the improvement in iterate_jobs performance).

@joaander joaander force-pushed the improve-performance branch from bc862f4 to 4f135b6 Compare February 8, 2024 17:28
@joaander joaander marked this pull request as ready for review February 8, 2024 17:36
@joaander joaander requested review from a team as code owners February 8, 2024 17:36
@joaander joaander requested review from cbkerr and jennyfothergill and removed request for a team February 8, 2024 17:36
@joaander
Copy link
Member Author

joaander commented Feb 8, 2024

This is ready for review. I suggest waiting to merge until I complete additional testing of this branch with signac-flow.

Copy link
Member

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The benchmarks look great! Nice job.

I worked quite a bit on improving signac performance prior to signac 2. At first glance, these optimizations seem fine, but it would be nice to know if there are tradeoffs in guaranteed consistency. Two important cases that are very hard to protect with increased levels of caching / reduced validation are:

  • parallel access from multiple Python interpreters modifying the same signac project
  • files being manually modified on disk by researchers who are unaware of signac's data model (we can't protect fully here, but if possible, we want to avoid data corruption / loss and give information to the user that the data model has been violated)

At some point, the filesystem itself is the layer that gives signac atomicity, in the sense of a database transaction. Cutting out the filesystem where possible is important for performance, but may come at a cost to ACID properties (atomicity, consistency, isolation, durability). If you anticipate significant impact to those properties, please share your thoughts.

@bdice
Copy link
Member

bdice commented Feb 8, 2024

Great! That's the analysis I needed. Please verify both signac-flow and signac-dashboard tests pass, if possible. Then I can approve.

Copy link

@janbridley janbridley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good and seems to perform well! Thanks for this

@joaander
Copy link
Member Author

joaander commented Feb 8, 2024

Yes, I plan to test this with flow and dashboard soon.

Copy link
Member

@cbkerr cbkerr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!

Discussing naming offline with Josh

signac/project.py Show resolved Hide resolved
signac/job.py Outdated Show resolved Hide resolved
tests/test_job.py Show resolved Hide resolved
@joaander joaander marked this pull request as draft February 10, 2024 17:23
@joaander
Copy link
Member Author

This pull request breaks flow aggregate.groupsof (and possibly other aggregates). The jobs appear in random orders in the aggregates. I am investigating.

Python set has a randomized iteration order. Preserve the original iteration
order with a list and converto to set only for __contains__ checks.
This has the added benefit of validating all statepoints that are
added to the cache. I needed to add a validate argument to update_cache
because one of the unit tests relies on adding invalid statepoints
to the cache.
@joaander joaander force-pushed the improve-performance branch from 6b30785 to 4ebf74f Compare February 12, 2024 01:41
@joaander joaander marked this pull request as ready for review February 12, 2024 01:44
@joaander
Copy link
Member Author

Since the last review, I:

  1. Restored the job iteration order.
  2. Added cached_statepoint validation when it is read from disk. This includes validation on update-cache to prevent invalid mappings from reaching the cache. One unit test relied on an invalid cache, so I added a validate argument to opt out of this behavior. Now, cached_statepoint will raise a JobsCorruptedError if the calc_id(statepoint) and the job's id do not match when loaded from disk. This matches the signac 2.1.0 behavior with statepoint.
  3. Use cached_statepoint to accelerate groupby. After testing, I learned that "dotted" keys referred only to the "sp." and "doc." prefixes.

flow and dashboard work well with 1de7155 installed - behaving correctly in production runs and passing all unit tests.

Here are updated benchmarks (1de7155).
cheme-hodges (NVME):

Benchmark no cache (s) cached (s)
iterate_jobs 0.151 0.149
with_job 0.529 0.523
open_job_id 0.318 0.172
open_job_statepoint 0.508 0.58
open_job_statepoint_id 0.0717 0.0701
access_cached_statepoint 1.44 0.271
access_statepoint 3.15 3.21
access_job_document 2.78 2.85
access_job_stores 25.8 25.6
find_all 1.56 0.456
groupby 1.72 0.585
to_dataframe 5.45 3.72
project_len 6.57 6.58
project_contains 0.499 0.391
find_jobs_len 1.44 0.389
find_jobs_contains 1.58 0.53

Grea Lakes (scratch):

Benchmark no cache (s) cached (s)
iterate_jobs 0.197 0.194
with_job 44.0 45.8
open_job_id 47.6 0.293
open_job_statepoint 0.603 0.732
open_job_statepoint_id 0.0918 0.0904
access_cached_statepoint 116.0 0.379
access_statepoint 113.0 116.0
access_job_document 119.0 117.0
access_job_stores >240 >240
find_all 113.0 0.852
groupby 111.0 0.857
to_dataframe 183.0 119.0
project_len 9.58 9.38
project_contains 48.7 50.0
find_jobs_len 115.0 0.902
find_jobs_contains 109.0 0.768

@joaander joaander requested review from bdice and cbkerr February 12, 2024 13:10
@cbkerr cbkerr mentioned this pull request Feb 12, 2024
6 tasks
@@ -406,6 +414,33 @@ def update_statepoint(self, update, overwrite=False):
statepoint.update(update)
self.statepoint = statepoint

@property
def cached_statepoint(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Must we add a new public API in order to provide the performance benefits of this PR? I am unsure if we should permit users to call this, or if it should be only leveraged internally as a private property job._cached_statepoint.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted below, I want to conceal as much as possible about topics like caching and validation from the user API as we can.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I intentionally make this public.

Job.statepoint is writeable and carries significant overhead from synced_collections as shown in the benchmarks - reading one key from every job's statepoint takes 116 seconds even when the statepoint is in the cache.

Many workflows only need to read the statepoint, flow in particular. While flow could internally use a private _cached_statepiont for str keys, users need public access to the fast path so that their user-defined callable methods (key, select, sort_by) can complete quickly. Many users are frustrated with 10+ minute flow status updates. As shown in the benchmarks, the same loop over projects accessing cached_statepoint completes in 0.379 seconds - 306 times faster. This alone improves flow performance tremendously when using aggregates.

The alternative API I considered was to replace statepoint with the read-only statepoint and require update_statepoint to change it. I opted for a new attribute as changing statepoint semantics is a massive breaking change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation, that is helpful.

I would be open to changing statepoint semantics to be read-only in a future major version. We had discussed this at one point as a possibility for signac 2. Let's file an issue for that proposal.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted in #983.

signac/project.py Outdated Show resolved Hide resolved
Locally catch the JobsCorruptedError and ignore it in the test that needs to.
Copy link
Member

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the hard work on this @joaander.

Copy link
Member

@cbkerr cbkerr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding changes to unify docstrings around state point being two words (https://docs.signac.io/en/latest/glossary.html#term-state-point).

changelog.txt Outdated Show resolved Hide resolved
signac/job.py Outdated Show resolved Hide resolved
signac/job.py Outdated Show resolved Hide resolved
signac/job.py Outdated Show resolved Hide resolved
signac/job.py Outdated Show resolved Hide resolved
signac/project.py Outdated Show resolved Hide resolved
signac/project.py Outdated Show resolved Hide resolved
signac/project.py Outdated Show resolved Hide resolved
Copy link
Member

@cbkerr cbkerr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excited to release this!

@cbkerr cbkerr merged commit 2d6db63 into main Feb 13, 2024
17 checks passed
@cbkerr cbkerr deleted the improve-performance branch February 13, 2024 16:28
@cbkerr cbkerr added this to the v2.2.0 milestone Feb 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants