Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/synced collections #484

Merged
merged 55 commits into from
Feb 20, 2021
Merged

Feature/synced collections #484

merged 55 commits into from
Feb 20, 2021

Conversation

vyasr
Copy link
Contributor

@vyasr vyasr commented Jan 27, 2021

Description

This pull request should not be squashed upon merge.
This pull request adds the synced_collections subpackage, which introduces a new SyncedCollection abstract base class defining a standardized interface for a collections.abc.Collection-like object that must be transparently synchronized with some underlying resource. These collections are designed to replace the _SyncedDict, SyncedAttrDict, and JSONDict classes with a new framework of classes that make it possible to easily interchange different data types (aside from dictionaries) and different resources (aside from JSON files). The new classes generalize the validation process so that different data types and/or resource types can specify the exact validation required. The new classes also generalize the buffering protocol for synced collections so that new buffering methods can be easily defined and interchanged. In addition, the new classes are more performant than the old versions, with the added bonus of thread-safety in parallel environments.

All of this is implemented via extensive inheritance. The new code is unfortunately substantially more complicated than the old code, but the separation of concerns is much clearer with the new classes and the new code is much more easily extensible. There is much less bleeding between the responsibilities of the Job class and the state point as well, while the new BufferedJSONDict is a drop-in replacement for the old JSONDict class in the Job and Project documents.

Breaking change: There is one very minor breaking change introduced by this PR. The force_write option in buffered mode is not supported by the new code. I have added a compatibility layer so that the argument will be processed, so it's not an API break, but it is a minor change in behavior because it will not actually force writing. This feature is extremely minor, and it's not used by anyone I'm aware of, so I'm comfortable with making this change within a minor release. As a result, I don't think that this PR requires a new major release of signac.

Documentation: Should I write up something for signac-docs? I started writing this up, but a lot of the associated documentation is internal-facing or at least for the purpose of others who wish to extend these classes. From signac's perspective, the existing documentation of the job/project document and state point already cover the necessary points. I added synced collections to the API documentation, and the docstrings within these classes are extensive.

Motivation and Context

Resolves #234
Resolves #196
Resolves #397
Resolves #316
Resolves #465
Resolves #249
Resolves #383

Types of Changes

  • Documentation update
  • Bug fix
  • New feature
  • Breaking change1

1The change breaks (or has the potential to break) existing functionality.

Checklist:

If necessary:

  • I have updated the API documentation as part of the package doc-strings.
  • I have created a separate pull request to update the framework documentation on signac-docs and linked it here.
  • I have updated the changelog and added all related issue and pull request numbers for future reference (if applicable). See example below.

vishav1771 and others added 30 commits July 31, 2020 10:39
Adds a generic framework for synced collections to signac. The abstract SyncedCollection defines the interface for any Collection-like object (as defined by Python's collections.abc.Collection) that is synced with some form of persistent storage. This concept generalizes the existing _SyncedDict and SyncedList classes by implementing a common framework for these. Critically, this framework decouples the act of saving to and loading from the backend from the recursive conversion between a SyncedCollection and its base data type (e.g. a SyncedDict to a dict and vice versa), the latter of which is necessary to ensure proper synchronization of nested collections. This decoupling makes it much easier to enable different back ends for a given data structure without having to modify the synchronization logic.
Adds Redis, Zarr, and MongoDB backends.
SyncedCollections can now arbitrary validators to input data to ensure that data conforms to a desired specification. This layer can be employed to ensure, for instance, that all inputs are valid JSON.
Adds the ability to buffer I/O in SyncedCollections using an in-memory cache to store data.
Reorganizes synced collections into their own subpackage and move all corresponding tests into a subdirectory for easier identification.
Cleans up and standardizes the testing architecture so that all backends can easily apply the exact same set of tets using a few minor extra pieces of information. Prevents duplicate execution of tests upon import, while also adding a large number of additional tests for backends that were not previously tested.
Rename various methods and remove unnecessary ones from the public API. Standardize internal APIs and simplify the implementation of new backends by automating more subclass behaviors. Improve constructors to enable positional arguments. Improve interfaces for various backends by making it easier for the user to specify and access the precise storage mechanisms.
…ion (#446)

Makes the SyncedCollection framework adhere to black, isort, flake8, pydocstyle, and mypy requirements while adding sufficiently detailed documentation to all classes.
Simplifies and standardizes file buffering implementation. Adds extra tests for SyncedAttrDict.update and simplifies its implementation to use _update.
Optimize various aspects of the SyncedCollection framework, including validators, abstract base class type lookups, and the core data structures.
MongoDB and Redis will no longer be silently skipped on CI, so any changes that break support for those will be immediately discovered.
Operations that modify SyncedCollections (and their subclasses) now acquire object-specific mutexes prior to modification in order to guarantee consistency between threads. Thread safety can be enabled or disabled by the user.
The new buffering mode is a variant on the old one that avoids unnecessary encoding, decoding, and in-memory updating by directly sharing memory between various collections accessing the same file in the buffer. This direct sharing allows all changes to be automatically persisted, avoiding any cache inconsistencies without the high overhead of JSON serialization for every single modification.
Completes TODO items scattered throughout the code base and removes a number of outdated ones that have already been addressed.
In addition to synced collections being thread safe individually, while in buffered mode the buffer accesses also have to be made thread safe for multithreaded operation to be safe. This pull request introduces sufficient locking mechanisms to support thread-safe reads from and writes to the buffers.
Unifies the implementation of the two different file buffering modes as much as possible using a shared base class. In addition, this fixes a couple of issues with the thread-safe buffering solution in #468 that only show up on PyPy where execution is fast enough to stress-test multithreaded execution. It also reduces thread locking scopes to the minimum possible.
Removes usage of contextlib.contextmanager and replaces it with custom context classes implementing the context manager protocol. The contextlib decorator has measurable overhead that these pre-instantiated context managers avoid. Furthermore, many of the context managers in synced collections follow a very similar counter-based pattern that is now generalized and shared between them.
…source arguments (e.g. filename for a JSONDict) is valid.
The old JSONDict and SyncedAttrDict classes are replaced with the new ones from the SyncedCollections framework. The new classes are now used for the Job's statepoint and document attributes as well as the Project document. The state point is now stored in the new _StatePointDict class, which extends the new JSONDict class to afford greater control over saving and loading of data. Much of the internals of the Job class have also been simplified, with most of the complex logic for job migration and validation when the state point changes now contained within the _StatePointDict.

* Replace old JSONDict with new BufferedJSONDict.

* Verify that replacing BufferedJSONDict with MemoryBufferedJSONDict.

* Remove largely redundant _reset_sp method.

* Remove single-use internal functions in Job to reduce surface area for SyncedCollection integration.

* Move logic from _init into init.

* Working implementation of statepoint using new SyncedCollection.

* Remove _check_manifest.

* Expose loading explicitly to remove the need for internal laziness in the StatepointDict.

* Simplify the code as much as possible by inlining move method and catching the correct error.

* Improve documentation of context manager for statepoint loading.

* Replace MemoryBufferedJSONDict in Project for now.

* Add documentation of why jobs must be stored as a list in the statepoint.

* Address PR comments.

* Add back import.

* Ensure _StatepointDict is always initialized in constructor.

* Change _StatepointDict to validate id on load.

* Refactor error handling into _StatepointDict class.

* Update docstrings.

* Update comment.

* Fix some docstrings.

* Remove redundant JobsCorruptedError check.

* Rewrite reset_statepoint to not depend on creating another job.

* Reduce direct accesses of internal attributes and do some simplification of the code.

* Reraise errors in JSONCollection.

* Change reset to require a non-None argument and to call _update internally.

* Add reset_data method to provide clear access points of the _StatepointDict for the Job.

* Create new internal method for handling resetting.

* Move statepoint resetting logic into the statepoint object itself.

* Stop accessing internal statepoint filename attribute directly and rely on validation on construction.

* Make statepoint thread safe.

* Some minor cleanup.

* Remove now unnecessary protection of the filename key.

* Explicitly document behavior of returning None from _load_from_resource.

* Apply suggestions from code review

Co-authored-by: Bradley Dice <[email protected]>

* Rename SCJSONEncoder to SyncedCollectionJSONEncoder.

* Only access old id once.

* Move lazy attribute initialization into one location.

* Address PR requests that don't cause any issues.

* Remove the temporary state point file backup.

* Make as many old buffer tests as possible.

* Reset buffer size after test.

* Last changes from PR review.

Co-authored-by: Bradley Dice <[email protected]>
@bdice
Copy link
Member

bdice commented Feb 6, 2021

I ran benchmarks and profiles with this command: ./benchmark.py run -N 10 100 1000 -k 10 -s 1000 -o synced_collections.txt

My observations are that this PR is about 2x slower than master for loading statepoints (for profiling enabled and profiling disabled). Other common operations (e.g. iterating over the project without loading data) are also slower. I have a breakdown below.

Synced Collections (this PR)

image

master (close to v1.6.0)

image

The slowdown is coming from these two places. Correcting these two problems would make the new code just as fast as the old code, as far as I can tell.

  1. It appears that data is validated after loading from the resource. For instance, the _StatePointDict reads from JSON, then validates that data is JSON-encodable during the call to SyncedCollection._update. Is this strictly necessary? The old code does not perform such a validation when loading from a resource. Perhaps we need a parameter for SyncedCollection._update that skips validation?
    image

  2. The method SyncedAttrDict.__setattr__ is consuming a considerable portion of time. It has to traverse a deep hierarchy of classes and brings a heavy cost to SyncedAttrDict instantiation at several levels of the class hierarchy. In turn, this makes open_job about 5x slower (Job.__init__ was nearly free before). I think it might be possible to defer the "fancy" attribute access until after the instance is ready for use, but I could not find a good solution for this (attempted a "guard" attribute _attribute_access_enabled but I couldn't work around the issue; using a metaclass to override __setattr__ with this hack led to a conflict).

image

@vyasr
Copy link
Contributor Author

vyasr commented Feb 6, 2021

@bdice thanks for doing some benchmarking. I'll go through your comments later, for now here are my thoughts regarding the two points you brought up on performance.

  1. I think it would be fine to remove validation entirely from _update. There are some API methods that call _update internally (reset, update, and __setitem__) for the SyncedAttrDict off the top of my head), and I think at least the first two rely on _update for validation currently (__setitem__ does its own IIRC which is another unnecessary inefficiency), so we would need to add self.validate calls to those methods. We can generally rely on the backend implementations to detect if the user attempts to create a SyncedCollection that points to an invalid resource. The small trade-off here is that removing validation in _update would prevent any additional validation of preexisting data. I don't think this would lead to destructive changes, only to unexpected behavior.
    Here's an example (completely untested, just based on my recollection of all the implementation details). Consider the case where a user creates a JSON file manually that contains dots in keys e.g. json.dump(f, {'foo.bar': 1}. If we remove validation in _update, the user would be able to initialize a JSONDict pointing to this file and make modifications, and I think the result would basically be that the key would appear to not exist. If the user did JSONDict(fn)['foo.bar'] = 2, the JSON file will look like {'foo.bar': 1, 'foo': {'bar': 2}}. If we wanted to offer some protection against this behavior, we could add a constructor parameter validate_on_init to run self._validate(self()) at the end of the constructor, but it would have to be implemented at the final class level (e.g. JSONDict) since it will only be possible after all backend and data type initialization is done. Alternatively, we could just document this behavior and leave it unhandled.
  2. I'm kind of surprised that this is so slow, the only work that __setattr__ should be doing that is affected by the deep hierarchy is constructing the set of protected keys for each class, but those are cached after the first time so it should only be expensive the very first time __setattr__ is called for any instance of the class. Without further investigation I can't say for sure why this is slowing it down. However, I would like to propose an alternative solution in lieu of further benchmarking.
    I have previously mentioned that I am uncomfortable with our overriding __getattr__, __setattr__, and __delattr__. While we're basically stuck with this decision for signac, I would not be unhappy if we could avoid introducing this into synced collections since I'm much more uncomfortable promising such a feature in a generic data structure class than I am with promising it within the context of a signac document. What if we removed this feature entirely (converted SyncedAttrDict to SyncedDict), then extend JSONDict in signac to just add those features? That would allow us to bypass all of these issues, and isolate this API decision just to signac.
    If you are in favor of this decision, it does require that we address one other point which I had been intentionally ignoring up to this point since it's not critical to any implementation details so far and it's quite tricky to resolve. Right now, nested SyncedCollections are handled by using is_base_type within from_base to determine what type of collection to create based on a registry of available types. One limitation of this approach is that it assumes that the first type to return True from is_base_type is what the user wants. Currently, this is not an issue because there only exists one type for every combination of backend and data type, but implementing the change I discussed above would introduce a subclass of JSONDict that also satisfied the same requirements. We avoid this entirely with the _StatePointDict by not registering it, meaning that dicts nested inside the _StatePointDict are actually instances of JSONDict, but this is not a problem since _load and _save calls eventually get passed up to the top-level _StatePointDict instance. However, if we extended JSONDict into a _JSONAttrDict within signac, we would need that to also persist for nested dictionaries. Currently the resolution order in from_base is arbitrary (but deterministic) based on the order in which SyncedCollection types are registered on construction. Introducing a well-defined, user-controlled order as an API promise is one solution, but I don't think it's general enough because the registry is on the package level. This means that a new SyncedCollection type registered by the user or another package would also modify the resolution order within signac, and vice versa. I'm not sure what the best way to overcome that problem is, and would be interested to hear ideas.
    We could also punt on making this decision now by just overriding the behavior of from_base in the new JSONAttrDict, so we don't have to resolve this now, but I wanted to bring it up in case someone has ideas for a better approach.

vyasr and others added 8 commits February 16, 2021 16:09
This patch removes attribute-based access to synced dictionaries. This logic is moved to a new `AttrDict` mixin class that can be inherited by any other subclasses if this feature is desired.

* Simplify definition of __setattr__ by relying on complete list of protected keys.

* Move all attribute-based access to a separate mixin class AttrDict.

* Rename SyncedAttrDict to SyncedDict.

* Move synced_attr_dict to synced_dict.

* Remove attribute-based access from existing backend dict types and add new validator to check string keys.

* Isolate deprecated non-str key handling to the _convert_key_to_str json validator.

* Add tests of the AttrDict behavior.

* Use new attrdict based classes in signac and make all tests pass.

* Remove support for inheriting protected attributes, they must be defined by the class itself.

* Change initialization order to call super first wherever possible.

* Address PR comments.

* Address final round of PR comments.
Isolate all numpy logic into a single utility file so that handling of numpy arrays can be standardized. Also substantially improves test coverage, testing a large matrix of different numpy dtypes and shapes with different types of synced collections. The testing framework as a whole is refactored to simplify the code and reduce the amount of boilerplate required to add the new numpy tests.

* Initial pass at isolating all numpy logic to a single file.

* Use pytest to generate temporary directories.

* Stop saving backend kwargs as class variables.

* Remove most references to class-level variables in tests.

* Remove the _backend_collection variable.

* Remove unnecessary autouse fixtures.

* Start adding comprehensive tests for numpy conversion.

* Make sure locks are released even if saving fails.

* Add tests of vector types and add tests for SyncedList as well as SyncedDict.

* Use pytest to parametrize numpy types instead of looping manually.

* Unify vector and scalar tests.

* Stop testing squeezing and just test the relevant shapes directly.

* Add test of reset.

* Move numpy tests back to main file.

* Remove _conert_numpy_scalar, which performed undesirable squeezing of 1-element 1d arrays, and replace usage with _convert_numpy.

* Separate numpy testing into separate functions and limit supported behavior to only the necessary use cases.

* Add a warning when implicit numpy conversion occurs.

* Update changelog.

* Address all PR comments aside from numpy conversion in type resolver.

* Catch numpy warnings in signac numpy integration tests.

* Support older versions of numpy for random number generation.

* Fix isort issue introduced by rebase.

* Address PR comments.

* Allow AbstractTypeResolver to perform arbitrary preprocessing, delegating the numpy-specialized code to the caller and making it less confusing.

* Add missing call to _convert_numpy.

* Set NUMPY_SHAPES for MongoDB tests when numpy is not present.
* Remove add_validators classmethod and instead require validators to be defined at class definition time, preventing one application from modifying validation process for others and giving a standard means to completely override parent validators in parents.

* Change type in docstring.
The extra work performed by os.path.join can be slow, so this PR replaces it with direct string concatenation of os.sep.

* Don't use os.path.join where not needed.

* Update signac/contrib/job.py

Co-authored-by: Bradley Dice <[email protected]>
Define a single validator for JSONAttrDict classes that combines the logic of other validators while reducing collection traversal costs. Also switch from converting numpy arrays to just bypassing the resolver's cache for them.

* Use single separate validator for state points for performance.

* Remove preprocessor from type resolver and instead use a blocklist that prevents caching data for certain types.

* Reorder resolvers to optimize performance.

* Make sure not to include strings as sequences.

* Move state point validator to collection_json and use for all JSONAttrDict types.

* Make sure to also check complex types.

* Add back missing period lost during stash merge.

* Address review comments.
Since the most common operation on protected keys is to check if some key is within the list of protected keys, this patch changes the `_PROTECTED_KEYS` attribute to a set for faster O(1) membership checks.

* Switch protected keys from a sequence to a set for faster containment checks.

* Change evaluation order of checks in setattr.

* Address PR comments.
@vyasr
Copy link
Contributor Author

vyasr commented Feb 19, 2021

@bdice here's what your benchmark looks like on the last few optimizations. I used timeit with 10 iterations to get rid of any profiling overhead and get a reasonable estimate of average performance (%%timeit -n 10 of load_code() in a Jupyter notebook):

master: 9.46 ms ± 809 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
9f24 (last commit before any optimizations): 33.1 ms ± 2.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
adddc (define validators at class-definition time): 27.5 ms ± 1.75 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
e71942 (no os path join): 25.2 ms ± 1.77 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
48863 (disabling recursive validation): 24.7 ms ± 2.11 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
f0537 (new JSON validator): 19.2 ms ± 1.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
28cfe (using sets for protected keys): 18.1 ms ± 876 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

If your benchmarks on #497 still hold (I suspect that it will still help, but less since validation is now much faster) we should get pretty close to the performance of master.

* Defer state point initialization when lazy loading.

* Allow validation to be disabled in SyncedCollection._update.

* Unify reset_statepoint logic across methods.

* Revert validation-related changes.

* Add test, fix bug.

* Update tests/test_job.py
@vyasr vyasr mentioned this pull request Feb 20, 2021
12 tasks
* Add a resoler to fast-exit _from_base for non-collection types.

* Optimize synced collection init method.

* Remove validators property and access internal variable directly.

* Add early exit for _convert_array.

* Use locally cached root var instead of self._root.

* Remove superfluous duplicate check.
@bdice bdice mentioned this pull request Feb 20, 2021
12 tasks
* Add option to trust the data source during _update.

* Skip validation if JSON is valid.

* Fix pre-commit.

* Rename trust_source to _validate.

* Set _validate=False.
@vyasr
Copy link
Contributor Author

vyasr commented Feb 20, 2021

@bdice here's an updated set of benchmarks, including #512 both with and without #514:

Benchmarks without #514

master: 9.46 ms ± 809 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
9f24 (last commit before any optimizations): 33.1 ms ± 2.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
adddc (define validators at class-definition time): 27.5 ms ± 1.75 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
e71942 (no os path join): 25.2 ms ± 1.77 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
48863 (disabling recursive validation): 24.7 ms ± 2.11 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
f0537 (new JSON validator): 19.2 ms ± 1.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
28cfe (using sets for protected keys): 18.1 ms ± 876 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
e5b80 (defer statepoint init): 17.5 ms ± 1.04 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
92593 (optimize SyncedCollection init and fast exit from _from_base): 15.6 ms ± 2.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
ca3397 (optionally skip validation on load from disk): 15.7 ms ± 907 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Benchmarks with #514

master: 9.27 ms ± 3.14 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
9f24: 32.6 ms ± 3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
adddc: 26.1 ms ± 3.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
e71942: 24.8 ms ± 2.94 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
48863: 23.7 ms ± 3.17 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
f0537: 19.5 ms ± 2.88 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
28cfe: 18.5 ms ± 3.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
e5b80: 17.6 ms ± 2.99 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
92593: 15.9 ms ± 3.21 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
ca3397: 15.3 ms ± 3.25 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

So #514 doesn't really affect performance much here. However, all of these numbers are for running the benchmark the very first time through. If I rerun the benchmark on the same project (I'm in a Jupyter notebook, so this means not resetting the kernel and not calling signac.get_project again), then the last commit time (from PR #512) drops to ~14.5 on subsequent loads. This improvement is reproducible, and I don't observe it on previous commits, so the fast iteration path is helped somewhat by #512. Additionally, all of the benchmarks with #514 will be much faster for iterating over the project without loading the state point [job for job in project] since this will skip disk I/O, but this is true both on master and on this PR.

Copy link
Member

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is ready to merge. Thank you @vyasr!

@vyasr vyasr merged commit 29c0749 into master Feb 20, 2021
@vyasr vyasr deleted the feature/synced_collections branch February 20, 2021 17:45
@vyasr vyasr mentioned this pull request Nov 14, 2021
12 tasks
@vyasr vyasr mentioned this pull request Mar 26, 2022
6 tasks
@vyasr vyasr mentioned this pull request May 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment