Buffered Synced collection #2

vishav1771 · 2020-07-22T18:38:08Z

This PR enables buffering feature to SyncedCollection.This related to glotzerlab#249

Description

Motivation and Context

glotzerlab#249

Types of Changes

Documentation update
Bug fix
New feature
Breaking change¹

¹The change breaks (or has the potential to break) existing functionality.

Checklist:

I am familiar with the Contributing Guidelines.
I agree with the terms of the Contributor Agreement.
My name is on the list of contributors.
My code follows the code style guideline of this project.
The changes introduced by this pull request are covered by existing or newly introduced tests.

If necessary:

I have updated the API documentation as part of the package doc-strings.
I have created a separate pull request to update the framework documentation on signac-docs and linked it here.
I have updated the changelog and added all related issue and pull request numbers for future reference (if applicable). See example below.

Example for a changelog entry: Fix issue with launching rockets to the moon (#101, #212).

signac/core/jsoncollection.py

vyasr · 2020-07-23T00:33:49Z

@vishav1771 nice work, this is a good starting point! It looks like you've successfully ported the existing signac buffering logic, modifying it to take advantage of the new structure. I also think you've implemented something that I suggested, which is treating the buffer as a new back-end.

However, I don't think I've given you sufficient guidance on how caching could look, my apologies on that. Have a look at the signac-2 prototype, particularly the design.md file, which outlines our goals for caching. Some of my next comments might reference ideas in there. Before getting too much into a line-by-line review of the code, I think we need to zoom out and make sure that the overall API and concepts we're developing here will address the various needs and open issues with the current code base. @csadorf @bdice @mikemhenry @atravitz @b-butler here are a few big picture questions that I think we need to address as part of this project. I apologize if there's overlap or if they're not all completely well thought through, but I wanted to get the discussion started and get my thoughts down before I forgot them:

The first PR in this chain (the one creating the SyncedCollection infrastructure) enables the ability to suspend synchronization. This optimization allows (for instance) recursive modifications of a highly nested synced data structure to occur in-memory without any synchronization, followed by a single sync to file. This mechanism was implemented in early versions of the SyncedDict to prevent unnecessary I/O operations. This PR is adding the in-memory buffer to work across all reads/writes (within some context), something that was later added to signac as a performance enhancement. Do these really need to be independent code paths? I think we probably need to enable both types of functionality (feel free to correct me if you think we can get by without one), but at an implementation level can we handle the suspended synchronization by simply reading from/writing to an object-specific cache? One part of the signac-2 idea was that every resource would have a cache (either owned by itself, or pointing to a higher-level object's cache), so can we combine these?
I think we need to define clearly what we mean by caching: in particular, at what level do these things persist? One of the issues raised in Lazy statepoint loading glotzerlab/signac#239 (see specifically here) is the fact that currently in signac the project cache may contain invalidated state points. I think we should tighten some of our restrictions with respect to such behavior. In order to do that, we need to address what the primary source of truth is when caching is enabled. Are caches always in-memory objects? If not, how do we ensure their synchronization, for instance with respect to the current project cache? Are on-disk caches (like the current project cache) relevant, but out of scope for this PR?
How do we handle multiple synced collections pointing to the same file? We currently have an integrity check when flushing the buffer, so should we assume that all instances must point to the same cache at any given time? Can there be multiple active caches? If not, how do we prevent that? The suspended sync above applies on a per-object basis, but the caching should be possible to apply across many objects. The signac-2 prototype contains an example of this, using a global Redis cache if one is available, otherwise falling back to an in-memory dict cache.
I'd like us to be consistent with terminology, specifically caching vs buffering. I would prefer that we use cache everywhere (and that's consistent with the signac-2 design), but either way we should be consistent.

Those are my first thoughts. I'll add more as I have time to ponder, but that's food for thought for other reviewers. I'd like to hash out more of the concepts here.

csadorf · 2020-07-24T15:30:51Z

* The first PR in this chain (the one creating the SyncedCollection infrastructure) enables the ability to suspend synchronization. This optimization allows (for instance) recursive modifications of a highly nested synced data structure to occur in-memory without any synchronization, followed by a single sync to file. This mechanism was implemented in early versions of the SyncedDict to prevent unnecessary I/O operations. This PR is adding the in-memory buffer to work across all reads/writes (within some context), something that was later added to signac as a performance enhancement. Do these really need to be independent code paths? I think we probably need to enable both types of functionality (feel free to correct me if you think we can get by without one), but at an implementation level can we handle the suspended synchronization by simply reading from/writing to an object-specific cache? One part of the signac-2 idea was that every resource would have a cache (either owned by itself, or pointing to a higher-level object's cache), so can we combine these?

Caching and buffering are note the same, see here for the differentation: https://en.wikipedia.org/wiki/Cache_(computing)#Buffer_vs._cache

In our context, buffering means that I/O operations can be delayed, because no concurrent access is expected, i.e., they are non-atomic. The way that one can typically take advantage of this kind of mode is by using a cache. Whether a backend uses a cache or another technique to improve performance should be left up to the backend implementation.

This means concretely that the buffering mode is merely an indicator to the backend, whether we expect that a completely different instance is going to access the signac data space concurrently.

Definition: Buffered mode means that the backend can safely assume that non-atomic I/O operations are currently safe.

Whether a specific backend implements a buffered mode or not should be completely transparent. This means that the only difference would be a potential performance increase in buffered mode, but no other change in behavior.

* I think we need to define clearly what we mean by caching: in particular, at what level do these things persist? One of the issues raised in [glotzerlab#239](https://github.com/glotzerlab/signac/pull/239) (see specifically [here](https://github.com/glotzerlab/signac/pull/239#issuecomment-555653170)) is the fact that currently in signac the project cache may contain invalidated state points. I think we should tighten some of our restrictions with respect to such behavior. In order to do that, we need to address what the primary source of truth is when caching is enabled. Are caches always in-memory objects? If not, how do we ensure their synchronization, for instance with respect to the current project cache? Are on-disk caches (like the current project cache) relevant, but out of scope for this PR?

I'm no longer convinced that caching on the "signac level" is a good idea. As long as we are rigorously abstracting all I/O operations by having them implemented purely in the backend, then we do not need to worry about this kind of optimization anymore. In fact, a rigorous backend implementation is going to make this kind of performance optimization much easier. For example, if we use redis or zarr backend, then we can assume that operations are atomic and concurrent access is safe at all times.

From the point of signac, the data provided by the backend is the single source of truth, nothing else matters.

* How do we handle multiple synced collections pointing to the same file? We currently have an integrity check when flushing the buffer, so should we assume that all instances must point to the same cache at any given time? Can there be multiple active caches? If not, how do we prevent that? The suspended sync above applies on a per-object basis, but the caching should be possible to apply across many objects. The signac-2 prototype contains an example of this, using a global Redis cache if one is available, otherwise falling back to an in-memory dict cache.

As stated above, caching should be handled by the backend, no caching on the signac level.

* I'd like us to be consistent with terminology, specifically caching vs buffering. I would prefer that we use cache everywhere (and that's consistent with the signac-2 design), but either way we should be consistent.

See my first paragraph.

vishav1771 · 2020-07-30T21:28:52Z

I have done the changes.

csadorf · 2020-08-10T11:41:01Z

@vishav1771 Did you accidentally close this?

vishav1771 · 2020-08-11T03:28:37Z

@csadorf I have moved this PR to glotzerlab#363 .

vishav1771 added 8 commits June 24, 2020 01:37

Added buffering

5c21d11

Added test to test_synced_collection

308c05d

Added todo

06f11e8

Merge SyncedCollection into buffereSyncedCollection

808ca38

Added 'SyncedCollection.buffered'

7a40260

Updated buffering for multiple backend

6d851db

Removed test.json

b509e94

Cleanup

222d4a4

vishav1771 commented Jul 22, 2020

View reviewed changes

signac/core/jsoncollection.py Outdated Show resolved Hide resolved

Added buffered_collection

07a9fdb

vishav1771 closed this Aug 10, 2020

vishav1771 mentioned this pull request Aug 11, 2020

Added buffering to SyncedCollection glotzerlab/signac#363

Merged

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Buffered Synced collection #2

Buffered Synced collection #2

vishav1771 commented Jul 22, 2020 •

edited

Loading

vyasr commented Jul 23, 2020

csadorf commented Jul 24, 2020

vishav1771 commented Jul 30, 2020

csadorf commented Aug 10, 2020

vishav1771 commented Aug 11, 2020

Buffered Synced collection #2

Buffered Synced collection #2

Conversation

vishav1771 commented Jul 22, 2020 • edited Loading

Description

Motivation and Context

Types of Changes

Checklist:

vyasr commented Jul 23, 2020

csadorf commented Jul 24, 2020

vishav1771 commented Jul 30, 2020

csadorf commented Aug 10, 2020

vishav1771 commented Aug 11, 2020

vishav1771 commented Jul 22, 2020 •

edited

Loading