Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Buffered Synced collection #2

Closed
wants to merge 9 commits into from

Conversation

vishav1771
Copy link
Owner

@vishav1771 vishav1771 commented Jul 22, 2020

This PR enables buffering feature to SyncedCollection.This related to glotzerlab#249

Description

Motivation and Context

glotzerlab#249

Types of Changes

  • Documentation update
  • Bug fix
  • New feature
  • Breaking change1

1The change breaks (or has the potential to break) existing functionality.

Checklist:

If necessary:

  • I have updated the API documentation as part of the package doc-strings.
  • I have created a separate pull request to update the framework documentation on signac-docs and linked it here.
  • I have updated the changelog and added all related issue and pull request numbers for future reference (if applicable). See example below.

Example for a changelog entry: Fix issue with launching rockets to the moon (#101, #212).

@vyasr
Copy link

vyasr commented Jul 23, 2020

@vishav1771 nice work, this is a good starting point! It looks like you've successfully ported the existing signac buffering logic, modifying it to take advantage of the new structure. I also think you've implemented something that I suggested, which is treating the buffer as a new back-end.

However, I don't think I've given you sufficient guidance on how caching could look, my apologies on that. Have a look at the signac-2 prototype, particularly the design.md file, which outlines our goals for caching. Some of my next comments might reference ideas in there. Before getting too much into a line-by-line review of the code, I think we need to zoom out and make sure that the overall API and concepts we're developing here will address the various needs and open issues with the current code base. @csadorf @bdice @mikemhenry @atravitz @b-butler here are a few big picture questions that I think we need to address as part of this project. I apologize if there's overlap or if they're not all completely well thought through, but I wanted to get the discussion started and get my thoughts down before I forgot them:

  • The first PR in this chain (the one creating the SyncedCollection infrastructure) enables the ability to suspend synchronization. This optimization allows (for instance) recursive modifications of a highly nested synced data structure to occur in-memory without any synchronization, followed by a single sync to file. This mechanism was implemented in early versions of the SyncedDict to prevent unnecessary I/O operations. This PR is adding the in-memory buffer to work across all reads/writes (within some context), something that was later added to signac as a performance enhancement. Do these really need to be independent code paths? I think we probably need to enable both types of functionality (feel free to correct me if you think we can get by without one), but at an implementation level can we handle the suspended synchronization by simply reading from/writing to an object-specific cache? One part of the signac-2 idea was that every resource would have a cache (either owned by itself, or pointing to a higher-level object's cache), so can we combine these?
  • I think we need to define clearly what we mean by caching: in particular, at what level do these things persist? One of the issues raised in Lazy statepoint loading glotzerlab/signac#239 (see specifically here) is the fact that currently in signac the project cache may contain invalidated state points. I think we should tighten some of our restrictions with respect to such behavior. In order to do that, we need to address what the primary source of truth is when caching is enabled. Are caches always in-memory objects? If not, how do we ensure their synchronization, for instance with respect to the current project cache? Are on-disk caches (like the current project cache) relevant, but out of scope for this PR?
  • How do we handle multiple synced collections pointing to the same file? We currently have an integrity check when flushing the buffer, so should we assume that all instances must point to the same cache at any given time? Can there be multiple active caches? If not, how do we prevent that? The suspended sync above applies on a per-object basis, but the caching should be possible to apply across many objects. The signac-2 prototype contains an example of this, using a global Redis cache if one is available, otherwise falling back to an in-memory dict cache.
  • I'd like us to be consistent with terminology, specifically caching vs buffering. I would prefer that we use cache everywhere (and that's consistent with the signac-2 design), but either way we should be consistent.

Those are my first thoughts. I'll add more as I have time to ponder, but that's food for thought for other reviewers. I'd like to hash out more of the concepts here.

@csadorf
Copy link

csadorf commented Jul 24, 2020

* The first PR in this chain (the one creating the SyncedCollection infrastructure) enables the ability to suspend synchronization. This optimization allows (for instance) recursive modifications of a highly nested synced data structure to occur in-memory without any synchronization, followed by a single sync to file. This mechanism was implemented in early versions of the SyncedDict to prevent unnecessary I/O operations. This PR is adding the in-memory buffer to work across all reads/writes (within some context), something that was later added to signac as a performance enhancement. Do these really need to be independent code paths? I think we probably need to enable both types of functionality (feel free to correct me if you think we can get by without one), but at an implementation level can we handle the suspended synchronization by simply reading from/writing to an object-specific cache? One part of the signac-2 idea was that every resource would have a cache (either owned by itself, or pointing to a higher-level object's cache), so can we combine these?

Caching and buffering are note the same, see here for the differentation: https://en.wikipedia.org/wiki/Cache_(computing)#Buffer_vs._cache

In our context, buffering means that I/O operations can be delayed, because no concurrent access is expected, i.e., they are non-atomic. The way that one can typically take advantage of this kind of mode is by using a cache. Whether a backend uses a cache or another technique to improve performance should be left up to the backend implementation.

This means concretely that the buffering mode is merely an indicator to the backend, whether we expect that a completely different instance is going to access the signac data space concurrently.

Definition: Buffered mode means that the backend can safely assume that non-atomic I/O operations are currently safe.

Whether a specific backend implements a buffered mode or not should be completely transparent. This means that the only difference would be a potential performance increase in buffered mode, but no other change in behavior.

* I think we need to define clearly what we mean by caching: in particular, at what level do these things persist? One of the issues raised in [glotzerlab#239](https://github.com/glotzerlab/signac/pull/239) (see specifically [here](https://github.com/glotzerlab/signac/pull/239#issuecomment-555653170)) is the fact that currently in signac the project cache may contain invalidated state points. I think we should tighten some of our restrictions with respect to such behavior. In order to do that, we need to address what the primary source of truth is when caching is enabled. Are caches always in-memory objects? If not, how do we ensure their synchronization, for instance with respect to the current project cache? Are on-disk caches (like the current project cache) relevant, but out of scope for this PR?

I'm no longer convinced that caching on the "signac level" is a good idea. As long as we are rigorously abstracting all I/O operations by having them implemented purely in the backend, then we do not need to worry about this kind of optimization anymore. In fact, a rigorous backend implementation is going to make this kind of performance optimization much easier. For example, if we use redis or zarr backend, then we can assume that operations are atomic and concurrent access is safe at all times.

From the point of signac, the data provided by the backend is the single source of truth, nothing else matters.

* How do we handle multiple synced collections pointing to the same file? We currently have an integrity check when flushing the buffer, so should we assume that all instances must point to the same cache at any given time? Can there be multiple active caches? If not, how do we prevent that? The suspended sync above applies on a per-object basis, but the caching should be possible to apply across many objects. The signac-2 prototype contains an example of this, using a global Redis cache if one is available, otherwise falling back to an in-memory dict cache.

As stated above, caching should be handled by the backend, no caching on the signac level.

* I'd like us to be consistent with terminology, specifically caching vs buffering. I would prefer that we use cache everywhere (and that's consistent with the signac-2 design), but either way we should be consistent.

See my first paragraph.

@vishav1771
Copy link
Owner Author

I have done the changes.

@vishav1771 vishav1771 closed this Aug 10, 2020
@csadorf
Copy link

csadorf commented Aug 10, 2020

@vishav1771 Did you accidentally close this?

@vishav1771
Copy link
Owner Author

@csadorf I have moved this PR to glotzerlab#363 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants