-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: Unify dict classes and improve buffering and synchronization #249
Comments
The class hierarchy reflects the subtly different responsibilities needed to implement the job-document and job-statepoint interfaces. While I don't think that the hierarchy is wrong, I understand that the organization with multiple and diamond inheritance is hard to penetrate and that refactorization towards simpler relationship would likely be beneficial. I am therefore in principle not against a refactorization attempt, however I caution that this would be a non-trivial undertaking and would require very careful implementation and review since this is an extremely sensitive area with respect to data integrity and performance. A lot of effort has been put into the current code base to ensure both of these aspects and refactoring or possibly even replacing this code will need to reconstruct a lot of considerations that went into the current implementation. Before, we can move forward I would like to see a class diagram of how the revised code base would be organized. I would also like to see a detailed plan on how caching would be achieved. If we are already planning a major revision for this part of the code base, we should revisit the design for signac 2.0. Here are my recommendations for concrete next steps:
Please have a look at the signac 2.0 design and the prototype implementation. I am confident that a lot of ideas for simplification and improved caching have already been prototyped there. Finally, I would recommend that we should implement these core data classes in separate packages. This would help us with a clear separation of concerns, but also enable us to release these packages separately on PyPI if we are so inclined. I would think that a lot of users could have use for a carefully implemented JSONDict and H5Store data class without needing to use signac and its namespace. |
I'm fine with implementing these new versions in a separate package, I agree that it would be beneficial. I completely agree that some of this is addressed by the signac 2.0 prototype. My primary concern with signac 2.0 was that since the changes were dramatic enough that we were considering rewriting the codebase from the ground up and then introducing compatibility layers, we ran the risk of ending up with a partial solution that wouldn't be a drop-in replacement. I think that revisions to our hierarchy of synced collections is self-contained enough and high leverage enough that we could accomplish it within the existing framework, and reap the benefits even if we made no further progress on the other components of the 2.0 proposal. My suggestion for 2.0 would be to identify similar changes that we could make to 1.x in place, and once we've completed such changes we can reevaluate the magnitude of further changes required for 2.0. That being said, I think one of the major changes we should make that is orthogonal to the concerns 2.0 addressed is that we need to generalize our syncing to other data structures. In particular, #196 and #198 suggest the need for a more generic Unless I'm very mistaken, I don't think there is any multiple or diamond inheritance. I think that a lot of what makes things confusing is really that the logic for the separation of concerns is not clear, particularly with respect to the buffering and caching logic. In particular, the fact that buffering is implemented at the I'll post again when I have a class diagram, that's an important first step. |
I've talked to @bdice about this and he had some concerns that it would increase our maintenance burden too much. I'm coming around to his view point and also think it's not a huge problem to delay such a split, because we can always do that in subsequent releases without harm.
I agree that a complete rewrite of the code base would require too much effort, so implementing these improvements in stages is totally reasonable. I j just want to make sure that we take the design for signac 2.0 into consideration when we develop these incremental changes.
I'm not sure that a generic
👍 |
I've finally put together a rough class diagram that describes how I think this should be structured: A few notes for those who aren't that familiar with UML class diagram syntax:
The core idea here is to separate what I see as two separate functionalities: 1) synchronization with an underlying file, and 2) recursively keeping a potentially nested collection-like data structure up-to-date. A In an ideal world, this would mean that classes at the bottom level of the hierarchy would be completely empty. For example, a SyncedJSONDict would simply inherit from the two parents, which implement all necessary abstract methods and therefore make it a concrete class. Implementing other backends would also be simpler, since adding one class would allow the usage of that backend for all types of collections. In practice certain backends may require special serialization logic for different data types, so we could implement on an as-needed basis. In general, we should only implement the bare minimum subset to start with, but this structure should make this code more extensible if we want to enable something like an SQL backend (for simple, homogeneous schema) We should define a clear protocol for what a cache "looks like" (in the sense of ducktyping), and allow users to provide any cache supporting this API. We can of course support some caches like Redis. I took into account the way that the signac-2 prototype was structured, but I prefer keeping the cache as a separate argument rather than having users specify an Possible changes and important concerns:
Here's a more complex class diagram including multiple backends (JSON and pickle) and multiple collections (dict and list) to show how that might look. I'm not advocating Pickle as a choice of backends, I just picked it as an obvious one for an example. Here's the raw UML for both diagrams so we can modify them easily later. |
@glotzerlab/signac-committers all comments on this would be welcome. If this is your first time reading the thread, feel free to ask any clarifying questions. |
This is great! I think this architecture is clear and well-founded. A few other ideas/asides:
|
|
Had a discussion with @csadorf about this, he's largely on board with the stipulation that the focus of the refactoring needs to be on enabling new features (better caching, interchangeable backends) and not on just reworking existing code that already exists. Here were a few specific points raised:
|
You need the hook, because you must be able to react to changes of the collection. If you change the |
In addition, I'd recommend that all classes with file access should have a URI that can be used as a namespace for the global cache. So for example, the URI for a SyncedJSONDict would be a UUID5 with namespace signac:SyncedJSONDict and name=file url. |
See also: #189 |
On further thought, I think the proposed infrastructure does support this, but part of the responsibility would be at the signac level, not the SyncedCollection level. |
This list should contain all tasks that need to be completed before this issue can be considered closed. We can modify this list if needed, but please
|
There are a couple of points that I'd like some discussion on. @glotzerlab/signac-committers I'd appreciate any feedback you have on these. Pickle/Copy/Deepcopy semanticsShallow copyShallow copies are straightforward, since all attributes are copied by reference. Deep copyingCurrently, some I would like to revisit the appropriate semantics for copy (and therefore pickling) semantics with By the same logic, I believe that we should disable deep copying for all currently implemented backends since their synchronization with some external persistent data intrinsically prevents a truly deep copy. If we add a backend that involves synchronizing a Python dict with some other dict-like object (see for instance glotzerlab/hoomd-blue#776) then those backends may choose to implement a proper deep copy. I don't think that any of our current backends make sense for this, though. For signac, a deepcopied JSONDict will always point to the same PicklingPickling is something of a middle ground. Statepoint cacheIn #239 @bdice identified some issues with the statepoint cache in the process of trying to implement lazy loading. In general, the current implementation of the statepoint cache is somewhat fragile since it is only modified by I have some ideas on how we could address this issue cleanly. For instance, we could a Before attempting to make any changes to this part of the code, though, I'd like to get an idea of what others think about this idea or something similar, or more generally what people think about the statepoint cache and how we want to handle it going forward. Getting rid of it would be an unacceptable performance hit, so we need to find some way to fix issues with cache fidelity while retaining the same basic functionality. |
Deep copyingI don't see the issue here. In my understanding, a deep copy simply means that nested containers in the deep copy aren't references to the original container's values, but are separate objects in memory. Unless there's some pattern I'm unaware of that is producing a "singleton pattern per filename", there should be no semantic confusion over what a deep copy should mean. Shallow copy is "create a new parent object and share references to all children," while deep copy is "create a new parent object and copy all children recursively by value." The issue here is not with our implementation of synced collections or Python value/reference semantics, it's that we have to set appropriate expectations for what a deep copy should mean to users as in the warning @vyasr linked above. This is because the in-memory representation of data is fundamentally linked to the synced source. That's the entire purpose of this class, so I think the resulting semantics (specifically that modifying a deep copy also modifies the original object via the synchronization with the "source of truth" on disk) are reasonable. Supposed d1 = SyncedDict('some_file.json')
shallow_copy_d1 = copy.copy(d1)
deep_copy_d1 = copy.deepcopy(d1)
assert shallow_copy_d1 == d1
assert deep_copy_d1 == d1
assert id(shallow_copy_d1) != id(d1)
assert id(shallow_copy_d1.a) == id(d1.a) # Nested data is shared by reference in a shallow copy
assert id(deep_copy_d1.a) != id(d1.a) # Nested data is NOT shared by reference in a deep copy
d2 = SyncedDict('some_file.json') # A newly constructed object should have the same semantics as a deep copy
assert d1 == d2
assert id(d1) != id(d2)
assert id(d1.a) != id(d2.a) # Nested data is NOT shared. Is there anything more nuanced that I'm missing? |
State point cacheIn the current implementation of #239, the state point cache in |
@bdice not really, aside from the minor oddities of copying an object that's already nested: I'm mostly discussing this from the perspective that I expect most users who want to deep copy these objects to just do it, see that it succeeds, and then experience unexpected side effects, rather than read the docstring. On second thought I no longer feel so strongly about this need; I'm still certain that this is going to happen, but I'm OK letting it happen since deep copying is a relatively infrequently used feature anyway. We can hope that a user who knows enough to try to deep copy these objects would also recognize the problem fairly quickly, or at least know to read the docstring. Neither of these issues are really a dealbreaker (although these semantics make a deep copy pretty useless), we're just providing more ways for users to shoot themselves in the foot. Trying too hard to prevent that isn't very Pythonic, so it's just a question of where to draw the line. I prefer to play it safer with operations like this where IIRC we originally only recognized this problem after a relatively long attempt at debugging some issue that was caused by an unrecognized reference between a Job and a deep copy of it. For what it's worth, Zarr.Array supports deecopy, while h5py.File and Python file handles do not. |
Yes, I'm familiar with that discussion and the design decisions involved in the current cache, and I agree that the statepoint cache can contain extra jobs that have since been (re)moved, but my understanding was that these statements were no longer true once you started implementing lazy loading. Have you managed to rewrite your implementation so that you no longer have these problems? Or are you simply working around the performance hits that lazy loading incurs when performing validation by optimizing other parts of your code so that the net result is faster? The main questions that we'll need to address have more to do with the fact that at present, the statepoint in signac is managed by a I think this discussion is probably best to revisit once we get to the point of integrating the SyncedCollection classes into signac. At that point we can see to what extent the existing logic needs to be rewritten to take advantage of the new classes, and whether any new functionality is necessary in the new classes to avoid regressions. The suggestions that I made above are IMO the safest way to implement this, but it may be possible to use something much simpler that's good enough. |
In both lazy loading and the previous implementation, the project would check its own
I agree that we should re-visit this when finalizing the integration with the
|
Along the way to reviewing #453, I’m reading through more conceptual discussions to try to understand I see in
Clarifying my understanding: “synchronization” basically means “keeping the manifest files on disk up to date”? Is the naming of I need help understanding why users need — From what I understand of the discussion between @vyasr and @bdice (especially talk of users “shooting themselves in the foot”), I also support removing deep copying as it currently exists. Partly bc I don’t know how to answer the question: What would deep copying bring to users who are not developers?
I initially didn’t know where to look but I found it: https://github.com/glotzerlab/signac-2-prototype |
@cbkerr take a deep breath, because there's a lot to unpack here :) "The answer may be that I would be equally confused about the current implementations involving _SyncedDict, SyncedAttrDict and JSONDict and that SyncedCollection would make all of this less confusing." I think this sentence is probably pretty representative of how most signac users and developers (including most of the committer team) feel about these classes right now, so let's orient ourselves by starting with what's currently in signac. Hopefully this explanation will prove useful to other @glotzerlab/signac-committers as well when it comes time to integrate the changes proposed by this issue. Current behavior of signacThe job statepoint and document are dictionary-like objects that keep a file up-to-date when they are modified. This process of constant synchronization is where the name However, the requirements for the statepoint and the document are slightly different. The statepoint is not supposed to change too much once it is created, so it doesn't need to be super fast, but it does need to be able to "tell" signac to move the job directory in order to maintain the invariant that To understand how these distinctions are handled, let's look at the
Hopefully it's clear from this what's happening: the data of the object is stored in an internal dictionary Now here's the catch: if you actually look at Now since documents have different requirements (potentially lots of I/O, but no need to move the job), they need to behave differently. To make job documents faster, we enable buffering: if you ask them to, they'll hold off writing files to disk until after you tell them you've made all the changes that you intend to make. This allows us to make many changes, then tell the document to write them all at once, rather than writing after every small change. This logic is implemented in the You mentioned these three classes in your comment, but there's one that you left out: the Background on and requirements for CollectionsIf all of these seems confusing, that's because it is. Most of these features were added on an as-needed basis, so things just grew organically. However, over time we started identifying lots of issues. Many of them are smaller bugs that are linked somewhere in this thread. The number of different places that implement different parts of the logic are hard to keep track of, making it very likely for bugs to be fixed in some places and not others; for example, problems with synced lists need to be addressed in very different ways from synced dicts. We also identified limitations that make optimizations difficult, for instance the lazy loading that @bdice has been working to implement. Scalability beyond a certain point is also simply infeasible using a purely file-based approach, so we'd like to be able to use a true database backend, but that's extremely difficult to fit into the existing data model of signac. There is a Python module The proposal in this issue is to try and reduce the duplication and create a clearer separation of responsibilities between different classes so that we avoid repeating ourselves (making us more bug prone) while also creating easier entry points for changing specific behaviors without needing to rewrite the entire functionality from scratch. The basic idea is that what we are trying to implement are various types of collections, all of which share the property of being synchronized with some other resource. Historically this has been a JSON file, but there's no reason that it can't be something else like a SQL database. The abcs described above basically just define specific abstract methods: methods that subclasses have to implement in order to expose necessary functionality. For instance, any The conflict with MongoDB collections is an unfortunate coincidence, but under the circumstances both names are appropriate and make more sense than any alternative that avoids using the word Collection. How SyncedCollections workKeeping any collection-like object in sync with something else adds two main twists to the standard behavior of collection-like objects. First, we need a way to actually read and write the other data, e.g. by reading a file. Second, we need a way to update the in-memory object based on what's in the file. This second aspect is more complicated than you might think for performance reasons. Recall that we have this complex parent-child relationship set up to make sure that the top-level dictionary in a nested set of them is what syncs to a file. That also means that every time we read from or write to the file, we have to traverse our data structure to keep these linkages intact. Additionally, it's very expensive to just recreate the whole object every single time anything happens, so we need a faster way to update it in place. The core idea with SyncedCollections is that these two problems are orthogonal: the way that you read and write is dependent on the resource you're synchronizing with (a file, a database, another in-memory object, etc), while the way that perform in-place updates is dependent on the type of the data (a dict, a list, etc). Therefore, we should in principle be able to decouple these in a way that makes it possible to mix and match freely. If you go to the
is effectively enough to define a new class (there's a tiny bit of extra logic I'm hiding that makes constructors play nice with multiple inheritance, but it's trivial code). Replacing SummaryHopefully this super long explanation is helpful for you to understand both where we've been and what we're trying to achieve. The classes currently in signac for statepoints and documents have had various features tacked onto them over time to improve performance and functionality, but they are built around JSON and in the case of the statepoint are also deeply intertwined with the functionality of other core signac core classes. This structure made it pretty difficult to keep up with complex bugs that could manifest differently in statepoint and job documents, or bugs in lists that didn't occur in dicts, and it also made it very difficult to improve performance, either using different caching methods or by employing something more performant than JSON files as a backend. The new classes aim to solve this by separating these different components into different classes that can be easily mixed and matched to suit our needs in different scenarios, hopefully while also making it easier to maintain by making it obvious where to look when something breaks. |
@vyasr Now that #472 is nearly done, it would be helpful if you could open a tracking PR for |
Tl;dr: We need to improve synchronization and caching logic, and I think the first step is to combine the
_SyncedDict
,SyncedAttrDict
, andJSONDict
classes.I apologize in advance for the lengthy nature of this issue. This issue will serve as a pseudo-signac Enhancement Proposal, I'll try and document very thoroughly and it can be a test case for the utility of such proposals :)
In view of our recent push for deprecations and our discussion of reorganizing namespaces and subpackages to prepare for signac 2.0, I'd like to also revisit discussion of the different dict classes. We have various open bugs and features (#234, #196, #239, #238, #198) that are related to improving our synchronization and caching processes. Our synchronization clearly has some holes in it, and in the process of making #239 @bdice has raised concerns about inconsistencies with respect to cache correctness and cache coherence, e.g. the fact that a job that exists and is cached will still exist in the cache after it is deleted (Bradley, feel free to add more information).
Fixing all of these is a complex problem, in part due to fragmentation in our implementation of various parts of the logic. I'd like to use this issue to broadly discuss the various problems that we need to fix, and we can spawn off related issues as needed once we have more of a plan of attack to address our problems. Planning this development more thoroughly is critical since the bugs that may arise touch on pretty much all critical code paths in signac. I think that a good first step is looking into simplifying the logic associated with our various dictionary classes. That change should make it easier to improve #198 since synchronization will be in one place. After that, I think it will be easier to consider the various levels of caching and properly define the invariants we want to preserve.
With respect to the various dictionary classes, I think we need to reassess and simplify our hierarchy:
_SyncedDict
class incore/synceddict.py
, and I think it should exist in more or less its current form.SyncedAttrDict
incore/attrdict.py
since attribute-based access to a dictionary is technically a feature unrelated to synchronization. However, the class is very minimal, and I think that the benefits of maintaining this level of purity in distinction is outweighed by the increased difficulty users and newer developers have in finding code in the code base. I would like to merge these classes.JSONDict
class incore/jsondict.py
is, in my opinion, harder to justify separating from_SyncedDict
on a conceptual level. Although in principle one could argue for different types of file backends, in practice we're very tied to JSON. The bigger problem, though, is that in my understanding (please correct me if I'm wrong here) the primary distinction between the two classes is less about the file backend and more about buffering. We always set theparent
ofSyncedAttrDict
, which is what we use for job statepoints, and this ensures that statepoints changes are immediately synced to disk. Conversely, job documents areJSONDict
objects, which use buffering. The fact that the_SyncedDict
has_load
and_save
methods that essentially must be implemented by a child class whenparent
is not set, and that theJSONDict
is the only example we have of such a class, suggests that this is a level of abstraction that isn't very helpful and mainly complicates management of the code. At least for now, I would prefer to unify theJSONDict
with_SyncedDict
; the logic for when we buffer is already governed by theparent
, but the logic for how we buffer is governed by the various other functions injsondict.py
. Afterwards, if we see a benefit to separating the choice of file backend, we could recreateJSONDict
where the new version of the class would really only implement JSON-specific logic. This change would have the added benefit of unifying statepoints and documents: I don't think it is intuitive design to have the document and the statepoint be two different classes for reasons of buffering, and it makes it substantially more difficult to follow the logic of how the_sp_save_hook
works and why it's necessary. Longer term, I would like to refactor the logic for persistence vs. buffering so that the roles ofJob
and_SyncedDict
are more disjoint, but I recognize that there may not be a way to completely decouple the bidirectional link.@csadorf @bdice @mikemhenry any commentary on this is welcome, also please tag any other devs who might have enough knowledge of these topics to provide useful feedback.
The text was updated successfully, but these errors were encountered: