Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reorganize modules for signac 2.0 #756

Closed
vyasr opened this issue May 6, 2022 · 13 comments · Fixed by #868
Closed

Reorganize modules for signac 2.0 #756

vyasr opened this issue May 6, 2022 · 13 comments · Fixed by #868
Assignees
Milestone

Comments

@vyasr
Copy link
Contributor

vyasr commented May 6, 2022

The current layout of signac reflects the state of the package as it was originally conceived, i.e. as a starting point for users whose data would eventually grow into a full-sized database like MongoDB. However, experience with signac has shown that in practice most of our users stay with the job-project data model, which has become the true core of signac. Many parts of signac associated with more traditional database functionality will be removed in signac 2.0, and some of the core data structures like the JSONDict have been reimplemented in a generalized fashion in synced_collections (which will be spun out into its own package, see #755), so signac 2.0 will be a much leaner package than the 1.x line. For 2.0 we should reorganize our files into a flatter namespace that is more familiar to newer developers who are not familiar with or constrained by the origins of the project or its original goals.

@vyasr vyasr added this to the v2.0.0 milestone May 6, 2022
@csadorf
Copy link
Contributor

csadorf commented Jul 19, 2022

I've tried to analyze the import structure using pydeps on next and here is the result ($ pydeps signac --rankdir BT --cluster):

signac

Here are my main conclusions:

  1. The modules with highest rank (based on connectivity) are project.py and __main__.py.
  2. The number of imports into the root namespace are actually quite low with almost all of the coming from the contrib sub-package.
  3. The synced_collections package is fairly self-contained, however most imports come directly from the JSON-backend. Especially the latter is somewhat problematic, because it feels like it breaks the modularization. @vyasr Maybe you can comment on that?

Based on this and some other considerations, I recommend to adopt the following package structure:

_cli/ # any modules that are only used for the CLI
_core/  # core business logic (includes all modules and packages from core/ and contrib/)
_core/migration/ # anything migration related
_core/synced_collections/  # probably eventually vendored or dependency
_utils/ # stand-alone classes and functions
_vendor/  # vendored packages and modules
__init__.py
__main__.py  # should call function from cli/ sub-package.
config.py
errors.py
testing.py
warnings.py

Note: The above package structure is not exhaustive.

@glotzerlab/signac-committers I would be interested in your take before I move forward with the implementation.

@vyasr
Copy link
Contributor Author

vyasr commented Jul 19, 2022

Regarding the synced_collections import, this was intentional. I'm happy to revisit the discussion, but the original rationale for this design is that, while signac happens to depend almost exclusively on the JSON parsing, once synced_collections is spun out into its own package the assumption should be that client code is equally likely to use any supported format. Users would therefore then be responsible for importing only the formats that they require. Here are some of the reasons to prefer that:

  • It requires a clear signal of intent from the user about what types of data that plan to deal with.
  • Since certain formats may have different requirements, it is neither necessary nor particularly beneficial to import them all into the top-level namespace.
  • This gives formats the flexibility to perform some package/module-level initialization upon import if they need to. Currently any format with additional requirements only performs checks on object construction (e.g. MongoDB or Zarr), but that doesn't always have to be the case.
  • Some formats may involve importing heavy third-party packages that slow down imports.

@vyasr
Copy link
Contributor Author

vyasr commented Jul 19, 2022

I think I am mostly on board with the suggested structure. Some minor comments/questions:

  • Although I have seen vendor, empirically thirdparty or extern are more common in my experience. That's mainly just a naming quibble, though.
  • I'm not sure what the difference is between core and top-level entities. Why aren't errors/warnings in core? Conversely, aren't core entities going to be used by the CLI, in which case maybe core entities should all be at the top-level? I haven't quite convinced myself either way here, just wanted to raise the question.
  • We should plan on ripping out synced_collections in 2.1 IMO, so no need to plan too much there. Since we own it, I would be fine having it as a dependency rather than vendored.
  • It feels a bit odd to have testing at the top level when it feels like more of a tacked on feature. Maybe that's just me, though.

@csadorf
Copy link
Contributor

csadorf commented Jul 20, 2022

Regarding the synced_collections import, this was intentional. I'm happy to revisit the discussion, but the original rationale for this design is that, while signac happens to depend almost exclusively on the JSON parsing, once synced_collections is spun out into its own package the assumption should be that client code is equally likely to use any supported format. [...]

No need for another discussion, I probably just forgot about the rationale. One alternative to consider that would improve modularity is to use a factory function that can be imported from the top-level namespace that returns the object with the derived class for a chosen backend (potentially with default):

from synced_collections import get_buffered_attr_dict

document = get_buffered_attr_dict(backend="json", filename=fn_doc, write_concern=True)

instead of

from synced_collections.backends.collection_json import BufferedJSONAttrDict

document = BufferedJSONAttrDict(filename=fn_doc, write_concern=True)

The type of document would be BufferedJSONAttrDict in both cases.

@csadorf
Copy link
Contributor

csadorf commented Jul 20, 2022

  • Although I have seen vendor, empirically thirdparty or extern are more common in my experience. That's mainly just a naming quibble, though.

Happy to adopt whatever is most common.

  • I'm not sure what the difference is between core and top-level entities. Why aren't errors/warnings in core? Conversely, aren't core entities going to be used by the CLI, in which case maybe core entities should all be at the top-level? I haven't quite convinced myself either way here, just wanted to raise the question.

My rationale was that warnings and error classes are maybe something that users want to directly import, i.e., the module is part of the API, and that those classes could be considered "meta" compared to modules that implement core business logic. However, I like your suggestion of moving all core/* modules and packages into the root package.

  • It feels a bit odd to have testing at the top level when it feels like more of a tacked on feature. Maybe that's just me, though.

Similar to warnings, and errors, I consider the testing module a meta-module that does not actually implement the core business logic of signac, hence why I suggested to place it in the top-level directory. I was also thinking that for someone who explores the package source code, this could be a good starting point to understand how signac functions. Further, I believe that testing modules and packages are usually at the top-level of the package that they test.

@bdice
Copy link
Member

bdice commented Jul 24, 2022

My votes:

  • Core entities should all be at the top-level. We should aim for a flatter structure overall.
  • Vendored modules should exist in a private namespace, and I like the name _vendor more than _thirdparty or _extern
  • We should use private module names (starting with underscore) wherever possible

I think making a prototype will make things easier to concretely evaluate. To reiterate previous discussions, we should make these changes at the last possible moment before 2.0 is released, so that rebasing master -> next isn't a major pain as things continue to change.

@vyasr
Copy link
Contributor Author

vyasr commented Jul 25, 2022

I wouldn't mind adding a factory, although we'd probably want to put a bit more thought into the API; I would probably expect the buffered and attr to be boolean parameters rather than part of the function name. So you'd just be doing get_[dict|list|set](...) and expect that any other args/kwargs would be forwarded to the constructor.

Purely as a matter of preference I'm probably extern > vendor > thirdparty, but I don't really care what we choose. I definitely want it underscored, though.

It sounds like we're aligned on moving core/* to the top level, in which case I'm also fine with errors/warnings/testing being top-level.

@bdice if we had a prototype, what would you evaluate? What imports look like in a script using signac? How deep the directory tree is? Something else?

@bdice
Copy link
Member

bdice commented Jul 25, 2022

if we had a prototype, what would you evaluate? What imports look like in a script using signac? How deep the directory tree is? Something else?

Yes, the depth of the directory tree and where various files end up because core/* is extensive. Just want to take a moment to think about where code should live in the new major version.

@mikemhenry
Copy link
Collaborator

I'm in favor of seeing a "prototype" that is just ascii on how it will all look. Also +1 to using _ for 3rd party code we vendor.

@csadorf
Copy link
Contributor

csadorf commented Jul 27, 2022

I already experimented with creating a prototype, but obviously rewiring all the imports takes some effort. @bdice pointed out that we should wait with this until a late stage before merging next. Regardless, I can try to create a prototype where I make no serious efforts to make the package functional and we can see how it looks.

@vyasr
Copy link
Contributor Author

vyasr commented Aug 3, 2022

That's why I asked about what exactly @bdice wanted to see from a prototype. I was going to suggest exactly what @mikemhenry did, making an ASCII diagram to see what it would look like. Rewriting all the imports now just for a prototype is a lot of effort to expend if we know that we'll have to rewrite them all again later.

@stale
Copy link

stale bot commented Nov 2, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Nov 2, 2022
@vyasr
Copy link
Contributor Author

vyasr commented Nov 7, 2022

This is the next item on the to-do list.

@stale stale bot removed the stale label Nov 7, 2022
@csadorf csadorf linked a pull request Nov 18, 2022 that will close this issue
6 tasks
@vyasr vyasr closed this as completed in #868 Dec 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants