-
Notifications
You must be signed in to change notification settings - Fork 36
GSoC 2020 Projects
This page contains project ideas for students applying to the Google Summer of Code 2020. We recommend that prospective students join our Slack workspace to discuss project proposals. Be sure to read our Code of Conduct - respect is important and you will be working with a team from many backgrounds.
signac is a data management framework named after the painter Paul Signac, whose colorful pointillist style resembles a collection of data "points". The signac framework is designed to help researchers design, manage, and execute computational studies. The core data management package signac helps users track data and metadata for file-based workflows (e.g. large molecular simulations) with features for searchability, collaboration, reproducibility, and archival.
The companion package signac-flow automates workflow submission on high performance computing clusters operated by universities, companies, and federal research labs. The architecture of signac is specifically aimed at research, where questions change rapidly, data models are always in flux, and computing infrastructure varies widely from project to project. Portability and fast prototypes are signac's strong suit -- compute some jobs, analyze the outputs, write a paper, and archive the data. The signac framework is available for Python 3.5+, can be installed with pip or conda, and is licensed BSD-3.
To learn more about signac, check out the signac website and framework documentation. You can also follow @signacdata on Twitter.
Above all else, we are looking for an enthusiastic student who is willing to learn and works well with our team. The signac framework is written in Python 3 and our organization relies on git, so basic familiarity in both Python and git is valuable.
We recommend you take a look at a "good first issue" to acquaint yourself with the project and our development process.
Note that the signac framework has a few separate repositories where issues are filed:
- signac, core data management package
- signac-flow, workflow automation
- signac-dashboard, rapid data visualization in a browser
- signac-docs, the central documentation repository
- signac-examples, a set of example projects
The core functionality of signac is the management of a database on the filesystem. The metadata associated with this database is stored in the job state point and document JSON files, which are distributed throughout the database in the subdirectories corresponding to their data points. One of signac's core features is the ability to interact with these metadata in a Pythonic manner, which it accomplishes by extending standard Python classes like dicts and lists to transparently synchronize user modifications to the data structure with a file on disk. For instance, the job statepoint acts just like a dictionary, except that all changes are automatically tracked in a signac_statepoint.json
file.
In this project, you will work to improve signac's internal synced data structures to enhance the design, API, and performance. There are 3 overarching goals for this project.
- The first goal is to enable the use of different back-ends for synchronization. Currently, all synced objects are synchronized to JSON files, but depending on the specific use-case a user might benefit from using a different back-end, such as a pickled file or even a centralized database index file (e.g. an sqlite database).
- The second major goal of this redesign is to extend the synchronization to a broader set of data types. signac was originally designed to support synchronized dictionaries, and while that functionality has been extended to support the nesting of lists within these dictionaries, redesigning for first-class support for direct synchronization of various data structures should help us develop a cleaner API and avoid subtle bugs.
- The third major goal is to improve our support for buffering, which should substantially improve performance. Originally , signac always immediately synchronized changes to files, so every operation that mutated a synced dictionary would result in a file write. As the demand for larger data spaces increased, we implemented some buffering features to improve the performance by reducing disk I/O. However, rewriting the buffering support in signac from the ground up should allow us to substantially improve the performance of the code while ensure data coherency.
The signac data and workflow model is primarily designed around the concept of operations acting on jobs, where the management of the job's data is handled by the signac package and the workflow definition and execution is handled by signac-flow. The current workflow model treats operations as always acting on single jobs. However, signac is frequently used for multi-dimensional parameter sweeps where analyses typically involve grouping jobs according to common state point parameters, for instance to make plots that average over replicas. As a result, operating on multiple jobs at once is a highly desirable feature.
In this project, you will enable users to create and execute operations that accept multiple jobs as input, via job queries or manually-constructed lists. This will greatly improve the power of workflows in the signac framework. Previous work resulted in a draft of this feature, which will need to be updated substantially to account for other changes in the signac-flow execution model.
Workflows in signac-flow are defined using "conditions" (such as whether an output file exists) that determine what operations to run. Workflows defined by signac-flow implicitly define a directed acyclic graph, i.e. a sequence of operations that are connected by their conditions. The definition is implicit because such conditions are encoded by arbitrary functions rather than by directly indicating that one operation precedes another; however, recent work on signac-flow enabled the automatic detection of the dependency graph. As a result, it is now possible to determine when an operation depends on another operation running first. In this project, you will make it possible to automatically execute an operation's dependencies before running the desired operation. This helps with applications such as active learning over a data space, as well as enhancing the user experience for complicated workflows where a user can request that an operation be run without having to determine all the other operations that must be run first to satisfy its preconditions.
- Learn to automate and scale computational workflows from laptops to the world's largest supercomputers
- Improve your skills in designing user-centered APIs, working on collaborative teams, and using scientific Python
- Work on a project that will be used by scientific researchers at institutions around the globe
- Work with a friendly team!
Our development team is distributed across several time zones, and we have an active Slack workspace, biweekly video calls, and biweekly development "sprints" to coordinate our efforts.
- Bradley Dice (@bdice)
- Vyas Ramasubramani (@vyasr)
- Alyssa Travitz (@atravitz)
- Mike Henry (@mikemhenry)
- Brandon Butler (@b-butler)