This repository contains code implementing the prototype analogy-making program Sifter, as described in our 'Onward!' 2020 paper "Analogy-Making as a Core Primitive in the Software Engineering Toolbox."
Sifter can make analogies about programs, e.g., identifying data structures and methods which play corresponding roles in two different programs. Sifter can also complete analogies, which allows it to automatically learn and generalize source code transformations from a small number of examples.
You must install Bazel as well as set up bazel_python with Python 3.7.4.
You should then be able to run
bazel test //... && bazel run coverage_report
to get a coverage report in htmlcov/index.html
.
To run the examples:
bazel run examples/letter_analogies:letter_analogy
bazel run examples/turing_machine:turing_machine
bazel run examples/program_analysis:program_understanding
bazel run examples/program_analysis:api_migration
bazel run examples/program_analysis:transform_learning
For the program_analysis
ones, you can view the result by visiting
http://localhost:8001
in your browser once prompted.
This repository accompanies our paper in Onward! 2020, with the goal of encouraging interest and future work in automated analogy-making for software engineering. Ultimately, we see analogy-making as involving three main processes:
-
Raw Perception, such as reading files (code, documentation) from the filesystem and into the triplet structure workspace.
Status: The
LazyStructure
andLazyTextDocument
classes in examples/program_analysis/lazy_structure.py read a file from the filesystem and add a representation of the file's contents to the triplet structure workspace. These interfaces are 'lazy' in that they allow selecting only certain parts of a file to be included in the workspace. Currently we either specify ahead-of-time which portions of the file to include in the workspace or just add the entire file contents.Future Work: We envision the laziness being used to initially focus the analogy-making on a small subset of the relevant code, then expanding to larger portions as the analogy solidifies.
-
Semantic Rewriting, such as grouping individual characters into a single token, inlining a function, and annotating invariants found via program analysis.
Status: Currently, the default
LazyTextDocument
interface performs a light-weight tokenization pass that groups characters separated by spaces and special characters (such as+
) before encoding the document into the workspace. Extra semantic information, such as program invariants, are specified ahead-of-time by either directly editing the workspace or using theAnnotateFact
method ofLazyTextDocument
. There is an example of annotating program invariants in examples/program_analysis/transform_learning.py.Future Work: We would like to directly connect Sifter to program analysis tools so analysis results can be automatically imported into the triplet structure workspace instead of needing to be manually annotated. We would like to incorporate rewrite rules that express semantic equivalence, such as inlining, syntax de-sugaring, and grouping. Finally, we would like to develop heuristics that can determine when to apply a program analyzer or semantics-preserving rewrite rule. Such a heuristic would need to operate in tandem with and be guided by the abstraction/anti-unification process.
-
Abstraction/Anti-Unification, where we pair up corresponding objects in the workspace to form an analogy.
Status: We have fairly-complete rules and heuristics for this that are described in more detail in AnalogyUtils.md. These rules roughly implement a syntactic anti-unification on unrooted, labelled graphs.
Future Work: Our existing abstraction rules work well, but depend on the semantic rewriting to have already exposed most of the correspondences in the syntactic representation of the workspace itself. So the primary future work for this process is to provide feedback to the semantic rewriting engine. For example, given programs
if x: y += 1
andz += a ? 1 : 0;
we may not be able to initially find a good syntactic abstraction, but we want the abstraction process to be able to give feedback to the semantic rewriting process to, e.g., desugar the ternary?:
operator into the correspondingif
statement, which we can then find a syntactic correspondance with. Beyond such better heuristics, we would also like to support higher-order 'slips,' i.e., analogies where the types themselves are abstracted.
In addition to these three processes, there are two other main directions for future work:
- The runtime does not currently support modifying rules with other rules; in fact, rule-related nodes are removed from the structure completely after an initial rule-parsing pass. In the future we may decide to change this to assist with meta-learning.
- We are still working on a good visualization module for triplet structures. There is a rudimentary CLI interface (runtime/interactive.py) as well as an interface specifically for source code analogies (examples/program_analysis/ui), however we plan to introduce a more general-purpose interface in the near future.
- ts_lib.py: the core library, defines the triplet structure data structure and some embedded-DSL-style helpers for describing triplet structures.
- ts_utils.py: a set of macros which make working with the library (especially expressing rules) easier.
- tactic_utils.py: a set of macros which make writing tactics ("controllers for applying the rules") easier.
- mapper.py: rules which can be added to any triplet structure to enable making abstract analogies (i.e., identifying isometries in sub-structures).
- analogy_utils.py: a Python interface and tactics to building analogies in triplet structures that have the rules from mapper.py added to them.
- ts_cpp/: C++ implementation of the core triplet structure data structure, as well as a backtracking triplet constraint solver.
- runtime/: a runtime to parse and execute rules on TripletStructures.
- examples/: examples of using triplet structures to solve analogies.
Please see TSLang.md, which documents how to write code using the Triplet Structure language.
We see much of the code in this repository as defining a domain-specific
language TSLang embedded in Python. We then write, within
TSLang
, the core analogy-making code for Sifter.
To highlight the difference between "plumbing" code implementing the TSLang
language and runtime vs. the actual analogy-making code written in TSLang
, we
use a distinct coding convention for each type of code:
- Code implementing
TSLang
, specificallyts_lib.py
andruntime/*
, is written using Google-style Python naming (e.g.,snake_case
for methods andPascalCase
for classes). - Code written in
TSLang
(everything else) is written usingPascalCase
for method names.
This naming convention is enforced intra-file by .pylintrc
.
@inproceedings{sifter:onward20,
author = {Matthew Sotoudeh and
Aditya V. Thakur},
title = {Analogy-Making as a Core Primitive in the Software Engineering Toolbox},
booktitle = {Proceedings of the 2020 {ACM} {SIGPLAN} International Symposium on
New Ideas, New Paradigms, and Reflections on Programming and Software,
Onward! 2020, November 18-20, Virtual, USA},
publisher = {{ACM}},
year = {2020},
url = {https://doi.org/10.1145/3426428.3426918},
doi = {10.1145/3426428.3426918},
}
- Matthew Sotoudeh: email [email protected].
- Aditya Thakur: email [email protected].
Licensed under the AGPLv3.