-
Notifications
You must be signed in to change notification settings - Fork 13
Qi Compiler Sync Aug 18 2023
Qi Compiler Sync Aug 18 2023
Adjacent meetings: Previous | Up | Next
Dominik gave a demo of visualizing flows using flow graphs for the Actor Basic language. We resumed working on stream fusion and incorporating it into the compiler in a minimal way to prove that it works end-to-end, and discussed high level considerations regarding the structure of the compiler. So far, we have yet to attain the performance in practice that we saw in isolated benchmarks.
Last time, we reached faster-than-Racket performance with stream fusion. This time we attempted to incorporate it into the Qi compiler to a minimal extent. Dominik independently has been working on visualizing flow graphs this week.
Having graphical visualizations for flows has always been an interesting avenue for exploration, and Dominik has been working on just that for the Actor Basic language. It uses Graphviz under the hood and uses submodules to define the graph in a similar way to how scribble/srcdoc
allows you to define docs alongside the code that are built in a separate phase. The Racket-Graphviz package came up as something that would simplify the implementation, but it is currently showing an error on the package index. It would be worthwhile to see what it would take to use such visualizations with Qi. We also discussed the possibility of using syntax-spec
to add macros to Actor Basic.
Last time, we'd observed that the foldr
version implementing filter-map
was slower than a hand-coded recursion. We also observed that the continuation-passing version had the same performance as the hand-coded version. To get a clearer sense of the reasons, we looked at the intermediate stages of compilation by the Racket compiler. Specifically, the CP0 pass is the first one that happens after Chez Scheme expansion, showing the result of inlining and other Scheme-level optimizations. Using, for example, PLT_LINKLET_SHOW_CP0=1 raco make fold-test.rkt |less
(where fold-test.rkt
is a module containing the fold implementation), we were able to see that, even if we explicitly defined the built-in foldr
implementation to be inlined, it was still slower than the hand-coded recursion due to the presence of error handling code and the need to allocate some data purely to preserve them in error contexts. We also saw that the compiled form of the continuation version was essentially identical to the hand-coded recursion, as expected.
Some forms like (~> (~> f g) h)
and (~> (~> f) (~> g) h)
are equivalent, so it seems important to have these in a standard form prior to attempting optimizations, similar to the "A-normal form" using in Scheme compilers. We called this "Qi Normal Form," and effectively, it amounts to ensuring that expressions are rewritten to a designated simplest form in terms of Qi identities, although in practice we may not actually recursively apply these identities as that may not be the most performant way to achieve the goal.
In order to optimize a Qi form, we'd like to recursively optimize every subexpression until it reaches a fixed point (i.e. depth-first), but at the same time also optimize the expression in terms of high level optimization rules (more like breadth-first). Sometimes, optimizing subexpressions may cause us to miss high level patterns that we could optimize even more effectively. One standard approach is to have two nested loops corresponding to these two optimization strategies. Either one of these could be the inner loop or the outer loop, and would represent different tradeoffs. We could potentially recur on an outer (high level) pass each time there is a change in any subexpression, but that would imply quadratic optimization time so it's likely infeasible. E-graphs were another approach discussed some time ago that solves this problem, but at the time we weren't sure if (a) there is a performant implementation of that that we could use, and (b) whether the "cost" of optimizations could be effectively defined, and (c) whether the optimization rules of Qi would be "swappable"/commutative enough to make E-graphs compelling, or whether the order of optimizations is likely to be more deterministic so that they could be hardcoded.
We'd like to have linear or Nlog(N) performance in implementing optimizations, and if there are parts that take quadratic or exponential time, those should be bounded to narrow fragments of programs.
Qi's core values-based routing forms such as amp
/ ><
convert values to lists before operating on them, and then once again convert to values before returning them. We could optimize the internal list-based implementation using stream fusion but we would need to modify the map
stream implementation to potentially yield multiple values before continuing, since each application of the function may produce any number of output values in Qi.
We decided to begin by optimizing the Racket list operations and return to this more complex version of stream fusion later.
To optimize Racket's list operations, we simply need to match literal uses of map
and filter
and so on, and rewrite them using (the continuation-passing version of) streams. We will be rewriting directly to Racket rather than Qi for this optimization, since this is an instance where we are explicitly choosing to optimize Racket code with respect to Qi invariants.
We added a very simple rule in the existing rudimentary layout of compiler passes to detect fusable list operations -- and just map
or filter
, for now -- and then rewrote that using stream fusion.
Performance for now is slightly better than Racket but not by as much as the 3x we observed last time.
Part of the reason for this, we guessed, was because it was using compose
to compose the stream operations instead of directly invoking subsequent stages. We addressed that with this macro:
(define-syntax inline-compose1
(syntax-rules ()
[(_ f) f]
[(_ f1 f ...) (f1 (inline-compose1 f ...))]))
We ran into some logistical difficulties with stale compilation and such, so we weren't able to verify performance after this change.
The ideal overall layout of the compiler is as yet unclear, but we were able to minimally incorporate stream fusion and get it working end-to-end.
(Some of these are carried over from last time)
- Understand why fusion performance in the compiler is slower than clean room performance, and attempt to recover some gains.
- Implement streams for the other list operations described in St-Amour's writeup (i.e.
unfoldr
foldr
map
filter
foldl
zip
average
). - Agree on the high level layout of compiler passes and implement that structure.
- Add additional nonlocal benchmarks to validate all cases optimized by stream fusion.
- Add docs for
syntax-spec
and put it on the package index. - Review other bindings cases that we could support next (e.g. selective binding, packing values, and binding exceptions in
try
).
Dominik, Jair, Michael, Sid, Stephen
Home | Developer's Guide | Calendar | Events | Projects | Meeting Notes