Avoiding scans by looping inside Ops? #1011

jessegrabowski · 2022-06-24T08:55:42Z

jessegrabowski
Jun 24, 2022

As I try to work on things, I keep running into the problem of implementing iterative algorithms. Previously I posted about optimization (Newton's method), but it also comes up in linear algebra applications (solving discrete lyapunov equations), and time-series statistics (solving yule-walker equations). Even things like creating a block diagonal matrix from a 3d tensor.

To do these iterative tasks I keep reaching for scan. But scan is horrible. Maybe I just don't get it, but the compilation times become extremely long for any non-trivial problem. When I try to attach graphs that include non-trivial scan Ops to PyMC, i get very poor results. Poor as in hours of sampling that ends up with 3 samples and lots of Ill-conditioned matrix spam to the terminal (from matrix inversions using solve). One model in particular I am adapting from Stan code; Stan samples in minutes with excellent ESS, while Aesara/PyMC basically crashes and burns. Something is very wrong with how I write loops.

This all makes me wonder if I am attacking the problem from the wrong angle. Should I instead be writing Ops with the loops in the perform method? Ideally I could supply an njit function in perform to get the best of all worlds?

Pre-edit: This post got really rambling, I'm sorry. Here's the TL;DR:

I have an algorithm that I want to implement in aesara, but it requires looping. I want to avoid scan because scan is slow. I want NUTS, so I need a derivative. In case 1, I can take the derivative of each iteration of the algorithm, but not end-to-end, because there is no closed-form solution (otherwise why am I iterating?). Case 2, the iteration does non-mathematical operations, and I don't know how I would express a derivative in this case. Specific questions:

Can aesara give numerical approximations to derivatives (complex step?) in the case that I can't write one down?
What are the derivatives of non-mathematical operations, such as a reshape, ravel, or assignment of an element in a matrix?
Is there any way to incorporate numba njit in the perform method of an Op for loop optimization?

Long post with some specific cases I want to work on:

Take a specific example, solving a discrete Lyapunov equation, a matrix valued equation of the form AXA.T + X + B= 0. A solution for X can be computed iterative using the so-called doubling method (code lifted from QuantEcon):

def doubling_solution(A, B, max_it=100):
    A, B = list(map(np.atleast_2d, [A, B]))
    alpha0 = A
    gamma0 = B
    diff = 5
    n_its = 1

    while diff > 1e-15:
        alpha1 = alpha0.dot(alpha0)
        gamma1 = gamma0 + np.dot(alpha0.dot(gamma0), alpha0.conjugate().T)

        diff = np.max(np.abs(gamma1 - gamma0)) 
        alpha0 = alpha1
        gamma0 = gamma1
        n_its += 1

        if n_its > max_it:
            msg = "Exceeded maximum iterations {}, check input matrics"
            raise ValueError(msg.format(n_its))

    return gamma1

A naive implementation of the algorithm with scan looks like this:

from aesara.scan.utils import until as scan_until

def doubling_step(alpha, gamma, tol):
    new_alpha = alpha.dot(alpha)
    new_gamma = gamma + at.linalg.matrix_dot(alpha, gamma, alpha.conj().T)

    diff = at.max(at.abs((new_gamma - gamma)))
    return (new_alpha, new_gamma), scan_until(diff < tol)


doubling_result, doubling_updates = aesara.scan(doubling_step,
                                                outputs_info = [A, B],
                                                non_sequences=[tol],
                                                n_steps=max_iter,
                                                name='doubling_algo')
alpha, gamma = doubling_result

But if I want to avoid the scan, could I bury the loop inside an Op? Something like this:

class SolveDiscreteLyapunov(at.Op):
    __props__ = ()

    itypes = [at.zmatrix, at.zmatrix, at.dscalar, at.iscalar]
    otypes = [aesara.tensor.zmatrix, at.iscalar]

    def perform(self, node, inputs, output_storage):
        A, B, tol, max_iter = inputs[0], inputs[1], inputs[2], inputs[3]
        X, info = output_storage[0], output_storage[1]
        
        diff = np.inf
        alpha = A
        gamma = B
        for i in range(max_iter):
            new_alpha = alpha @ alpha
            new_gamma = gamma + alpha @ gamma @ alpha.conj().T
            
            diff = np.max(np.abs(new_gamma - gamma))
            if diff < tol:
                break
            alpha = new_alpha
            gamma = new_gamma
        
        X[0] = new_gamma
        info[0] = 1 - int(i == max_iter)

So this Op works, but it's obviously not equivalent to the scan, because it has no gradients. Nevertheless I compare the speed of compiling a function:

%timeit f_doubling_1 = aesara.function([A, B, tol, max_iter], [gamma[-1]])
>> 148 ms ± 1.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

solve_lyapunov = SolveDiscreteLyapunov()(A, B, tol, max_iter)
%timeit %timeit f_doubling_2= aesara.function([A, B, tol, max_iter], solve_lyapunov)
>>3.99 ms ± 54 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The Op version is about 40x faster, so I feel like I'm barking up the right tree. I don't know how much of that is due to not implementing the gradient, though. Obviously I would like gradients, because all of this is in service of models I want to estimate with NUTS. But in many cases gradients aren't available. In this case, could I provide numerical approximation of the gradients with a package like numdifftools? Can Aesara do this for me out of the box?

In other cases I have no idea what the gradients would even mean. Consider construction of a block diagonal matrix from a 3d tensor of matrices. I posted a scan implementation here, but we could also define a block_diag Op, something like:

from scipy.linalg import block_diag
class BlockDiag(at.Op):
    __props__ = ()

    itypes = [at.dtensor3]
    otypes = [at.dmatrix]

    def perform(self, node, inputs, output_storage):
        arrs = inputs[0]
        block_matrix = output_storage[0]        

       # scipy block_diag uses a loop so this is on-topic
        block_matrix[0] = block_diag(*arrs)

What is the derivative of moving stuff around? I have no idea. It would be nice to have this kind of Op though; as it comes up from time to time.

Answered by brandonwillard

Jun 30, 2022

Regarding your basic question, yes, a custom Op will always work, but–as you've noticed–you'll need to implement your own gradients. While you're considering how to implement a gradient for an Op that represents a combination of existing Ops, try to notice when you're automatically applying simplifications to the resulting gradient based on the underlying expression. These simplifications are the essence of optimizing libraries like Aesara.

Making things work more generally (i.e. for many/most combinations of Ops), instead of implementing an endless number of overlapping Ops, is where a lot of the work in this project is focused, and arguably the reason it and projects like it exist. We c…

View full answer

ricardoV94 · 2022-06-24T11:24:09Z

ricardoV94
Jun 24, 2022

Have a look at #695, for your last example.

Regarding Scan, I have also noticed it to be painfully slow in sampling contexts as shown in this notebook: https://gist.github.com/ricardoV94/27625c3e33a77ce2853f4db20c0e8dfa and https://gist.github.com/ricardoV94/2167ab7214affef47a86582141205bf5

Not the compilation per se, but the execution itself. Would be great to improve it, but it's a challenging area of the library to tweak.

1 reply

jessegrabowski Jun 24, 2022
Author

I guess I focused on compilation because in models with many scans, it becomes intolerable. This notebook https://github.com/jessegrabowski/gEcon.py/blob/main/Estimation%20with%20PyMC%20v2.ipynb shows a model that uses two, a Kalman Filter + the lyapunov solver i posed, and it takes 5 minutes for PyMC to start up and fail. I was experimenting with a more numerically stable kalman filter (no matrix inversion), but it requires nested scans, so just forget about it.

How does the at.nnet.conv2d package work under the hood? That is fast and very efficient, and seem scan-ish. I used it to implement generalized (V)AR models in this notebook I'm working on at your suggestion. I wanted to iron out the kinks before I asked you what you thought, but actually this post come out of that effort. So here it is:

https://github.com/jessegrabowski/VAR-Blog/blob/main/Big%20VAR%20Blog%20Post.ipynb

The sampling transformation to ensure VAR stability requires a lot of looping, and the sampler does not like it. The transformation itself is here https://github.com/jessegrabowski/VAR-Blog/blob/main/heaps_transform.py, function P_to_params, plus a numpy implementation, reverse_mapping. This is the model that Stan has zero problems with. Very frustrating!

aseyboldt · 2022-06-24T12:43:10Z

aseyboldt
Jun 24, 2022

This all makes me wonder if I am attacking the problem from the wrong angle. Should I instead be writing Ops with the loops in the perform method? Ideally I could supply an njit function in perform to get the best of all worlds?

If you are implementing things like lyapunov equations, then yes, I'd argue this would be a much better way. In those cases it should also be much better to work out the derivatives by hand and implement them as ops as well, in almost all cases this should be much faster and probably more stable than relying on autodiff.

5 replies

aseyboldt Jun 24, 2022

Oh, and you should also be able to just call the scipy.special implementation for the lyaponov equation in an njit, if you switch to object mode (https://numba.pydata.org/numba-doc/latest/user/withobjmode.html#numba.objmode).
If you have access to a c implementation (or anything else with a c api) you can use cffi or ctypes in the njit function to call that without python overhead.

jessegrabowski Jun 24, 2022
Author

Nice, I had never seen objmode before! This solves a lot of problems for me. Others are still outstanding, though. As far as I can tell, I absolutely can't avoid looping for kalman filtering, or solving Yule-Walker equations.

As for working out the derivative of the lyapunov equation, do you know a reference for a general analytic solution for the derivative? I did some research but didn't find anything concrete. The general solution is an infinite sum, $X = \sum_{k=0}^\infty A^kB(A^H)^k$ so it seems like the best I can do for the analytical derivative is to write down another infinite sum, which puts me squarely back into iterating to a solution - replacing one problem with another?

aseyboldt Jun 24, 2022

I don't know a reference unfortunately, and have to leave for the weekend in a couple of minutes. The derivative shouldn't be too hard to work out though (famous last words... ;-) )
Don't try to take the derivative of the numerical approximation, but use the implicit function theorem (ie take the derivative with respect to some variable of both sides of the lyapunov equation) to find an equation for the derivative, and think about how to approximate that numerically.
(nice page for working out derivatives involving matrices: https://www.matrixcalculus.org/)
If that doesn't work feel free to ping me next week, then I can give it a try.

aseyboldt Jun 24, 2022

Oh, and a lyaponov op would be a nice addition to aesara ;-)

jessegrabowski Jun 24, 2022
Author

I'll take a crack at it and let you know how it goes. If I can work it out I'll be excited to share. There must be an implementation somewhere though; I can't imagine I'm the first person to want to do this..

brandonwillard · 2022-06-30T05:35:34Z

brandonwillard
Jun 30, 2022
Maintainer

Regarding your basic question, yes, a custom Op will always work, but–as you've noticed–you'll need to implement your own gradients. While you're considering how to implement a gradient for an Op that represents a combination of existing Ops, try to notice when you're automatically applying simplifications to the resulting gradient based on the underlying expression. These simplifications are the essence of optimizing libraries like Aesara.

Making things work more generally (i.e. for many/most combinations of Ops), instead of implementing an endless number of overlapping Ops, is where a lot of the work in this project is focused, and arguably the reason it and projects like it exist. We can always concentrate on making specific cases work better, but we need to do that while also improving the library's generalizability–or at least not sacrificing it; otherwise, we might as well stop work on this and tell everyone to write their code/models by hand in hardware-specific machine code, because it'll definitely run faster that way.

That said, this discussion is easily identified as an "anti-Scan" conversation, one that unfortunately makes a lot of invalid comparisons and conflates and/or misattributes a few unrelated issues to Scan. These kinds of discussions are very rarely constructive, even when/if they appear to spiral off into something more conventionally constructive (e.g. helping someone implement a new Op).

Instead, let's look at some of the issues mentioned here and try to see exactly how they relate to Scan, if they do at all, and consider what can be done to fix them.

Compilation Time

Maybe I just don't get it, but the compilation times become extremely long for any non-trivial problem. When I try to attach graphs that include non-trivial scan Ops to PyMC, i get very poor results.

First, it would be very helpful to have MWEs for these extremely long compilation times. The example you provided further down is nowhere near being extremely long; the latencies are literally sub-second.

More importantly, you have to ask yourself what a long compilation time is for you. The basic premise behind compilation is that a one-time optimization cost can improve the average run-time costs of the resulting code. It's not a guarantee, but it tends to be true. Regardless, if one can't personally accept that premise or the one-time cost, then elements of the compilation process can be disabled. For instance, optimizations can be disabled (e.g. certain ones or all of them).

Likewise, actual compilation of C extensions can be disabled by using a different backend (e.g. pure Python). Compilation of these C extensions tends to be the most time-consuming part of the process, and this has nothing to do with Scan.

There are some Scan-specific optimizations that could be slowing things down, but there's no way to tell given only the observations you've made.

Let's use your example to demonstrate:

# Make sure the cache is clear so that we can see the actual time spent during
# compilation
# !aesara-cache clear

import aesara
import aesara.tensor as at
from aesara.scan.utils import until as scan_until


A = at.matrix("A")
B = at.matrix("B")
tol = at.scalar("tol")
max_iter = at.lscalar("max_iter")


def doubling_step(alpha, gamma, tol):
    new_alpha = alpha.dot(alpha)
    new_gamma = gamma + at.linalg.matrix_dot(alpha, gamma, alpha.T)

    diff = at.max(at.abs((new_gamma - gamma)))
    return (new_alpha, new_gamma), scan_until(diff < tol)


doubling_result, doubling_updates = aesara.scan(
    doubling_step,
    outputs_info=[A, B],
    non_sequences=[tol],
    n_steps=max_iter,
    name="doubling_algo",
    # Enable profiling when the `Scan`'s inner-function is compiled
    profile=True,
)
alpha, gamma = doubling_result

# Enable profiling for compilation of the outer-function
f_doubling = aesara.function([A, B, tol, max_iter], gamma[-1], profile=True)

f_doubling.profile.summary()
# Function profiling
# ==================
#   Message: <ipython-input-5-294105621f54>:1
#   Time in 0 calls to Function.__call__: 0.000000e+00s
#   Total compile time: 2.265139e+01s
#     Number of Apply nodes: 17
#     Aesara Optimizer time: 5.699408e+00s
#        Aesara validate time: 2.002478e-03s
#     Aesara Linker time (includes C, CUDA code generation/compiling): 16.937392234802246s
#        Import time 1.482582e-02s
#        Node make_thunk time 1.693630e+01s
#            Node do_whileall_inplace,cpu,doubling_algo}(max_iter, IncSubtensor{InplaceSet;:int64:}.0, IncSubtensor{InplaceSet;:int64:}.0, tol) time 1.305243e+01s
#            Node Elemwise{Composite{Switch(LT((i0 + Composite{(i0 - Switch(LT(i1, i0), i1, i0))}(i1, i2)), i3), (i0 - i1), Switch(GE((i0 + Composite{(i0 - Switch(LT(i1, i0), i1, i0))}(i1, i2)), Composite{(i0 - Switch(LT(i1, i0), i1, i0))}(i1, i2)), (i2 + i1), Switch(LE(Composite{(i0 - Switch(LT(i1, i0), i1, i0))}(i1, i2), i3), (i2 + i1), (i0 + i1))))}}[(0, 1)](TensorConstant{-1}, Shape_i{0}.0, TensorConstant{1}, TensorConstant{0}) time 6.080954e-01s
#            Node InplaceDimShuffle{x,0,1}(B) time 5.972800e-01s
#            Node AllocEmpty{dtype='float64'}(Elemwise{add,no_inplace}.0, Shape_i{0}.0, Shape_i{1}.0) time 5.633097e-01s
#            Node Subtensor{int64}(do_whileall_inplace,cpu,doubling_algo}.1, ScalarFromTensor.0) time 5.134976e-01s
#
# Time in all call to aesara.grad() 0.000000e+00s
# Time since aesara import 37.958s
# Here are tips to potentially make your code run faster
#                  (if you think of new ones, suggest them on the mailing list).
#                  Test them first, as they are not guaranteed to always provide a speedup.
#   - Try the Aesara flag floatX=float32

Notice how the total compile time is distributed between the time spent optimizing and linking (i.e. generating the C extensions): the majority is spent on the latter.

Regardless, it should be clear that what you were measuring was something closer to the cache latency.

Another part of your description implies that you're observing these long compilation times in PyMC. This means it's likely that you're observing more than just the compilation of a specific graph, which makes your example even less representative of the issue(s) you're describing. PyMC could be constructing and compiling multiple graphs, including ones that involve gradients. Attributing the latency of a cumulative process with such elements to just the use of a Scan Op seems premature, because it's also possible that PyMC is producing buggy/overly-complicated graphs, performing initial value computations, etc.

This hypothetical Scan-induced problem situation isn't entirely without precedence, though; we saw something like it in AeHMC (see aesara-devs/aehmc#48 (review)) when compilation would take minutes because the code was producing unnecessarily large graphs (due to bugs). A few quick aesara.dprints showed us that the code in that branch was producing graphs with hundreds of Scan nodes, and that was obviously not correct. That issue wasn't actually due to Scan itself; the Scan nodes simply exacerbated the issue. It was easy to jump to erroneous conclusions about Scan being the problem, because it had the most visibility in the clearly buggy output we were observing.

Nevertheless, those buggy graphs did provide an example of some Scan scaling issues that we've already started to address: #824. Still, those issues only start to become relevant when graphs are intentionally large and contain many nested Scans.

The Op version is about 40x faster, so I feel like I'm barking up the right tree. I don't know how much of that is due to not implementing the gradient, though.

Again, this isn't a useful comparison, because the timing values are already so low, they're mostly measuring cache latency, and one graph is only comprised of a single Python-only Op—among other things—so no compilation is involved. Also, the gradient doesn't play into those examples.

Let's take a look at a the compilation of a graph with a gradient of that Scan (using a clean cache):

f_doubling_grad = aesara.function(
    [A, B, tol, max_iter], aesara.grad(gamma[-1].sum(), [A, B]), profile=True
)

f_doubling_grad.profile.summary()
# Function profiling
# ==================
#   Message: <ipython-input-5-9baaf36b31a6>:1
#   Time in 0 calls to Function.__call__: 0.000000e+00s
#   Total compile time: 2.048326e+01s
#     Number of Apply nodes: 104
#     Aesara Optimizer time: 3.609982e+00s
#        Aesara validate time: 1.658869e-02s
#     Aesara Linker time (includes C, CUDA code generation/compiling): 16.870064735412598s
#        Import time 2.915764e-02s
#        Node make_thunk time 1.686693e+01s
#            Node forall_inplace,cpu,grad_of_doubling_algo}(Elemwise{add,no_inplace}.0, InplaceDimShuffle{0,2,1}.0, Subtensor{int64:int64:int64}.0, InplaceDimShuffle{0,2,1}.0, Subtensor{int64:int64:int64}.0, Alloc.0, Subtensor{::int64}.0) time 3.562204e+00s
#            Node Elemwise{Composite{Switch(i0, i1, Switch(AND(LT((i2 - i3), i1), GT(i3, i1)), (i4 - i5), maximum((i4 + i6), (i2 - i3))))}}[(0, 3)](Elemwise{le,no_inplace}.0, TensorConstant{0}, Elemwise{Add}[(0, 1)].0, Elemwise{Composite{Switch(i0, Switch(LT(i1, i2), i2, i1), Switch(LT(i3, i4), i3, i4))}}[(0, 1)].0, TensorConstant{-1}, Elemwise{Composite{Switch(LT((i0 + i1), i2), i2, (i0 + i1))}}.0, Elemwise{Composite{Switch(LT(Composite{Switch(LT(i0, i1), i1, i0)}(Composite{Switch(LT(i0, i1), i2, i0)}((i0 - i1), i2, i3), i2), i1), Composite{Switch(LT(i0, i1), i1, i0)}(Composite{Switch(LT(i0, i1), i2, i0)}((i0 - i1), i2, i3), i2), i1)}}.0) time 8.090849e-01s
#            Node Elemwise{Composite{Switch(LT(Composite{Switch(LT(i0, i1), i1, i0)}(Composite{Switch(LT(i0, i1), i2, i0)}(Composite{(i0 - Switch(LT(i1, i2), i2, i1))}(i0, Composite{(i0 - Switch(GE(i1, i2), i2, i1))}(i1, Composite{Switch(LT(i0, i1), i2, i0)}(Composite{Switch(LT(i0, i1), i2, i0)}(i2, i3, i4), i3, i5), i6), i3), i3, i1), i3), i7), Composite{Switch(LT(i0, i1), i1, i0)}(Composite{Switch(LT(i0, i1), i2, i0)}(Composite{(i0 - Switch(LT(i1, i2), i2, i1))}(i0, Composite{(i0 - Switch(GE(i1, i2), i2, i1))}(i1, Composite{Switch(LT(i0, i1), i2, i0)}(Composite{Switch(LT(i0, i1), i2, i0)}(i2, i3, i4), i3, i5), i6), i3), i3, i1), i3), i7)}}[(0, 0)](Elemwise{add,no_inplace}.0, Elemwise{Composite{Switch(GE(Composite{Switch(LT(i0, i1), i2, i0)}(Composite{Switch(LT(i0, i1), i2, i0)}(i0, i1, i2), i1, i3), i4), i5, Composite{Switch(LT(i0, i1), i2, i0)}(Composite{Switch(LT(i0, i1), i2, i0)}(i0, i1, i2), i1, i3))}}[(0, 0)].0, Elemwise{Composite{Switch(i0, i1, Switch(AND(LT((i2 - i3), i1), GT(i3, i1)), (i4 - i5), maximum((i4 + i6), (i2 - i3))))}}[(0, 3)].0, TensorConstant{0}, TensorFromScalar.0, TensorConstant{-1}, Elemwise{Composite{Switch(LT((i0 + i1), i2), i2, (i0 + i1))}}.0, Elemwise{Composite{Switch(LT(i0, i1), i1, i0)}}.0) time 7.391601e-01s
#            Node Elemwise{Composite{Switch(i0, i1, Switch(AND(LT((i2 + i3), i1), GT(i4, i1)), (i2 - i5), minimum((i2 + i3), i6)))}}[(0, 3)](Elemwise{le,no_inplace}.0, TensorConstant{0}, TensorConstant{-1}, Elemwise{Composite{Switch(LT(Composite{Switch(LT(i0, i1), i1, i0)}(Composite{Switch(LT(i0, i1), i2, i0)}(Composite{(i0 - Switch(LT(i1, i2), i2, i1))}(i0, Composite{(i0 - Switch(GE(i1, i2), i2, i1))}(i1, Composite{Switch(LT(i0, i1), i2, i0)}(Composite{Switch(LT(i0, i1), i2, i0)}(i2, i3, i4), i3, i5), i6), i3), i3, i1), i3), i7), Composite{Switch(LT(i0, i1), i1, i0)}(Composite{Switch(LT(i0, i1), i2, i0)}(Composite{(i0 - Switch(LT(i1, i2), i2, i1))}(i0, Composite{(i0 - Switch(GE(i1, i2), i2, i1))}(i1, Composite{Switch(LT(i0, i1), i2, i0)}(Composite{Switch(LT(i0, i1), i2, i0)}(i2, i3, i4), i3, i5), i6), i3), i3, i1), i3), i7)}}[(0, 0)].0, Elemwise{sub,no_inplace}.0, Shape_i{0}.0, Elemwise{Composite{Switch(LT((i0 + i1), i2), i2, (i0 + i1))}}.0) time 6.649911e-01s
#            Node Elemwise{Composite{Switch(i0, i1, maximum(minimum((i2 + i3), i4), i5))}}[(0, 3)](Elemwise{le,no_inplace}.0, TensorConstant{0}, TensorConstant{-1}, Elemwise{Composite{Switch(LT(i0, i1), i1, i0)}}.0, Elemwise{Composite{Switch(LT((i0 + i1), i2), i2, (i0 + i1))}}.0, TensorConstant{0}) time 6.371284e-01s
#
# Time in all call to aesara.grad() 2.692196e-01s
# Time since aesara import 1067.307s
# Here are tips to potentially make your code run faster
#                  (if you think of new ones, suggest them on the mailing list).
#                  Test them first, as they are not guaranteed to always provide a speedup.
#   - Try the Aesara flag floatX=float32

Again, first time compiling such a graph takes nearly 20 seconds, and the vast majority of that time is spent compiling the C extensions. Luckily, these C extensions are fairly well covered by caching, so they really do tend to be one-time costs.

Hopefully this clarifies where the actual latency is and how one can inspect it. More importantly, I hope it provides some insight into the kind of material we need to be discussing in order to make improvements–or simply understand what is going on. Much of the compilation and optimization process latency can be considerably reduced, even when it comes to Scan specifically, since it definitely isn't the most performant Op with respect to either measure (but it also isn't the cause of all the latency). One can get similar numbers from comparably complex graphs that don't use Scan.

`Scan` performance

This wasn't really addressed via an example in the OP, but it's implied by the entire conversation.

First, comparisons of Scan with equivalent vectorized operations are almost completely pointless. The only relevance such comparisons tend to have in this project involves rewriting. In other words, knowledge of such equivalences allow us to create rewrites that turn Scans into more efficient vectorized operations.

Otherwise, reasonable comparisons can be made with equivalent Python for-loop code, because that's effectively what Scan models. More generally, Scan models a generic iterative process with multiple types of input/output connections, which means that the logic for handling these different connection patterns is always present in its Op.perform implementation. This extra logic has overhead, even when it's not being used.

Regarding performance, it's well known that Scan's current Cython implementation is only better than pure Python when the number of iterations is above ~100–even then, the differences need not be very large (e.g. 10's to 100's of milliseconds) and can be easily overshadowed by the computations performed in the body of a loop. We even have a unit test that checks that the Cython implementation outperforms the Python one on average. Even though this test is only for one type of graph, it sufficiently demonstrates that the Scan-induced overhead can be overcome (relative to Python) in basic use cases.

N.B.: If you're not using the Cython version, because Cython isn't installed in your environment, then Scan will almost always be worse than pure Python.

There's a lot more to say about all this, but an important point to make is that we can always rewrite Scans, so, if you know there is a vectorized–or even more simplified–form for a given Scan, then that's what we need to discuss. In order to determine that, though, one needs to aesara.dprint their graphs.

I can't emphasis this enough; if one wants to understand what Aesara is computing and how it might be improved, one must start with the aesara.dprint output of a graph. That output is essentially the "source code" of Aesara. When aesara.dprint is used on an aesara.function-compiled object, it will print the compiled and optimized graph that is actually used for computations.

Let's try that with the more complicated gradient graph:

expand

# `print_op_info=True` adds extra information in parentheses that
# describe each node/variable's role in a `Scan` `Op`.
aesara.dprint(f_doubling_grad, print_op_info=True)
# InplaceDimShuffle{1,2} [id A] 103
#  |Rebroadcast{(0, True)} [id B] 101
#    |Subtensor{:int64:} [id C] 99
#      |Join [id D] 97
#      | |TensorConstant{0} [id E]
#      | |Subtensor{::int64} [id F] 95
#      | | |forall_inplace,cpu,grad_of_doubling_algo}.0 [id G] 93 (outer_out_mit_mot-0)
#      | | | |Elemwise{add,no_inplace} [id H] 17 (n_steps)
#      | | | | |TensorConstant{-1} [id I]
#      | | | | |Shape_i{0} [id J] 14
#      | | | |   |do_whileall_inplace,cpu,doubling_algo}.0 [id K] 13 (outer_out_sit_sot-0)
#      | | | |     |max_iter [id L] (n_steps)
#      | | | |     |IncSubtensor{InplaceSet;:int64:} [id M] 11 (outer_in_sit_sot-0)
#      | | | |     | |AllocEmpty{dtype='float64'} [id N] 9
#      | | | |     | | |Elemwise{add,no_inplace} [id O] 6
#      | | | |     | | | |TensorConstant{1} [id P]
#      | | | |     | | | |max_iter [id L]
#      | | | |     | | |Shape_i{0} [id Q] 4
#      | | | |     | | | |A [id R]
#      | | | |     | | |Shape_i{1} [id S] 3
#      | | | |     | |   |A [id R]
#      | | | |     | |Rebroadcast{(0, False)} [id T] 8
#      | | | |     | | |InplaceDimShuffle{x,0,1} [id U] 5
#      | | | |     | |   |A [id R]
#      | | | |     | |ScalarConstant{1} [id V]
#      | | | |     |IncSubtensor{InplaceSet;:int64:} [id W] 12 (outer_in_sit_sot-1)
#      | | | |     | |AllocEmpty{dtype='float64'} [id X] 10
#      | | | |     | | |Elemwise{add,no_inplace} [id O] 6
#      | | | |     | | |Shape_i{0} [id Y] 1
#      | | | |     | | | |B [id Z]
#      | | | |     | | |Shape_i{1} [id BA] 0
#      | | | |     | |   |B [id Z]
#      | | | |     | |Rebroadcast{(0, False)} [id BB] 7
#      | | | |     | | |InplaceDimShuffle{x,0,1} [id BC] 2
#      | | | |     | |   |B [id Z]
#      | | | |     | |ScalarConstant{1} [id V]
#      | | | |     |tol [id BD] (outer_in_non_seqs-0)
#      | | | |InplaceDimShuffle{0,2,1} [id BE] 88 (outer_in_seqs-0)
#      | | | | |Subtensor{int64:int64:int64} [id BF] 84
#      | | | |   |do_whileall_inplace,cpu,doubling_algo}.1 [id K] 13 (outer_out_sit_sot-1)
#      | | | |   |ScalarFromTensor [id BG] 80
#      | | | |   | |Elemwise{Composite{Switch(i0, i1, maximum(minimum((i2 + i3), i4), i5))}}[(0, 3)] [id BH] 76
#      | | | |   |   |Elemwise{le,no_inplace} [id BI] 73
#      | | | |   |   | |Elemwise{sub,no_inplace} [id BJ] 71
#      | | | |   |   | | |Elemwise{Composite{Switch(LT(i0, i1), i1, i0)}} [id BK] 67
#      | | | |   |   | | | |Elemwise{add,no_inplace} [id BL] 65
#      | | | |   |   | | | | |TensorConstant{1} [id P]
#      | | | |   |   | | | | |Elemwise{Composite{Switch(GE(Composite{Switch(LT(i0, i1), i2, i0)}(Composite{Switch(LT(i0, i1), i2, i0)}(i0, i1, i2), i1, i3), i4), i5, Composite{Switch(LT(i0, i1), i2, i0)}(Composite{Switch(LT(i0, i1), i2, i0)}(i0, i1, i2), i1, i3))}}[(0, 0)] [id BM] 62
#      | | | |   |   | | | |   |Elemwise{Composite{Switch(i0, i1, maximum(i2, i3))}}[(0, 2)] [id BN] 48
#      | | | |   |   | | | |   | |Elemwise{le,no_inplace} [id BO] 45
#      | | | |   |   | | | |   | | |Elemwise{Composite{Switch(i0, Switch(LT(i1, i2), i2, i1), Switch(LT(i3, i4), i3, i4))}}[(0, 1)] [id BP] 43
#      | | | |   |   | | | |   | | | |Elemwise{lt,no_inplace} [id BQ] 24
#      | | | |   |   | | | |   | | | | |Elemwise{add,no_inplace} [id H] 17
#      | | | |   |   | | | |   | | | | |TensorConstant{0} [id E]
#      | | | |   |   | | | |   | | | |TensorFromScalar [id BR] 41
#      | | | |   |   | | | |   | | | | |add [id BS] 38
#      | | | |   |   | | | |   | | | |   |ScalarFromTensor [id BT] 23
#      | | | |   |   | | | |   | | | |   | |Elemwise{add,no_inplace} [id H] 17
#      | | | |   |   | | | |   | | | |   |ScalarFromTensor [id BU] 36
#      | | | |   |   | | | |   | | | |     |Elemwise{sub,no_inplace} [id BV] 33
#      | | | |   |   | | | |   | | | |       |Elemwise{Composite{Switch(LT(i0, i1), i1, i0)}} [id BW] 26
#      | | | |   |   | | | |   | | | |       | |Elemwise{Composite{Switch(LT((i0 + i1), i2), i2, (i0 + i1))}} [id BX] 19
#      | | | |   |   | | | |   | | | |       | | |TensorConstant{-1} [id I]
#      | | | |   |   | | | |   | | | |       | | |Shape_i{0} [id BY] 15
#      | | | |   |   | | | |   | | | |       | | | |do_whileall_inplace,cpu,doubling_algo}.1 [id K] 13 (outer_out_sit_sot-1)
#      | | | |   |   | | | |   | | | |       | | |TensorConstant{0} [id E]
#      | | | |   |   | | | |   | | | |       | |TensorConstant{0} [id E]
#      | | | |   |   | | | |   | | | |       |Elemwise{Composite{Switch(LT(Composite{Switch(LT(i0, i1), i1, i0)}(Composite{Switch(LT(i0, i1), i2, i0)}((i0 - i1), i2, i3), i2), i1), Composite{Switch(LT(i0, i1), i1, i0)}(Composite{Switch(LT(i0, i1), i2, i0)}((i0 - i1), i2, i3), i2), i1)}} [id BZ] 29
#      | | | |   |   | | | |   | | | |         |Elemwise{Composite{Switch(LT((i0 + i1), i2), i2, (i0 + i1))}} [id BX] 19
#      | | | |   |   | | | |   | | | |         |Elemwise{Composite{Switch(LT(i0, i1), i1, i0)}} [id BW] 26
#      | | | |   |   | | | |   | | | |         |TensorConstant{0} [id E]
#      | | | |   |   | | | |   | | | |         |Elemwise{add,no_inplace} [id CA] 25
#      | | | |   |   | | | |   | | | |           |TensorConstant{-1} [id I]
#      | | | |   |   | | | |   | | | |           |Elemwise{Composite{Switch(LT((i0 + i1), i2), i2, (i0 + i1))}} [id BX] 19
#      | | | |   |   | | | |   | | | |TensorConstant{0} [id E]
#      | | | |   |   | | | |   | | | |Elemwise{add,no_inplace} [id H] 17
#      | | | |   |   | | | |   | | | |Elemwise{sub,no_inplace} [id BV] 33
#      | | | |   |   | | | |   | | |TensorConstant{0} [id E]
#      | | | |   |   | | | |   | |TensorConstant{0} [id E]
#      | | | |   |   | | | |   | |Elemwise{Composite{Switch(LT(Composite{Switch(LT(i0, i1), i1, i0)}(Composite{Switch(LT(i0, i1), i2, i0)}((i0 - i1), i2, i3), i2), i1), Composite{Switch(LT(i0, i1), i1, i0)}(Composite{Switch(LT(i0, i1), i2, i0)}((i0 - i1), i2, i3), i2), i1)}} [id BZ] 29
#      | | | |   |   | | | |   | |Elemwise{Add}[(0, 1)] [id CB] 34
#      | | | |   |   | | | |   |   |TensorConstant{-1} [id I]
#      | | | |   |   | | | |   |   |Elemwise{Composite{Switch(LT(i0, i1), i1, i0)}} [id BW] 26
#      | | | |   |   | | | |   |TensorConstant{0} [id E]
#      | | | |   |   | | | |   |TensorFromScalar [id CC] 59
#      | | | |   |   | | | |   | |add [id CD] 55
#      | | | |   |   | | | |   |   |ScalarFromTensor [id CE] 51
#      | | | |   |   | | | |   |   | |Elemwise{Composite{Switch(i0, i1, maximum(i2, i3))}}[(0, 2)] [id BN] 48
#      | | | |   |   | | | |   |   |ScalarFromTensor [id CF] 27
#      | | | |   |   | | | |   |     |Elemwise{Composite{Switch(LT((i0 + i1), i2), i2, (i0 + i1))}} [id BX] 19
#      | | | |   |   | | | |   |TensorConstant{-1} [id CG]
#      | | | |   |   | | | |   |Elemwise{Composite{Switch(LT((i0 + i1), i2), i2, (i0 + i1))}} [id BX] 19
#      | | | |   |   | | | |   |Elemwise{add,no_inplace} [id CA] 25
#      | | | |   |   | | | |TensorConstant{0} [id E]
#      | | | |   |   | | |Elemwise{Composite{Switch(LT(Composite{Switch(LT(i0, i1), i1, i0)}(Composite{Switch(LT(i0, i1), i2, i0)}(Composite{(i0 - Switch(LT(i1, i2), i2, i1))}(i0, Composite{(i0 - Switch(GE(i1, i2), i2, i1))}(i1, Composite{Switch(LT(i0, i1), i2, i0)}(Composite{Switch(LT(i0, i1), i2, i0)}(i2, i3, i4), i3, i5), i6), i3), i3, i1), i3), i7), Composite{Switch(LT(i0, i1), i1, i0)}(Composite{Switch(LT(i0, i1), i2, i0)}(Composite{(i0 - Switch(LT(i1, i2), i2, i1))}(i0, Composite{(i0 - Switch(GE(i1, i2), i2, i1))}(i1, Composite{Switch(LT(i0, i1), i2, i0)}(Composite{Switch(LT(i0, i1), i2, i0)}(i2, i3, i4), i3, i5), i6), i3), i3, i1), i3), i7)}}[(0, 0)] [id CH] 69
#      | | | |   |   | |   |Elemwise{add,no_inplace} [id BL] 65
#      | | | |   |   | |   |Elemwise{Composite{Switch(GE(Composite{Switch(LT(i0, i1), i2, i0)}(Composite{Switch(LT(i0, i1), i2, i0)}(i0, i1, i2), i1, i3), i4), i5, Composite{Switch(LT(i0, i1), i2, i0)}(Composite{Switch(LT(i0, i1), i2, i0)}(i0, i1, i2), i1, i3))}}[(0, 0)] [id BM] 62
#      | | | |   |   | |   |Elemwise{Composite{Switch(i0, i1, Switch(AND(LT((i2 - i3), i1), GT(i3, i1)), (i4 - i5), maximum((i4 + i6), (i2 - i3))))}}[(0, 3)] [id CI] 47
#      | | | |   |   | |   | |Elemwise{le,no_inplace} [id BO] 45
#      | | | |   |   | |   | |TensorConstant{0} [id E]
#      | | | |   |   | |   | |Elemwise{Add}[(0, 1)] [id CB] 34
#      | | | |   |   | |   | |Elemwise{Composite{Switch(i0, Switch(LT(i1, i2), i2, i1), Switch(LT(i3, i4), i3, i4))}}[(0, 1)] [id BP] 43
#      | | | |   |   | |   | |TensorConstant{-1} [id I]
#      | | | |   |   | |   | |Elemwise{Composite{Switch(LT((i0 + i1), i2), i2, (i0 + i1))}} [id BX] 19
#      | | | |   |   | |   | |Elemwise{Composite{Switch(LT(Composite{Switch(LT(i0, i1), i1, i0)}(Composite{Switch(LT(i0, i1), i2, i0)}((i0 - i1), i2, i3), i2), i1), Composite{Switch(LT(i0, i1), i1, i0)}(Composite{Switch(LT(i0, i1), i2, i0)}((i0 - i1), i2, i3), i2), i1)}} [id BZ] 29
#      | | | |   |   | |   |TensorConstant{0} [id E]
#      | | | |   |   | |   |TensorFromScalar [id CJ] 57
#      | | | |   |   | |   | |add [id CK] 53
#      | | | |   |   | |   |   |ScalarFromTensor [id CL] 50
#      | | | |   |   | |   |   | |Elemwise{Composite{Switch(i0, i1, Switch(AND(LT((i2 - i3), i1), GT(i3, i1)), (i4 - i5), maximum((i4 + i6), (i2 - i3))))}}[(0, 3)] [id CI] 47
#      | | | |   |   | |   |   |ScalarFromTensor [id CF] 27
#      | | | |   |   | |   |TensorConstant{-1} [id CG]
#      | | | |   |   | |   |Elemwise{Composite{Switch(LT((i0 + i1), i2), i2, (i0 + i1))}} [id BX] 19
#      | | | |   |   | |   |Elemwise{Composite{Switch(LT(i0, i1), i1, i0)}} [id BK] 67
#      | | | |   |   | |TensorConstant{0} [id E]
#      | | | |   |   |TensorConstant{0} [id E]
#      | | | |   |   |TensorConstant{-1} [id I]
#      | | | |   |   |Elemwise{Composite{Switch(LT(i0, i1), i1, i0)}} [id BK] 67
#      | | | |   |   |Elemwise{Composite{Switch(LT((i0 + i1), i2), i2, (i0 + i1))}} [id BX] 19
#      | | | |   |   |TensorConstant{0} [id CM]
#      | | | |   |ScalarFromTensor [id CN] 79
#      | | | |   | |Elemwise{Composite{Switch(i0, i1, Switch(AND(LT((i2 + i3), i1), GT(i4, i1)), (i2 - i5), minimum((i2 + i3), i6)))}}[(0, 3)] [id CO] 75
#      | | | |   |   |Elemwise{le,no_inplace} [id BI] 73
#      | | | |   |   |TensorConstant{0} [id E]
#      | | | |   |   |TensorConstant{-1} [id I]
#      | | | |   |   |Elemwise{Composite{Switch(LT(Composite{Switch(LT(i0, i1), i1, i0)}(Composite{Switch(LT(i0, i1), i2, i0)}(Composite{(i0 - Switch(LT(i1, i2), i2, i1))}(i0, Composite{(i0 - Switch(GE(i1, i2), i2, i1))}(i1, Composite{Switch(LT(i0, i1), i2, i0)}(Composite{Switch(LT(i0, i1), i2, i0)}(i2, i3, i4), i3, i5), i6), i3), i3, i1), i3), i7), Composite{Switch(LT(i0, i1), i1, i0)}(Composite{Switch(LT(i0, i1), i2, i0)}(Composite{(i0 - Switch(LT(i1, i2), i2, i1))}(i0, Composite{(i0 - Switch(GE(i1, i2), i2, i1))}(i1, Composite{Switch(LT(i0, i1), i2, i0)}(Composite{Switch(LT(i0, i1), i2, i0)}(i2, i3, i4), i3, i5), i6), i3), i3, i1), i3), i7)}}[(0, 0)] [id CH] 69
#      | | | |   |   |Elemwise{sub,no_inplace} [id BJ] 71
#      | | | |   |   |Shape_i{0} [id BY] 15
#      | | | |   |   |Elemwise{Composite{Switch(LT((i0 + i1), i2), i2, (i0 + i1))}} [id BX] 19
#      | | | |   |ScalarConstant{-1} [id CP]
#      | | | |Subtensor{int64:int64:int64} [id CQ] 90 (outer_in_seqs-1)
#      | | | | |do_whileall_inplace,cpu,doubling_algo}.0 [id K] 13 (outer_out_sit_sot-0)
#      | | | | |ScalarFromTensor [id CR] 87
#      | | | | | |Elemwise{Composite{Switch(i0, i1, maximum(minimum((i2 + i3), i4), i5))}}[(0, 3)] [id CS] 83
#      | | | | |   |Elemwise{le,no_inplace} [id CT] 78
#      | | | | |   | |Elemwise{sub,no_inplace} [id CU] 74
#      | | | | |   | | |Elemwise{Composite{Switch(LT(i0, i1), i1, i0)}} [id CV] 70
#      | | | | |   | | | |Elemwise{add,no_inplace} [id CW] 68
#      | | | | |   | | | | |TensorConstant{1} [id P]
#      | | | | |   | | | | |Elemwise{Composite{Switch(GE(Composite{Switch(LT(i0, i1), i2, i0)}(Composite{Switch(LT(i0, i1), i2, i0)}(i0, i1, i2), i1, i3), i4), i5, Composite{Switch(LT(i0, i1), i2, i0)}(Composite{Switch(LT(i0, i1), i2, i0)}(i0, i1, i2), i1, i3))}}[(0, 0)] [id CX] 66
#      | | | | |   | | | |   |Elemwise{Composite{Switch(i0, i1, maximum(i2, i3))}}[(0, 2)] [id CY] 54
#      | | | | |   | | | |   | |Elemwise{le,no_inplace} [id CZ] 49
#      | | | | |   | | | |   | | |Elemwise{Composite{Switch(i0, Switch(LT(i1, i2), i2, i1), Switch(LT(i3, i4), i3, i4))}}[(0, 1)] [id DA] 46
#      | | | | |   | | | |   | | | |Elemwise{lt,no_inplace} [id BQ] 24
#      | | | | |   | | | |   | | | |TensorFromScalar [id DB] 44
#      | | | | |   | | | |   | | | | |add [id DC] 42
#      | | | | |   | | | |   | | | |   |ScalarFromTensor [id BT] 23
#      | | | | |   | | | |   | | | |   |ScalarFromTensor [id DD] 40
#      | | | | |   | | | |   | | | |     |Elemwise{sub,no_inplace} [id DE] 37
#      | | | | |   | | | |   | | | |       |Elemwise{Composite{Switch(LT(i0, i1), i1, i0)}} [id DF] 31
#      | | | | |   | | | |   | | | |       | |Elemwise{switch,no_inplace} [id DG] 28
#      | | | | |   | | | |   | | | |       | | |Elemwise{lt,no_inplace} [id BQ] 24
#      | | | | |   | | | |   | | | |       | | |TensorConstant{0} [id E]
#      | | | | |   | | | |   | | | |       | | |Elemwise{add,no_inplace} [id H] 17
#      | | | | |   | | | |   | | | |       | |TensorConstant{0} [id E]
#      | | | | |   | | | |   | | | |       |Elemwise{Composite{Switch(LT(Composite{Switch(LT(i0, i1), i1, i0)}(Composite{Switch(LT(i0, i1), i2, i0)}((i0 - i1), i2, i3), i2), i1), Composite{Switch(LT(i0, i1), i1, i0)}(Composite{Switch(LT(i0, i1), i2, i0)}((i0 - i1), i2, i3), i2), i1)}} [id DH] 35
#      | | | | |   | | | |   | | | |         |Elemwise{switch,no_inplace} [id DG] 28
#      | | | | |   | | | |   | | | |         |Elemwise{Composite{Switch(LT(i0, i1), i1, i0)}} [id DF] 31
#      | | | | |   | | | |   | | | |         |TensorConstant{0} [id E]
#      | | | | |   | | | |   | | | |         |Elemwise{add,no_inplace} [id DI] 30
#      | | | | |   | | | |   | | | |           |TensorConstant{-1} [id I]
#      | | | | |   | | | |   | | | |           |Elemwise{switch,no_inplace} [id DG] 28
#      | | | | |   | | | |   | | | |TensorConstant{0} [id E]
#      | | | | |   | | | |   | | | |Elemwise{add,no_inplace} [id H] 17
#      | | | | |   | | | |   | | | |Elemwise{sub,no_inplace} [id DE] 37
#      | | | | |   | | | |   | | |TensorConstant{0} [id E]
#      | | | | |   | | | |   | |TensorConstant{0} [id E]
#      | | | | |   | | | |   | |Elemwise{Composite{Switch(LT(Composite{Switch(LT(i0, i1), i1, i0)}(Composite{Switch(LT(i0, i1), i2, i0)}((i0 - i1), i2, i3), i2), i1), Composite{Switch(LT(i0, i1), i1, i0)}(Composite{Switch(LT(i0, i1), i2, i0)}((i0 - i1), i2, i3), i2), i1)}} [id DH] 35
#      | | | | |   | | | |   | |Elemwise{Add}[(0, 1)] [id DJ] 39
#      | | | | |   | | | |   |   |TensorConstant{-1} [id I]
#      | | | | |   | | | |   |   |Elemwise{Composite{Switch(LT(i0, i1), i1, i0)}} [id DF] 31
#      | | | | |   | | | |   |TensorConstant{0} [id E]
#      | | | | |   | | | |   |TensorFromScalar [id DK] 64
#      | | | | |   | | | |   | |add [id DL] 61
#      | | | | |   | | | |   |   |ScalarFromTensor [id DM] 58
#      | | | | |   | | | |   |   | |Elemwise{Composite{Switch(i0, i1, maximum(i2, i3))}}[(0, 2)] [id CY] 54
#      | | | | |   | | | |   |   |ScalarFromTensor [id DN] 32
#      | | | | |   | | | |   |     |Elemwise{switch,no_inplace} [id DG] 28
#      | | | | |   | | | |   |TensorConstant{-1} [id CG]
#      | | | | |   | | | |   |Elemwise{switch,no_inplace} [id DG] 28
#      | | | | |   | | | |   |Elemwise{add,no_inplace} [id DI] 30
#      | | | | |   | | | |TensorConstant{0} [id E]
#      | | | | |   | | |Elemwise{Composite{Switch(LT(Composite{Switch(LT(i0, i1), i1, i0)}(Composite{Switch(LT(i0, i1), i2, i0)}(Composite{(i0 - Switch(LT(i1, i2), i2, i1))}(i0, Composite{(i0 - Switch(GE(i1, i2), i2, i1))}(i1, Composite{Switch(LT(i0, i1), i2, i0)}(Composite{Switch(LT(i0, i1), i2, i0)}(i2, i3, i4), i3, i5), i6), i3), i3, i1), i3), i7), Composite{Switch(LT(i0, i1), i1, i0)}(Composite{Switch(LT(i0, i1), i2, i0)}(Composite{(i0 - Switch(LT(i1, i2), i2, i1))}(i0, Composite{(i0 - Switch(GE(i1, i2), i2, i1))}(i1, Composite{Switch(LT(i0, i1), i2, i0)}(Composite{Switch(LT(i0, i1), i2, i0)}(i2, i3, i4), i3, i5), i6), i3), i3, i1), i3), i7)}}[(0, 0)] [id DO] 72
#      | | | | |   | |   |Elemwise{add,no_inplace} [id CW] 68
#      | | | | |   | |   |Elemwise{Composite{Switch(GE(Composite{Switch(LT(i0, i1), i2, i0)}(Composite{Switch(LT(i0, i1), i2, i0)}(i0, i1, i2), i1, i3), i4), i5, Composite{Switch(LT(i0, i1), i2, i0)}(Composite{Switch(LT(i0, i1), i2, i0)}(i0, i1, i2), i1, i3))}}[(0, 0)] [id CX] 66
#      | | | | |   | |   |Elemwise{Composite{Switch(i0, i1, Switch(AND(LT((i2 - i3), i1), GT(i3, i1)), (i4 - i5), maximum((i4 + i6), (i2 - i3))))}}[(0, 3)] [id DP] 52
#      | | | | |   | |   | |Elemwise{le,no_inplace} [id CZ] 49
#      | | | | |   | |   | |TensorConstant{0} [id E]
#      | | | | |   | |   | |Elemwise{Add}[(0, 1)] [id DJ] 39
#      | | | | |   | |   | |Elemwise{Composite{Switch(i0, Switch(LT(i1, i2), i2, i1), Switch(LT(i3, i4), i3, i4))}}[(0, 1)] [id DA] 46
#      | | | | |   | |   | |TensorConstant{-1} [id I]
#      | | | | |   | |   | |Elemwise{switch,no_inplace} [id DG] 28
#      | | | | |   | |   | |Elemwise{Composite{Switch(LT(Composite{Switch(LT(i0, i1), i1, i0)}(Composite{Switch(LT(i0, i1), i2, i0)}((i0 - i1), i2, i3), i2), i1), Composite{Switch(LT(i0, i1), i1, i0)}(Composite{Switch(LT(i0, i1), i2, i0)}((i0 - i1), i2, i3), i2), i1)}} [id DH] 35
#      | | | | |   | |   |TensorConstant{0} [id E]
#      | | | | |   | |   |TensorFromScalar [id DQ] 63
#      | | | | |   | |   | |add [id DR] 60
#      | | | | |   | |   |   |ScalarFromTensor [id DS] 56
#      | | | | |   | |   |   | |Elemwise{Composite{Switch(i0, i1, Switch(AND(LT((i2 - i3), i1), GT(i3, i1)), (i4 - i5), maximum((i4 + i6), (i2 - i3))))}}[(0, 3)] [id DP] 52
#      | | | | |   | |   |   |ScalarFromTensor [id DN] 32
#      | | | | |   | |   |TensorConstant{-1} [id CG]
#      | | | | |   | |   |Elemwise{switch,no_inplace} [id DG] 28
#      | | | | |   | |   |Elemwise{Composite{Switch(LT(i0, i1), i1, i0)}} [id CV] 70
#      | | | | |   | |TensorConstant{0} [id E]
#      | | | | |   |TensorConstant{0} [id E]
#      | | | | |   |TensorConstant{-1} [id I]
#      | | | | |   |Elemwise{Composite{Switch(LT(i0, i1), i1, i0)}} [id CV] 70
#      | | | | |   |Elemwise{switch,no_inplace} [id DG] 28
#      | | | | |   |TensorConstant{0} [id CM]
#      | | | | |ScalarFromTensor [id DT] 86
#      | | | | | |Elemwise{Composite{Switch(i0, i1, Switch(AND(LT((i2 + i3), i1), GT(i4, i1)), (i2 - i5), minimum((i2 + i3), i6)))}}[(0, 3)] [id DU] 82
#      | | | | |   |Elemwise{le,no_inplace} [id CT] 78
#      | | | | |   |TensorConstant{0} [id E]
#      | | | | |   |TensorConstant{-1} [id I]
#      | | | | |   |Elemwise{Composite{Switch(LT(Composite{Switch(LT(i0, i1), i1, i0)}(Composite{Switch(LT(i0, i1), i2, i0)}(Composite{(i0 - Switch(LT(i1, i2), i2, i1))}(i0, Composite{(i0 - Switch(GE(i1, i2), i2, i1))}(i1, Composite{Switch(LT(i0, i1), i2, i0)}(Composite{Switch(LT(i0, i1), i2, i0)}(i2, i3, i4), i3, i5), i6), i3), i3, i1), i3), i7), Composite{Switch(LT(i0, i1), i1, i0)}(Composite{Switch(LT(i0, i1), i2, i0)}(Composite{(i0 - Switch(LT(i1, i2), i2, i1))}(i0, Composite{(i0 - Switch(GE(i1, i2), i2, i1))}(i1, Composite{Switch(LT(i0, i1), i2, i0)}(Composite{Switch(LT(i0, i1), i2, i0)}(i2, i3, i4), i3, i5), i6), i3), i3, i1), i3), i7)}}[(0, 0)] [id DO] 72
#      | | | | |   |Elemwise{sub,no_inplace} [id CU] 74
#      | | | | |   |Shape_i{0} [id J] 14
#      | | | | |   |Elemwise{switch,no_inplace} [id DG] 28
#      | | | | |ScalarConstant{-1} [id CP]
#      | | | |InplaceDimShuffle{0,2,1} [id DV] 92 (outer_in_seqs-2)
#      | | | | |Subtensor{int64:int64:int64} [id CQ] 90
#      | | | |Subtensor{int64:int64:int64} [id BF] 84 (outer_in_seqs-3)
#      | | | |Alloc [id DW] 18 (outer_in_mit_mot-0)
#      | | | | |TensorConstant{0.0} [id DX]
#      | | | | |Shape_i{0} [id J] 14
#      | | | | |Shape_i{0} [id Q] 4
#      | | | | |Shape_i{1} [id S] 3
#      | | | |Subtensor{::int64} [id DY] 91 (outer_in_mit_mot-1)
#      | | |   |IncSubtensor{InplaceInc;int64::} [id DZ] 89
#      | | |   | |Alloc [id EA] 20
#      | | |   | | |TensorConstant{(1, 1, 1) of 0.0} [id EB]
#      | | |   | | |Shape_i{0} [id BY] 15
#      | | |   | | |Shape_i{0} [id Y] 1
#      | | |   | | |Shape_i{1} [id BA] 0
#      | | |   | |IncSubtensor{InplaceInc;int64} [id EC] 85
#      | | |   | | |Alloc [id ED] 81
#      | | |   | | | |TensorConstant{(1, 1, 1) of 0.0} [id EB]
#      | | |   | | | |Elemwise{Composite{(i0 - Switch(LT(i1, i0), i1, i0))}}[(0, 0)] [id EE] 77
#      | | |   | | | | |Shape_i{0} [id BY] 15
#      | | |   | | | | |TensorConstant{1} [id P]
#      | | |   | | | |Shape_i{0} [id Y] 1
#      | | |   | | | |Shape_i{1} [id BA] 0
#      | | |   | | |TensorConstant{(1, 1) of 1.0} [id EF]
#      | | |   | | |ScalarConstant{-1} [id CP]
#      | | |   | |ScalarConstant{1} [id V]
#      | | |   |ScalarConstant{-1} [id CP]
#      | | |ScalarConstant{-1} [id CP]
#      | |Alloc [id EG] 22
#      |   |TensorConstant{0.0} [id DX]
#      |   |Elemwise{Sub}[(0, 0)] [id EH] 16
#      |   | |Elemwise{add,no_inplace} [id O] 6
#      |   | |Shape_i{0} [id J] 14
#      |   |Shape_i{0} [id Q] 4
#      |   |Shape_i{1} [id S] 3
#      |ScalarConstant{1} [id V]
# InplaceDimShuffle{1,2} [id EI] 102
#  |Rebroadcast{(0, True)} [id EJ] 100
#    |Subtensor{:int64:} [id EK] 98
#      |Join [id EL] 96
#      | |TensorConstant{0} [id E]
#      | |Subtensor{::int64} [id EM] 94
#      | | |forall_inplace,cpu,grad_of_doubling_algo}.1 [id G] 93 (outer_out_mit_mot-1)
#      | | |ScalarConstant{-1} [id CP]
#      | |Alloc [id EN] 21
#      |   |TensorConstant{0.0} [id DX]
#      |   |Elemwise{Sub}[(0, 0)] [id EH] 16
#      |   |Shape_i{0} [id Y] 1
#      |   |Shape_i{1} [id BA] 0
#      |ScalarConstant{1} [id V]
#
# Inner graphs:
#
# forall_inplace,cpu,grad_of_doubling_algo}.0 [id G] (outer_out_mit_mot-0)
#  >Elemwise{Add}[(0, 0)] [id EO] (inner_out_mit_mot-0-0)
#  > |*5-<TensorType(float64, (None, None))> [id EP] -> [id DW] (inner_in_mit_mot-0-1)
#  > |Gemm{inplace} [id EQ]
#  > | |Dot22 [id ER]
#  > | | |*4-<TensorType(float64, (None, None))> [id ES] -> [id DW] (inner_in_mit_mot-0-0)
#  > | | |*2-<TensorType(float64, (None, None))> [id ET] -> [id DV] (inner_in_seqs-2)
#  > | |TensorConstant{1.0} [id EU]
#  > | |*2-<TensorType(float64, (None, None))> [id ET] -> [id DV] (inner_in_seqs-2)
#  > | |*4-<TensorType(float64, (None, None))> [id ES] -> [id DW] (inner_in_mit_mot-0-0)
#  > | |TensorConstant{1.0} [id EU]
#  > |Gemm{inplace} [id EV]
#  >   |Dot22 [id EW]
#  >   | |Dot22 [id EX]
#  >   | | |*6-<TensorType(float64, (None, None))> [id EY] -> [id DY] (inner_in_mit_mot-1-0)
#  >   | | |*1-<TensorType(float64, (None, None))> [id EZ] -> [id CQ] (inner_in_seqs-1)
#  >   | |*0-<TensorType(float64, (None, None))> [id FA] -> [id BE] (inner_in_seqs-0)
#  >   |TensorConstant{1.0} [id EU]
#  >   |InplaceDimShuffle{1,0} [id FB]
#  >   | |*6-<TensorType(float64, (None, None))> [id EY] -> [id DY] (inner_in_mit_mot-1-0)
#  >   |Dot22 [id FC]
#  >   | |*1-<TensorType(float64, (None, None))> [id EZ] -> [id CQ] (inner_in_seqs-1)
#  >   | |*3-<TensorType(float64, (None, None))> [id FD] -> [id BF] (inner_in_seqs-3)
#  >   |TensorConstant{1.0} [id EU]
#  >Elemwise{Add}[(0, 0)] [id FE] (inner_out_mit_mot-1-0)
#  > |*7-<TensorType(float64, (None, None))> [id FF] -> [id DY] (inner_in_mit_mot-1-1)
#  > |Gemm{no_inplace} [id FG]
#  >   |*6-<TensorType(float64, (None, None))> [id EY] -> [id DY] (inner_in_mit_mot-1-0)
#  >   |TensorConstant{1.0} [id EU]
#  >   |*2-<TensorType(float64, (None, None))> [id ET] -> [id DV] (inner_in_seqs-2)
#  >   |Dot22 [id EX]
#  >   |TensorConstant{1.0} [id EU]
#
# do_whileall_inplace,cpu,doubling_algo}.0 [id K] (outer_out_sit_sot-0)
#  >Dot22 [id FH] (inner_out_sit_sot-0)
#  > |*0-<TensorType(float64, (None, None))> [id FA] -> [id M] (inner_in_sit_sot-0)
#  > |*0-<TensorType(float64, (None, None))> [id FA] -> [id M] (inner_in_sit_sot-0)
#  >Elemwise{add,no_inplace} [id FI] (inner_out_sit_sot-1)
#  > |*1-<TensorType(float64, (None, None))> [id EZ] -> [id W] (inner_in_sit_sot-1)
#  > |Dot22 [id FJ]
#  >   |Dot22 [id FK]
#  >   | |*0-<TensorType(float64, (None, None))> [id FA] -> [id M] (inner_in_sit_sot-0)
#  >   | |*1-<TensorType(float64, (None, None))> [id EZ] -> [id W] (inner_in_sit_sot-1)
#  >   |InplaceDimShuffle{1,0} [id FL] 'A[t-1].T'
#  >     |*0-<TensorType(float64, (None, None))> [id FA] -> [id M] (inner_in_sit_sot-0)
#  >Elemwise{lt,no_inplace} [id FM]
#  > |Max{maximum}{0, 1} [id FN] 'max'
#  > | |Elemwise{Abs}[(0, 0)] [id FO]
#  > |   |Dot22 [id FJ]
#  > |*2-<TensorType(float64, ())> [id FP] -> [id BD] (inner_in_non_seqs-0)
#
# do_whileall_inplace,cpu,doubling_algo}.1 [id K] (outer_out_sit_sot-1)
#  >Dot22 [id FH] (inner_out_sit_sot-0)
#  >Elemwise{add,no_inplace} [id FI] (inner_out_sit_sot-1)
#  >Elemwise{lt,no_inplace} [id FM]
#
# forall_inplace,cpu,grad_of_doubling_algo}.1 [id G] (outer_out_mit_mot-1)
#  >Elemwise{Add}[(0, 0)] [id EO] (inner_out_mit_mot-0-0)
#  >Elemwise{Add}[(0, 0)] [id FE] (inner_out_mit_mot-1-0)

As with almost all things complex, one doesn't need to read and understand every single detail, but there are a few important things to understand about this output:

Each line is an Apply node output, and you can tell which one it is from the numeric index in the description. For example, do_whileall_inplace,cpu,doubling_algo.1 is the second output (i.e. zero-indexed) of the Apply node with the description do_whileall_inplace,cpu,doubling_algo. The suffix .0 would indicate that a line corresponds to the zeroth/first output Variable.
The parents/inputs/ancestors of a node are only ever printed once. After that, only the first line/node descriptions are printed. You can always search backward for an Apply node's ID (e.g. [id K]) to see its inputs–or forward to see how often a subgraph is referenced. This can be a great way to tell whether or not something is being computed more often than it should be, because, if you see two subgraphs that are performing identical operations on identical inputs, and their IDs are not equal, you've found a bug in the merge optimization.
The bodies of Scan Ops are always printed as separate "inner" graphs appearing after the "outer" graph. What you're seeing in the "outer" graph is only the Scan node and its inputs, which are all necessarily contained in the "outer" graph. Notice that the "inner" graphs tell you which "inner" graph inputs map to which "outer" graph inputs. There's also a whole nomenclature for Scan that describes "inner" input/output relationships, and those terms are used in the extra information shown using the print_op_info option.

From a quick glance, I see at least two distinct Scans in this graph: the nodes labeled forall_inplace (i.e. ID G) and do_whileall_inplace (i.e. ID K).

Notice that the latter (i.e. the node with ID K) is the one and only input to a Shape_i{0}. In other words, the do_whileall_inplace needs to be computed in order to get its shape.

This is something that our shape inference functionality is supposed to handle symbolically (i.e. without run-time computations). When it's possible to do this, it becomes a compile-time optimization–and one of the reasons that time is spent performing optimizations.

Even so, the do_whileall_inplace is actually referenced in more than one place throughout the graph, so it likely does need to be computed and the shape inference optimization wouldn't be as helpful as it otherwise would be. Nevertheless, this graph is telling us about a potential Scan-related inefficiency that could easily affect other graphs.

There are some BLAS/LAPACK optimized Ops that are "stacked". More specifically, there are Dot22s that have other Dot22s as inputs, and, since Dot22 is a dot-product specialization, I'm left to wonder whether or not these dot-products would be better handled by a single variadic Op that takes into account their shapes at run-time (e.g. like numpy.linalg.multi_dot), or a different combination of BLAS/LAPACK operations.

Otherwise, from here, the idea is to inspect the Scans and see if they make basic "computational" sense. If they're computing things in the loop-body/inner-function/inner-graph that only needs to be computed once, then one of our Scan rewrites is broken. Likewise, if the inner-graph is a constant, then a rewrite is not working. I don't see any issues of this sort in the graph, but I haven't looked very closely either. In general, I would say that this gradient-of-a-Scan graph is actually quite tame. Aside from the overly-verbose shape-involving operations in the "outer" graph (i.e. all the large TensorFromScalar, ScalarFromTensor, and le-containing sub-graphs that compute Subtensor indices), this graph is rather straightforward and quite amenable to a detailed performance-focused inspection at the lowest levels.

Anyway, I hope this helps illustrates how one would start an investigation into the use of Scan in a graph. Without a more solid performance comparison involving non-Aesara code, and perhaps some higher-level reasoning that justifies why things should be considerably faster than what Aesara produces, it's not particularly easy to pinpoint issues. Aside from that, one can always use the same profile=True settings as above and evaluate the functions numerous times to get actual timing information about each node (or use a profiler), which can serve as the start to a standard performance improvement investigation.

4 replies

jessegrabowski Jun 30, 2022
Author

There's a lot here, and thank you for such a thorough reply. I'll need to go through it carefully. There are clearly implementation problems with my code, no question.

I'm sorry if the OP came off as "anti-scan". The desire to avoid using scan was born from a general feeling I've gotten reading through docs and discussions that scan is complex, fragile, and introduces unique difficulties. And thus, I extrapolated, should be avoided if possible. Your post here is a needed splash of cold water, and essentially an answer to the op: there's no need to avoid scan. @junpenglao recently pointed out to me that several (very fast!) PyMC distributions use scan under the hood, so there's a priori no reason to point the finger there.

I will look over my projects for what I consider to be a "long" compilation time, apply the profiling techniques you outline here, and come back if I have something more specific/productive I can point to.

brandonwillard Jun 30, 2022
Maintainer

The desire to avoid using scan was born from a general feeling I've gotten reading through docs and discussions that scan is complex, fragile, and introduces unique difficulties.

That's exactly the sentiment/mindset that was developed within the Theano and PyMC3 communities that we're trying to fix. It really didn't lead to any improvements in Scan, or much else, and I don't see that being any different here.

aseyboldt Jun 30, 2022

I agree with the sentiment of "stop just saying that scan is bad", but I'd like to add "stop saying working out gradients manually is bad".

For this particular problem for instance I really think a custom op is much better than scan, and not because scan is complicated (although I think it can be a bit tricky to use), but just because we end up doing the computations that we should be doing:

We have some mathematical function (call it lyapunov:Rn×n×Rn×n→Rn×n ) that represents the solution to the lyapunov equation.

We can now find an algorithm that approximates that function, we call that approx(lyapunov), assuming that we actually use real numbers, not floating point numbers. This could be the algorithm in scipy (which I think is a direct solver, so without the floating point error it would give us the exact result), or we can use some other algorithm (eg iterative krylov solvers or the finite sum approximation that you implemented using scan).

We then implement that algorithm using floating point math, so say napprox(approx(lyapunov)).

Now we use autodiff to generate a (math) function that computes the derivatives (ignoring the details about which derivatives) of approx(lyapunov), so autodiff(approx(lyapunov)) and implement that using floating point math to get napprox(autodiff(approx(lyapunov))). This is what the scan autodiff tool is doing.

But we also can find the necessary derivatives of lyapunov, namely diff(lyapunov), find a way to approximate that: approx(diff(lyapunov)); and implement it using floating point math: napprox(approx(diff(lyapunov))). And that is what the custom op is doing. (And I think what we should be doing most of the time).

The disadvantage of the custom ops of course is that the new op doesn't automatically work with different backends. I think there is a good reason for this however: It really just isn't obvious how we should solve lyapunov equations in the different backends.

The original scan implementation that computes a finite sum isn't the solution we'd want I think:

scipy converts it to a sylvester equation and then uses lapack to solve this. This logic isn't trivial, someone needs to think about how to do this, and implement it. I guess it would be possible to implement it in aesara and then use OpFromGraph instead of a custom Op to provide the proper gradients.

Edit Clarify approximations

brandonwillard Jun 30, 2022
Maintainer

I agree with the sentiment of "stop just saying that scan is bad", but I'd like to add "stop saying working out gradients manually is bad".

First off, no one is saying—or even implying—that "working out gradients manually is bad". What I have implied at times is that this is a software project that provides AD and other automations, so we should prioritize those over solutions that could be implemented just as well in almost any other non-symbolic numerical linear algebra library. If Aesara were primarily composed of such implementations, it would only be a glorified NumPy/SciPy wrapper, and one would likely be better off using the wrapped libraries directly.

In part, I'm simply saying "We should (re)use our code when possible, and, when we can't, let's consider improving it first."

Regardless, before one declares custom Ops as the solution, they should at least try to understand what a person was originally attempting to do, why they were doing it, and what was going wrong. Writing custom Ops gives the illusion of an "always correct" answer, because, yes, one can literally do anything that way; however, that illusion disappears as soon as one starts considering the cumulative development costs and codebase that results from such an approach.

Furthermore, as this discussion demonstrates yet again, when Scan is involved, people seem to neglect basic forms of debugging and investigation and jump right into completely alternative solutions that ultimately do nothing to improve Scan, teach the invaluable art of debugging, or demonstrate any of the unique aspects of Aesara itself.

For this particular problem for instance I really think a custom op is much better than scan, and not because scan is complicated (although I think it can be a bit tricky to use), but just because we end up doing the computations that we should be doing:
...

I think you might be blowing this problem out of proportion. As far as I can tell, it's really a very straightforward linear equation.

Here's a complete example that demonstrates one means of estimating discrete Lyapunov equations in Aesara. It also computes gradients, is performant, and—most importantly—does not require custom Ops:

import numpy as np
import scipy
import aesara
import aesara.tensor as at


A = at.matrix("A")
B = at.matrix("B")


def solve_discrete_lyapunov(A, B):
    AAc = at.slinalg.kron(A, A.conj())
    X = at.linalg.solve(at.eye(AAc.shape[0]) - AAc, B.ravel())
    return at.reshape(X, B.shape)


X = solve_discrete_lyapunov(A, B)

X_fn = aesara.function([A, B], X)

# Construct a test problem
rng = np.random.default_rng(2083)

A_val = rng.normal(size=(50, 50))
A_val = A_val + A_val.T
B_val = rng.normal(size=(50, 50))
B_val = B_val + B_val.T

%timeit X_res = X_fn(A_val, B_val)
# 304 ms ± 15.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit X_sci = scipy.linalg.solve_discrete_lyapunov(A_val, B_val, method="direct")
# 359 ms ± 12.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

np.allclose(X_res, X_sci)
# True

# Just for fun, let's take a gradient.
# (N.B.: We can't take the gradient of `A`, because `A.conj` hasn't been
# implemented yet.  We could always remove the `conj` when it's not acting on
# imaginary numbers, though.)
X_grad_fn = aesara.function([A, B], aesara.grad(X.sum(), [B]))

X_grad_fn(A_val, B_val)
# [array([[ 0.59922958, -0.42855711, -0.38940466, ..., -0.03356429,
#          -0.08858355,  0.03054955],
#         [-0.42855711,  0.74013785, -0.07212333, ...,  0.21437122,
#           0.67511239,  0.27132156],
#         [-0.38940466, -0.07212333, -0.7688667 , ...,  0.55084694,
#           0.48338775,  0.28464258],
#         ...,
#         [-0.03356429,  0.21437122,  0.55084694, ..., -0.29671591,
#          -0.243245  , -0.17693303],
#         [-0.08858355,  0.67511239,  0.48338775, ..., -0.243245  ,
#           0.27293186,  0.15860888],
#         [ 0.03054955,  0.27132156,  0.28464258, ..., -0.17693303,
#           0.15860888,  0.15656974]])]

The disadvantage of the custom ops of course is that the new op doesn't automatically work with different backends. I think there is a good reason for this however: It really just isn't obvious how we should solve lyapunov equations in the different backends.

There is not a single disadvantage to writing redundant custom Ops; there are many.

From a design perspective, redundant custom Ops are almost always a last resort, because they can create considerable development and maintenance overhead through the inherent redundancies they introduce, and their presence cripples many types of optimizations and potential automations. Simply put, custom Ops that could otherwise be implemented via a combination of existing or more "atomic" Ops are antithetical to the features and goals of this project.

This is why our coverage of NumPy and SciPy functions isn't exclusively comprised of custom Ops. As a matter of fact, in order for a new Op to be added to Aesara, one of the first questions/criteria is whether or not a custom Op is even necessary for the underlying operation(s). For example, the ideal implementation of scipy.linalg.solve_discrete_lyapunov in Aesara would look a lot like the graph constructor function above.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoiding scans by looping inside Ops? #1011

{{title}}

Replies: 3 comments 10 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Avoiding scans by looping inside Ops? #1011

jessegrabowski Jun 24, 2022

Replies: 3 comments · 10 replies

ricardoV94 Jun 24, 2022

jessegrabowski Jun 24, 2022 Author

aseyboldt Jun 24, 2022

aseyboldt Jun 24, 2022

jessegrabowski Jun 24, 2022 Author

aseyboldt Jun 24, 2022

aseyboldt Jun 24, 2022

jessegrabowski Jun 24, 2022 Author

brandonwillard Jun 30, 2022 Maintainer

Compilation Time

Scan performance

jessegrabowski Jun 30, 2022 Author

brandonwillard Jun 30, 2022 Maintainer

aseyboldt Jun 30, 2022

brandonwillard Jun 30, 2022 Maintainer

jessegrabowski
Jun 24, 2022

Replies: 3 comments 10 replies

ricardoV94
Jun 24, 2022

jessegrabowski Jun 24, 2022
Author

aseyboldt
Jun 24, 2022

jessegrabowski Jun 24, 2022
Author

jessegrabowski Jun 24, 2022
Author

brandonwillard
Jun 30, 2022
Maintainer

`Scan` performance

jessegrabowski Jun 30, 2022
Author

brandonwillard Jun 30, 2022
Maintainer

brandonwillard Jun 30, 2022
Maintainer