Avoiding scans by looping inside Ops? #1011
-
As I try to work on things, I keep running into the problem of implementing iterative algorithms. Previously I posted about optimization (Newton's method), but it also comes up in linear algebra applications (solving discrete lyapunov equations), and time-series statistics (solving yule-walker equations). Even things like creating a block diagonal matrix from a 3d tensor. To do these iterative tasks I keep reaching for This all makes me wonder if I am attacking the problem from the wrong angle. Should I instead be writing Ops with the loops in the Pre-edit: This post got really rambling, I'm sorry. Here's the TL;DR: I have an algorithm that I want to implement in aesara, but it requires looping. I want to avoid scan because scan is slow. I want NUTS, so I need a derivative. In case 1, I can take the derivative of each iteration of the algorithm, but not end-to-end, because there is no closed-form solution (otherwise why am I iterating?). Case 2, the iteration does non-mathematical operations, and I don't know how I would express a derivative in this case. Specific questions:
Long post with some specific cases I want to work on: Take a specific example, solving a discrete Lyapunov equation, a matrix valued equation of the form def doubling_solution(A, B, max_it=100):
A, B = list(map(np.atleast_2d, [A, B]))
alpha0 = A
gamma0 = B
diff = 5
n_its = 1
while diff > 1e-15:
alpha1 = alpha0.dot(alpha0)
gamma1 = gamma0 + np.dot(alpha0.dot(gamma0), alpha0.conjugate().T)
diff = np.max(np.abs(gamma1 - gamma0))
alpha0 = alpha1
gamma0 = gamma1
n_its += 1
if n_its > max_it:
msg = "Exceeded maximum iterations {}, check input matrics"
raise ValueError(msg.format(n_its))
return gamma1 A naive implementation of the algorithm with from aesara.scan.utils import until as scan_until
def doubling_step(alpha, gamma, tol):
new_alpha = alpha.dot(alpha)
new_gamma = gamma + at.linalg.matrix_dot(alpha, gamma, alpha.conj().T)
diff = at.max(at.abs((new_gamma - gamma)))
return (new_alpha, new_gamma), scan_until(diff < tol)
doubling_result, doubling_updates = aesara.scan(doubling_step,
outputs_info = [A, B],
non_sequences=[tol],
n_steps=max_iter,
name='doubling_algo')
alpha, gamma = doubling_result But if I want to avoid the scan, could I bury the loop inside an Op? Something like this: class SolveDiscreteLyapunov(at.Op):
__props__ = ()
itypes = [at.zmatrix, at.zmatrix, at.dscalar, at.iscalar]
otypes = [aesara.tensor.zmatrix, at.iscalar]
def perform(self, node, inputs, output_storage):
A, B, tol, max_iter = inputs[0], inputs[1], inputs[2], inputs[3]
X, info = output_storage[0], output_storage[1]
diff = np.inf
alpha = A
gamma = B
for i in range(max_iter):
new_alpha = alpha @ alpha
new_gamma = gamma + alpha @ gamma @ alpha.conj().T
diff = np.max(np.abs(new_gamma - gamma))
if diff < tol:
break
alpha = new_alpha
gamma = new_gamma
X[0] = new_gamma
info[0] = 1 - int(i == max_iter) So this %timeit f_doubling_1 = aesara.function([A, B, tol, max_iter], [gamma[-1]])
>> 148 ms ± 1.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
solve_lyapunov = SolveDiscreteLyapunov()(A, B, tol, max_iter)
%timeit %timeit f_doubling_2= aesara.function([A, B, tol, max_iter], solve_lyapunov)
>>3.99 ms ± 54 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) The In other cases I have no idea what the gradients would even mean. Consider construction of a block diagonal matrix from a 3d tensor of matrices. I posted a scan implementation here, but we could also define a from scipy.linalg import block_diag
class BlockDiag(at.Op):
__props__ = ()
itypes = [at.dtensor3]
otypes = [at.dmatrix]
def perform(self, node, inputs, output_storage):
arrs = inputs[0]
block_matrix = output_storage[0]
# scipy block_diag uses a loop so this is on-topic
block_matrix[0] = block_diag(*arrs) What is the derivative of moving stuff around? I have no idea. It would be nice to have this kind of |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 10 replies
-
Have a look at #695, for your last example. Regarding Not the compilation per se, but the execution itself. Would be great to improve it, but it's a challenging area of the library to tweak. |
Beta Was this translation helpful? Give feedback.
-
If you are implementing things like lyapunov equations, then yes, I'd argue this would be a much better way. In those cases it should also be much better to work out the derivatives by hand and implement them as ops as well, in almost all cases this should be much faster and probably more stable than relying on autodiff. |
Beta Was this translation helpful? Give feedback.
-
Regarding your basic question, yes, a custom Making things work more generally (i.e. for many/most combinations of That said, this discussion is easily identified as an "anti- Instead, let's look at some of the issues mentioned here and try to see exactly how they relate to Compilation Time
First, it would be very helpful to have MWEs for these extremely long compilation times. The example you provided further down is nowhere near being extremely long; the latencies are literally sub-second. More importantly, you have to ask yourself what a long compilation time is for you. The basic premise behind compilation is that a one-time optimization cost can improve the average run-time costs of the resulting code. It's not a guarantee, but it tends to be true. Regardless, if one can't personally accept that premise or the one-time cost, then elements of the compilation process can be disabled. For instance, optimizations can be disabled (e.g. certain ones or all of them). Likewise, actual compilation of C extensions can be disabled by using a different backend (e.g. pure Python). Compilation of these C extensions tends to be the most time-consuming part of the process, and this has nothing to do with There are some Let's use your example to demonstrate: # Make sure the cache is clear so that we can see the actual time spent during
# compilation
# !aesara-cache clear
import aesara
import aesara.tensor as at
from aesara.scan.utils import until as scan_until
A = at.matrix("A")
B = at.matrix("B")
tol = at.scalar("tol")
max_iter = at.lscalar("max_iter")
def doubling_step(alpha, gamma, tol):
new_alpha = alpha.dot(alpha)
new_gamma = gamma + at.linalg.matrix_dot(alpha, gamma, alpha.T)
diff = at.max(at.abs((new_gamma - gamma)))
return (new_alpha, new_gamma), scan_until(diff < tol)
doubling_result, doubling_updates = aesara.scan(
doubling_step,
outputs_info=[A, B],
non_sequences=[tol],
n_steps=max_iter,
name="doubling_algo",
# Enable profiling when the `Scan`'s inner-function is compiled
profile=True,
)
alpha, gamma = doubling_result
# Enable profiling for compilation of the outer-function
f_doubling = aesara.function([A, B, tol, max_iter], gamma[-1], profile=True)
f_doubling.profile.summary()
# Function profiling
# ==================
# Message: <ipython-input-5-294105621f54>:1
# Time in 0 calls to Function.__call__: 0.000000e+00s
# Total compile time: 2.265139e+01s
# Number of Apply nodes: 17
# Aesara Optimizer time: 5.699408e+00s
# Aesara validate time: 2.002478e-03s
# Aesara Linker time (includes C, CUDA code generation/compiling): 16.937392234802246s
# Import time 1.482582e-02s
# Node make_thunk time 1.693630e+01s
# Node do_whileall_inplace,cpu,doubling_algo}(max_iter, IncSubtensor{InplaceSet;:int64:}.0, IncSubtensor{InplaceSet;:int64:}.0, tol) time 1.305243e+01s
# Node Elemwise{Composite{Switch(LT((i0 + Composite{(i0 - Switch(LT(i1, i0), i1, i0))}(i1, i2)), i3), (i0 - i1), Switch(GE((i0 + Composite{(i0 - Switch(LT(i1, i0), i1, i0))}(i1, i2)), Composite{(i0 - Switch(LT(i1, i0), i1, i0))}(i1, i2)), (i2 + i1), Switch(LE(Composite{(i0 - Switch(LT(i1, i0), i1, i0))}(i1, i2), i3), (i2 + i1), (i0 + i1))))}}[(0, 1)](TensorConstant{-1}, Shape_i{0}.0, TensorConstant{1}, TensorConstant{0}) time 6.080954e-01s
# Node InplaceDimShuffle{x,0,1}(B) time 5.972800e-01s
# Node AllocEmpty{dtype='float64'}(Elemwise{add,no_inplace}.0, Shape_i{0}.0, Shape_i{1}.0) time 5.633097e-01s
# Node Subtensor{int64}(do_whileall_inplace,cpu,doubling_algo}.1, ScalarFromTensor.0) time 5.134976e-01s
#
# Time in all call to aesara.grad() 0.000000e+00s
# Time since aesara import 37.958s
# Here are tips to potentially make your code run faster
# (if you think of new ones, suggest them on the mailing list).
# Test them first, as they are not guaranteed to always provide a speedup.
# - Try the Aesara flag floatX=float32 Notice how the total compile time is distributed between the time spent optimizing and linking (i.e. generating the C extensions): the majority is spent on the latter. Regardless, it should be clear that what you were measuring was something closer to the cache latency. Another part of your description implies that you're observing these long compilation times in PyMC. This means it's likely that you're observing more than just the compilation of a specific graph, which makes your example even less representative of the issue(s) you're describing. PyMC could be constructing and compiling multiple graphs, including ones that involve gradients. Attributing the latency of a cumulative process with such elements to just the use of a This hypothetical Nevertheless, those buggy graphs did provide an example of some
Again, this isn't a useful comparison, because the timing values are already so low, they're mostly measuring cache latency, and one graph is only comprised of a single Python-only Let's take a look at a the compilation of a graph with a gradient of that f_doubling_grad = aesara.function(
[A, B, tol, max_iter], aesara.grad(gamma[-1].sum(), [A, B]), profile=True
)
f_doubling_grad.profile.summary()
# Function profiling
# ==================
# Message: <ipython-input-5-9baaf36b31a6>:1
# Time in 0 calls to Function.__call__: 0.000000e+00s
# Total compile time: 2.048326e+01s
# Number of Apply nodes: 104
# Aesara Optimizer time: 3.609982e+00s
# Aesara validate time: 1.658869e-02s
# Aesara Linker time (includes C, CUDA code generation/compiling): 16.870064735412598s
# Import time 2.915764e-02s
# Node make_thunk time 1.686693e+01s
# Node forall_inplace,cpu,grad_of_doubling_algo}(Elemwise{add,no_inplace}.0, InplaceDimShuffle{0,2,1}.0, Subtensor{int64:int64:int64}.0, InplaceDimShuffle{0,2,1}.0, Subtensor{int64:int64:int64}.0, Alloc.0, Subtensor{::int64}.0) time 3.562204e+00s
# Node Elemwise{Composite{Switch(i0, i1, Switch(AND(LT((i2 - i3), i1), GT(i3, i1)), (i4 - i5), maximum((i4 + i6), (i2 - i3))))}}[(0, 3)](Elemwise{le,no_inplace}.0, TensorConstant{0}, Elemwise{Add}[(0, 1)].0, Elemwise{Composite{Switch(i0, Switch(LT(i1, i2), i2, i1), Switch(LT(i3, i4), i3, i4))}}[(0, 1)].0, TensorConstant{-1}, Elemwise{Composite{Switch(LT((i0 + i1), i2), i2, (i0 + i1))}}.0, Elemwise{Composite{Switch(LT(Composite{Switch(LT(i0, i1), i1, i0)}(Composite{Switch(LT(i0, i1), i2, i0)}((i0 - i1), i2, i3), i2), i1), Composite{Switch(LT(i0, i1), i1, i0)}(Composite{Switch(LT(i0, i1), i2, i0)}((i0 - i1), i2, i3), i2), i1)}}.0) time 8.090849e-01s
# Node Elemwise{Composite{Switch(LT(Composite{Switch(LT(i0, i1), i1, i0)}(Composite{Switch(LT(i0, i1), i2, i0)}(Composite{(i0 - Switch(LT(i1, i2), i2, i1))}(i0, Composite{(i0 - Switch(GE(i1, i2), i2, i1))}(i1, Composite{Switch(LT(i0, i1), i2, i0)}(Composite{Switch(LT(i0, i1), i2, i0)}(i2, i3, i4), i3, i5), i6), i3), i3, i1), i3), i7), Composite{Switch(LT(i0, i1), i1, i0)}(Composite{Switch(LT(i0, i1), i2, i0)}(Composite{(i0 - Switch(LT(i1, i2), i2, i1))}(i0, Composite{(i0 - Switch(GE(i1, i2), i2, i1))}(i1, Composite{Switch(LT(i0, i1), i2, i0)}(Composite{Switch(LT(i0, i1), i2, i0)}(i2, i3, i4), i3, i5), i6), i3), i3, i1), i3), i7)}}[(0, 0)](Elemwise{add,no_inplace}.0, Elemwise{Composite{Switch(GE(Composite{Switch(LT(i0, i1), i2, i0)}(Composite{Switch(LT(i0, i1), i2, i0)}(i0, i1, i2), i1, i3), i4), i5, Composite{Switch(LT(i0, i1), i2, i0)}(Composite{Switch(LT(i0, i1), i2, i0)}(i0, i1, i2), i1, i3))}}[(0, 0)].0, Elemwise{Composite{Switch(i0, i1, Switch(AND(LT((i2 - i3), i1), GT(i3, i1)), (i4 - i5), maximum((i4 + i6), (i2 - i3))))}}[(0, 3)].0, TensorConstant{0}, TensorFromScalar.0, TensorConstant{-1}, Elemwise{Composite{Switch(LT((i0 + i1), i2), i2, (i0 + i1))}}.0, Elemwise{Composite{Switch(LT(i0, i1), i1, i0)}}.0) time 7.391601e-01s
# Node Elemwise{Composite{Switch(i0, i1, Switch(AND(LT((i2 + i3), i1), GT(i4, i1)), (i2 - i5), minimum((i2 + i3), i6)))}}[(0, 3)](Elemwise{le,no_inplace}.0, TensorConstant{0}, TensorConstant{-1}, Elemwise{Composite{Switch(LT(Composite{Switch(LT(i0, i1), i1, i0)}(Composite{Switch(LT(i0, i1), i2, i0)}(Composite{(i0 - Switch(LT(i1, i2), i2, i1))}(i0, Composite{(i0 - Switch(GE(i1, i2), i2, i1))}(i1, Composite{Switch(LT(i0, i1), i2, i0)}(Composite{Switch(LT(i0, i1), i2, i0)}(i2, i3, i4), i3, i5), i6), i3), i3, i1), i3), i7), Composite{Switch(LT(i0, i1), i1, i0)}(Composite{Switch(LT(i0, i1), i2, i0)}(Composite{(i0 - Switch(LT(i1, i2), i2, i1))}(i0, Composite{(i0 - Switch(GE(i1, i2), i2, i1))}(i1, Composite{Switch(LT(i0, i1), i2, i0)}(Composite{Switch(LT(i0, i1), i2, i0)}(i2, i3, i4), i3, i5), i6), i3), i3, i1), i3), i7)}}[(0, 0)].0, Elemwise{sub,no_inplace}.0, Shape_i{0}.0, Elemwise{Composite{Switch(LT((i0 + i1), i2), i2, (i0 + i1))}}.0) time 6.649911e-01s
# Node Elemwise{Composite{Switch(i0, i1, maximum(minimum((i2 + i3), i4), i5))}}[(0, 3)](Elemwise{le,no_inplace}.0, TensorConstant{0}, TensorConstant{-1}, Elemwise{Composite{Switch(LT(i0, i1), i1, i0)}}.0, Elemwise{Composite{Switch(LT((i0 + i1), i2), i2, (i0 + i1))}}.0, TensorConstant{0}) time 6.371284e-01s
#
# Time in all call to aesara.grad() 2.692196e-01s
# Time since aesara import 1067.307s
# Here are tips to potentially make your code run faster
# (if you think of new ones, suggest them on the mailing list).
# Test them first, as they are not guaranteed to always provide a speedup.
# - Try the Aesara flag floatX=float32 Again, first time compiling such a graph takes nearly 20 seconds, and the vast majority of that time is spent compiling the C extensions. Luckily, these C extensions are fairly well covered by caching, so they really do tend to be one-time costs. Hopefully this clarifies where the actual latency is and how one can inspect it. More importantly, I hope it provides some insight into the kind of material we need to be discussing in order to make improvements–or simply understand what is going on. Much of the compilation and optimization process latency can be considerably reduced, even when it comes to
|
Beta Was this translation helpful? Give feedback.
Regarding your basic question, yes, a custom
Op
will always work, but–as you've noticed–you'll need to implement your own gradients. While you're considering how to implement a gradient for anOp
that represents a combination of existingOp
s, try to notice when you're automatically applying simplifications to the resulting gradient based on the underlying expression. These simplifications are the essence of optimizing libraries like Aesara.Making things work more generally (i.e. for many/most combinations of
Op
s), instead of implementing an endless number of overlappingOp
s, is where a lot of the work in this project is focused, and arguably the reason it and projects like it exist. We c…