- This version started out as a reaction to ConvNext: A ConvNet for the 2020s, first on Twitter, then expanded on Weights & Biases. In my (very biased) opinion, if anything is to be called the "ConvNets of the 2020s", it would be those that are designed from the PDE perspective.
- A newer version is now published as the appendix in A Novel Convolutional Neural Network Architecture with a Continuous Symmetry.
- Everyone is invited to open a discussion, to offer criticisms, ask for clarifications, suggest or develop new ideas and experiments, à la the Polymath projects (open collaborations of research math that work quite well for iterations of "small" improvements, rather than major conceptual breakthroughs); and of course to PR any code or results that you would like to share. The hope is to allow anyone from anywhere to contribute, to move the field forward, without the inefficient cycle of published papers that repeat many of the same things over, while sweeping other things under the rug. Unlike the Polymath projects, there doesn't appear to be a clearly defined goal here.
The key observation is that the 3x3 conv can be seen as a differential operator (of order ≤2): the so-called Sobel filters are partial derivatives in the
By compounding multiple layers of 3x3 conv, with "identity skip connections" (ResNet), you are effectively solving a partial differential equation, numerically.
[Indeed, with
the so-called "forward Euler" method.] With the nonlinear activation
In fact, with multiple channels this is technically a system of PDEs, or a PDE with matrix coefficients. The coefficients
Convolutional Neural Nets | Partial Differential Equations |
---|---|
x=x+Conv(x) |
|
input layer | initial condition |
feed-forward | solving the equation numerically |
hidden layers | solution at intermediate times |
output layer | solution at final time |
convolution with 3×3 kernel | differential operator of order ≤ 2 |
weights | coefficients |
boundary handling (padding) |
boundary condition |
multiple channels [e.g. 16×16×3×3 kernel; 16×16×1×1 kernel] |
system of (coupled) PDEs [16×16 matrix of differential operators; 16×16 matrix of constants] |
groups (=g) | matrix is block-diagonal (direct sum of g blocks) |
The training of a ConvNet would be an inverse problem: we know the solutions (dataset), and look for the PDE that would yield those solutions. On the grand scheme of things, it is part of the continuous formulation of deep neural nets, which supplements the more traditional "statistical learning" interpretation. If you thought the curse of dimensionality was bad enough, you might find relief in optimization over an infinite-dimensional space, aka the calculus of variations. (I have very little to say about backpropagation, and will focus on the "feed forward" or the "architecture" of ConvNets.)
If you think about it, a full 3x3 conv from (say) 64 channels to 64 channels is rather wasteful, for there can only be at most 9 different kernels (or rather, 9 linearly independent kernels). One way to address this is to take each of the 64 channels, convolve with 9 different kernels, and take linear combinations; in other words, 64 channels go into 64x9 channels (3x3 with groups=64
), followed by a 1x1 conv. This is awkwardly named "depthwise separable convolution".
In fact, doing x9 is also rather wasteful, for the space of second-order differential operators is 6 or perhaps 5 dimensional, so I'd go with x4. That x1 already works so well (in MobileNet, etc.) is kind of surprising. Now, the
64 --(3x3, groups=64) --> 64x4 --(1x1)--> 64
block is a lot like the Bottleneck block in ResNet. (I don't have a good intuition with the placement of the other 1x1.)
What I'm saying is that the perspective of PDE can get you quite a lot of the design of the modern ConvNets — consisting almost entirely of 3x3 and 1x1, with skip connections — more naturally and more directly. Moreover, it suggests other ways of tweaking the architecture:
-
In such PDEs, one always needs to impose a boundary condition (BC), which is simply padding in conv. It seems that only "padding with 0" (Dirichlet BC) is ever used. One could instead try to implement the Neumann BC (by using
padding_mode="reflect"
), which could make traveling waves bounce back and thereby retain more information. -
One important class of PDEs, called symmetric hyperbolic systems (Maxwell and Dirac equations are prominent linear examples), would be interesting to implement. Or instead, we could help make the ConvNet more "hyperbolic" by putting a suitable regularization term in the loss function. In a sense, hyperbolic equations are better suited for numerical simulation due to the "finite speed of propagation".
-
The PDEs here are all "constant coefficients" (i.e., the coefficients are constant in the
$x$ and$y$ variables, but do vary in$t$ ). What if we make them vary in$x$ and$y$ as well? That is, after the standard 3x3 conv, multiply the result by the "coordinate function" of the form$ax+by+c$ . Taking the latest torchvision code for ResNet, here are the relevant changes that can be made (easily adaptable to other ConvNets):
class Bottleneck(nn.Module):
# Bottleneck in torchvision places the stride for downsampling at 3x3 convolution(self.conv2)
# while original implementation places the stride at the first 1x1 convolution(self.conv1)
# according to "Deep residual learning for image recognition"https://arxiv.org/abs/1512.03385.
# This variant is also known as ResNet V1.5 and improves accuracy according to
# https://ngc.nvidia.com/catalog/model-scripts/nvidia:resnet_50_v1_5_for_pytorch.
expansion: int = 4
def __init__(
self,
inplanes: int,
planes: int,
stride: int = 1,
downsample: Optional[nn.Module] = None,
groups: int = 1,
base_width: int = 64,
dilation: int = 1,
norm_layer: Optional[Callable[..., nn.Module]] = None,
) -> None:
super().__init__()
if norm_layer is None:
norm_layer = nn.BatchNorm2d
width = int(planes * (base_width / 64.0)) * groups
# Both self.conv2 and self.downsample layers downsample the input when stride != 1
self.conv1 = conv1x1(inplanes, width)
self.bn1 = norm_layer(width)
# self.conv2 = conv3x3(width, width, stride, groups, dilation) # This is the original
# self.bn2 = norm_layer(width) # This is the original
# self.conv3 = conv1x1(width, planes * self.expansion) # This is the original
self.conv2 = conv3x3(width, width * 4, stride, width, dilation) # Modified
self.bn2 = norm_layer(width * 4) # Modified
self.XY = None # Modified
self.mix = conv1x1(5, width * 4, bias=True) # Modified
self.conv3 = conv1x1(width * 4, planes * self.expansion) # Modified
self.bn3 = norm_layer(planes * self.expansion)
self.relu = nn.ReLU(inplace=True)
self.downsample = downsample
self.stride = stride
def forward(self, x: Tensor) -> Tensor:
identity = x
out = self.conv1(x)
out = self.bn1(out)
out = self.relu(out)
out = self.conv2(out)
if self.XY is None or self.XY.size()[2] != out.size()[2]: # Added
N, C, H, W = out.size() # Added
XX = torch.Tensor(np.indices((1,1,H,W))[3]*2/(W-1)-1) # Added
YY = torch.Tensor(np.indices((1,1,H,W))[2]*2/(H-1)-1) # Added
self.XY = torch.cat([XX, YY, XX*XX, XX*YY, YY*YY], # Added
dim=1).type(out.dtype).to(out.device) # Added
out = out * self.mix(self.XY) # Added
out = self.bn2(out)
out = self.relu(out)
out = self.conv3(out)
out = self.bn3(out)
if self.downsample is not None:
identity = self.downsample(x)
out += identity
out = self.relu(out)
return out
One could motivate the addition of variable coefficients as enabling the ConvNet to rotate and scale the image—just like how the Sobel can shift the image, but by different amounts for different parts of the image. But whether or not it actually learns these transformations is not guaranteed, nor easy to verify. At any rate, a better explanation may be that it at least expands the "expressive power" of the network.
I hope someone with resources can put this to more thorough tests on ImageNet, and share the results. It seems that only with solid results will it convince more people to take this perspective seriously.
I'd bet that Yann LeCun did understand PDEs well when he introduced the first ConvNet, but purposefully framed it in terms of convolutions. It's a bit unfortunate that, without the guide of PDE, the field had missed many opportunities to improve the architecture design, or did so with ad hoc reasoning. The first to note the connection between ResNet and differential equations or dynamical systems is perhaps Weinan E, an applied mathematician from Princeton. The Neural ODE paper also starts out from the same observation, but it treats each pixel as a dynamical variable (hence ODE), interacting with its immediate neighbors; it's more natural to speak of PDEs, if somewhat limited to Computer Vision, so that both the depth (
Last but not least, the PDE or dynamical systems perspective points to a partial answer (though somewhat useless one) to the problem of interpretability. The "hypothesis space" is now a class of PDEs that seems to be rich enough for traces of integrability amidst chaos — analogous to the long-term stability and predictability of the solar system, despite the fact that the three-body problem is chaotic — and that gradient descent is somehow quite effective in finding them. Though it would be nice to "prove" such a universal integrability hypothesis, it's also fine to assume it and see what new ideas it would inspire.
It should be added that this perspective is getting some exposure in recent years, with several workshops at top venues on it:
One thing that might be hard to find in a PDE course is how repeated local operations (first-order differential operators) give rise to a global transformation. This is "well known", but under different names in different areas: infinitesimal generator and the exponential map (Lie group/Lie algebra), semigroup of unbounded operators (functional analysis), the method of characteristics (linear PDEs), flow of a vector field (differential geometry), and even "observables" or self-adjoint operators in quantum mechanics. The simplest way to understand it is by way of the exponential — the master key for solving any linear differential equation — of
which is nothing but the formula for Taylor expansion. We would say that translation (in
which actually involves two limiting processes, and if we take the two limits simultaneously, differentiability of
respectively. Note that these are differential operators with variable coefficients.
IMO, this is the kind of thing that would be hard to come up with, or to justify, without the continuous formulation of neural nets. Unfortunately, the mathematics is just beyond what the standard CS and ML training would prepare you, but should and can be made more accessible with little or no jargon.