The Forward and Backward Passes
As we saw in <>, to train a model, we will need to compute all the gradients of a given loss with respect to its parameters, which is known as the backward pass. The forward pass is where we compute the output of the model on a given input, based on the matrix products. As we define our first neural net, we will also delve into the problem of properly initializing the weights, which is crucial for making training start properly.
Defining and Initializing a Layer
We will take the example of a two-layer neural net first. As we’ve seen, one layer can be expressed as y = x @ w + b
, with x
our inputs, y
our outputs, w
the weights of the layer (which is of size number of inputs by number of neurons if we don’t transpose like before), and b
is the bias vector:
In [ ]:
def lin(x, w, b): return x @ w + b
We can stack the second layer on top of the first, but since mathematically the composition of two linear operations is another linear operation, this only makes sense if we put something nonlinear in the middle, called an activation function. As mentioned at the beginning of the chapter, in deep learning applications the activation function most commonly used is a ReLU, which returns the maximum of x
and 0
.
We won’t actually train our model in this chapter, so we’ll use random tensors for our inputs and targets. Let’s say our inputs are 200 vectors of size 100, which we group into one batch, and our targets are 200 random floats:
In [ ]:
x = torch.randn(200, 100)
y = torch.randn(200)
For our two-layer model we will need two weight matrices and two bias vectors. Let’s say we have a hidden size of 50 and the output size is 1 (for one of our inputs, the corresponding output is one float in this toy example). We initialize the weights randomly and the bias at zero:
In [ ]:
w1 = torch.randn(100,50)
b1 = torch.zeros(50)
w2 = torch.randn(50,1)
b2 = torch.zeros(1)
Then the result of our first layer is simply:
In [ ]:
l1 = lin(x, w1, b1)
l1.shape
Out[ ]:
torch.Size([200, 50])
Note that this formula works with our batch of inputs, and returns a batch of hidden state: l1
is a matrix of size 200 (our batch size) by 50 (our hidden size).
There is a problem with the way our model was initialized, however. To understand it, we need to look at the mean and standard deviation (std) of l1
:
In [ ]:
l1.mean(), l1.std()
Out[ ]:
(tensor(0.0019), tensor(10.1058))
The mean is close to zero, which is understandable since both our input and weight matrices have means close to zero. But the standard deviation, which represents how far away our activations go from the mean, went from 1 to 10. This is a really big problem because that’s with just one layer. Modern neural nets can have hundred of layers, so if each of them multiplies the scale of our activations by 10, by the end of the last layer we won’t have numbers representable by a computer.
Indeed, if we make just 50 multiplications between x
and random matrices of size 100×100, we’ll have:
In [ ]:
x = torch.randn(200, 100)
for i in range(50): x = x @ torch.randn(100,100)
x[0:5,0:5]
Out[ ]:
tensor([[nan, nan, nan, nan, nan],
[nan, nan, nan, nan, nan],
[nan, nan, nan, nan, nan],
[nan, nan, nan, nan, nan],
[nan, nan, nan, nan, nan]])
The result is nan
s everywhere. So maybe the scale of our matrix was too big, and we need to have smaller weights? But if we use too small weights, we will have the opposite problem—the scale of our activations will go from 1 to 0.1, and after 50 layers we’ll be left with zeros everywhere:
In [ ]:
x = torch.randn(200, 100)
for i in range(50): x = x @ (torch.randn(100,100) * 0.01)
x[0:5,0:5]
Out[ ]:
tensor([[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.]])
So we have to scale our weight matrices exactly right so that the standard deviation of our activations stays at 1. We can compute the exact value to use mathematically, as illustrated by Xavier Glorot and Yoshua Bengio in “Understanding the Difficulty of Training Deep Feedforward Neural Networks”. The right scale for a given layer is $1/\sqrt{n_{in}}$, where $n_{in}$ represents the number of inputs.
In our case, if we have 100 inputs, we should scale our weight matrices by 0.1:
In [ ]:
x = torch.randn(200, 100)
for i in range(50): x = x @ (torch.randn(100,100) * 0.1)
x[0:5,0:5]
Out[ ]:
tensor([[ 0.7554, 0.6167, -0.1757, -1.5662, 0.5644],
[-0.1987, 0.6292, 0.3283, -1.1538, 0.5416],
[ 0.6106, 0.2556, -0.0618, -0.9463, 0.4445],
[ 0.4484, 0.7144, 0.1164, -0.8626, 0.4413],
[ 0.3463, 0.5930, 0.3375, -0.9486, 0.5643]])
Finally some numbers that are neither zeros nor nan
s! Notice how stable the scale of our activations is, even after those 50 fake layers:
In [ ]:
x.std()
Out[ ]:
tensor(0.7042)
If you play a little bit with the value for scale you’ll notice that even a slight variation from 0.1 will get you either to very small or very large numbers, so initializing the weights properly is extremely important.
Let’s go back to our neural net. Since we messed a bit with our inputs, we need to redefine them:
In [ ]:
x = torch.randn(200, 100)
y = torch.randn(200)
And for our weights, we’ll use the right scale, which is known as Xavier initialization (or Glorot initialization):
In [ ]:
from math import sqrt
w1 = torch.randn(100,50) / sqrt(100)
b1 = torch.zeros(50)
w2 = torch.randn(50,1) / sqrt(50)
b2 = torch.zeros(1)
Now if we compute the result of the first layer, we can check that the mean and standard deviation are under control:
In [ ]:
l1 = lin(x, w1, b1)
l1.mean(),l1.std()
Out[ ]:
(tensor(-0.0050), tensor(1.0000))
Very good. Now we need to go through a ReLU, so let’s define one. A ReLU removes the negatives and replaces them with zeros, which is another way of saying it clamps our tensor at zero:
In [ ]:
def relu(x): return x.clamp_min(0.)
We pass our activations through this:
In [ ]:
l2 = relu(l1)
l2.mean(),l2.std()
Out[ ]:
(tensor(0.3961), tensor(0.5783))
And we’re back to square one: the mean of our activations has gone to 0.4 (which is understandable since we removed the negatives) and the std went down to 0.58. So like before, after a few layers we will probably wind up with zeros:
In [ ]:
x = torch.randn(200, 100)
for i in range(50): x = relu(x @ (torch.randn(100,100) * 0.1))
x[0:5,0:5]
Out[ ]:
tensor([[0.0000e+00, 1.9689e-08, 4.2820e-08, 0.0000e+00, 0.0000e+00],
[0.0000e+00, 1.6701e-08, 4.3501e-08, 0.0000e+00, 0.0000e+00],
[0.0000e+00, 1.0976e-08, 3.0411e-08, 0.0000e+00, 0.0000e+00],
[0.0000e+00, 1.8457e-08, 4.9469e-08, 0.0000e+00, 0.0000e+00],
[0.0000e+00, 1.9949e-08, 4.1643e-08, 0.0000e+00, 0.0000e+00]])
This means our initialization wasn’t right. Why? At the time Glorot and Bengio wrote their article, the popular activation in a neural net was the hyperbolic tangent (tanh, which is the one they used), and that initialization doesn’t account for our ReLU. Fortunately, someone else has done the math for us and computed the right scale for us to use. In “Delving Deep into Rectifiers: Surpassing Human-Level Performance” (which we’ve seen before—it’s the article that introduced the ResNet), Kaiming He et al. show that we should use the following scale instead: $\sqrt{2 / n_{in}}$, where $n_{in}$ is the number of inputs of our model. Let’s see what this gives us:
In [ ]:
x = torch.randn(200, 100)
for i in range(50): x = relu(x @ (torch.randn(100,100) * sqrt(2/100)))
x[0:5,0:5]
Out[ ]:
tensor([[0.2871, 0.0000, 0.0000, 0.0000, 0.0026],
[0.4546, 0.0000, 0.0000, 0.0000, 0.0015],
[0.6178, 0.0000, 0.0000, 0.0180, 0.0079],
[0.3333, 0.0000, 0.0000, 0.0545, 0.0000],
[0.1940, 0.0000, 0.0000, 0.0000, 0.0096]])
That’s better: our numbers aren’t all zeroed this time. So let’s go back to the definition of our neural net and use this initialization (which is named Kaiming initialization or He initialization):
In [ ]:
x = torch.randn(200, 100)
y = torch.randn(200)
In [ ]:
w1 = torch.randn(100,50) * sqrt(2 / 100)
b1 = torch.zeros(50)
w2 = torch.randn(50,1) * sqrt(2 / 50)
b2 = torch.zeros(1)
Let’s look at the scale of our activations after going through the first linear layer and ReLU:
In [ ]:
l1 = lin(x, w1, b1)
l2 = relu(l1)
l2.mean(), l2.std()
Out[ ]:
(tensor(0.5661), tensor(0.8339))
Much better! Now that our weights are properly initialized, we can define our whole model:
In [ ]:
def model(x):
l1 = lin(x, w1, b1)
l2 = relu(l1)
l3 = lin(l2, w2, b2)
return l3
This is the forward pass. Now all that’s left to do is to compare our output to the labels we have (random numbers, in this example) with a loss function. In this case, we will use the mean squared error. (It’s a toy problem, and this is the easiest loss function to use for what is next, computing the gradients.)
The only subtlety is that our outputs and targets don’t have exactly the same shape—after going though the model, we get an output like this:
In [ ]:
out = model(x)
out.shape
Out[ ]:
torch.Size([200, 1])
To get rid of this trailing 1 dimension, we use the squeeze
function:
In [ ]:
def mse(output, targ): return (output.squeeze(-1) - targ).pow(2).mean()
And now we are ready to compute our loss:
In [ ]:
loss = mse(out, y)
That’s all for the forward pass—let’s now look at the gradients.
Gradients and the Backward Pass
We’ve seen that PyTorch computes all the gradients we need with a magic call to loss.backward
, but let’s explore what’s happening behind the scenes.
Now comes the part where we need to compute the gradients of the loss with respect to all the weights of our model, so all the floats in w1
, b1
, w2
, and b2
. For this, we will need a bit of math—specifically the chain rule. This is the rule of calculus that guides how we can compute the derivative of a composed function:
(g \circ f)’(x) = g’(f(x)) f’(x)
j: I find this notation very hard to wrap my head around, so instead I like to think of it as: if
y = g(u)
andu=f(x)
; thendy/dx = dy/du * du/dx
. The two notations mean the same thing, so use whatever works for you.
Our loss is a big composition of different functions: mean squared error (which is in turn the composition of a mean and a power of two), the second linear layer, a ReLU and the first linear layer. For instance, if we want the gradients of the loss with respect to b2
and our loss is defined by:
loss = mse(out,y) = mse(lin(l2, w2, b2), y)
The chain rule tells us that we have: \frac{\text{d} loss}{\text{d} b_{2}} = \frac{\text{d} loss}{\text{d} out} \times \frac{\text{d} out}{\text{d} b_{2}} = \frac{\text{d}}{\text{d} out} mse(out, y) \times \frac{\text{d}}{\text{d} b_{2}} lin(l_{2}, w_{2}, b_{2})
To compute the gradients of the loss with respect to $b_{2}$, we first need the gradients of the loss with respect to our output $out$. It’s the same if we want the gradients of the loss with respect to $w_{2}$. Then, to get the gradients of the loss with respect to $b_{1}$ or $w_{1}$, we will need the gradients of the loss with respect to $l_{1}$, which in turn requires the gradients of the loss with respect to $l_{2}$, which will need the gradients of the loss with respect to $out$.
So to compute all the gradients we need for the update, we need to begin from the output of the model and work our way backward, one layer after the other—which is why this step is known as backpropagation. We can automate it by having each function we implemented (relu
, mse
, lin
) provide its backward step: that is, how to derive the gradients of the loss with respect to the input(s) from the gradients of the loss with respect to the output.
Here we populate those gradients in an attribute of each tensor, a bit like PyTorch does with .grad
.
The first are the gradients of the loss with respect to the output of our model (which is the input of the loss function). We undo the squeeze
we did in mse
, then we use the formula that gives us the derivative of $x^{2}$: $2x$. The derivative of the mean is just $1/n$ where $n$ is the number of elements in our input:
In [ ]:
def mse_grad(inp, targ):
# grad of loss with respect to output of previous layer
inp.g = 2. * (inp.squeeze() - targ).unsqueeze(-1) / inp.shape[0]
For the gradients of the ReLU and our linear layer, we use the gradients of the loss with respect to the output (in out.g
) and apply the chain rule to compute the gradients of the loss with respect to the input (in inp.g
). The chain rule tells us that inp.g = relu'(inp) * out.g
. The derivative of relu
is either 0 (when inputs are negative) or 1 (when inputs are positive), so this gives us:
In [ ]:
def relu_grad(inp, out):
# grad of relu with respect to input activations
inp.g = (inp>0).float() * out.g
The scheme is the same to compute the gradients of the loss with respect to the inputs, weights, and bias in the linear layer:
In [ ]:
def lin_grad(inp, out, w, b):
# grad of matmul with respect to input
inp.g = out.g @ w.t()
w.g = inp.t() @ out.g
b.g = out.g.sum(0)
We won’t linger on the mathematical formulas that define them since they’re not important for our purposes, but do check out Khan Academy’s excellent calculus lessons if you’re interested in this topic.
Sidebar: SymPy
SymPy is a library for symbolic computation that is extremely useful library when working with calculus. Per the documentation:
: Symbolic computation deals with the computation of mathematical objects symbolically. This means that the mathematical objects are represented exactly, not approximately, and mathematical expressions with unevaluated variables are left in symbolic form.
To do symbolic computation, we first define a symbol, and then do a computation, like so:
In [ ]:
from sympy import symbols,diff
sx,sy = symbols('sx sy')
diff(sx**2, sx)
Out[ ]:
$\displaystyle 2 sx$
Here, SymPy has taken the derivative of x**2
for us! It can take the derivative of complicated compound expressions, simplify and factor equations, and much more. There’s really not much reason for anyone to do calculus manually nowadays—for calculating gradients, PyTorch does it for us, and for showing the equations, SymPy does it for us!
End sidebar
Once we have have defined those functions, we can use them to write the backward pass. Since each gradient is automatically populated in the right tensor, we don’t need to store the results of those _grad
functions anywhere—we just need to execute them in the reverse order of the forward pass, to make sure that in each function out.g
exists:
In [ ]:
def forward_and_backward(inp, targ):
# forward pass:
l1 = inp @ w1 + b1
l2 = relu(l1)
out = l2 @ w2 + b2
# we don't actually need the loss in backward!
loss = mse(out, targ)
# backward pass:
mse_grad(out, targ)
lin_grad(l2, out, w2, b2)
relu_grad(l1, l2)
lin_grad(inp, l1, w1, b1)
And now we can access the gradients of our model parameters in w1.g
, b1.g
, w2.g
, and b2.g
.
We have successfully defined our model—now let’s make it a bit more like a PyTorch module.
Refactoring the Model
The three functions we used have two associated functions: a forward pass and a backward pass. Instead of writing them separately, we can create a class to wrap them together. That class can also store the inputs and outputs for the backward pass. This way, we will just have to call backward
:
In [ ]:
class Relu():
def __call__(self, inp):
self.inp = inp
self.out = inp.clamp_min(0.)
return self.out
def backward(self): self.inp.g = (self.inp>0).float() * self.out.g
__call__
is a magic name in Python that will make our class callable. This is what will be executed when we type y = Relu()(x)
. We can do the same for our linear layer and the MSE loss:
In [ ]:
class Lin():
def __init__(self, w, b): self.w,self.b = w,b
def __call__(self, inp):
self.inp = inp
self.out = inp@self.w + self.b
return self.out
def backward(self):
self.inp.g = self.out.g @ self.w.t()
self.w.g = self.inp.t() @ self.out.g
self.b.g = self.out.g.sum(0)
In [ ]:
class Mse():
def __call__(self, inp, targ):
self.inp = inp
self.targ = targ
self.out = (inp.squeeze() - targ).pow(2).mean()
return self.out
def backward(self):
x = (self.inp.squeeze()-self.targ).unsqueeze(-1)
self.inp.g = 2.*x/self.targ.shape[0]
Then we can put everything in a model that we initiate with our tensors w1
, b1
, w2
, b2
:
In [ ]:
class Model():
def __init__(self, w1, b1, w2, b2):
self.layers = [Lin(w1,b1), Relu(), Lin(w2,b2)]
self.loss = Mse()
def __call__(self, x, targ):
for l in self.layers: x = l(x)
return self.loss(x, targ)
def backward(self):
self.loss.backward()
for l in reversed(self.layers): l.backward()
What is really nice about this refactoring and registering things as layers of our model is that the forward and backward passes are now really easy to write. If we want to instantiate our model, we just need to write:
In [ ]:
model = Model(w1, b1, w2, b2)
The forward pass can then be executed with:
In [ ]:
loss = model(x, y)
And the backward pass with:
In [ ]:
model.backward()
Going to PyTorch
The Lin
, Mse
and Relu
classes we wrote have a lot in common, so we could make them all inherit from the same base class:
In [ ]:
class LayerFunction():
def __call__(self, *args):
self.args = args
self.out = self.forward(*args)
return self.out
def forward(self): raise Exception('not implemented')
def bwd(self): raise Exception('not implemented')
def backward(self): self.bwd(self.out, *self.args)
Then we just need to implement forward
and bwd
in each of our subclasses:
In [ ]:
class Relu(LayerFunction):
def forward(self, inp): return inp.clamp_min(0.)
def bwd(self, out, inp): inp.g = (inp>0).float() * out.g
In [ ]:
class Lin(LayerFunction):
def __init__(self, w, b): self.w,self.b = w,b
def forward(self, inp): return inp@self.w + self.b
def bwd(self, out, inp):
inp.g = out.g @ self.w.t()
self.w.g = inp.t() @ self.out.g
self.b.g = out.g.sum(0)
In [ ]:
class Mse(LayerFunction):
def forward (self, inp, targ): return (inp.squeeze() - targ).pow(2).mean()
def bwd(self, out, inp, targ):
inp.g = 2*(inp.squeeze()-targ).unsqueeze(-1) / targ.shape[0]
The rest of our model can be the same as before. This is getting closer and closer to what PyTorch does. Each basic function we need to differentiate is written as a torch.autograd.Function
object that has a forward
and a backward
method. PyTorch will then keep trace of any computation we do to be able to properly run the backward pass, unless we set the requires_grad
attribute of our tensors to False
.
Writing one of these is (almost) as easy as writing our original classes. The difference is that we choose what to save and what to put in a context variable (so that we make sure we don’t save anything we don’t need), and we return the gradients in the backward
pass. It’s very rare to have to write your own Function
but if you ever need something exotic or want to mess with the gradients of a regular function, here is how to write one:
In [ ]:
from torch.autograd import Function
class MyRelu(Function):
@staticmethod
def forward(ctx, i):
result = i.clamp_min(0.)
ctx.save_for_backward(i)
return result
@staticmethod
def backward(ctx, grad_output):
i, = ctx.saved_tensors
return grad_output * (i>0).float()
The structure used to build a more complex model that takes advantage of those Function
s is a torch.nn.Module
. This is the base structure for all models, and all the neural nets you have seen up until now inherited from that class. It mostly helps to register all the trainable parameters, which as we’ve seen can be used in the training loop.
To implement an nn.Module
you just need to:
- Make sure the superclass
__init__
is called first when you initialize it. - Define any parameters of the model as attributes with
nn.Parameter
. - Define a
forward
function that returns the output of your model.
As an example, here is the linear layer from scratch:
In [ ]:
import torch.nn as nn
class LinearLayer(nn.Module):
def __init__(self, n_in, n_out):
super().__init__()
self.weight = nn.Parameter(torch.randn(n_out, n_in) * sqrt(2/n_in))
self.bias = nn.Parameter(torch.zeros(n_out))
def forward(self, x): return x @ self.weight.t() + self.bias
As you see, this class automatically keeps track of what parameters have been defined:
In [ ]:
lin = LinearLayer(10,2)
p1,p2 = lin.parameters()
p1.shape,p2.shape
Out[ ]:
(torch.Size([2, 10]), torch.Size([2]))
It is thanks to this feature of nn.Module
that we can just say opt.step()
and have an optimizer loop through the parameters and update each one.
Note that in PyTorch, the weights are stored as an n_out x n_in
matrix, which is why we have the transpose in the forward pass.
By using the linear layer from PyTorch (which uses the Kaiming initialization as well), the model we have been building up during this chapter can be written like this:
In [ ]:
class Model(nn.Module):
def __init__(self, n_in, nh, n_out):
super().__init__()
self.layers = nn.Sequential(
nn.Linear(n_in,nh), nn.ReLU(), nn.Linear(nh,n_out))
self.loss = mse
def forward(self, x, targ): return self.loss(self.layers(x).squeeze(), targ)
fastai provides its own variant of Module
that is identical to nn.Module
, but doesn’t require you to call super().__init__()
(it does that for you automatically):
In [ ]:
class Model(Module):
def __init__(self, n_in, nh, n_out):
self.layers = nn.Sequential(
nn.Linear(n_in,nh), nn.ReLU(), nn.Linear(nh,n_out))
self.loss = mse
def forward(self, x, targ): return self.loss(self.layers(x).squeeze(), targ)
In the last chapter, we will start from such a model and see how to build a training loop from scratch and refactor it to what we’ve been using in previous chapters.