The Magic of Convolutions
One of the most powerful tools that machine learning practitioners have at their disposal is feature engineering. A feature is a transformation of the data which is designed to make it easier to model. For instance, the add_datepart
function that we used for our tabular dataset preprocessing in <> added date features to the Bulldozers dataset. What kinds of features might we be able to create from images?
jargon: Feature engineering: Creating new transformations of the input data in order to make it easier to model.
In the context of an image, a feature is a visually distinctive attribute. For example, the number 7 is characterized by a horizontal edge near the top of the digit, and a top-right to bottom-left diagonal edge underneath that. On the other hand, the number 3 is characterized by a diagonal edge in one direction at the top left and bottom right of the digit, the opposite diagonal at the bottom left and top right, horizontal edges at the middle, top, and bottom, and so forth. So what if we could extract information about where the edges occur in each image, and then use that information as our features, instead of raw pixels?
It turns out that finding the edges in an image is a very common task in computer vision, and is surprisingly straightforward. To do it, we use something called a convolution. A convolution requires nothing more than multiplication, and addition—two operations that are responsible for the vast majority of work that we will see in every single deep learning model in this book!
A convolution applies a kernel across an image. A kernel is a little matrix, such as the 3×3 matrix in the top right of <>.
The 7×7 grid to the left is the image we’re going to apply the kernel to. The convolution operation multiplies each element of the kernel by each element of a 3×3 block of the image. The results of these multiplications are then added together. The diagram in <> shows an example of applying a kernel to a single location in the image, the 3×3 block around cell 18.
Let’s do this with code. First, we create a little 3×3 matrix like so:
In [ ]:
top_edge = tensor([[-1,-1,-1],
[ 0, 0, 0],
[ 1, 1, 1]]).float()
We’re going to call this our kernel (because that’s what fancy computer vision researchers call these). And we’ll need an image, of course:
In [ ]:
path = untar_data(URLs.MNIST_SAMPLE)
In [ ]:
#hide
Path.BASE_PATH = path
In [ ]:
im3 = Image.open(path/'train'/'3'/'12.png')
show_image(im3);
Now we’re going to take the top 3×3-pixel square of our image, and multiply each of those values by each item in our kernel. Then we’ll add them up, like so:
In [ ]:
im3_t = tensor(im3)
im3_t[0:3,0:3] * top_edge
Out[ ]:
tensor([[-0., -0., -0.],
[0., 0., 0.],
[0., 0., 0.]])
In [ ]:
(im3_t[0:3,0:3] * top_edge).sum()
Out[ ]:
tensor(0.)
Not very interesting so far—all the pixels in the top-left corner are white. But let’s pick a couple of more interesting spots:
In [ ]:
#hide_output
df = pd.DataFrame(im3_t[:10,:20])
df.style.set_properties(**{'font-size':'6pt'}).background_gradient('Greys')
Out[ ]:
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 | 0 | 0 | 0 | 12 | 99 | 91 | 142 | 155 | 246 | 182 | 155 | 155 | 155 | 155 | 131 | 52 | 0 | 0 | 0 | 0 |
6 | 0 | 0 | 0 | 138 | 254 | 254 | 254 | 254 | 254 | 254 | 254 | 254 | 254 | 254 | 254 | 252 | 210 | 122 | 33 | 0 |
7 | 0 | 0 | 0 | 220 | 254 | 254 | 254 | 235 | 189 | 189 | 189 | 189 | 150 | 189 | 205 | 254 | 254 | 254 | 75 | 0 |
8 | 0 | 0 | 0 | 35 | 74 | 35 | 35 | 25 | 0 | 0 | 0 | 0 | 0 | 0 | 13 | 224 | 254 | 254 | 153 | 0 |
9 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 90 | 254 | 254 | 247 | 53 | 0 |
There’s a top edge at cell 5,8. Let’s repeat our calculation there:
In [ ]:
(im3_t[4:7,6:9] * top_edge).sum()
Out[ ]:
tensor(762.)
There’s a right edge at cell 8,18. What does that give us?:
In [ ]:
(im3_t[7:10,17:20] * top_edge).sum()
Out[ ]:
tensor(-29.)
As you can see, this little calculation is returning a high number where the 3×3-pixel square represents a top edge (i.e., where there are low values at the top of the square, and high values immediately underneath). That’s because the -1
values in our kernel have little impact in that case, but the 1
values have a lot.
Let’s look a tiny bit at the math. The filter will take any window of size 3×3 in our images, and if we name the pixel values like this:
\begin{matrix} a1 & a2 & a3 \\ a4 & a5 & a6 \\ a7 & a8 & a9 \end{matrix}
it will return $a1+a2+a3-a7-a8-a9$. If we are in a part of the image where $a1$, $a2$, and $a3$ add up to the same as $a7$, $a8$, and $a9$, then the terms will cancel each other out and we will get 0. However, if $a1$ is greater than $a7$, $a2$ is greater than $a8$, and $a3$ is greater than $a9$, we will get a bigger number as a result. So this filter detects horizontal edges—more precisely, edges where we go from bright parts of the image at the top to darker parts at the bottom.
Changing our filter to have the row of 1
s at the top and the -1
s at the bottom would detect horizontal edges that go from dark to light. Putting the 1
s and -1
s in columns versus rows would give us filters that detect vertical edges. Each set of weights will produce a different kind of outcome.
Let’s create a function to do this for one location, and check it matches our result from before:
In [ ]:
def apply_kernel(row, col, kernel):
return (im3_t[row-1:row+2,col-1:col+2] * kernel).sum()
In [ ]:
apply_kernel(5,7,top_edge)
Out[ ]:
tensor(762.)
But note that we can’t apply it to the corner (e.g., location 0,0), since there isn’t a complete 3×3 square there.
Mapping a Convolution Kernel
We can map apply_kernel()
across the coordinate grid. That is, we’ll be taking our 3×3 kernel, and applying it to each 3×3 section of our image. For instance, <> shows the positions a 3×3 kernel can be applied to in the first row of a 5×5 image.
To get a grid of coordinates we can use a nested list comprehension, like so:
In [ ]:
[[(i,j) for j in range(1,5)] for i in range(1,5)]
Out[ ]:
[[(1, 1), (1, 2), (1, 3), (1, 4)],
[(2, 1), (2, 2), (2, 3), (2, 4)],
[(3, 1), (3, 2), (3, 3), (3, 4)],
[(4, 1), (4, 2), (4, 3), (4, 4)]]
note: Nested List Comprehensions: Nested list comprehensions are used a lot in Python, so if you haven’t seen them before, take a few minutes to make sure you understand what’s happening here, and experiment with writing your own nested list comprehensions.
Here’s the result of applying our kernel over a coordinate grid:
In [ ]:
rng = range(1,27)
top_edge3 = tensor([[apply_kernel(i,j,top_edge) for j in rng] for i in rng])
show_image(top_edge3);
Looking good! Our top edges are black, and bottom edges are white (since they are the opposite of top edges). Now that our image contains negative numbers too, matplotlib
has automatically changed our colors so that white is the smallest number in the image, black the highest, and zeros appear as gray.
We can try the same thing for left edges:
In [ ]:
left_edge = tensor([[-1,1,0],
[-1,1,0],
[-1,1,0]]).float()
left_edge3 = tensor([[apply_kernel(i,j,left_edge) for j in rng] for i in rng])
show_image(left_edge3);
As we mentioned before, a convolution is the operation of applying such a kernel over a grid in this way. In the paper “A Guide to Convolution Arithmetic for Deep Learning” there are many great diagrams showing how image kernels can be applied. Here’s an example from the paper showing (at the bottom) a light blue 4×4 image, with a dark blue 3×3 kernel being applied, creating a 2×2 green output activation map at the top.
Look at the shape of the result. If the original image has a height of h
and a width of w
, how many 3×3 windows can we find? As you can see from the example, there are h-2
by w-2
windows, so the image we get has a result as a height of h-2
and a width of w-2
.
We won’t implement this convolution function from scratch, but use PyTorch’s implementation instead (it is way faster than anything we could do in Python).
Convolutions in PyTorch
Convolution is such an important and widely used operation that PyTorch has it built in. It’s called F.conv2d
(recall that F
is a fastai import from torch.nn.functional
, as recommended by PyTorch). The PyTorch docs tell us that it includes these parameters:
- input:: input tensor of shape
(minibatch, in_channels, iH, iW)
- weight:: filters of shape
(out_channels, in_channels, kH, kW)
Here iH,iW
is the height and width of the image (i.e., 28,28
), and kH,kW
is the height and width of our kernel (3,3
). But apparently PyTorch is expecting rank-4 tensors for both these arguments, whereas currently we only have rank-2 tensors (i.e., matrices, or arrays with two axes).
The reason for these extra axes is that PyTorch has a few tricks up its sleeve. The first trick is that PyTorch can apply a convolution to multiple images at the same time. That means we can call it on every item in a batch at once!
The second trick is that PyTorch can apply multiple kernels at the same time. So let’s create the diagonal-edge kernels too, and then stack all four of our edge kernels into a single tensor:
In [ ]:
diag1_edge = tensor([[ 0,-1, 1],
[-1, 1, 0],
[ 1, 0, 0]]).float()
diag2_edge = tensor([[ 1,-1, 0],
[ 0, 1,-1],
[ 0, 0, 1]]).float()
edge_kernels = torch.stack([left_edge, top_edge, diag1_edge, diag2_edge])
edge_kernels.shape
Out[ ]:
torch.Size([4, 3, 3])
To test this, we’ll need a DataLoader
and a sample mini-batch. Let’s use the data block API:
In [ ]:
mnist = DataBlock((ImageBlock(cls=PILImageBW), CategoryBlock),
get_items=get_image_files,
splitter=GrandparentSplitter(),
get_y=parent_label)
dls = mnist.dataloaders(path)
xb,yb = first(dls.valid)
xb.shape
Out[ ]:
torch.Size([64, 1, 28, 28])
By default, fastai puts data on the GPU when using data blocks. Let’s move it to the CPU for our examples:
In [ ]:
xb,yb = to_cpu(xb),to_cpu(yb)
One batch contains 64 images, each of 1 channel, with 28×28 pixels. F.conv2d
can handle multichannel (i.e., color) images too. A channel is a single basic color in an image—for regular full-color images there are three channels, red, green, and blue. PyTorch represents an image as a rank-3 tensor, with dimensions [channels, rows, columns]
.
We’ll see how to handle more than one channel later in this chapter. Kernels passed to F.conv2d
need to be rank-4 tensors: [channels_in, features_out, rows, columns]
. edge_kernels
is currently missing one of these. We need to tell PyTorch that the number of input channels in the kernel is one, which we can do by inserting an axis of size one (this is known as a unit axis) in the first location, where the PyTorch docs show in_channels
is expected. To insert a unit axis into a tensor, we use the unsqueeze
method:
In [ ]:
edge_kernels.shape,edge_kernels.unsqueeze(1).shape
Out[ ]:
(torch.Size([4, 3, 3]), torch.Size([4, 1, 3, 3]))
This is now the correct shape for edge_kernels
. Let’s pass this all to conv2d
:
In [ ]:
edge_kernels = edge_kernels.unsqueeze(1)
In [ ]:
batch_features = F.conv2d(xb, edge_kernels)
batch_features.shape
Out[ ]:
torch.Size([64, 4, 26, 26])
The output shape shows we gave 64 images in the mini-batch, 4 kernels, and 26×26 edge maps (we started with 28×28 images, but lost one pixel from each side as discussed earlier). We can see we get the same results as when we did this manually:
In [ ]:
show_image(batch_features[0,0]);
The most important trick that PyTorch has up its sleeve is that it can use the GPU to do all this work in parallel—that is, applying multiple kernels, to multiple images, across multiple channels. Doing lots of work in parallel is critical to getting GPUs to work efficiently; if we did each of these operations one at a time, we’d often run hundreds of times slower (and if we used our manual convolution loop from the previous section, we’d be millions of times slower!). Therefore, to become a strong deep learning practitioner, one skill to practice is giving your GPU plenty of work to do at a time.
It would be nice to not lose those two pixels on each axis. The way we do that is to add padding, which is simply additional pixels added around the outside of our image. Most commonly, pixels of zeros are added.
Strides and Padding
With appropriate padding, we can ensure that the output activation map is the same size as the original image, which can make things a lot simpler when we construct our architectures. <> shows how adding padding allows us to apply the kernels in the image corners.
With a 5×5 input, 4×4 kernel, and 2 pixels of padding, we end up with a 6×6 activation map, as we can see in <>.
If we add a kernel of size ks
by ks
(with ks
an odd number), the necessary padding on each side to keep the same shape is ks//2
. An even number for ks
would require a different amount of padding on the top/bottom and left/right, but in practice we almost never use an even filter size.
So far, when we have applied the kernel to the grid, we have moved it one pixel over at a time. But we can jump further; for instance, we could move over two pixels after each kernel application, as in <>. This is known as a stride-2 convolution. The most common kernel size in practice is 3×3, and the most common padding is 1. As you’ll see, stride-2 convolutions are useful for decreasing the size of our outputs, and stride-1 convolutions are useful for adding layers without changing the output size.
In an image of size h
by w
, using a padding of 1 and a stride of 2 will give us a result of size (h+1)//2
by (w+1)//2
. The general formula for each dimension is (n + 2*pad - ks)//stride + 1
, where pad
is the padding, ks
, the size of our kernel, and stride
is the stride.
Let’s now take a look at how the pixel values of the result of our convolutions are computed.
Understanding the Convolution Equations
To explain the math behind convolutions, fast.ai student Matt Kleinsmith came up with the very clever idea of showing CNNs from different viewpoints. In fact, it’s so clever, and so helpful, we’re going to show it here too!
Here’s our 3×3 pixel image, with each pixel labeled with a letter:
And here’s our kernel, with each weight labeled with a Greek letter:
Since the filter fits in the image four times, we have four results:
<> shows how we applied the kernel to each section of the image to yield each result.
The equation view is in <>.
Notice that the bias term, b, is the same for each section of the image. You can consider the bias as part of the filter, just like the weights (α, β, γ, δ) are part of the filter.
Here’s an interesting insight—a convolution can be represented as a special kind of matrix multiplication, as illustrated in <>. The weight matrix is just like the ones from traditional neural networks. However, this weight matrix has two special properties:
- The zeros shown in gray are untrainable. This means that they’ll stay zero throughout the optimization process.
- Some of the weights are equal, and while they are trainable (i.e., changeable), they must remain equal. These are called shared weights.
The zeros correspond to the pixels that the filter can’t touch. Each row of the weight matrix corresponds to one application of the filter.
Now that we understand what a convolution is, let’s use them to build a neural net.