Conclusions
We’ve seen that convolutions are just a type of matrix multiplication, with two constraints on the weight matrix: some elements are always zero, and some elements are tied (forced to always have the same value). In <> we saw the eight requirements from the 1986 book Parallel Distributed Processing; one of them was “A pattern of connectivity among units.” That’s exactly what these constraints do: they enforce a certain pattern of connectivity.
These constraints allow us to use far fewer parameters in our model, without sacrificing the ability to represent complex visual features. That means we can train deeper models faster, with less overfitting. Although the universal approximation theorem shows that it should be possible to represent anything in a fully connected network in one hidden layer, we’ve seen now that in practice we can train much better models by being thoughtful about network architecture.
Convolutions are by far the most common pattern of connectivity we see in neural nets (along with regular linear layers, which we refer to as fully connected), but it’s likely that many more will be discovered.
We’ve also seen how to interpret the activations of layers in the network to see whether training is going well or not, and how batchnorm helps regularize the training and makes it smoother. In the next chapter, we will use both of those layers to build the most popular architecture in computer vision: a residual network.