Improving Our Model

We will now look at a range of techniques to improve the training of our model and make it better. While doing so, we will explain a little bit more about transfer learning and how to fine-tune our pretrained model as best as possible, without breaking the pretrained weights.

The first thing we need to set when training a model is the learning rate. We saw in the previous chapter that it needs to be just right to train as efficiently as possible, so how do we pick a good one? fastai provides a tool for this.

The Learning Rate Finder

One of the most important things we can do when training a model is to make sure that we have the right learning rate. If our learning rate is too low, it can take many, many epochs to train our model. Not only does this waste time, but it also means that we may have problems with overfitting, because every time we do a complete pass through the data, we give our model a chance to memorize it.

So let’s just make our learning rate really high, right? Sure, let’s try that and see what happens:

In [ ]:

  1. learn = cnn_learner(dls, resnet34, metrics=error_rate)
  2. learn.fine_tune(1, base_lr=0.1)
epochtrain_lossvalid_losserror_ratetime
02.7788165.1507320.50406000:20
epochtrain_lossvalid_losserror_ratetime
04.3546803.0035330.83423500:24

That doesn’t look good. Here’s what happened. The optimizer stepped in the correct direction, but it stepped so far that it totally overshot the minimum loss. Repeating that multiple times makes it get further and further away, not closer and closer!

What do we do to find the perfect learning rate—not too high, and not too low? In 2015 the researcher Leslie Smith came up with a brilliant idea, called the learning rate finder. His idea was to start with a very, very small learning rate, something so small that we would never expect it to be too big to handle. We use that for one mini-batch, find what the losses are afterwards, and then increase the learning rate by some percentage (e.g., doubling it each time). Then we do another mini-batch, track the loss, and double the learning rate again. We keep doing this until the loss gets worse, instead of better. This is the point where we know we have gone too far. We then select a learning rate a bit lower than this point. Our advice is to pick either:

  • One order of magnitude less than where the minimum loss was achieved (i.e., the minimum divided by 10)
  • The last point where the loss was clearly decreasing

The learning rate finder computes those points on the curve to help you. Both these rules usually give around the same value. In the first chapter, we didn’t specify a learning rate, using the default value from the fastai library (which is 1e-3):

In [ ]:

  1. learn = cnn_learner(dls, resnet34, metrics=error_rate)
  2. lr_min,lr_steep = learn.lr_find()

Improving Our Model - 图1

In [ ]:

  1. print(f"Minimum/10: {lr_min:.2e}, steepest point: {lr_steep:.2e}")
  1. Minimum/10: 1.00e-02, steepest point: 5.25e-03

We can see on this plot that in the range 1e-6 to 1e-3, nothing really happens and the model doesn’t train. Then the loss starts to decrease until it reaches a minimum, and then increases again. We don’t want a learning rate greater than 1e-1 as it will give a training that diverges like the one before (you can try for yourself), but 1e-1 is already too high: at this stage we’ve left the period where the loss was decreasing steadily.

In this learning rate plot it appears that a learning rate around 3e-3 would be appropriate, so let’s choose that:

In [ ]:

  1. learn = cnn_learner(dls, resnet34, metrics=error_rate)
  2. learn.fine_tune(2, base_lr=3e-3)
epochtrain_lossvalid_losserror_ratetime
01.3285910.3446780.11434400:20
epochtrain_lossvalid_losserror_ratetime
00.5401800.4209450.12787600:24
10.3298270.2488130.08322100:24

Note: Logarithmic Scale: The learning rate finder plot has a logarithmic scale, which is why the middle point between 1e-3 and 1e-2 is between 3e-3 and 4e-3. This is because we care mostly about the order of magnitude of the learning rate.

It’s interesting that the learning rate finder was only discovered in 2015, while neural networks have been under development since the 1950s. Throughout that time finding a good learning rate has been, perhaps, the most important and challenging issue for practitioners. The solution does not require any advanced maths, giant computing resources, huge datasets, or anything else that would make it inaccessible to any curious researcher. Furthermore, Leslie Smith, was not part of some exclusive Silicon Valley lab, but was working as a naval researcher. All of this is to say: breakthrough work in deep learning absolutely does not require access to vast resources, elite teams, or advanced mathematical ideas. There is lots of work still to be done that requires just a bit of common sense, creativity, and tenacity.

Now that we have a good learning rate to train our model, let’s look at how we can fine-tune the weights of a pretrained model.

Unfreezing and Transfer Learning

We discussed briefly in <> how transfer learning works. We saw that the basic idea is that a pretrained model, trained potentially on millions of data points (such as ImageNet), is fine-tuned for some other task. But what does this really mean?

We now know that a convolutional neural network consists of many linear layers with a nonlinear activation function between each pair, followed by one or more final linear layers with an activation function such as softmax at the very end. The final linear layer uses a matrix with enough columns such that the output size is the same as the number of classes in our model (assuming that we are doing classification).

This final linear layer is unlikely to be of any use for us when we are fine-tuning in a transfer learning setting, because it is specifically designed to classify the categories in the original pretraining dataset. So when we do transfer learning we remove it, throw it away, and replace it with a new linear layer with the correct number of outputs for our desired task (in this case, there would be 37 activations).

This newly added linear layer will have entirely random weights. Therefore, our model prior to fine-tuning has entirely random outputs. But that does not mean that it is an entirely random model! All of the layers prior to the last one have been carefully trained to be good at image classification tasks in general. As we saw in the images from the Zeiler and Fergus paper in <> (see <> through <>), the first few layers encode very general concepts, such as finding gradients and edges, and later layers encode concepts that are still very useful for us, such as finding eyeballs and fur.

We want to train a model in such a way that we allow it to remember all of these generally useful ideas from the pretrained model, use them to solve our particular task (classify pet breeds), and only adjust them as required for the specifics of our particular task.

Our challenge when fine-tuning is to replace the random weights in our added linear layers with weights that correctly achieve our desired task (classifying pet breeds) without breaking the carefully pretrained weights and the other layers. There is actually a very simple trick to allow this to happen: tell the optimizer to only update the weights in those randomly added final layers. Don’t change the weights in the rest of the neural network at all. This is called freezing those pretrained layers.

When we create a model from a pretrained network fastai automatically freezes all of the pretrained layers for us. When we call the fine_tune method fastai does two things:

  • Trains the randomly added layers for one epoch, with all other layers frozen
  • Unfreezes all of the layers, and trains them all for the number of epochs requested

Although this is a reasonable default approach, it is likely that for your particular dataset you may get better results by doing things slightly differently. The fine_tune method has a number of parameters you can use to change its behavior, but it might be easiest for you to just call the underlying methods directly if you want to get some custom behavior. Remember that you can see the source code for the method by using the following syntax:

  1. learn.fine_tune??

So let’s try doing this manually ourselves. First of all we will train the randomly added layers for three epochs, using fit_one_cycle. As mentioned in <>, fit_one_cycle is the suggested way to train models without using fine_tune. We’ll see why later in the book; in short, what fit_one_cycle does is to start training at a low learning rate, gradually increase it for the first section of training, and then gradually decrease it again for the last section of training.

In [ ]:

  1. learn.fine_tune??

In [ ]:

  1. learn = cnn_learner(dls, resnet34, metrics=error_rate)
  2. learn.fit_one_cycle(3, 3e-3)
epochtrain_lossvalid_losserror_ratetime
01.1880420.3550240.10284200:20
10.5342340.3024530.09472300:20
20.3250310.2222680.07442500:20

Then we’ll unfreeze the model:

In [ ]:

  1. learn.unfreeze()

and run lr_find again, because having more layers to train, and weights that have already been trained for three epochs, means our previously found learning rate isn’t appropriate any more:

In [ ]:

  1. learn.lr_find()

Out[ ]:

  1. (1.0964782268274575e-05, 1.5848931980144698e-06)

Improving Our Model - 图2

Note that the graph is a little different from when we had random weights: we don’t have that sharp descent that indicates the model is training. That’s because our model has been trained already. Here we have a somewhat flat area before a sharp increase, and we should take a point well before that sharp increase—for instance, 1e-5. The point with the maximum gradient isn’t what we look for here and should be ignored.

Let’s train at a suitable learning rate:

In [ ]:

  1. learn.fit_one_cycle(6, lr_max=1e-5)
epochtrain_lossvalid_losserror_ratetime
00.2635790.2174190.06901200:24
10.2530600.2103460.06292300:24
20.2243400.2073570.06021700:24
30.2001950.2072440.06157000:24
40.1942690.2001490.05954000:25
50.1731640.2023010.05954000:25

This has improved our model a bit, but there’s more we can do. The deepest layers of our pretrained model might not need as high a learning rate as the last ones, so we should probably use different learning rates for those—this is known as using discriminative learning rates.

Discriminative Learning Rates

Even after we unfreeze, we still care a lot about the quality of those pretrained weights. We would not expect that the best learning rate for those pretrained parameters would be as high as for the randomly added parameters, even after we have tuned those randomly added parameters for a few epochs. Remember, the pretrained weights have been trained for hundreds of epochs, on millions of images.

In addition, do you remember the images we saw in <>, showing what each layer learns? The first layer learns very simple foundations, like edge and gradient detectors; these are likely to be just as useful for nearly any task. The later layers learn much more complex concepts, like “eye” and “sunset,” which might not be useful in your task at all (maybe you’re classifying car models, for instance). So it makes sense to let the later layers fine-tune more quickly than earlier layers.

Therefore, fastai’s default approach is to use discriminative learning rates. This was originally developed in the ULMFiT approach to NLP transfer learning that we will introduce in <>. Like many good ideas in deep learning, it is extremely simple: use a lower learning rate for the early layers of the neural network, and a higher learning rate for the later layers (and especially the randomly added layers). The idea is based on insights developed by Jason Yosinski, who showed in 2014 that with transfer learning different layers of a neural network should train at different speeds, as seen in <>.

Impact of different layers and training methods on transfer learning (Yosinski)

fastai lets you pass a Python slice object anywhere that a learning rate is expected. The first value passed will be the learning rate in the earliest layer of the neural network, and the second value will be the learning rate in the final layer. The layers in between will have learning rates that are multiplicatively equidistant throughout that range. Let’s use this approach to replicate the previous training, but this time we’ll only set the lowest layer of our net to a learning rate of 1e-6; the other layers will scale up to 1e-4. Let’s train for a while and see what happens:

In [ ]:

  1. learn = cnn_learner(dls, resnet34, metrics=error_rate)
  2. learn.fit_one_cycle(3, 3e-3)
  3. learn.unfreeze()
  4. learn.fit_one_cycle(12, lr_max=slice(1e-6,1e-4))
epochtrain_lossvalid_losserror_ratetime
01.1453000.3455680.11975600:20
10.5339860.2519440.07713100:20
20.3176960.2083710.06901200:20
epochtrain_lossvalid_losserror_ratetime
00.2579770.2054000.06765900:25
10.2467630.2051070.06630600:25
20.2405950.1938480.06224600:25
30.2099880.1980610.06292300:25
40.1947560.1931300.06427600:25
50.1699850.1878850.05615700:25
60.1532050.1861450.05886300:25
70.1414800.1853160.05345100:25
80.1285640.1809990.05142100:25
90.1269410.1862880.05412700:25
100.1300640.1817640.05412700:25
110.1242810.1818550.05412700:25

Now the fine-tuning is working great!

fastai can show us a graph of the training and validation loss:

In [ ]:

  1. learn.recorder.plot_loss()

Improving Our Model - 图4

As you can see, the training loss keeps getting better and better. But notice that eventually the validation loss improvement slows, and sometimes even gets worse! This is the point at which the model is starting to over fit. In particular, the model is becoming overconfident of its predictions. But this does not mean that it is getting less accurate, necessarily. Take a look at the table of training results per epoch, and you will often see that the accuracy continues improving, even as the validation loss gets worse. In the end what matters is your accuracy, or more generally your chosen metrics, not the loss. The loss is just the function we’ve given the computer to help us to optimize.

Another decision you have to make when training the model is for how long to train for. We’ll consider that next.

Selecting the Number of Epochs

Often you will find that you are limited by time, rather than generalization and accuracy, when choosing how many epochs to train for. So your first approach to training should be to simply pick a number of epochs that will train in the amount of time that you are happy to wait for. Then look at the training and validation loss plots, as shown above, and in particular your metrics, and if you see that they are still getting better even in your final epochs, then you know that you have not trained for too long.

On the other hand, you may well see that the metrics you have chosen are really getting worse at the end of training. Remember, it’s not just that we’re looking for the validation loss to get worse, but the actual metrics. Your validation loss will first get worse during training because the model gets overconfident, and only later will get worse because it is incorrectly memorizing the data. We only care in practice about the latter issue. Remember, our loss function is just something that we use to allow our optimizer to have something it can differentiate and optimize; it’s not actually the thing we care about in practice.

Before the days of 1cycle training it was very common to save the model at the end of each epoch, and then select whichever model had the best accuracy out of all of the models saved in each epoch. This is known as early stopping. However, this is very unlikely to give you the best answer, because those epochs in the middle occur before the learning rate has had a chance to reach the small values, where it can really find the best result. Therefore, if you find that you have overfit, what you should actually do is retrain your model from scratch, and this time select a total number of epochs based on where your previous best results were found.

If you have the time to train for more epochs, you may want to instead use that time to train more parameters—that is, use a deeper architecture.

Deeper Architectures

In general, a model with more parameters can model your data more accurately. (There are lots and lots of caveats to this generalization, and it depends on the specifics of the architectures you are using, but it is a reasonable rule of thumb for now.) For most of the architectures that we will be seeing in this book, you can create larger versions of them by simply adding more layers. However, since we want to use pretrained models, we need to make sure that we choose a number of layers that have already been pretrained for us.

This is why, in practice, architectures tend to come in a small number of variants. For instance, the ResNet architecture that we are using in this chapter comes in variants with 18, 34, 50, 101, and 152 layer, pretrained on ImageNet. A larger (more layers and parameters; sometimes described as the “capacity” of a model) version of a ResNet will always be able to give us a better training loss, but it can suffer more from overfitting, because it has more parameters to overfit with.

In general, a bigger model has the ability to better capture the real underlying relationships in your data, and also to capture and memorize the specific details of your individual images.

However, using a deeper model is going to require more GPU RAM, so you may need to lower the size of your batches to avoid an out-of-memory error. This happens when you try to fit too much inside your GPU and looks like:

  1. Cuda runtime error: out of memory

You may have to restart your notebook when this happens. The way to solve it is to use a smaller batch size, which means passing smaller groups of images at any given time through your model. You can pass the batch size you want to the call creating your DataLoaders with bs=.

The other downside of deeper architectures is that they take quite a bit longer to train. One technique that can speed things up a lot is mixed-precision training. This refers to using less-precise numbers (half-precision floating point, also called fp16) where possible during training. As we are writing these words in early 2020, nearly all current NVIDIA GPUs support a special feature called tensor cores that can dramatically speed up neural network training, by 2-3x. They also require a lot less GPU memory. To enable this feature in fastai, just add to_fp16() after your Learner creation (you also need to import the module).

You can’t really know ahead of time what the best architecture for your particular problem is—you need to try training some. So let’s try a ResNet-50 now with mixed precision:

In [ ]:

  1. from fastai.callback.fp16 import *
  2. learn = cnn_learner(dls, resnet50, metrics=error_rate).to_fp16()
  3. learn.fine_tune(6, freeze_epochs=3)
epochtrain_lossvalid_losserror_ratetime
01.4275050.3105540.09878200:21
10.6067850.3023250.09472300:22
20.4092670.2948030.09134000:21
epochtrain_lossvalid_losserror_ratetime
00.2611210.2745070.08389700:26
10.2966530.3186490.08457400:26
20.2423560.2536770.06901200:26
30.1506840.2514380.06562900:26
40.0949970.2397720.06427600:26
50.0611440.2280820.05480400:26

You’ll see here we’ve gone back to using fine_tune, since it’s so handy! We can pass freeze_epochs to tell fastai how many epochs to train for while frozen. It will automatically change learning rates appropriately for most datasets.

In this case, we’re not seeing a clear win from the deeper model. This is useful to remember—bigger models aren’t necessarily better models for your particular case! Make sure you try small models before you start scaling up.