Label Smoothing
In the theoretical expression of loss, in classification problems, our targets are one-hot encoded (in practice we tend to avoid doing this to save memory, but what we compute is the same loss as if we had used one-hot encoding). That means the model is trained to return 0 for all categories but one, for which it is trained to return 1. Even 0.999 is not “good enough”, the model will get gradients and learn to predict activations with even higher confidence. This encourages overfitting and gives you at inference time a model that is not going to give meaningful probabilities: it will always say 1 for the predicted category even if it’s not too sure, just because it was trained this way.
This can become very harmful if your data is not perfectly labeled. In the bear classifier we studied in <>, we saw that some of the images were mislabeled, or contained two different kinds of bears. In general, your data will never be perfect. Even if the labels were manually produced by humans, they could make mistakes, or have differences of opinions on images that are harder to label.
Instead, we could replace all our 1s with a number a bit less than 1, and our 0s by a number a bit more than 0, and then train. This is called label smoothing. By encouraging your model to be less confident, label smoothing will make your training more robust, even if there is mislabeled data. The result will be a model that generalizes better.
This is how label smoothing works in practice: we start with one-hot-encoded labels, then replace all 0s with $\frac{\epsilon}{N}$ (that’s the Greek letter epsilon, which is what was used in the paper that introduced label smoothing and is used in the fastai code), where $N$ is the number of classes and $\epsilon$ is a parameter (usually 0.1, which would mean we are 10% unsure of our labels). Since we want the labels to add up to 1, replace the 1 by $1-\epsilon + \frac{\epsilon}{N}$. This way, we don’t encourage the model to predict something overconfidently. In our Imagenette example where we have 10 classes, the targets become something like (here for a target that corresponds to the index 3):
[0.01, 0.01, 0.01, 0.91, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01]
In practice, we don’t want to one-hot encode the labels, and fortunately we won’t need to (the one-hot encoding is just good to explain what label smoothing is and visualize it).
Sidebar: Label Smoothing, the Paper
Here is how the reasoning behind label smoothing was explained in the paper by Christian Szegedy et al.:
: This maximum is not achievable for finite $z_k$ but is approached if $z_y\gg z_k$ for all $k\neq y$—that is, if the logit corresponding to the ground-truth label is much great than all other logits. This, however, can cause two problems. First, it may result in over-fitting: if the model learns to assign full probability to the ground-truth label for each training example, it is not guaranteed to generalize. Second, it encourages the differences between the largest logit and all others to become large, and this, combined with the bounded gradient $\frac{\partial\ell}{\partial z_k}$, reduces the ability of the model to adapt. Intuitively, this happens because the model becomes too confident about its predictions.
Let’s practice our paper-reading skills to try to interpret this. “This maximum” is refering to the previous part of the paragraph, which talked about the fact that 1 is the value of the label for the positive class. So it’s not possible for any value (except infinity) to result in 1 after sigmoid or softmax. In a paper, you won’t normally see “any value” written; instead it will get a symbol, which in this case is $z_k$. This shorthand is helpful in a paper, because it can be referred to again later and the reader will know what value is being discussed.
Then it says “if $z_y\gg z_k$ for all $k\neq y$.” In this case, the paper immediately follows the math with an English description, which is handy because you can just read that. In the math, the $y$ is refering to the target ($y$ is defined earlier in the paper; sometimes it’s hard to find where symbols are defined, but nearly all papers will define all their symbols somewhere), and $z_y$ is the activation corresponding to the target. So to get close to 1, this activation needs to be much higher than all the others for that prediction.
Next, consider the statement “if the model learns to assign full probability to the ground-truth label for each training example, it is not guaranteed to generalize.” This is saying that making $z_y$ really big means we’ll need large weights and large activations throughout our model. Large weights lead to “bumpy” functions, where a small change in input results in a big change to predictions. This is really bad for generalization, because it means just one pixel changing a bit could change our prediction entirely!
Finally, we have “it encourages the differences between the largest logit and all others to become large, and this, combined with the bounded gradient $\frac{\partial\ell}{\partial z_k}$, reduces the ability of the model to adapt.” The gradient of cross-entropy, remember, is basically output - target
. Both output
and target
are between 0 and 1, so the difference is between -1
and 1
, which is why the paper says the gradient is “bounded” (it can’t be infinite). Therefore our SGD steps are bounded too. “Reduces the ability of the model to adapt” means that it is hard for it to be updated in a transfer learning setting. This follows because the difference in loss due to incorrect predictions is unbounded, but we can only take a limited step each time.
End sidebar
To use this in practice, we just have to change the loss function in our call to Learner
:
model = xresnet50(n_out=dls.c)
learn = Learner(dls, model, loss_func=LabelSmoothingCrossEntropy(),
metrics=accuracy)
learn.fit_one_cycle(5, 3e-3)
Like with Mixup, you won’t generally see significant improvements from label smoothing until you train more epochs. Try it yourself and see: how many epochs do you have to train before label smoothing shows an improvement?