16 The Training Process - Decoupled Weight Decay - 《The fastai book》

Decoupled Weight Decay

Decoupled Weight Decay

Weight decay, which we discussed in <>, is equivalent to (in the case of vanilla SGD) updating the parameters with:

new_weight = weight - lr*weight.grad - lr*wd*weight

The last part of this formula explains the name of this technique: each weight is decayed by a factor lr * wd.

The other name of weight decay is L2 regularization, which consists in adding the sum of all squared weights to the loss (multiplied by the weight decay). As we have seen in <>, this can be directly expressed on the gradients with:

weight.grad += wd*weight

For SGD, those two formulas are equivalent. However, this equivalence only holds for standard SGD, because we have seen that with momentum, RMSProp or in Adam, the update has some additional formulas around the gradient.

Most libraries use the second formulation, but it was pointed out in “Decoupled Weight Decay Regularization” by Ilya Loshchilov and Frank Hutter, that the first one is the only correct approach with the Adam optimizer or momentum, which is why fastai makes it its default.

Now you know everything that is hidden behind the line learn.fit_one_cycle!

Optimizers are only one part of the training process, however when you need to change the training loop with fastai, you can’t directly change the code inside the library. Instead, we have designed a system of callbacks to let you write any tweaks you like in independent blocks that you can then mix and match.