Training callbacks
Various callbacks to customize training behavior
/usr/local/lib/python3.8/dist-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
return torch._C._cuda_getDeviceCount() > 0
class
ShortEpochCallback
[source]
ShortEpochCallback
(pct
=0.01
,short_valid
=True
) ::Callback
Fit just pct
of an epoch, then stop
learn = synth_learner()
learn.fit(1, cbs=ShortEpochCallback())
epoch | train_loss | valid_loss | time |
---|---|---|---|
0 | 00:00 |
learn = synth_learner()
learn.fit(1, cbs=ShortEpochCallback(short_valid=False))
epoch | train_loss | valid_loss | time |
---|---|---|---|
0 | 14.867975 | 00:00 |
class
GradientAccumulation
[source]
GradientAccumulation
(n_acc
=32
) ::Callback
Accumulate gradients before updating weights
When the number of steps per accumulation is higher than the number of batches, the parameters (and therefore validation loss) don’t change at all:
learn = synth_learner()
learn.fit(1, lr=0.01, cbs=GradientAccumulation(n_acc=1000))
# ensure valid_loss didn't change
assert learn.recorder.values[-1][1] == learn.recorder.values[0][1]
epoch | train_loss | valid_loss | time |
---|---|---|---|
0 | 10.941168 | 10.280428 | 00:00 |
class
GradientClip
[source]
GradientClip
(max_norm
:float
=1.0
,norm_type
:float
=2.0
) ::Callback
Clip norm of gradients
Normally if we use a learning rate that is too high, our training will diverge. This even happens if we use mixed precision training, which avoid infinities by using dynamic loss scaling, but still diverges:
fp16 = MixedPrecision()
set_seed(99)
learn = synth_learner(lr=1.1, cuda=True)
learn.fit(3, cbs=fp16)
epoch | train_loss | valid_loss | time |
---|---|---|---|
0 | 38.214169 | 25.269012 | 00:00 |
1 | 377.146088 | 890.011780 | 00:00 |
2 | 839.391907 | 9965.712891 | 00:00 |
By adding the GradientClip
callback, the gradient norm_type
(default:2) norm is clipped to at most max_norm
(default:1) using nn.utils.clip_grad_norm_
, which can avoid loss divergence:
set_seed(99)
learn = synth_learner(lr=1.1, cuda=True)
learn.fit(3, cbs=[GradientClip,fp16])
epoch | train_loss | valid_loss | time |
---|---|---|---|
0 | 2.039427 | 2.372183 | 00:00 |
1 | 1.402424 | 0.300724 | 00:00 |
2 | 1.013551 | 0.332668 | 00:00 |
BnFreeze
set_bn_eval
[source]
set_bn_eval
(m
:Module
,use_eval
=True
)
Set bn layers in eval mode for all recursive children of m
.
class
BnFreeze
[source]
BnFreeze
(after_create
=None
,before_fit
=None
,before_epoch
=None
,before_train
=None
,before_batch
=None
,after_pred
=None
,after_loss
=None
,before_backward
=None
,before_step
=None
,after_cancel_step
=None
,after_step
=None
,after_cancel_batch
=None
,after_batch
=None
,after_cancel_train
=None
,after_train
=None
,before_validate
=None
,after_cancel_validate
=None
,after_validate
=None
,after_cancel_epoch
=None
,after_epoch
=None
,after_cancel_fit
=None
,after_fit
=None
) ::Callback
Basic class handling tweaks of the training loop by changing a Learner
in various events
BnFreeze
is useful when you’d like to train two separate models that have a common feature extractor / body. The only part of the model that’s different is the head that you attach for transfer learning.
Learner.freeze()
) doesn’t suffice here as the BatchNorm
layers are trainable by default, and running mean and std of batches are tracked. For feature extractors to fully match, you need to set train_bn=False
and these stats need to be frozen as well, which is precisely the function of BnFreeze
.
path = untar_data(URLs.MNIST_TINY)
dls = ImageDataLoaders.from_folder(path, valid_pct=0.2)
We first demonstrate the mismatch of the running stats when using only train_bn=False
, by creating a Learner
…:
learn1 = cnn_learner(deepcopy(dls), resnet18, pretrained=True, train_bn=False)
…and grab the first BatchNorm
layer, and store its running mean:
m = learn1.model[0][1].running_mean.clone()
You can see that now that running mean has changed:
learn1.fit(1, lr=0.02)
test_ne(to_detach(learn1.model[0][1].running_mean), m)
epoch | train_loss | valid_loss | time |
---|---|---|---|
0 | 1.152701 | 0.468892 | 00:02 |
When we use the BnFreeze
callback, the running statistics will not be changed during training. This is often important for getting good results from transfer learning.
learn1 = cnn_learner(deepcopy(dls), resnet18, pretrained=True, train_bn=False, cbs=BnFreeze)
m = learn1.model[0][1].running_mean.detach().clone()
learn1.fit(1, lr=0.02)
test_eq(to_detach(learn1.model[0][1].running_mean), m)
epoch | train_loss | valid_loss | time |
---|---|---|---|
0 | 0.488634 | 0.277683 | 00:02 |
©2021 fast.ai. All rights reserved.
Site last generated: Mar 31, 2021