Hyperparam schedule
Callback and helper functions to schedule any hyper-parameter
/usr/local/lib/python3.8/dist-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
return torch._C._cuda_getDeviceCount() > 0
from fastai.test_utils import *
Annealing
annealer
[source]
annealer
(f
)
Decorator to make f
return itself partially applied.
This is the decorator we will use for all of our scheduling functions, as it transforms a function taking (start, end, pos)
to something taking (start, end)
and return a function depending of pos
.
sched_lin
[source]
sched_lin
(start
,end
,pos
)
sched_cos
[source]
sched_cos
(start
,end
,pos
)
sched_no
[source]
sched_no
(start
,end
,pos
)
sched_exp
[source]
sched_exp
(start
,end
,pos
)
annealings = "NO LINEAR COS EXP".split()
p = torch.linspace(0.,1,100)
fns = [SchedNo, SchedLin, SchedCos, SchedExp]
for fn, t in zip(fns, annealings):
plt.plot(p, [fn(2, 1e-2)(o) for o in p], label=t)
f = SchedPoly(2,1e-2,0.5)
plt.plot(p, [f(o) for o in p], label="POLY(0.5)")
plt.legend();
SchedLin
[source]
SchedLin
(start
,end
)
Linear schedule function from start
to end
sched = SchedLin(0, 2)
test_eq(L(map(sched, [0., 0.25, 0.5, 0.75, 1.])), [0., 0.5, 1., 1.5, 2.])
SchedCos
[source]
SchedCos
(start
,end
)
Cosine schedule function from start
to end
sched = SchedCos(0, 2)
test_close(L(map(sched, [0., 0.25, 0.5, 0.75, 1.])), [0., 0.29289, 1., 1.70711, 2.])
SchedNo
[source]
SchedNo
(start
,end
)
Constant schedule function with start
value
sched = SchedNo(0, 2)
test_close(L(map(sched, [0., 0.25, 0.5, 0.75, 1.])), [0., 0., 0., 0., 0.])
SchedExp
[source]
SchedExp
(start
,end
)
Exponential schedule function from start
to end
sched = SchedExp(1, 2)
test_close(L(map(sched, [0., 0.25, 0.5, 0.75, 1.])), [1., 1.18921, 1.41421, 1.68179, 2.])
SchedPoly
[source]
SchedPoly
(start
,end
,power
)
Polynomial schedule (of power
) function from start
to end
sched = SchedPoly(0, 2, 2)
test_close(L(map(sched, [0., 0.25, 0.5, 0.75, 1.])), [0., 0.125, 0.5, 1.125, 2.])
p = torch.linspace(0.,1,100)
pows = [0.5,1.,2.]
for e in pows:
f = SchedPoly(2, 0, e)
plt.plot(p, [f(o) for o in p], label=f'power {e}')
plt.legend();
combine_scheds
[source]
combine_scheds
(pcts
,scheds
)
Combine scheds
according to pcts
in one function
pcts
must be a list of positive numbers that add up to 1 and is the same length as scheds
. The generated function will use scheds[0]
from 0 to pcts[0]
then scheds[1]
from pcts[0]
to pcts[0]+pcts[1]
and so forth.
p = torch.linspace(0.,1,100)
f = combine_scheds([0.3,0.7], [SchedCos(0.3,0.6), SchedCos(0.6,0.2)])
plt.plot(p, [f(o) for o in p]);
p = torch.linspace(0.,1,100)
f = combine_scheds([0.3,0.2,0.5], [SchedLin(0.,1.), SchedNo(1.,1.), SchedCos(1., 0.)])
plt.plot(p, [f(o) for o in p]);
combined_cos
[source]
combined_cos
(pct
,start
,middle
,end
)
Return a scheduler with cosine annealing from start
→middle
& middle
→end
This is a useful helper function for the 1cycle policy. pct
is used for the start
to middle
part, 1-pct
for the middle
to end
. Handles floats or collection of floats. For example:
f = combined_cos(0.25,0.5,1.,0.)
plt.plot(p, [f(o) for o in p]);
class
ParamScheduler
[source]
ParamScheduler
(scheds
) ::Callback
Schedule hyper-parameters according to scheds
scheds
is a dictionary with one key for each hyper-parameter you want to schedule, with either a scheduler or a list of schedulers as values (in the second case, the list must have the same length as the the number of parameters groups of the optimizer).
learn = synth_learner()
sched = {'lr': SchedLin(1e-3, 1e-2)}
learn.fit(1, cbs=ParamScheduler(sched))
n = len(learn.dls.train)
test_close(learn.recorder.hps['lr'], [1e-3 + (1e-2-1e-3) * i/n for i in range(n)])
[0, 8.821188926696777, 4.2881364822387695, '00:00']
ParamScheduler.before_fit
[source]
ParamScheduler.before_fit
()
Initialize container for hyper-parameters
ParamScheduler.before_batch
[source]
ParamScheduler.before_batch
()
Set the proper hyper-parameters in the optimizer
ParamScheduler.after_batch
[source]
ParamScheduler.after_batch
()
Record hyper-parameters of this batch
ParamScheduler.after_fit
[source]
ParamScheduler.after_fit
()
Save the hyper-parameters in the recorder if there is one
Learner.fit_one_cycle
[source]
Learner.fit_one_cycle
(n_epoch
,lr_max
=None
,div
=25.0
,div_final
=100000.0
,pct_start
=0.25
,wd
=None
,moms
=None
,cbs
=None
,reset_opt
=False
)
Fit self.model
for n_epoch
using the 1cycle policy.
The 1cycle policy was introduced by Leslie N. Smith et al. in Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates. It schedules the learning rate with a cosine annealing from lr_max/div
to lr_max
then lr_max/div_final
(pass an array to lr_max
if you want to use differential learning rates) and the momentum with cosine annealing according to the values in moms
. The first phase takes pct_start
of the training. You can optionally pass additional cbs
and reset_opt
.
learn = synth_learner(lr=1e-2)
xb,yb = learn.dls.one_batch()
init_loss = learn.loss_func(learn.model(xb), yb)
learn.fit_one_cycle(2)
xb,yb = learn.dls.one_batch()
final_loss = learn.loss_func(learn.model(xb), yb)
assert final_loss < init_loss
[0, 8.405224800109863, 1.90297269821167, '00:00']
[1, 4.177524089813232, 0.25869476795196533, '00:00']
lrs,moms = learn.recorder.hps['lr'],learn.recorder.hps['mom']
test_close(lrs, [combined_cos(0.25,1e-2/25,1e-2,1e-7)(i/20) for i in range(20)])
test_close(moms, [combined_cos(0.25,0.95,0.85,0.95)(i/20) for i in range(20)])
Recorder.plot_sched
[source]
Recorder.plot_sched
(keys
=None
,figsize
=None
)
learn = synth_learner()
learn.fit_one_cycle(2)
[0, 30.43614959716797, 28.462574005126953, '00:00']
[1, 28.108238220214844, 25.820087432861328, '00:00']
learn.recorder.plot_sched()
Learner.fit_flat_cos
[source]
Learner.fit_flat_cos
(n_epoch
,lr
=None
,div_final
=100000.0
,pct_start
=0.75
,wd
=None
,cbs
=None
,reset_opt
=False
)
Fit self.model
for n_epoch
at flat lr
before a cosine annealing.
learn = synth_learner()
learn.fit_flat_cos(2)
[0, 19.585208892822266, 13.384245872497559, '00:00']
[1, 17.099363327026367, 10.142148971557617, '00:00']
learn.recorder.plot_sched()
Learner.fit_sgdr
[source]
Learner.fit_sgdr
(n_cycles
,cycle_len
,lr_max
=None
,cycle_mult
=2
,cbs
=None
,reset_opt
=False
,wd
=None
)
Fit self.model
for n_cycles
of cycle_len
using SGDR.
This schedule was introduced by Ilya Loshchilov et al. in SGDR: Stochastic Gradient Descent with Warm Restarts. It consists of n_cycles
that are cosine annealings from lr_max
(defaults to the Learner
lr) to 0, with a length of cycle_len * cycle_mult**i
for the i
-th cycle (first one is cycle_len
-long, then we multiply the length by cycle_mult
at each epoch). You can optionally pass additional cbs
and reset_opt
.
learn = synth_learner()
with learn.no_logging(): learn.fit_sgdr(3, 1)
test_eq(learn.n_epoch, 7)
iters = [k * len(learn.dls.train) for k in [0,1,3,7]]
for i in range(3):
n = iters[i+1]-iters[i]
#The start of a cycle can be mixed with the 0 of the previous cycle with rounding errors, so we test at +1
test_close(learn.recorder.lrs[iters[i]+1:iters[i+1]], [SchedCos(learn.lr, 0)(k/n) for k in range(1,n)])
learn.recorder.plot_sched()
Learner.fine_tune
[source]
Learner.fine_tune
(epochs
,base_lr
=0.002
,freeze_epochs
=1
,lr_mult
=100
,pct_start
=0.3
,div
=5.0
,lr_max
=None
,div_final
=100000.0
,wd
=None
,moms
=None
,cbs
=None
,reset_opt
=False
)
Fine tune with freeze
for freeze_epochs
then with unfreeze
from epochs
using discriminative LR
learn.fine_tune(1)
[0, 4.645303726196289, 3.2293577194213867, '00:00']
[0, 3.677720785140991, 2.936737537384033, '00:00']
class
LRFinder
[source]
LRFinder
(start_lr
=1e-07
,end_lr
=10
,num_it
=100
,stop_div
=True
) ::ParamScheduler
Training with exponentially growing learning rate
from fastai.vision.all import *
set_seed(99, True)
path = untar_data(URLs.PETS)/'images'
image_files = get_image_files(path)
if sys.platform == "win32" and IN_NOTEBOOK:
image_files = random.choices(image_files, k=int(len(image_files)/8))
print("Randomly select 1/8 files in NOTEBOOK on Windows to save time")
# pickle can't serializer lamda function.
def _label_func(x):
return x[0].isupper()
dls = ImageDataLoaders.from_name_func(
path, image_files, valid_pct=0.2,
label_func=_label_func, item_tfms=Resize(224))
learn = cnn_learner(dls, resnet18)
learn.fit(1)
learn.opt.state_dict()['state'][1]['grad_avg']
epoch | train_loss | valid_loss | time |
---|---|---|---|
0 | 0.098187 | 0.132234 | 00:10 |
tensor([-0.0043, 0.0034, 0.0000, -0.0031, 0.0000, 0.0095, -0.0037, 0.0000,
0.0012, 0.0000, -0.0033, -0.0025, -0.0004, 0.0000, -0.0068, 0.0077,
-0.0025, -0.0061, -0.0011, -0.0065, 0.0229, -0.0052, -0.0016, -0.0004,
0.0093, -0.0042, -0.0107, -0.0024, -0.0023, 0.0076, 0.0084, 0.0039,
-0.0047, 0.0147, 0.0013, 0.0065, 0.0000, -0.0154, 0.0000, 0.0109,
0.0124, -0.0144, 0.0126, -0.0028, 0.0079, -0.0105, -0.0118, -0.0009,
0.0000, 0.0291, -0.0218, 0.0464, 0.0086, -0.0104, -0.0062, -0.0089,
0.0225, -0.0076, 0.0010, 0.0077, -0.0180, -0.0321, -0.0008, -0.0139],
device='cuda:0')
learn.lr_find()
learn.opt.state_dict()['state'][1]['grad_avg']
tensor([-0.0043, 0.0034, 0.0000, -0.0031, 0.0000, 0.0095, -0.0037, 0.0000,
0.0012, 0.0000, -0.0033, -0.0025, -0.0004, 0.0000, -0.0068, 0.0077,
-0.0025, -0.0061, -0.0011, -0.0065, 0.0229, -0.0052, -0.0016, -0.0004,
0.0093, -0.0042, -0.0107, -0.0024, -0.0023, 0.0076, 0.0084, 0.0039,
-0.0047, 0.0147, 0.0013, 0.0065, 0.0000, -0.0154, 0.0000, 0.0109,
0.0124, -0.0144, 0.0126, -0.0028, 0.0079, -0.0105, -0.0118, -0.0009,
0.0000, 0.0291, -0.0218, 0.0464, 0.0086, -0.0104, -0.0062, -0.0089,
0.0225, -0.0076, 0.0010, 0.0077, -0.0180, -0.0321, -0.0008, -0.0139],
device='cuda:0')
learn.lr_find()
learn.opt.state_dict()['state'][1]['grad_avg']
tensor([-0.0043, 0.0034, 0.0000, -0.0031, 0.0000, 0.0095, -0.0037, 0.0000,
0.0012, 0.0000, -0.0033, -0.0025, -0.0004, 0.0000, -0.0068, 0.0077,
-0.0025, -0.0061, -0.0011, -0.0065, 0.0229, -0.0052, -0.0016, -0.0004,
0.0093, -0.0042, -0.0107, -0.0024, -0.0023, 0.0076, 0.0084, 0.0039,
-0.0047, 0.0147, 0.0013, 0.0065, 0.0000, -0.0154, 0.0000, 0.0109,
0.0124, -0.0144, 0.0126, -0.0028, 0.0079, -0.0105, -0.0118, -0.0009,
0.0000, 0.0291, -0.0218, 0.0464, 0.0086, -0.0104, -0.0062, -0.0089,
0.0225, -0.0076, 0.0010, 0.0077, -0.0180, -0.0321, -0.0008, -0.0139],
device='cuda:0')
with tempfile.TemporaryDirectory() as d:
learn = synth_learner(path=Path(d))
init_a,init_b = learn.model.a,learn.model.b
with learn.no_logging(): learn.fit(20, cbs=LRFinder(num_it=100))
assert len(learn.recorder.lrs) <= 100
test_eq(len(learn.recorder.lrs), len(learn.recorder.losses))
#Check stop if diverge
if len(learn.recorder.lrs) < 100: assert learn.recorder.losses[-1] > 4 * min(learn.recorder.losses)
#Test schedule
test_eq(learn.recorder.lrs, [SchedExp(1e-7, 10)(i/100) for i in range_of(learn.recorder.lrs)])
#No validation data
test_eq([len(v) for v in learn.recorder.values], [1 for _ in range_of(learn.recorder.values)])
#Model loaded back properly
test_eq(learn.model.a, init_a)
test_eq(learn.model.b, init_b)
test_eq(learn.opt.state_dict()['state'], [{}, {}])
LRFinder.before_fit
[source]
LRFinder.before_fit
()
Initialize container for hyper-parameters and save the model
LRFinder.before_batch
[source]
LRFinder.before_batch
()
Set the proper hyper-parameters in the optimizer
LRFinder.after_batch
[source]
LRFinder.after_batch
()
Record hyper-parameters of this batch and potentially stop training
LRFinder.before_validate
[source]
LRFinder.before_validate
()
Skip the validation part of training
Recorder.plot_lr_find
[source]
Recorder.plot_lr_find
(skip_end
=5
)
Plot the result of an LR Finder test (won’t work if you didn’t do learn.lr_find()
before)
Learner.lr_find
[source]
Learner.lr_find
(start_lr
=1e-07
,end_lr
=10
,num_it
=100
,stop_div
=True
,show_plot
=True
,suggestions
=True
)
Launch a mock training to find a good learning rate, return lr_min, lr_steep if suggestions
is True
First introduced by Leslie N. Smith in Cyclical Learning Rates for Training Neural Networks, the LR Finder trains the model with exponentially growing learning rates from start_lr
to end_lr
for num_it
and stops in case of divergence (unless stop_div=False
) then plots the losses vs the learning rates with a log scale.
A good value for the learning rates is then either:
- one tenth of the minimum before the divergence
- when the slope is the steepest
Those two values are returned by default by the Learning Rate Finder.
with tempfile.TemporaryDirectory() as d:
learn = synth_learner(path=Path(d))
weights_pre_lr_find = L(learn.model.parameters())
lr_min,lr_steep = learn.lr_find()
weights_post_lr_find = L(learn.model.parameters())
test_eq(weights_pre_lr_find, weights_post_lr_find)
print(f"Minimum/10: {lr_min:.2e}, steepest point: {lr_steep:.2e}")
Minimum/10: 5.25e-02, steepest point: 1.10e-02
©2021 fast.ai. All rights reserved.
Site last generated: Mar 31, 2021