Distributed and parallel training
Callbacks and helper functions to train in parallel or use distributed training
/usr/local/lib/python3.8/dist-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
return torch._C._cuda_getDeviceCount() > 0
When using multiple GPUs, you will most probably want to fit using distributed training. See examples/distrib.py for a complete example. To use distributed training, there are only two required steps:
- Add
with learn.distrib_ctx():
before yourlearn.fit
call - Run your training script with
python -m fastai.launch scriptname.py ...args...
After fastai.launch
you can add --gpus 0,1
for instance, to use only using GPUs 1 and 2.
If you’re using untar_data
, or may be downloading or uncompressing data or models as part of your script, you should wrap that code with rank0_first
, which forces that step to occur first just once on the master process, prior to the remaining processes running it in parallel. E.g. instead of:
path = untar_data(URLs.IMAGEWOOF_320)
…you instead use:
path = rank0_first(untar_data, URLs.IMAGEWOOF_320)
See below for details on the full API and underlying helper functions, if needed — however, note that you will not need anything except the above unless you need to change how the distributed training is implemented.
Parallel
DataParallel.reset
[source]
DataParallel.reset
()
Patch required reset
call into DataParallel
class
ParallelTrainer
[source]
ParallelTrainer
(device_ids
) ::Callback
Wrap a model DataParallel
automatically
Learner.to_parallel
[source]
Learner.to_parallel
(device_ids
=None
)
Add ParallelTrainer
callback to a Learner
Learner.detach_parallel
[source]
Learner.detach_parallel
()
Remove ParallelTrainer
callback from a Learner
Learner.parallel_ctx
[source]
Learner.parallel_ctx
(device_ids
=None
)
A context manager to adapt a learner to train in data parallel mode.
Distributed
Helper functions
DistributedDataParallel.reset
[source]
DistributedDataParallel.reset
()
Patch required reset
call into DistributedDataParallel
setup_distrib
[source]
setup_distrib
(gpu
=None
)
Setup this process to participate in distributed training
teardown_distrib
[source]
teardown_distrib
()
Free distributed training resources
DataLoader
class
DistributedDL
[source]
DistributedDL
(dl
,rank
=None
,world_size
=None
) ::TfmdDL
A TfmdDL
which splits a batch into equal size pieces for each worker
dl = TfmdDL(list(range(50)), bs=12, num_workers=2)
for i in range(4):
dl1 = DistributedDL(dl, i, 4)
test_eq(list(dl1), (torch.arange(i*13, i*13+12)%50,torch.tensor([i*13+12])%50))
class
DistributedTrainer
[source]
DistributedTrainer
(cuda_id
=0
,sync_bn
=True
) ::Callback
Wrap model
in DistributedDataParallel
and dls
in DistributedDL
Learner.to_distributed
[source]
Learner.to_distributed
(cuda_id
,sync_bn
=True
)
Add DistributedTrainer
to a learner
Learner.detach_distributed
[source]
Learner.detach_distributed
()
Remove DistributedTrainer
from a learner
distrib_ctx
context manager
Learner.distrib_ctx
[source]
Learner.distrib_ctx
(cuda_id
=None
,sync_bn
=True
)
A context manager to adapt a learner to train in distributed data parallel mode.
distrib_ctx
prepares a learner to train in distributed data parallel mode. It assumes these environment variables have all been setup properly, such as those launched by python -m fastai.launch
.
Typical usage:
with learn.distrib_ctx(): learn.fit(.....)
It attaches a DistributedTrainer
callback and DistributedDL
data loader to the learner, then executes learn.fit(.....)
. Upon exiting the context, it removes the DistributedTrainer
and DistributedDL
, and destroys any locally created distributed process group. The process is still attached to the GPU though.
rank0_first
[source]
rank0_first
(func
, *args
, **kwargs
)
Execute func
in the Rank-0 process first, then in other ranks in parallel.
rank0_first
calls f()
in rank-0 process first, then in parallel on the rest, in distributed training mode. In single process, non-distributed training mode, f()
is called only once as expected.
One application of rank0_first()
is to make fresh downloads via untar_data
safe in distributed training scripts launched by python -m fastai.launch <script>
:
path = untar_data(URLs.IMDB)
becomes:
path = rank0_first(lambda: untar_data(URLs.IMDB))
Some learner factory methods may use untar_data
to download pretrained models:
learn = text_classifier_learner(dls, AWD_LSTM, drop_mult=0.5, metrics=accuracy)
becomes:
learn = rank0_first(lambda: text_classifier_learner(dls, AWD_LSTM, drop_mult=0.5, metrics=accuracy))
Otherwise, multiple processes will download at the same time and corrupt the data.
©2021 fast.ai. All rights reserved.
Site last generated: Mar 31, 2021