DATA PARALLELISM TRAINING
- Codes
- DistributedSampler

DATA PARALLELISM TRAINING

In Common Distributed Parallel Strategies, we introduced the characteristics of data parallel.

OneFlow provides oneflow.nn.parallel.DistributedDataParallel module and launcher, which allows users to run data parallel training almost without modifying the script of a single node.

A quick start of OneFlow’s data parallelsim training:

wget https://docs.oneflow.org/master/code/parallelism/ddp_train.py #Download
python3 -m oneflow.distributed.launch --nproc_per_node 2 ./ddp_train.py #data parallel training

Out：

50/500 loss:0.004111831542104483
50/500 loss:0.00025336415274068713
...
500/500 loss:6.184563972055912e-11
500/500 loss:4.547473508864641e-12
w:tensor([[2.0000],
        [3.0000]], device='cuda:1', dtype=oneflow.float32,
       grad_fn=<accumulate_grad>)
w:tensor([[2.0000],
        [3.0000]], device='cuda:0', dtype=oneflow.float32,
       grad_fn=<accumulate_grad>)

Codes

Click “Code” below to expand the code of the above running script.

Code

import oneflow as flow
from oneflow.nn.parallel import DistributedDataParallel as ddp
train_x = [
    flow.tensor([[1, 2], [2, 3]], dtype=flow.float32),
    flow.tensor([[4, 6], [3, 1]], dtype=flow.float32),
]
train_y = [
    flow.tensor([[8], [13]], dtype=flow.float32),
    flow.tensor([[26], [9]], dtype=flow.float32),
]
class Model(flow.nn.Module):
    def __init__(self):
        super().__init__()
        self.lr = 0.01
        self.iter_count = 500
        self.w = flow.nn.Parameter(flow.tensor([[0], [0]], dtype=flow.float32))
    def forward(self, x):
        x = flow.matmul(x, self.w)
        return x
m = Model().to("cuda")
m = ddp(m)
loss = flow.nn.MSELoss(reduction="sum")
optimizer = flow.optim.SGD(m.parameters(), m.lr)
for i in range(0, m.iter_count):
    rank = flow.env.get_rank()
    x = train_x[rank].to("cuda")
    y = train_y[rank].to("cuda")
    y_pred = m(x)
    l = loss(y_pred, y)
    if (i + 1) % 50 == 0:
        print(f"{i+1}/{m.iter_count} loss:{l}")
    optimizer.zero_grad()
    l.backward()
    optimizer.step()
print(f"\nw:{m.w}")

There are only two differences between the data parallelism training code and the stand-alone single-card script:

Use DistributedDataParallel to wrap the module object (m = ddp(m))
Use get_rank to get the current device number and distribute the data to the device.

Then use launcher to run the script, leave everything else to OneFlow, which makes distributed training as simple as stand-alone single-card training:

python3 -m oneflow.distributed.launch --nproc_per_node 2 ./ddp_train.py

DistributedSampler

The data used is manually distributed in this context to highlight DistributedDataParallel. However, in practical applications, you can directly use DistributedSampler with data parallel.

DistributedSampler will instantiate the Dataloader in each process, and each Dataloader instance will load a part of the complete data to automatically complete the data distribution.

Please activate JavaScript for write a comment in LiveRe

Data Parallelism Training

DATA PARALLELISM TRAINING

Codes

DistributedSampler