CONSISTENT VIEW
The concept of consistent view in OneFlow is introduced to simplify distributed training. In short, the cluster is abstracted as a “Super Computing Device” under OneFlow consistent view.
Instead of caring about the details of computing and communication in a cluster, users can program like on a single node, and OneFlow can train the model in a distributed way.
OneFlow’s consistent view relies on several important concepts: Placement, SBP and SBP Signature.
Placement
The Tensors of OneFlow has a placement
attribute in consistent view; the placement
specifies which physical device the Tensor is placed on.
OneFlow will automatically number the devices in the cluster. For example, if there are four hosts in a cluster and each host has eight cards, then the four hosts correspond to ID: 0,1,2,3. The cards on each host correspond to numbers 0 to 7. To place a Tensor on the first four cards on machine 0, simply configure: placement("cuda", {0: [0, 1, 2, 3]})
.
Placement makes it easy for OneFlow to support pipelining parallelism, and we’ll see examples of placement
in other articles on this topic.
SBP
SBP is a unique concept in OneFlow, which describes the mapping of data from a “Super Computing Device” perspective to data on real physical devices in a cluster. It is a combination of the initials of three words: split
, broadcast
, partial
.
In detail:
split
means that the physical Tensor is obtained by splitting the logical Tensor along a certain dimension. Anaxis
parameter is used to indicate the dimension of the split. If multiple physical Tensors are concatenated along the dimension of Split, the logical Tensor can be restored.broadcast
indicates that each physical Tensor is exactly a copy of the logical Tensor.partial
indicates that although the physical Tensor has the same shape as the logical Tensor, the value in the physical Tensor is a part of the value in the corresponding position in the logical Tensor, if you add multiple physical Tensors at the same positions, you can restore the logical Tensor. Besidessum
,min
,max
and some other opreations are made available forpartial
.
The figures below show some examples of SBP, including split(0)
, split(1)
, broadcast
and partial sum
.
When you create a Consistent Tensor, you can specify the SBP of the Tensor. The example will be seen in the next article: Consistent Tensor.
SPB Signature
SBP describes the mapping relationship between the data under the consistent view and the data on the physical devices. When doing distributed training, OneFlow distributes the data to the physical devices, computes the results according to the SBP attributes of the data.
For an isolated Tensor, we can set its SBP attributes at will. However, for an operator with input and output data, we can not arbitrarily set the SBP attributes of its input and output. This is because arbitrarily setting the SBP attributes of an operator’s input and output may not conform to the algorithm of the operator under consistent view.
Let us discuss this problem with the example of matrix multiplication. Look at how the input and output SBP of matrix multiplication are combined to be legal and illegal in a distributed system with tow devices.
Suppose, from the consistent view, that a matrix with the shape $ is multiplied by a matrix with the shape to get $y $, the shape of must be .
According to the rule of matrix multiplication, we can divide the matrix into two matrices and by dimension 0, with the shapes of , respectively:
Device 1:
Device 2:
It’s easy to configure the relationship among physical Tensors , and the Tensor , which is under the consistent view. And also the relationship between , and the consistent view data :
Note: The
concat
above represents a concatenate operation.
In this way, it is possible to execute the operation and get the correct result from the consistent view by distributing the data to each physical device. The long story we talked above, described in SBP, are surprisingly simple:
is split(0)
, is broadcast
, and is split(0)
.
We can see that for matrix multiplication, the SBP of its input and output combined in the above way, is legal. For matrix multiplication, there are more than one valid SBP combinations, such as:
is broadcast
, is split(1)
, and is split(1)
.
Or:
is split(1)
, is split(0)
, and is partial sum
.
While we showed multiple valid SBP combinations above, not all SBP combinations are valid. For example, for matrix multiplication, if , are both split(0)
, then:
Because the shapes of and do not meet the requirements of matrix multiplication, it is impossible to compute the matrix multiplication on physical devices. We can say that the combination of as split(0)
and as split(0)
is illegal.
We defines a specific, valid SBP combination of the inputs and outputs of an operator, as shown above, as a SBP Signature of this operator.
All operators in OneFlow are presetting all possible SBP signatures according to the operator’s Operation Rules. The user only needs to set the placement
and SBP
attributes of the data, the selection process is transparent to the user.
Conclusion
placement
, SBP
, and SBP Signature
are the important guarantee of OneFlow distributed consistent view, which makes OneFlow distributed training as simple as on a single machine single card.
In the next article Consistent Tensor, we’ll show you an example of programming under the consistent view.
Please activate JavaScript for write a comment in LiveRe