Custom Operators

Did you check out the wide array of Operators already provided in Caffe2? Still want to roll your own operator? Read on, but don’t forget to contribute your fancy new operator back to the project!

Writing a Basic Operator

Almost every operator will use both a .cc file for the registering of the operator and a .h file for the actual implementation, though this can vary across operators. For example, in some cases, the implementation may be coded in the .cc file. In addition, several operators also have GPU/CUDA implementations, which are stored in .cu files.

If a CUDA implementation involves actual CUDA kernels it has to be named .cu so it is complied by NVCC. If it is only implementing existing CUDA libraries then we name it _gpu.cc to save on compilation time.

We will start by describing what goes into the .cc file. As an example, consider the operator defined in fully_connected_op.cc:

  1. #include "caffe2/operators/fully_connected_op.h"
  2.  
  3. namespace caffe2 {
  4. namespace {
  5.  
  6. REGISTER_CPU_OPERATOR(FC, FullyConnectedOp<float, CPUContext>);
  7. REGISTER_CPU_OPERATOR(FCGradient, FullyConnectedGradientOp<float, CPUContext>);

At first, the names of the operators and the corresponding gradient operator is registered with this macro; this binds the function FC whenever used in Python to the FullyConnectedOp operator, where the float and CPUContext dictate what kind of input type is expected, and what the context is; this value can be either CPUContext or CUDAContext depending on whether this is used on a CPU or GPU device.

Fully Connected also has a GPU implementation that can be found in fully_connected_op_gpu.cc.

  1. #include "caffe2/core/context_gpu.h"
  2. #include "caffe2/operators/fully_connected_op.h"
  3.  
  4. namespace caffe2 {
  5. namespace {
  6. REGISTER_CUDA_OPERATOR(FC, FullyConnectedOp<float, CUDAContext>);
  7. REGISTER_CUDA_OPERATOR(FCGradient,
  8. FullyConnectedGradientOp<float, CUDAContext>);
  9. } // namespace
  10. } // namespace caffe2

Note that the primary differences between this GPU implementation versus the CPU implementation is using REGISTER_CUDA_OPERATOR and CUDAContext instead of REGISTER_CPU_OPERATOR and CPUContext. Also note the inclusion of the additional header file context_gpu.h which is something you’ll want to include for any GPU implementation.

Referring back to fully_connected_op.cc we will look at the remainder of the file and discuss the operator schema. This is where the operator is told how many inputs and outputs are created. This section is also used to generate the documentation for the operator in the Operators Catalog, so be thorough in describing the arguments and the functionality. Also note below that with .Arg, .Input, and .Output the last parameter is a description that is also utilized in generating documentation.

fully_connected_op.cc

  1. OPERATOR_SCHEMA(FC)
  2. .NumInputs(3)
  3. .NumOutputs(1)
  4. .SetDoc(R"DOC(
  5. Computes the result of passing an input vector X into a fully connected layer with 2D weight matrix W and 1D bias vector b.
  6.  
  7. The layer computes Y = X * W + b, where X has size (M x K), W has size (K x N), b has size (N), and Y has size (M x N), where M is the batch size. Even though b is 1D, it is resized to size (M x N) implicitly and added to each vector in the batch. These dimensions must be matched correctly, or else the operator will throw errors.
  8. )DOC")
  9. .Arg("axis", "(int32_t) default to 1; describes the axis of the inputs; "
  10. "defaults to one because the 0th axis most likely describes the batch_size")
  11. .Input(0, "X", "2D input of size (MxK) data")
  12. .Input(1, "W", "2D blob of size (KxN) containing fully connected weight "
  13. "matrix")
  14. .Input(2, "b", "1D blob containing bias vector")
  15. .Output(0, "Y", "1D output tensor");

As you can see in the schema code above, this operator has 3 inputs and 1 output, which were specified by .NumInputs and .NumOutputs respectively. The documentation is thorough and specified with .SetDoc. It also has one additional optional argument that defaults to 1 as specified with .Arg.

.SetDocR"DOC(docs go here)DOC" is where you provide the operator’s documentation.

.Input sets the main data used in the operator, such as the weight matrices for a fully connected layer. The example above shows three entries for .Input. Note the first parameter is the index of the input, starting at 0 for the first input. The second parameter is the name of the variable such as X, W, or b. Finally, the third parameter is the description.

.Arg are usually auxiliary inputs that are not involved in the raw data manipulation.

.Output specifies the outputs. The types parameters are the same as .Input: (index, name, description)

The schema goes on to describe a second operator, FCGradient.

fully_connected_op.cc

  1. OPERATOR_SCHEMA(FCGradient).NumInputs(3).NumOutputs(2, 3);
  2. class GetFCGradient : public GradientMakerBase {
  3. using GradientMakerBase::GradientMakerBase;
  4. vector<OperatorDef> GetGradientDefs() override {
  5. CHECK_EQ(def_.input_size(), 3);
  6. return SingleGradientDef(
  7. "FCGradient", "",
  8. vector<string>{I(0), I(1), GO(0)},
  9. vector<string>{GI(1), GI(2), GI(0)});
  10. }
  11. };
  12. REGISTER_GRADIENT(FC, GetFCGradient);
  13. } // namespace
  14. } // namespace caffe2

The input and output of GradientOp have to be tagged using the GradientMakerBase::GetGradientDefs(). By doing so, we’re effectively informing Caffe2 how the inputs and outputs of the gradient operator are related to the corresponding operator. In particular, the first vector tags the inputs of the gradient operator, and the second vector tags the outputs. Note that doc scheme is not necessary for gradient operators usually, unless you see fit.

Implementation Details

As previously mentioned, most of the implementation details are in header file in the general case. It can be the case that the implementation details are directly placed in the .cc file. For any CUDA implementations, the brunt of the logic and code is in .cu files.

Unit Testing Caffe2 operators

It is a very good idea to write some unit tests to verify your operator is correctly implemented. There are a few helper libraries provided within Caffe2 to make sure your operator tests have good coverage.

Hypothesis is a very useful library for property-based testing. The key idea here is to express properties of the code under test (e.g. that it passes a gradient check, that it implements a reference function, etc), and then generate random instances and verify they satisfy these properties.

The main functions of interest are exposed on HypothesisTestCase, defined in caffe2/python/hypothesis_test_util.py.

You should add your unit test to the folder caffe2/python/operator_tests/. In that directory you can find many existing examples to work from.

The key functions are:

  • assertDeviceChecks(devices, op, inputs, outputs): This asserts that the operator computes the same outputs, regardless of which device it is executed on.
  • assertGradientChecks(device, op, inputs, output_, outputs_with_grads): This implements a standard numerical gradient checker for the operator in question.
  • assertReferenceChecks(device, op, inputs, reference): This runs the reference function (effectively calling reference(*inputs), and comparing that to the output of output.hypothesis_test_util.py exposes some useful pre-built samplers.
  • hu.gcs - a gradient checker device (gc) and device checker devices (dc)
  • hu.gcs_cpu_only - a gradient checker device (gc) and device checker devices (dc) for CPU-only operators

For a simple example:

  1. @given(X=hu.tensor(), **hu.gcs)
    def test_averaged_loss(self, X, gc, dc):
    op = core.CreateOperator("AveragedLoss", ["X"], ["loss"])
    self.assertDeviceChecks(dc, op, [X], [0])
    self.assertGradientChecks(gc, op, [X], 0, [0])

Another example that demonstrates the usage of assertReferenceChecks:

  1. @given(inputs=hu.tensors(n=3),
    in_place=st.booleans(),
    beta1=st.floats(min_value=0.1, max_value=0.9),
    beta2=st.floats(min_value=0.1, max_value=0.9),
    lr=st.floats(min_value=0.1, max_value=0.9),
    iters=st.integers(min_value=1, max_value=10000),
    epsilon=st.floats(min_value=1e-5, max_value=1e-2),
    **hu.gcs)
    def test_adam(self, inputs, in_place, beta1, beta2, lr, iters, epsilon,
    gc, dc):
    grad, m1, m2 = inputs
    m2 += np.abs(m2) + 0.01
    lr = np.asarray([lr], dtype=np.float32)
    iters = np.asarray([iters], dtype=np.int32)
    op = core.CreateOperator(
    "Adam",
    ["grad", "m1", "m2", "lr", "iters"],
    ["grad" if in_place else "grad_o",
    "m1" if in_place else "m1_o",
    "m2" if in_place else "m2_o"],
    beta1=beta1, beta2=beta2, epsilon=epsilon,
    device_option=gc)
    input_device_options = {"lr": hu.cpu_do, "iters": hu.cpu_do}
    self.assertDeviceChecks(
    dc, op, [grad, m1, m2, lr, iters], [0], input_device_options)

  2. # Reference
  3. def adam(grad, m1, m2, lr, iters):
  4.     lr = lr[0]
  5.     iters = iters[0]
  6.     t = iters + 1
  7.     corrected_local_rate = lr * np.sqrt(1. - np.power(beta2, t)) / \
  8.         (1. - np.power(beta1, t))
  9.     m1_o = (beta1 * m1) + (1. - beta1) * grad
  10.     m2_o = (beta2 * m2) + (1. - beta2) * np.square(grad)
  11.     grad_o = corrected_local_rate * m1_o / \
  12.         (np.sqrt(m2_o) + epsilon)
  13.     return (grad_o, m1_o, m2_o)
  14. self.assertReferenceChecks(gc, op, [grad, m1, m2, lr, iters],
  15.                            adam, input_device_options)

For a fancier example that demonstrates drawing more sophisticated elements:

  1. @given(prediction=hu.arrays(dims=[10, 3],
    elements=st.floats(allow_nan=False,
    allow_infinity=False,
    min_value=0,
    max_value=1)),
    labels=hu.arrays(dims=[10],
    dtype=np.int32,
    elements=st.integers(min_value=0,
    max_value=3 - 1)),
    **hu.gcs)
    def test_accuracy(self, prediction, labels, gc, dc):
    op = core.CreateOperator(
    "Accuracy",
    ["prediction", "labels"],
    ["accuracy"]
    )

  2. def op_ref(prediction, labels):
  3.     N = prediction.shape[0]
  4.     correct = 0
  5.     max_ids = np.argmax(prediction, axis=1)
  6.     for i in range(0, N):
  7.         if max_ids[i] == labels[i]:
  8.             correct += 1
  9.     accuracy = correct / N
  10.     return (accuracy,)
  11. self.assertReferenceChecks(
  12.     device_option=gc,
  13.     op=op,
  14.     inputs=[prediction, labels],
  15.     reference=op_ref)

Don’t forget to contribute by creating an Issue and describing your operator and linking to your project.