自定义Op

在自定义Op前,请参阅Op列表文档,避免不必要的重复。

在MNN中,添加Op包含如下步骤:

  1. 添加Op的模型描述
  2. 添加Op的模型转换
  3. 计算Op输出tensor大小
  4. 添加对应的Backend的Op实现,即Execution(CPU|Metal|Vulkan|OpenCL|OpenGL)

添加Op的模型描述

首先在模型参数描述文件夹下添加相应的Op名字以及参数描述,具体如下:

  1. 若此Op来自于Caffe,则在CaffeOp.fbs下添加参数描述;若Op来自Tensorflow,则在TensorflowOp.fbs下添加,例如:
  1. table Pool {
  2. padX:int;
  3. padY:int;
  4. isGlobal:bool=false;
  5. kernelX:int;
  6. kernelY:int;
  7. strideX:int;
  8. strideY:int;
  9. type:PoolType;
  10. padType:PoolPadType;
  11. dataType:DataType=DT_FLOAT;
  12. }
  1. MNN.fbs的OpType列表里添加Op的名字,而后,在OpParameter列表里添加参数描述名字(即table后的Pool)

    若Op无参数则不需上述步骤1,只需在OpType列表里添加Op名字即可。

添加Op的模型转换

目前支持Tensorflow、Caffe、ONNX、Tensorflow Lite模型格式的转换。

Tensorflow模型转换

  1. tensorflow目录下添加xxxTf.cpp 实现可参考PoolingTf.cpp
    1. // 声明具体Op转换PoolingTf, 继承Op转换基类tfOpConverter
    2. // 两种方法
    3. class PoolingTf : public tfOpConverter {
    4. public:
    5. virtual void run(MNN::OpT *dstOp, TmpNode *srcNode, TmpGraph *tempGraph);
    6. PoolingTf() {}
    7. virtual ~PoolingTf() {}
    8. virtual MNN::OpType opType();
    9. virtual MNN::OpParameter type();
    10. }
    11. // 或宏定义
    12. // DECLARE_OP_CONVERTER(PoolingTf);

    run()函数用于解析模型的proto文件得到参数,然后赋值给flatbuffer自定义参数:可以调用函数find_attr_value(const tensorflow::NodeDef& node, const char* key, tensorflow::AttrValue& value)获得对应参数的值。

virtual void run(MNN::OpT dstOp, TmpNode srcNode, TmpGraph *tempGraph),其中参数srcNode保存有输入/输出节点信息,可以根据输入输出节点在tempGraph中找到TmpNode

  1. 在OpMapper.hpp中添加相应的tensorflow Op名字到MNN Op名字的映射
    1. {"MaxPool", MNN::OpType_Pooling},
    2. {"AvgPool", MNN::OpType_Pooling},
  2. Op附带Const的情况
  • 如果Const不作为此Op的参数,而是看成一个单独的Op,可以忽略此步骤
  • 如果此Op要把其附带的Const当成参数,要在文件TmpGraph.cpp里修改函数_genMinGraph(),把相应Const节点的isCovered属性设置为true

Caffe模型转换

  1. caffe目录下添加xxx.cpp 具体参考Pool.cpp 在run函数中解析caffe参数得到具体参数

    virtual void run(MNN::OpT* dstOp, const caffe::LayerParameter& parameters, const caffe::LayerParameter& weight),其中参数parameters保存Op的参数信息,weight保存卷积/BN等数据参数

ONNX模型转换

  1. 添加具体Op转换xxxOnnx.cpp,实现如下三个函数 ```c++ MNN::OpType PoolingOnnx::opType() { return MNN::OpType_Pooling; } MNN::OpParameter PoolingOnnx::type() { return MNN::OpParameter_Pool; }

void PoolingOnnx::run(MNN::OpT dstOp, const onnx::NodeProto onnxNode, std::vector initializers)

  1. > PoolingOnnx::run(MNN::OpT* dstOp, const onnx::NodeProto* onnxNode, std::vector<const onnx::TensorProto*> initializers),其中onnxNodeonnx原始节点信息,权重等数据信息需从initializers取。
  2. ### Tensorflow Lite模型转换
  3. 1. 添加XXXTflite.cpp
  4. ```c++
  5. DECLARE_OP_COVERTER(XXXTflite);
  6. // 需要实现如下函数
  7. XXXTflite::opType(bool quantizedModel);
  8. XXXTflite::type(bool quantizedModel);
  9. XXXTflite::run(MNN::OpT *dstOp, const std::unique_ptr<tflite::OperatorT> &tfliteOp,
  10. const std::vector<std::unique_ptr<tflite::TensorT> > &tfliteTensors,
  11. const std::vector<std::unique_ptr<tflite::BufferT> > &tfliteModelBuffer,
  12. const std::vector<std::unique_ptr<tflite::OperatorCodeT> > &tfliteOpSet,
  13. bool quantizedModel)
  14. // 接口函数相比tensorflow多一个quantizedModel参数,若quantizedModel为true,
  15. // 则模型为量化模型,需转为相应的量化Op,若为false,转为float Op
  16. // 在run()函数中需要设置输入/输出tensor的index
  17. // set input output index
  18. dstOp->inputIndexes.resize(1);
  19. dstOp->outputIndexes.resize(1);
  20. dstOp->inputIndexes[0] = tfliteOp->inputs[0];
  21. dstOp->outputIndexes[0] = tfliteOp->outputs[0];
  22. // 注册Op转换
  23. using namespace tflite;
  24. REGISTER_CONVERTER(SqueezeTflite, BuiltinOperator_SQUEEZE);

计算Op输出tensor大小

  1. 根据输入tensor的维度信息,计算输出tensor的维度信息,并设置输出tensor的数据类型。 继承基类SizeComputer,实现onComputeSize函数,若输入维度信息未知返回false,计算完成后返回true。例如Pooling:
  1. class PoolSizeComputer : public SizeComputer {
  2. public:
  3. virtual bool onComputeSize(const MNN::Op* op, const std::vector<Tensor*>& inputs,
  4. const std::vector<Tensor*>& outputs) const override {
  5. MNN_ASSERT(1 == inputs.size());
  6. MNN_ASSERT(1 == outputs.size());
  7. auto input = inputs[0];
  8. auto output = outputs[0];
  9. ::memcpy(output->buffer().dim, input->buffer().dim, input->buffer().dimensions * sizeof(halide_dimension_t));
  10. // Pool 参数信息
  11. auto layer = op->main_as_Pool();
  12. int outw = 1;
  13. int outh = 1;
  14. if (!layer->isGlobal()) {
  15. int w = input->width();
  16. int h = input->height();
  17. if (layer->pad_x() > 0)
  18. w += layer->pad_x() * 2;
  19. if (layer->pad_y() > 0)
  20. h += layer->pad_y() * 2;
  21. // Tensorflow padding mode SAME
  22. if (layer->pad_type() == PoolPadType_SAME) {
  23. outw = ceil((float)w / (float)layer->stride_x());
  24. outh = ceil((float)h / (float)layer->stride_y());
  25. }
  26. // Tensorflow padding mode VALID
  27. else if (layer->pad_type() == PoolPadType_VALID) {
  28. outw = ceil((float)(w - layer->kernel_x() + 1) / (float)layer->stride_x());
  29. outh = ceil((float)(h - layer->kernel_y() + 1) / (float)layer->stride_y());
  30. }
  31. else {
  32. outw = UP_DIV(w - layer->kernel_x(), layer->stride_x()) + 1;
  33. outh = UP_DIV(h - layer->kernel_y(), layer->stride_y()) + 1;
  34. }
  35. }
  36. // 输入信息未知返回false
  37. if (outw <= 0 || outh <= 0) {
  38. return false;
  39. }
  40. // Pooling只改变Height,Width
  41. output->buffer().dim[3].extent = outw;
  42. output->buffer().dim[2].extent = outh;
  43. return true;
  44. }
  45. };
  46. // 注册Shape计算功能
  47. REGISTER_SHAPE(XXXComputer, OpType_XXX);

添加对应Backend的Execution

CPU

CPU目录下添加CPUXXX.hpp、CPUXXX.cpp。

  1. 类声明 继承基类Execution,主要实现onResize()onExecute()
  1. class CPUPool : public Execution {
  2. public:
  3. CPUPool(Backend *b, const Pool *parameter);
  4. virtual ~CPUPool() = default;
  5. // 若执行onExecute需要使用缓存,在此函数中申请,若无可不声明
  6. virtual ErrorCode onResize(const std::vector<Tensor *> &inputs, const std::vector<Tensor *> &outputs) override;
  7. // 具体的Op执行函数
  8. virtual ErrorCode onExecute(const std::vector<Tensor *> &inputs, const std::vector<Tensor *> &outputs) override;
  9. private:
  10. const Pool *mParameter;
  11. Tensor mCacheLine;
  12. };
  1. 实现 在onResize()中调用backend()->onAcquireBuffer(&mCacheLine, Backend::DYNAMIC)进行缓存的申请以及调用backend()->onReleaseBuffer(&mCacheLine, Backend::DYNAMIC)回收缓存,便于内存的复用。 在onExecute()需做必要的输入的检查,便于提前发现问题,执行完毕正确返回NO_ERROR。

  2. 实现CPU Op Creator,完成注册

  1. class CPUPoolCreator : public CPUBackend::Creator {
  2. public:
  3. virtual Execution *onCreate(const std::vector<Tensor *> &inputs, const std::vector<Tensor *> &outputs, const MNN::Op *op,
  4. Backend *backend) const override {
  5. return new CPUPool(backend, op->main_as_Pool());
  6. }
  7. };
  8. REGISTER_CPU_Op_CREATOR(CPUPoolCreator, OpType_Pooling);

Metal

Metal目录下添加MetalXXX.hpp和MetalXXX.cpp。

1. 声明

继承基类Execution,声明构造、析构、onResizeonExecute函数:

  1. class MetalPooling : public Execution {
  2. public:
  3. MetalPooling(Backend *backend, const Pool *pooling);
  4. virtual ~MetalPooling();
  5. virtual ErrorCode onResize(const std::vector<Tensor *> &inputs, const std::vector<Tensor *> &outputs) override;
  6. virtual ErrorCode onExecute(const std::vector<Tensor *> &inputs, const std::vector<Tensor *> &outputs) override;
  7. private:
  8. bool mGlobal;
  9. PoolType mPoolType;
  10. int mKernelX;
  11. int mKernelY;
  12. int mStrideX;
  13. int mStrideY;
  14. int mPadX;
  15. int mPadY;
  16. id<MTLBuffer> mConstBuffer;
  17. };

2. 实现

  • 不同于CPU Tensor将数据存储在host指针中,Metal数据指针存放在deviceId中,deviceId上存储的是id

    1. auto buffer = (__bridge id<MTLBuffer>)(void *)tensor->deviceId();
  • Metal Op的特定参数等可以通过id存储。buffer数据类型可以与tensor不同,buffer甚至可以混合多种数据类型,只需保证创建时指定了正确的长度即可。例如:

    1. auto buffer = [context newDeviceBuffer:2 * sizeof(int) + 2 * sizeof(__fp16) access:CPUWriteOnly];
    2. ((__fp16 *)buffer.contents)[0] = mAlpha / mLocalSize; // alpha
    3. ((__fp16 *)buffer.contents)[1] = mBeta; // beta
    4. ((int *)buffer.contents)[1] = mLocalSize; // local size
    5. ((int *)buffer.contents)[2] = inputs[0]->channel(); // channel
  • 在创建buffer时,需要指定访问控制权限。目前共有三种权限: CPUReadWrite,数据在CPU/GPU间共享存储,一般用于device buffer; CPUWriteOnly,数据通过CPU写入后不再读取,一般用于参数buffer; CPUTransparent,数据只在GPU中,一般用于heap buffer。

  • MNNMetalContext在创建buffer上,有两套相近的接口,区别只在数据的生命周期上:device占用的内存在单次推理过程中都不会被复用;而heap占用的内存,在调用-[MNNMetalContext releaseHeapBuffer:]之后,可以被其他Op复用。一般而言,heap只会与CPUTransparent一起使用。

    heap实际只在iOS 10+上有效,iOS 9-上会回退到device上。

  • 使用Metal时,如非特殊情况,禁止自行创建device和library。加载library、编译function都是耗时行为,MNNMetalContext上做了必要的缓存优化。通过context执行Metal的示例如下:

    1. auto context = (__bridge MNNMetalContext *)backend->context();
    2. auto kernel = /* metal kernel name NSString */;
    3. auto encoder = [context encoder];
    4. auto bandwidth = [context load:kernel encoder:encoder];
    5. /* encoder set buffer(s)/sampler(s) */
    6. [context dispatchEncoder:encoder
    7. threads:{x, y, z}
    8. maxThreadsPerGroup:maxThreadsPerThreadgroup]; // recommended way to dispatch
    9. [encoder endEncoding];

3. 实现Metal Op Creator,完成注册:

  1. class MetalPoolingCreator : public MetalBackend::Creator {
  2. public:
  3. virtual Execution *onCreate(const std::vector<Tensor *> &inputs, const MNN::Op *op, Backend *backend) const {
  4. return new MetalPooling(backend, op->main_as_Pool());
  5. }
  6. };
  7. REGISTER_METAL_OP_CREATOR(MetalPoolingCreator, OpType_Pooling);

Vulkan

1. Shader

glsl目录下添加具体的shader(*.comp),Pooling输入内存布局默认为NC4HW4,故按image实现,否则采用buffer实现。

  1. #version 440 core
  2. layout(std140) buffer;
  3. layout(set=0, binding=0, rgba16f) writeonly restrict uniform image3D uOutput;
  4. layout(set=0, binding=1) uniform sampler3D uInput;
  5. layout(set=0, binding=2) uniform constBuffer {
  6. ivec4 inputSize;
  7. ivec4 outputSize;
  8. ivec2 pad;
  9. ivec2 kernelSize;
  10. ivec2 stride;
  11. } uConstant;
  12. layout (local_size_x = 8, local_size_y = 8, local_size_z = 1) in;
  13. void main()
  14. {
  15. ivec3 pos = ivec3(gl_GlobalInvocationID);
  16. ivec3 outputSize = uConstant.outputSize.xyz;
  17. ivec2 spos = pos.xy*uConstant.stride-uConstant.pad;
  18. if (all(lessThan(pos, outputSize)))
  19. {
  20. ivec2 inputSizeXY = uConstant.inputSize.xy;
  21. vec4 color = vec4(-1000000.0);
  22. ivec2 sfxy = max(ivec2(0), -spos);
  23. ivec2 efxy = min(uConstant.kernelSize, inputSizeXY-spos);
  24. for (int fy=sfxy.y; fy<efxy.y; ++fy)
  25. {
  26. for (int fx=sfxy.x; fx<efxy.x; ++fx)
  27. {
  28. ivec2 spos_ = spos + ivec2(fx, fy);
  29. color = max(texelFetch(uInput, ivec3(spos_.x, spos_.y, pos.z), 0), color);
  30. }
  31. }
  32. imageStore(uOutput, pos, color);
  33. }
  34. }

然后执行makeshader.py脚本编译Shader。

2. 类声明

在目录execution下添加VulkanXXX.hpp和VulkanXXX.cpp。类声明继承类VulkanBasicExecution:

  1. class VulkanPool : public VulkanBasicExecution {
  2. public:
  3. VulkanPool(const Op* op, Backend* bn);
  4. virtual ~VulkanPool();
  5. ErrorCode onEncode(const std::vector<Tensor*>& inputs, const std::vector<Tensor*>& outputs,
  6. const VulkanCommandPool::Buffer* cmdBuffer) override;
  7. private:
  8. // GPU Shader所需的参数
  9. std::shared_ptr<VulkanBuffer> mConstBuffer;
  10. // Pipeline
  11. const VulkanPipeline* mPoolPipeline;
  12. const Pool* mCommon;
  13. // Layout Descriptor Set
  14. std::shared_ptr<VulkanPipeline::DescriptorSet> mDescriptorSet;
  15. };

3. 实现

实现函数onEncode(),首先需要做内存布局检查(若为NC4HW4,则Shader用image实现,否则用buffer),执行完毕返回NO_ERROR。

4. 实现Vulkan Execution Creator,完成注册

  1. class VulkanPoolCreator : public VulkanBackend::Creator {
  2. public:
  3. virtual Execution* onCreate(const std::vector<Tensor*>& inputs, const MNN::Op* op,
  4. Backend* backend) const override {
  5. return new VulkanPool(op, backend);
  6. }
  7. };
  8. static bool gResistor = []() {
  9. VulkanBackend::addCreator(OpType_Pooling, new VulkanPoolCreator);
  10. return true;
  11. }();

OpenCL

1. Kernel

cl目录添加具体的kernel(*.cl),Pooling按image实现,内存排序为( H : batch * height, W : channel/4 * width * channel4)

  1. __kernel void pooling(GLOBAL_SIZE_3_DIMS __read_only image2d_t input, __private const int in_height,
  2. __private const int in_width, __private const int out_height, __private const int pad_top,
  3. __private const int pad_left, __private const int stride_h, __private const int stride_w,
  4. __private const int pooling_size_h, __private const int pooling_size_w,
  5. __write_only image2d_t output) {
  6. const int out_channel_idx = get_global_id(0);
  7. const int out_width_idx = get_global_id(1);
  8. const int out_hb_idx = get_global_id(2);
  9. if (out_channel_idx >= global_size_dim0 || out_width_idx >= global_size_dim1 || out_hb_idx >= global_size_dim2) {
  10. return;
  11. }
  12. const int out_width = global_size_dim1;
  13. const int n_b = out_hb_idx / out_height;
  14. const int mod_b = out_hb_idx - mul24(n_b, out_height);
  15. const int batch_idx = mul24(n_b, in_height);
  16. const int in_height_start = mad24(mod_b, stride_h, -pad_top);
  17. const int in_width_start = mad24(out_width_idx, stride_w, -pad_left);
  18. const int in_channel_offset = mul24(out_channel_idx, in_width);
  19. DATA_TYPE4 res = (DATA_TYPE4)(MIN_VALUE);
  20. for (int height = 0; height < pooling_size_h; ++height) {
  21. int in_height_idx = in_height_start + height;
  22. in_height_idx = select(batch_idx + in_height_idx, -1, (in_height_idx < 0 || in_height_idx >= in_height));
  23. if (in_height_idx != -1) {
  24. for (int width = 0; width < pooling_size_w; ++width) {
  25. int in_width_idx = in_width_start + width;
  26. in_width_idx =
  27. select(in_channel_offset + in_width_idx, -1, (in_width_idx < 0 || in_width_idx >= in_width));
  28. if (in_width_idx != -1) {
  29. DATA_TYPE4 in = READ_IMAGE(input, SAMPLER, (int2)(in_width_idx, in_height_idx));
  30. res = MNN_MAX(res, in);
  31. }
  32. }
  33. }
  34. }
  35. const int pos = mad24(out_channel_idx, out_width, out_width_idx);
  36. WRITE_IMAGE(output, (int2)(pos, out_hb_idx), res);
  37. }

Then execute opencl_codegen.py to generate the string map corresponding to the kernel.

Note: Macro description in kernel

a. GLOBAL_SIZE_3_DIMS :Corresponds to the specified global work group size. b. READ_IMAGE / WRITE_IMAGE :Read and write pictures. c. DATA_TYPE:The specified data type (float/half/int32).

2. Class declaration

在目录execution下添加xxx.h和XXX.cpp。类声明继承类Execution,如下

  1. template <typename T>
  2. class PoolOp : public Execution {
  3. public:
  4. PoolOp(const std::vector<Tensor *> &inputs, const MNN::Op *op, Backend *backend);
  5. virtual ~PoolOp() = default;
  6. virtual ErrorCode onResize(const std::vector<Tensor *> &inputs, const std::vector<Tensor *> &outputs) override;
  7. virtual ErrorCode onExecute(const std::vector<Tensor *> &inputs, const std::vector<Tensor *> &outputs) override;
  8. bool buildPoolingKernel();
  9. std::vector<uint32_t> poolLocalWS(const std::vector<uint32_t> &gws, const uint32_t maxWorkGroupSize);
  10. private:
  11. const Pool *mPoolParams;
  12. PoolType mPoolType;
  13. PoolPadType mPadType;
  14. std::vector<int> mStrides{1, 1};
  15. std::vector<int> mKernels{1, 1};
  16. std::vector<int> mPaddings{0, 0};
  17. std::vector<int> mDilations{1, 1};
  18. cl::Kernel mKernel;
  19. uint32_t mMaxWorkGroupSize;
  20. std::vector<int> mInputShape;
  21. OpenCLBackend *mOpenCLBackend;
  22. };

3. 实现

实现函数onResize( )(可选)onExecute( ),执行完毕返回NO_ERROR。

4. 实现OpenCL Op Creator以及注册

如下

  1. OpenCLCreatorRegister<TypedCreator<PoolOp<cl_data_t>>> __Pool_op(OpType_Pooling);

OpenGL

1. Shader

2. Executor

OpenGL/execution/ 添加执行器,可参考 GLPool.h 和 GLPool.cpp

3. 注册

OpenGL 不是用的抽象工厂方案,需要修改 OpenGL 下的 GLBackend.cpp

如下:

  1. switch (op->type()) {
  2. case OpType_Convolution:
  3. exe = new GLConvolution(op, this);
  4. break;
  5. case OpType_Pooling:
  6. exe = new GLPool(op->main_as_Pool(), this);
  7. break;
  8. case OpType_Eltwise:
  9. exe = new GLEltwise(op->main_as_Eltwise()->type(), inputs.size(), this);
  10. break;
  11. /*添加到后面即可*/