Extending Theano with a GPU Op

Note

This covers the gpuarray back-end for the GPU.

This tutorial covers how to extend Theano with an op that offers a GPUimplementation. It assumes you are familiar with how to write newTheano ops. If that is not the case you should probably follow theCreating a new Op: Python implementation and Extending Theano with a C Op sections beforecontinuing on.

Writing a new GPU op can be done in Python for some simple tasks, butwill usually done in C to access the complete API and avoid paying theoverhead of a Python function call.

Dealing With the Context

One of the major differences with GPU ops is that they require acontext (a.k.a. device) to execute. Most of the time you can inferthe context to run on from your inputs. There is a way for the userto transfer things between contexts and to tag certain variables fortransfer. It might also be the case that your inputs are not all fromthe same context and you would have to choose which one to run on.

In order to support all of those options and have a consistentinterface, theano.gpuarray.basic_ops.infer_context_name() waswritten. An example usage is below:

  1. def make_node(self, a, b, c):
  2. ctx = infer_context_name(a, b, c)
  3. a = as_gpuarray_variable(a, ctx)
  4. b = as_gpuarray_variable(b, ctx)
  5. c = as_gpuarray_variable(c, ctx)
  6. return Apply(self, [a, b, c], [a.type()])

In this example the Op takes three inputs, all on the GPU. In caseone or more of your inputs is not supposed to be on the GPU, youshould not pass it to infer_context_name() or callas_gpuarray_variable() on it.

Also note that theano.gpuarray.basic_ops.as_gpuarray_variable()takes context_name as a mandatory parameter. This is because it’snot enough to know you want the value to be on the GPU, you also wantto know which GPU to put it on. In almost all cases, you can pass inthe return value of infer_context_name() there.

If you also need the context during runtime (for example to allocatethe output), you can use the context of one of your inputs to knowwhich one to use. Here is another example:

  1. def perform(self, node, inputs, output_storage):
  2. A, B = inputs
  3. C, = output_storage
  4. C[0] = pygpu.empty([A.shape[0], B.shape[1]], dtype=A.dtype, A.context)
  5. pygpu.blas.gemm(1, A, B, 0, C, overwrite_c=True)

Finally if you require the context before perform, such as duringmake_thunk() to initialize kernels and such, you can access thecontext of your inputs through the type of the variables:

  1. def make_thunk(self, node, storage_map, compute_map, no_recycling):
  2. ctx = node.inputs[0].type.context

Note that GpuArrayType objects also have a context_nameattribute which is the symbolic equivalent of context. It can’tbe used for calls to pygpu or libgpuarray, but it should be used fortheano operations and variables.

The last place where you might need the context is in the Cinitialization code. For that you will have to use the params. The params type should betheano.gpuarray.type.gpu_context_type and the params objectshould be a context object from one of your input variables:

  1. def get_params(self, node):
  2. return node.inputs[0].type.context

If you don’t have any input variables on the GPU you can follow thethe example of GpuFromHost or GpuEye. This is not a case that youshould encounter often, so it will not be covered further.

Defining New Kernels

If your op needs to do some transformation on the data, chances arethat you will need to write a new kernel. The best way to do this isto leverage GpuKernelBase (or CGpuKernelBase if you want to use theCOp functionality).

For plain GpuKernelBase, you have to define amethod called gpu_kernels which returns a list of Kernel objects. You can define as manykernels as you want for a single op. An example would look likethis:

  1. def gpu_kernels(self, node, name):
  2. code = """
  3. KERNEL void k(GLOBAL_MEM ga_double *a, ga_size n, ga_size m) {
  4. ga_size nb = n < m ? n : m;
  5. for (ga_size i = LID_0; i < nb; i += LDIM_0) {
  6. a[i*m + i] = 1;
  7. }
  8. }"""
  9. return [Kernel(
  10. code=code, name="k",
  11. params=[gpuarray.GpuArray, gpuarray.SIZE, gpuarray.SIZE],
  12. flags=Kernel.get_flags('float64'))]

If you want to use COp, then you should use CGpuKernelBaseinstead. It adds a new section to the parsed files whose tag iskernels. Inside that section you can define some kernels with#kernel name:params:flags.

Here name is the name of the kernel function in the followingcode, params is a comma-separeted list of numpy typecode names.There are three exceptions for size_t which should be noted assize, ssize_t which should be noted as ssize and a pointerwhich should be noted as *.

flags is a |-separated list of C kernel flag values (can beempty). The same kernel definition as above would look like this withCGpuKernelBase:

  1. #section kernels
  2.  
  3. #kernel k : *, size, size : GA_USE_DOUBLE
  4.  
  5. KERNEL void k(GLOBAL_MEM ga_double *a, ga_size n, ga_size m) {
  6. ga_size nb = n < m ? n : m;
  7. for (ga_size i = LID_0; i < nb; i += LDIM_0) {
  8. a[i*m + i] = 1;
  9. }
  10. }

The second method is to handle the kernel compilation and cache onyour own. This is not recommended because there are lots of detailsto pay attention to that can cripple your performance if not doneright, which GpuKernelBase handles for you. But if you really want togo this way, then you can look up the C API for kernels inlibgpuarray.

In any case you will need to call your compiled kernel with some data,in most cases in your c_code() method. This is done by usingthe provided wrapper function. An example calling the above kernelwould be:

  1. size_t dims[2];
  2. size_t n = 256;
  3.  
  4. // ...
  5.  
  6. err = k_scall(1, &n, 0, input->ga.data, dims[0], dims[1]);
  7.  
  8. // ...

If you want explicit control over the scheduling, you can use the_call wrapper instead which works like this:

  1. size_t ls, gs;
  2.  
  3. // ...
  4.  
  5. gs = 1;
  6. ls = 256;
  7. err = k_call(1, &gs, &ls, 0, input->ga.data, dims[0], dims[1]);

The name of the wrapper function depends on the name you passed toKernel() when you declared it (or the name in your #kernel_statement). It defaults to _name + ‘_call’ or ‘_scall’.

For other operations in the C code you should refer to thelibgpuarray documentation.

Dealing with float16

To support limited-precision storage in a kernel you have to becareful to load values properly, declare working memory in float32 andwrite results properly. To help with that some functions have beendeclared in theano.gpuarray.fp16_help.

To load the inputs you should wrap the read with the function returnedby load_w(). Similarly writesshould be wrapped in the function returned by write_w(). Finally working data shouldhave the type returned by work_dtype().

Here is a +1 kernel that is not ready to deal with float16 input:

  1. type_x = dtype_to_ctype(x.dtype)
  2. type_y = dtype_to_ctype(y.dtype)
  3. """
  4. KERNEL void k(const ga_size n, %(type_x)s *x, %(type_y)s *y) {
  5. ga_size i = GID_0 * LDIM_0 + LID_0;
  6. %(type_x) z = x[i];
  7. z += 1;
  8. y[i] = z;
  9. }
  10. """ % dict(dtype_x=dtype_x, dtype_y=dtype_y)

Here is the same kernel, but now ready to handle float16:

  1. type_x = dtype_to_ctype(x.dtype)
  2. type_y = dtype_to_ctype(y.dtype)
  3. work_x = dtype_to_ctype(work_dtype(x.dtype))
  4. load_x = load_w(x.dtype)
  5. write_y = write_w(y.dtype)
  6. """
  7. KERNEL void k(const ga_size n, %(type_x)s *x, %(type_y)s *y) {
  8. ga_size i = GID_0 * LDIM_0 + LID_0;
  9. %(work_x) z = %(load_x)(x[i]);
  10. z += 1;
  11. y[i] = %(write_y)(z);
  12. }
  13. """ % dict(dtype_x=dtype_x, dtype_y=dtype_y, work_x=work_x, load_x=load_x,
  14. write_y=write_y)

Once you have converted your kernels for float16 support you need totag your op with _f16_ok = True so that the linker will accept togenerate C code on float16 types. This is done by inserting it as aclass property like this:

  1. class SomeOp(Op):
  2. _f16_ok = True

If this attribute is not present or is False, the linker will print amessage saying that it’s refusing to use C code for float16 for theop.

A Complete Example

This is a complete example using both approches for a implementationof the Eye operation.

GpuKernelBase

Python File

  1. class GpuEye(GpuKernelBase, Op):
  2. """
  3. Eye for GPU.
  4.  
  5. """
  6. __props__ = ('dtype', 'context_name')
  7. _f16_ok = True
  8.  
  9. def __init__(self, dtype=None, context_name=None):
  10. if dtype is None:
  11. dtype = config.floatX
  12. self.dtype = dtype
  13. self.context_name = context_name
  14.  
  15. def get_params(self, node):
  16. return get_context(self.context_name)
  17.  
  18. def make_node(self, n, m, k):
  19. n = tensor.as_tensor_variable(n)
  20. m = tensor.as_tensor_variable(m)
  21. k = tensor.as_tensor_variable(k)
  22. assert n.ndim == 0
  23. assert m.ndim == 0
  24. assert k.ndim == 0
  25. otype = GpuArrayType(dtype=self.dtype,
  26. broadcastable=(False, False),
  27. context_name=self.context_name)
  28.  
  29. return Apply(self, [n, m, k], [otype()])
  30.  
  31. def infer_shape(self, node, in_shapes):
  32. out_shape = [node.inputs[0], node.inputs[1]]
  33. return [out_shape]
  34.  
  35. def grad(self, inp, grads):
  36. return [grad_undefined(self, i, inp[i])
  37. for i in xrange(3)]
  38.  
  39. def gpu_kernels(self, node, name):
  40. code = """#include "cluda.h"
  41.  
  42. KERNEL void eye(GLOBAL_MEM %(ctype)s *a, ga_size a_off,
  43. ga_size n, ga_size m, ga_ssize k) {
  44. a = (GLOBAL_MEM %(ctype)s *)(((GLOBAL_MEM char *)a) + a_off);
  45. ga_ssize coff = max(k, (ga_ssize) 0);
  46. ga_ssize roff = -min(k, (ga_ssize) 0);
  47. ga_size nb = (ga_size) min(n - roff, m - coff);
  48. for (ga_size i = LID_0; i < nb; i += LDIM_0) {
  49. a[(i + roff)*m + i + coff] = %(write_a)s(1);
  50. }
  51. }""" % dict(ctype=pygpu.gpuarray.dtype_to_ctype(self.dtype),
  52. name=name, write_a=write_w(self.dtype))
  53. return [Kernel(
  54. code=code, name="eye",
  55. params=[gpuarray.GpuArray, gpuarray.SIZE, gpuarray.SIZE,
  56. gpuarray.SIZE, gpuarray.SSIZE],
  57. flags=Kernel.get_flags(self.dtype),
  58. objvar='k_eye_' + name)]
  59.  
  60. def c_code(self, node, name, inp, out, sub):
  61. if len(inp) == 2:
  62. n, m = inp
  63. k = 0
  64. elif len(inp) == 3:
  65. n, m, k = inp
  66.  
  67. z, = out
  68. fail = sub['fail']
  69. ctx = sub['params']
  70. typecode = pygpu.gpuarray.dtype_to_typecode(self.dtype)
  71. kname = self.gpu_kernels(node, name)[0].objvar
  72. s = """
  73. size_t dims[2] = {0, 0};
  74. size_t ls, gs;
  75. ssize_t k;
  76. size_t col_off;
  77. size_t row_off;
  78. int err;
  79.  
  80. dims[0] = ((dtype_%(n)s*)PyArray_DATA(%(n)s))[0];
  81. dims[1] = ((dtype_%(m)s*)PyArray_DATA(%(m)s))[0];
  82. k = ((dtype_%(k)s*)PyArray_DATA(%(k)s))[0];
  83.  
  84. Py_CLEAR(%(z)s);
  85.  
  86. %(z)s = pygpu_zeros(2, dims,
  87. %(typecode)s,
  88. GA_C_ORDER,
  89. %(ctx)s, Py_None);
  90. if (%(z)s == NULL) {
  91. %(fail)s
  92. }
  93.  
  94. ls = 1;
  95. gs = 256;
  96. col_off = (size_t) (k > 0?k:0);
  97. row_off = (size_t) (k < 0?-k:0);
  98. if (row_off < dims[0] && col_off < dims[1]) {
  99. err = eye_call(1, &gs, &ls, 0, %(z)s->ga.data, %(z)s->ga.offset,
  100. dims[0], dims[1], k);
  101. if (err != GA_NO_ERROR) {
  102. PyErr_Format(PyExc_RuntimeError,
  103. "gpuarray error: kEye: %%s. n%%lu, m=%%lu.",
  104. GpuKernel_error(&%(kname)s, err),
  105. (unsigned long)dims[0], (unsigned long)dims[1]);
  106. %(fail)s;
  107. }
  108. }
  109.  
  110. """ % locals()
  111.  
  112. return s
  113.  
  114. def c_code_cache_version(self):
  115. return (10,)

CGpuKernelBase

Python File

  1. class GpuEye(CGpuKernelBase, Op):
  2. """
  3. Eye for GPU.
  4.  
  5. """
  6. __props__ = ('dtype', 'context_name')
  7. params_type = ParamsType(typecode=int_t, context=gpu_context_type)
  8.  
  9. def __init__(self, dtype=None, context_name=None):
  10. if dtype is None:
  11. dtype = config.floatX
  12. self.dtype = dtype
  13. self.context_name = context_name
  14. CGpuKernelBase.__init__(self, ['c_code/tstgpueye.c'],
  15. 'APPLY_SPECIFIC(tstgpueye)')
  16.  
  17. def get_params(self, node):
  18. from pygpu.gpuarray import dtype_to_typecode
  19. return self.params_type.get_params(typecode=dtype_to_typecode(self.dtype),
  20. context=get_context(self.context_name))
  21.  
  22. def c_headers(self):
  23. return ['<gpuarray/types.h>', '<gpuarray/kernel.h>']
  24.  
  25. def make_node(self, n, m):
  26. n = tensor.as_tensor_variable(n)
  27. m = tensor.as_tensor_variable(m)
  28. assert n.ndim == 0
  29. assert m.ndim == 0
  30. otype = GpuArrayType(dtype=self.dtype,
  31. broadcastable=(False, False),
  32. context_name=self.context_name)
  33.  
  34. return Apply(self, [n, m], [otype()])
  35.  
  36. def infer_shape(self, node, in_shapes):
  37. out_shape = [node.inputs[0], node.inputs[1]]
  38. return [out_shape]
  39.  
  40. def grad(self, inp, grads):
  41. return [grad_undefined(self, i, inp[i])
  42. for i in xrange(2)]

tstgpueye.c

  1. #section kernels
  2.  
  3. #kernel eye : *, size, size, size :
  4. #include <cluda.h>
  5. /* The eye name will be used to generate supporting objects. The only
  6. you probably need to care about is the kernel object which will be
  7. named 'k_' + <the name above> (k_eye in this case). This name also
  8. has to match the kernel function name below.
  9. */
  10.  
  11. KERNEL void eye(GLOBAL_MEM DTYPE_OUTPUT_0 *a, ga_size a_off, ga_size n, ga_size m) {
  12. a = (GLOBAL_MEM DTYPE_OUTPUT_0 *)(((GLOBAL_MEM char *)a) + a_off);
  13. ga_size nb = n < m ? n : m;
  14. for (ga_size i = LID_0; i < nb; i += LDIM_0) {
  15. a[i*m + i] = 1;
  16. }
  17. }
  18.  
  19. #section support_code_struct
  20.  
  21. int APPLY_SPECIFIC(tstgpueye)(PyArrayObject *n, PyArrayObject *m,
  22. PyGpuArrayObject **z, PARAMS_TYPE* params) {
  23. size_t dims[2] = {0, 0};
  24. size_t ls, gs;
  25. void *args[3];
  26. int err;
  27.  
  28. dims[0] = ((DTYPE_INPUT_0 *)PyArray_DATA(n))[0];
  29. dims[1] = ((DTYPE_INPUT_1 *)PyArray_DATA(m))[0];
  30.  
  31. Py_XDECREF(*z);
  32. *z = pygpu_zeros(2, dims,
  33. params->typecode,
  34. GA_C_ORDER,
  35. params->context, Py_None);
  36. if (*z == NULL)
  37. return -1;
  38.  
  39. ls = 1;
  40. gs = 256;
  41. /* The eye_call name comes from the kernel declaration above. */
  42. err = eye_call(1, &gs, &ls, 0, (*z)->ga.data, (*z)->ga.offset, dims[0], dims[1]);
  43. if (err != GA_NO_ERROR) {
  44. PyErr_Format(PyExc_RuntimeError,
  45. "gpuarray error: kEye: %s. n%lu, m=%lu.",
  46. GpuKernel_error(&k_eye, err),
  47. (unsigned long)dims[0], (unsigned long)dims[1]);
  48. return -1;
  49. }
  50. return 0;
  51. }

Wrapping Exisiting Libraries

PyCUDA

For things in PyCUDA (or things wrapped with PyCUDA), we usually needto create a PyCUDA context. This can be done with the followingcode:

  1. with gpuarray_cuda_context:
  2. pycuda_context = pycuda.driver.Context.attach()

If you don’t need to create a context, because the library doesn’trequire it, you can also just use the pygpu context and a _with_statement like above for all your code which will make the context thecurrent context on the cuda stack.

GpuArray objects are compatible with PyCUDA and will expose thenecessary interface so that they can be used in most things. Onenotable exception is PyCUDA kernels which require native objects. Ifyou need to convert a pygpu GpuArray to a PyCUDA GPUArray, this codeshould do the trick:

  1. assert pygpu_array.flags['IS_C_CONTIGUOUS']
  2. pycuda_array = pycuda.gpuarray.GPUArray(pygpu_array.shape,
  3. pygpu_array.dtype,
  4. base=pygpu_array,
  5. gpudata=(pygpu_array.gpudata +
  6. pygpu_array.offset))

As long as the computations happen on the NULL stream there are nospecial considerations to watch for with regards to synchronization.Otherwise, you will have to make sure that you synchronize the pygpuobjects by calling the .sync() method before scheduling any work andsynchronize with the work that happens in the library after all thework is scheduled.