Using the GPU
For an introductory discussion of Graphical Processing Units (GPU)and their use for intensive parallel computation purposes, see GPGPU.
One of Theano’s design goals is to specify computations at an abstractlevel, so that the internal function compiler has a lot of flexibilityabout how to carry out those computations. One of the ways we takeadvantage of this flexibility is in carrying out calculations on agraphics card.
Using the GPU in Theano is as simple as setting the device
configuration flag to device=cuda
. You can optionally target aspecific gpu by specifying the number of the gpu as ine.g. device=cuda2
. It is also encouraged to set the floatingpoint precision to float32 when working on the GPU as that is usuallymuch faster. For example:THEANO_FLAGS='device=cuda,floatX=float32'
. You can also set theseoptions in the .theanorc file’s [global]
section:
- [global]
- device = cuda
- floatX = float32
Note
- If your computer has multiple GPUs and you use
device=cuda
,the driver selects the one to use (usually cuda0). - You can use the program
nvidia-smi
to change this policy. - By default, when
device
indicates preference for GPU computations,Theano will fall back to the CPU if there is a problem with the GPU.You can use the flagforce_device=True
to instead raise an error whenTheano cannot use the GPU.
GpuArray Backend
If you have not done so already, you will need to install libgpuarrayas well as at least one computing toolkit (CUDA or OpenCL). Detailedinstructions to accomplish that are provided atlibgpuarray.
To install Nvidia’s GPU-programming toolchain (CUDA) and configureTheano to use it, see the installation instructions forLinux, MacOS and Windows.
While all types of devices are supported if using OpenCL, for theremainder of this section, whatever compute device you are using willbe referred to as GPU.
Note
GpuArray backend uses config.gpuarray.preallocate
for GPU memoryallocation.
Warning
The backend was designed to support OpenCL, however current support isincomplete. A lot of very useful ops still do not support it because theywere ported from the old backend with minimal change.
Testing Theano with GPU
To see if your GPU is being used, cut and paste the following programinto a file and run it.
Use the Theano flag device=cuda
to require the use of the GPU. Use the flagdevice=cuda{0,1,…}
to specify which GPU to use.
- from theano import function, config, shared, tensor
- import numpy
- import time
- vlen = 10 * 30 * 768 # 10 x #cores x # threads per core
- iters = 1000
- rng = numpy.random.RandomState(22)
- x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
- f = function([], tensor.exp(x))
- print(f.maker.fgraph.toposort())
- t0 = time.time()
- for i in range(iters):
- r = f()
- t1 = time.time()
- print("Looping %d times took %f seconds" % (iters, t1 - t0))
- print("Result is %s" % (r,))
- if numpy.any([isinstance(x.op, tensor.Elemwise) and
- ('Gpu' not in type(x.op).__name__)
- for x in f.maker.fgraph.toposort()]):
- print('Used the cpu')
- else:
- print('Used the gpu')
The program just computes exp()
of a bunch of random numbers. Notethat we use the theano.shared()
function to make sure that theinput x is stored on the GPU.
- $ THEANO_FLAGS=device=cpu python gpu_tutorial1.py
- [Elemwise{exp,no_inplace}(<TensorType(float64, vector)>)]
- Looping 1000 times took 2.271284 seconds
- Result is [ 1.23178032 1.61879341 1.52278065 ..., 2.20771815 2.29967753
- 1.62323285]
- Used the cpu
- $ THEANO_FLAGS=device=cuda0 python gpu_tutorial1.py
- Using cuDNN version 5105 on context None
- Mapped name None to device cuda0: GeForce GTX 750 Ti (0000:07:00.0)
- [GpuElemwise{exp,no_inplace}(<GpuArrayType<None>(float64, (False,))>), HostFromGpu(gpuarray)(GpuElemwise{exp,no_inplace}.0)]
- Looping 1000 times took 1.697514 seconds
- Result is [ 1.23178032 1.61879341 1.52278065 ..., 2.20771815 2.29967753
- 1.62323285]
- Used the gpu
Returning a Handle to Device-Allocated Data
By default functions that execute on the GPU still return a standardnumpy ndarray. A transfer operation is inserted just before theresults are returned to ensure a consistent interface with CPU code.This allows changing the device some code runs on by only replacingthe value of the device
flag without touching the code.
If you don’t mind a loss of flexibility, you can ask theano to returnthe GPU object directly. The following code is modified to do just that.
- from theano import function, config, shared, tensor
- import numpy
- import time
- vlen = 10 * 30 * 768 # 10 x #cores x # threads per core
- iters = 1000
- rng = numpy.random.RandomState(22)
- x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
- f = function([], tensor.exp(x).transfer(None))
- print(f.maker.fgraph.toposort())
- t0 = time.time()
- for i in range(iters):
- r = f()
- t1 = time.time()
- print("Looping %d times took %f seconds" % (iters, t1 - t0))
- print("Result is %s" % (numpy.asarray(r),))
- if numpy.any([isinstance(x.op, tensor.Elemwise) and
- ('Gpu' not in type(x.op).__name__)
- for x in f.maker.fgraph.toposort()]):
- print('Used the cpu')
- else:
- print('Used the gpu')
Here tensor.exp(x).transfer(None)
means “copy exp(x)
to the GPU”,with None
the default GPU context when not explicitly given.For information on how to set GPU contexts, see Using multiple GPUs.
The output is
- $ THEANO_FLAGS=device=cuda0 python gpu_tutorial2.py
- Using cuDNN version 5105 on context None
- Mapped name None to device cuda0: GeForce GTX 750 Ti (0000:07:00.0)
- [GpuElemwise{exp,no_inplace}(<GpuArrayType<None>(float64, (False,))>)]
- Looping 1000 times took 0.040277 seconds
- Result is [ 1.23178032 1.61879341 1.52278065 ..., 2.20771815 2.29967753
- 1.62323285]
- Used the gpu
While the time per call appears to be much lower than the two previousinvocations (and should indeed be lower, since we avoid a transfer)the massive speedup we obtained is in part due to asynchronous natureof execution on GPUs, meaning that the work isn’t completed yet, just‘launched’. We’ll talk about that later.
The object returned is a GpuArray from pygpu. It mostly acts as anumpy ndarray with some exceptions due to its data being on the GPU.You can copy it to the host and convert it to a regular ndarray byusing usual numpy casting such as numpy.asarray()
.
For even more speed, you can play with the borrow
flag. SeeBorrowing when Constructing Function Objects.
What Can be Accelerated on the GPU
The performance characteristics will of course vary from device todevice, and also as we refine our implementation:
- In general, matrix multiplication, convolution, and large element-wiseoperations can be accelerated a lot (5-50x) when arguments are large enoughto keep 30 processors busy.
- Indexing, dimension-shuffling and constant-time reshaping will beequally fast on GPU as on CPU.
- Summation over rows/columns of tensors can be a little slower on theGPU than on the CPU.
- Copying of large quantities of data to and from a device is relatively slow,and often cancels most of the advantage of one or two accelerated functionson that data. Getting GPU performance largely hinges on making data transferto the device pay off.
The backend supports all regular theano data types (float32, float64,int, …), however GPU support varies and some units can’t deal withdouble (float64) or small (less than 32 bits like int16) data types.You will get an error at compile time or runtime if this is the case.
By default all inputs will get transferred to GPU. You can prevent aninput from getting transferred by setting its tag.target
attribute to‘cpu’.
Complex support is untested and most likely completely broken.
Tips for Improving Performance on GPU
- Consider adding
floatX=float32
(or the type you are using) to your.theanorc
file if you plan to do a lot of GPU work. - The GPU backend supports float64 variables, but they are still slowerto compute than float32. The more float32, the better GPU performanceyou will get.
- Prefer constructors like
matrix
,vector
andscalar
(whichfollow the type set infloatX
) todmatrix
,dvector
anddscalar
. The latter enforce double precision (float64 on mostmachines), which slows down GPU computations on current hardware. - Minimize transfers to the GPU device by using
shared
variablesto store frequently-accessed data (seeshared()
).When using the GPU, tensorshared
variables are stored onthe GPU by default to eliminate transfer time for GPU ops using thosevariables. - If you aren’t happy with the performance you see, try running yourscript with
profile=True
flag. This should print some timinginformation at program termination. Is time being used sensibly? Ifan op or Apply is taking more time than its share, then if you knowsomething about GPU programming, have a look at how it’s implementedin theano.gpuarray. Check the line similar to Spent Xs(X%) in cpuop, Xs(X%) in gpu op and Xs(X%) in transfer op. This can tell youif not enough of your graph is on the GPU or if there is too muchmemory transfer. - To investigate whether all the Ops in the computational graph arerunning on GPU, it is possible to debug or check your code by providinga value to assert_no_cpu_op flag, i.e. warn, for warning, raise forraising an error or pdb for putting a breakpoint in the computationalgraph if there is a CPU Op.
GPU Async Capabilities
By default, all operations on the GPU are run asynchronously. Thismeans that they are only scheduled to run and the function returns.This is made somewhat transparently by the underlying libgpuarray.
A forced synchronization point is introduced when doing memorytransfers between device and host.
It is possible to force synchronization for a particular GpuArray bycalling its sync()
method. This is useful to get accurate timingswhen doing benchmarks.
Changing the Value of Shared Variables
To change the value of a shared
variable, e.g. to provide new datato processes, use shared_variable.set_value(new_value)
. For a lotmore detail about this, see Understanding Memory Aliasing for Speed and Correctness.
Exercise
Consider again the logistic regression:
- import numpy
- import theano
- import theano.tensor as T
- rng = numpy.random
- N = 400
- feats = 784
- D = (rng.randn(N, feats).astype(theano.config.floatX),
- rng.randint(size=N,low=0, high=2).astype(theano.config.floatX))
- training_steps = 10000
- # Declare Theano symbolic variables
- x = T.matrix("x")
- y = T.vector("y")
- w = theano.shared(rng.randn(feats).astype(theano.config.floatX), name="w")
- b = theano.shared(numpy.asarray(0., dtype=theano.config.floatX), name="b")
- x.tag.test_value = D[0]
- y.tag.test_value = D[1]
- # Construct Theano expression graph
- p_1 = 1 / (1 + T.exp(-T.dot(x, w)-b)) # Probability of having a one
- prediction = p_1 > 0.5 # The prediction that is done: 0 or 1
- xent = -y*T.log(p_1) - (1-y)*T.log(1-p_1) # Cross-entropy
- cost = xent.mean() + 0.01*(w**2).sum() # The cost to optimize
- gw,gb = T.grad(cost, [w,b])
- # Compile expressions to functions
- train = theano.function(
- inputs=[x,y],
- outputs=[prediction, xent],
- updates=[(w, w-0.01*gw), (b, b-0.01*gb)],
- name = "train")
- predict = theano.function(inputs=[x], outputs=prediction,
- name = "predict")
- if any([x.op.__class__.__name__ in ['Gemv', 'CGemv', 'Gemm', 'CGemm'] for x in
- train.maker.fgraph.toposort()]):
- print('Used the cpu')
- elif any([x.op.__class__.__name__ in ['GpuGemm', 'GpuGemv'] for x in
- train.maker.fgraph.toposort()]):
- print('Used the gpu')
- else:
- print('ERROR, not able to tell if theano used the cpu or the gpu')
- print(train.maker.fgraph.toposort())
- for i in range(training_steps):
- pred, err = train(D[0], D[1])
- print("target values for D")
- print(D[1])
- print("prediction on D")
- print(predict(D[0]))
- print("floatX=", theano.config.floatX)
- print("device=", theano.config.device)
Modify and execute this example to run on GPU with floatX=float32
and time it using the command line time python file.py
. (Ofcourse, you may use some of your answer to the exercise in sectionConfiguration Settings and Compiling Mode.)
Is there an increase in speed from CPU to GPU?
Where does it come from? (Use profile=True
flag.)
What can be done to further increase the speed of the GPU version? Putyour ideas to test.
Software for Directly Programming a GPU
Leaving aside Theano which is a meta-programmer, there are:
CUDA: GPU programming API by NVIDIA based on extension to C (CUDA C)
- Vendor-specific
- Numeric libraries (BLAS, RNG, FFT) are maturing.
OpenCL: multi-vendor version of CUDA
- More general, standardized.
- Fewer libraries, lesser spread.
PyCUDA: Python bindings to CUDA driver interface allow to access Nvidia’s CUDA parallelcomputation API from Python
- Convenience:
Makes it easy to do GPU meta-programming from within Python.
Abstractions to compile low-level CUDA code from Python (pycuda.driver.SourceModule
).
GPU memory buffer (pycuda.gpuarray.GPUArray
).
Helpful documentation.
Completeness: Binding to all of CUDA’s driver API.
Automatic error checking: All CUDA errors are automatically translated into Python exceptions.
Speed: PyCUDA’s base layer is written in C++.
Good memory management of GPU objects:
Object cleanup tied to lifetime of objects (RAII, ‘Resource Acquisition Is Initialization’).
Makes it much easier to write correct, leak- and crash-free code.
PyCUDA knows about dependencies (e.g. it won’t detach from a context before all memoryallocated in it is also freed).
(This is adapted from PyCUDA’s documentationand Andreas Kloeckner’s website on PyCUDA.)
- PyOpenCL: PyCUDA for OpenCL
Learning to Program with PyCUDA
If you already enjoy a good proficiency with the C programming language, youmay easily leverage your knowledge by learning, first, to program a GPU with theCUDA extension to C (CUDA C) and, second, to use PyCUDA to access the CUDAAPI with a Python wrapper.
The following resources will assist you in this learning process:
- CUDA API and CUDA C: Introductory
- CUDA API and CUDA C: Advanced
- MIT IAP2009 CUDA(full coverage: lectures, leading Kirk-Hwu textbook, examples, additional resources)
- Course U. of Illinois(full lectures, Kirk-Hwu textbook)
- NVIDIA’s knowledge base(extensive coverage, levels from introductory to advanced)
- practical issues(on the relationship between grids, blocks and threads; see also linked and related issues on same page)
- CUDA optimization
- PyCUDA: Introductory
- PYCUDA: Advanced
The following examples give a foretaste of programming a GPU with PyCUDA. Onceyou feel competent enough, you may try yourself on the corresponding exercises.
Example: PyCUDA
- # (from PyCUDA's documentation)
- import pycuda.autoinit
- import pycuda.driver as drv
- import numpy
- from pycuda.compiler import SourceModule
- mod = SourceModule("""
- __global__ void multiply_them(float *dest, float *a, float *b)
- {
- const int i = threadIdx.x;
- dest[i] = a[i] * b[i];
- }
- """)
- multiply_them = mod.get_function("multiply_them")
- a = numpy.random.randn(400).astype(numpy.float32)
- b = numpy.random.randn(400).astype(numpy.float32)
- dest = numpy.zeros_like(a)
- multiply_them(
- drv.Out(dest), drv.In(a), drv.In(b),
- block=(400,1,1), grid=(1,1))
- assert numpy.allclose(dest, a*b)
- print(dest)
Exercise
Run the preceding example.
Modify and execute to work for a matrix of shape (20, 10).
Example: Theano + PyCUDA
- import numpy, theano
- import theano.misc.pycuda_init
- from pycuda.compiler import SourceModule
- import theano.sandbox.cuda as cuda
- class PyCUDADoubleOp(theano.Op):
- __props__ = ()
- def make_node(self, inp):
- inp = cuda.basic_ops.gpu_contiguous(
- cuda.basic_ops.as_cuda_ndarray_variable(inp))
- assert inp.dtype == "float32"
- return theano.Apply(self, [inp], [inp.type()])
- def make_thunk(self, node, storage_map, _, _2, impl):
- mod = SourceModule("""
- __global__ void my_fct(float * i0, float * o0, int size) {
- int i = blockIdx.x*blockDim.x + threadIdx.x;
- if(i<size){
- o0[i] = i0[i]*2;
- }
- }""")
- pycuda_fct = mod.get_function("my_fct")
- inputs = [storage_map[v] for v in node.inputs]
- outputs = [storage_map[v] for v in node.outputs]
- def thunk():
- z = outputs[0]
- if z[0] is None or z[0].shape != inputs[0][0].shape:
- z[0] = cuda.CudaNdarray.zeros(inputs[0][0].shape)
- grid = (int(numpy.ceil(inputs[0][0].size / 512.)), 1)
- pycuda_fct(inputs[0][0], z[0], numpy.intc(inputs[0][0].size),
- block=(512, 1, 1), grid=grid)
- return thunk
Use this code to test it:
- >>> x = theano.tensor.fmatrix()
- >>> f = theano.function([x], PyCUDADoubleOp()(x))
- >>> xv = numpy.ones((4, 5), dtype="float32")
- >>> assert numpy.allclose(f(xv), xv*2)
- >>> print(numpy.asarray(f(xv)))
Exercise
Run the preceding example.
Modify and execute to multiply two matrices: x * y.
Modify and execute to return two outputs: x + y and x - y.
(Notice that Theano’s current elemwise fusion optimization isonly applicable to computations involving a single output. Hence, to gainefficiency over the basic solution that is asked here, the two operations wouldhave to be jointly optimized explicitly in the code.)
Modify and execute to support stride (i.e. to avoid constraining the input to be C-contiguous).
Note
- See Other Implementations to know how to handle random numberson the GPU.
- The mode FAST_COMPILE disables C code, so also disables the GPU. Youcan use the Theano flag optimizer=’fast_compile’ to speed upcompilation and keep the GPU.