Understanding Memory Aliasing for Speed and Correctness

The aggressive reuse of memory is one of the ways through which Theano makes code fast, andit is important for the correctness and speed of your program that you understandhow Theano might alias buffers.

This section describes the principles based on which Theano handles memory, and explainswhen you might want to alter the default behaviour of some functions andmethods for faster performance.

The Memory Model: Two Spaces

There are some simple principles that guide Theano’s handling of memory. Themain idea is that there is a pool of memory managed by Theano, and Theano trackschanges to values in that pool.

  • Theano manages its own memory space, which typically does not overlap withthe memory of normal Python variables that non-Theano code creates.
  • Theano functions only modify buffers that are in Theano’s memory space.
  • Theano’s memory space includes the buffers allocated to store sharedvariables and the temporaries used to evaluate functions.
  • Physically, Theano’s memory space may be spread across the host, a GPUdevice(s), and in the future may even include objects on a remote machine.
  • The memory allocated for a shared variable buffer is unique: it is neveraliased to another shared variable.
  • Theano’s managed memory is constant while Theano functions are not runningand Theano’s library code is not running.
  • The default behaviour of a function is to return user-space values foroutputs, and to expect user-space values for inputs.

The distinction between Theano-managed memory and user-managed memory can bebroken down by some Theano functions (e.g. shared, get_value and theconstructors for In and Out) by using a borrow=True flag.This can make those methods faster (by avoiding copy operations) at the expenseof risking subtle bugs in the overall program (by aliasing memory).

The rest of this section is aimed at helping you to understand when it is safeto use the borrow=True argument and reap the benefits of faster code.

Borrowing when Creating Shared Variables

A borrow argument can be provided to the shared-variable constructor.

  1. import numpy, theano
  2. np_array = numpy.ones(2, dtype='float32')
  3.  
  4. s_default = theano.shared(np_array)
  5. s_false = theano.shared(np_array, borrow=False)
  6. s_true = theano.shared(np_array, borrow=True)

By default (s_default) and when explicitly setting borrow=False, theshared variable we construct gets a [deep] copy of np_array. So changes wesubsequently make to np_array have no effect on our shared variable.

  1. np_array += 1 # now it is an array of 2.0 s
  2.  
  3. print(s_default.get_value())
  4. print(s_false.get_value())
  5. print(s_true.get_value())
  1. [ 1. 1.]
  2. [ 1. 1.]
  3. [ 2. 2.]

If we are running this with the CPU as the device,then changes we make to np_array right away will show up instrue.get_value()because NumPy arrays are mutable, and _s_true is using the _np_array_object as it’s internal buffer.

However, this aliasing of np_array and s_true is not guaranteed to occur,and may occur only temporarily even if it occurs at all.It is not guaranteed to occur because if Theano is using a GPU device, then theborrow flag has no effect. It may occur only temporarily becauseif we call a Theano function that updates the value of s_true the aliasingrelationship may or may not be broken (the function is allowed toupdate the shared variable by modifying its buffer, which will preservethe aliasing, or by changing which buffer the variable points to, whichwill terminate the aliasing).

Take home message:

It is a safe practice (and a good idea) to use borrow=True in a sharedvariable constructor when the shared variable stands for a large object (interms of memory footprint) and you do not want to create copies of it inmemory.

It is not a reliable technique to use borrow=True to modify shared variablesthrough side-effect, because with some devices (e.g. GPU devices) this technique willnot work.

Borrowing when Accessing Value of Shared Variables

Retrieving

A borrow argument can also be used to control how a shared variable’s value isretrieved.

  1. s = theano.shared(np_array)
  2.  
  3. v_false = s.get_value(borrow=False) # N.B. borrow default is False
  4. v_true = s.get_value(borrow=True)

When borrow=False is passed to getvalue, it means that the return valuemay not be aliased to any part of Theano’s internal memory.When borrow=True is passed to get_value, it means that the return value_might be aliased to some of Theano’s internal memory.But both of these calls might create copies of the internal memory.

The reason that borrow=True might still make a copy is that the internalrepresentation of a shared variable might not be what you expect. When youcreate a shared variable by passing a NumPy array for example, then get_value()must return a NumPy array too. That’s how Theano can make the GPU usetransparent. But when you are using a GPU (or in the future perhaps a remote machine),then the numpy.ndarray is not the internal representation of your data.If you really want Theano to return its internal representation _and never copy it_then you should use the return_internal_type=True argument toget_value. It will never cast the internal object (always return inconstant time), but might return various datatypes depending on contextualfactors (e.g. the compute device, the dtype of the NumPy array).

  1. v_internal = s.get_value(borrow=True, return_internal_type=True)

It is possible to use borrow=False in conjunction withreturn_internal_type=True, which will return a deep copy of the internal object.This is primarily for internal debugging, not for typical use.

For the transparent use of different type of optimization Theano can make,there is the policy that get_value() always return by default the same object typeit received when the shared variable was created. So if you created manually data onthe gpu and create a shared variable on the gpu with this data, get_value will alwaysreturn gpu data even when return_internal_type=False.

Take home message:

It is safe (and sometimes much faster) to use getvalue(borrow=True) whenyour code does not modify the return value. _Do not use this to modify a sharedvariable by side-effect because it will make your code device-dependent.Modification of GPU variables through this sort of side-effect is impossible.

Assigning

Shared variables also have a setvalue method that can accept an optionalborrow=True argument. The semantics are similar to those of creating a newshared variable - borrow=False is the default and borrow=True meansthat Theano _may reuse the buffer you provide as the internal storage for the variable.

A standard pattern for manually updating the value of a shared variable is asfollows:

  1. s.set_value(
  2. some_inplace_fn(s.get_value(borrow=True)),
  3. borrow=True)

This pattern works regardless of the computing device, and when the lattermakes it possible to expose Theano’s internal variables without a copy, then itproceeds as fast as an in-place update.

When shared variables are allocated on the GPU, the transfers to and from the GPU device memory canbe costly. Here are a few tips to ensure fast and efficient use of GPU memory and bandwidth:

  • Prior to Theano 0.3.1, set_value did not work in-place on the GPU. This meant that, sometimes,GPU memory for the new value would be allocated before the old memory was released. If you’rerunning near the limits of GPU memory, this could cause you to run out of GPU memoryunnecessarily.

Solution: update to a newer version of Theano.

  • If you are going to swap several chunks of data in and out of a shared variable repeatedly,you will want to reuse the memory that you allocated the first time if possible - it is bothfaster and more memory efficient.

Solution: upgrade to a recent version of Theano (>0.3.0) and consider padding your sourcedata to make sure that every chunk is the same size.

  • It is also worth mentioning that, current GPU copying routines support only contiguous memory.So Theano must make the value you provide C-contiguous prior to copying it.This can require an extra copy of the data on the host.

Solution: make sure that the valueyou assign to a CudaNdarraySharedVariable is already C-contiguous.

(Further information on the current implementation of the GPU version of set_value() can be foundhere: sandbox.cuda.var – The Variables for Cuda-allocated arrays)

Borrowing when Constructing Function Objects

A borrow argument can also be provided to the In and Out objectsthat control how theano.function handles its argument[s] and return value[s].

  1. import theano, theano.tensor
  2.  
  3. x = theano.tensor.matrix()
  4. y = 2 * x
  5. f = theano.function([theano.In(x, borrow=True)], theano.Out(y, borrow=True))

Borrowing an input means that Theano will treat the argument you provide as ifit were part of Theano’s pool of temporaries. Consequently, your inputmay be reused as a buffer (and overwritten!) during the computation of other variables in thecourse of evaluating that function (e.g. f).

Borrowing an output means that Theano will not insist on allocating a freshoutput buffer every time you call the function. It will possibly reuse the same one ason a previous call, and overwrite the old content. Consequently, it may overwriteold return values through side-effect.Those return values may also be overwritten inthe course of evaluating another compiled function (for example, the outputmay be aliased to a shared variable). So be careful to use a borrowed returnvalue right away before calling any more Theano functions.The default is of course to not borrow internal results.

It is also possible to pass a return_internal_type=True flag to the Outvariable which has the same interpretation as the return_internal_type flagto the shared variable’s get_value function. Unlike get_value(), thecombination of return_internal_type=True and borrow=True arguments toOut() are not guaranteed to avoid copying an output value. They are justhints that give more flexibility to the compilation and optimization of thegraph.

For GPU graphs, this borrowing can have a major speed impact. See the following code:

  1. from theano import function, config, shared, sandbox, tensor, Out
  2. import numpy
  3. import time
  4.  
  5. vlen = 10 * 30 * 768 # 10 x # cores x # threads per core
  6. iters = 1000
  7.  
  8. rng = numpy.random.RandomState(22)
  9. x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
  10. f1 = function([], sandbox.cuda.basic_ops.gpu_from_host(tensor.exp(x)))
  11. f2 = function([],
  12. Out(sandbox.cuda.basic_ops.gpu_from_host(tensor.exp(x)),
  13. borrow=True))
  14. t0 = time.time()
  15. for i in range(iters):
  16. r = f1()
  17. t1 = time.time()
  18. no_borrow = t1 - t0
  19. t0 = time.time()
  20. for i in range(iters):
  21. r = f2()
  22. t1 = time.time()
  23. print(
  24. "Looping %s times took %s seconds without borrow "
  25. "and %s seconds with borrow" % (iters, no_borrow, (t1 - t0))
  26. )
  27. if numpy.any([isinstance(x.op, tensor.Elemwise) and
  28. ('Gpu' not in type(x.op).__name__)
  29. for x in f1.maker.fgraph.toposort()]):
  30. print('Used the cpu')
  31. else:
  32. print('Used the gpu')

Which produces this output:

  1. $ THEANO_FLAGS=device=gpu0,floatX=float32 python test1.py
  2. Using gpu device 0: GeForce GTX 275
  3. Looping 1000 times took 0.368273973465 seconds without borrow and 0.0240728855133 seconds with borrow.
  4. Used the gpu

Take home message:

When an input x to a function is not needed after the functionreturns and you would like to make it available to Theano asadditional workspace, then consider marking it with In(x, borrow=True). It may make the function faster and reduce its memoryrequirement. When a return value y is large (in terms of memoryfootprint), and you only need to read from it once, right away whenit’s returned, then consider marking it with an Out(y, borrow=True).