- scan – Looping in Theano
- Guide
- Simple loop with accumulation: Computing
- Iterating over the first dimension of a tensor: Calculating a polynomial
- Simple accumulation into a scalar, ditching lambda
- Another simple example
- Using shared variables - Gibbs sampling
- Using shared variables - the strict flag
- Multiple outputs, several taps values - Recurrent Neural Network with Scan
- Conditional ending of Scan
- Reducing Scan’s memory usage
- Optimizing Scan’s performance
- reference
- Guide
scan – Looping in Theano
Guide
The scan functions provides the basic functionality needed to do loopsin Theano. Scan comes with many whistles and bells, which we will introduceby way of examples.
Simple loop with accumulation: Computing
Assume that, given k you want to get Ak
using a loop.More precisely, if A is a tensor you want to computeA
k
elemwise. The python/numpy code might look like:
- result = 1
- for i in range(k):
- result = result * A
There are three things here that we need to handle: the initial valueassigned to result
, the accumulation of results in result
, andthe unchanging variable A
. Unchanging variables are passed to scan asnon_sequences
. Initialization occurs in outputs_info
, and the accumulationhappens automatically.
The equivalent Theano code would be:
- import theano
- import theano.tensor as T
- k = T.iscalar("k")
- A = T.vector("A")
- # Symbolic description of the result
- result, updates = theano.scan(fn=lambda prior_result, A: prior_result * A,
- outputs_info=T.ones_like(A),
- non_sequences=A,
- n_steps=k)
- # We only care about A**k, but scan has provided us with A**1 through A**k.
- # Discard the values that we don't care about. Scan is smart enough to
- # notice this and not waste memory saving them.
- final_result = result[-1]
- # compiled function that returns A**k
- power = theano.function(inputs=[A,k], outputs=final_result, updates=updates)
- print(power(range(10),2))
- print(power(range(10),4))
- [ 0. 1. 4. 9. 16. 25. 36. 49. 64. 81.]
- [ 0.00000000e+00 1.00000000e+00 1.60000000e+01 8.10000000e+01
- 2.56000000e+02 6.25000000e+02 1.29600000e+03 2.40100000e+03
- 4.09600000e+03 6.56100000e+03]
Let us go through the example line by line. What we did is first toconstruct a function (using a lambda expression) that given prior_result
andA
returns prior_result * A
. The order of parameters is fixed by scan:the output of the prior call to fn
(or the initial value, initially)is the first parameter, followed by all non-sequences.
Next we initialize the output as a tensor with same shape and dtype as A
,filled with ones. We give A
to scan as a non sequence parameter andspecify the number of steps k
to iterate over our lambda expression.
Scan returns a tuple containing our result (result
) and adictionary of updates (empty in this case). Note that the resultis not a matrix, but a 3D tensor containing the value of A**k
foreach step. We want the last value (after k
steps) so we compilea function to return just that. Note that there is an optimization, thatat compile time will detect that you are using just the last value of theresult and ensure that scan does not store all the intermediate valuesthat are used. So do not worry if A
and k
are large.
Iterating over the first dimension of a tensor: Calculating a polynomial
In addition to looping a fixed number of times, scan can iterate overthe leading dimension of tensors (similar to Python’s for x in a_list
).
The tensor(s) to be looped over should be provided to scan using thesequence
keyword argument.
Here’s an example that builds a symbolic calculation of a polynomialfrom a list of its coefficients:
- import numpy
- coefficients = theano.tensor.vector("coefficients")
- x = T.scalar("x")
- max_coefficients_supported = 10000
- # Generate the components of the polynomial
- components, updates = theano.scan(fn=lambda coefficient, power, free_variable: coefficient * (free_variable ** power),
- outputs_info=None,
- sequences=[coefficients, theano.tensor.arange(max_coefficients_supported)],
- non_sequences=x)
- # Sum them up
- polynomial = components.sum()
- # Compile a function
- calculate_polynomial = theano.function(inputs=[coefficients, x], outputs=polynomial)
- # Test
- test_coefficients = numpy.asarray([1, 0, 2], dtype=numpy.float32)
- test_value = 3
- print(calculate_polynomial(test_coefficients, test_value))
- print(1.0 * (3 ** 0) + 0.0 * (3 ** 1) + 2.0 * (3 ** 2))
- 19.0
- 19.0
There are a few things to note here.
First, we calculate the polynomial by first generating each of the coefficients, andthen summing them at the end. (We could also have accumulated them along the way, and thentaken the last one, which would have been more memory-efficient, but this is an example.)
Second, there is no accumulation of results, we can set outputs_info
to None
. This indicatesto scan that it doesn’t need to pass the prior result to fn
.
The general order of function parameters to fn
is:
- sequences (if any), prior result(s) (if needed), non-sequences (if any)
Third, there’s a handy trick used to simulate python’s enumerate
: simply includetheano.tensor.arange
to the sequences.
Fourth, given multiple sequences of uneven lengths, scan will truncate to the shortest of them.This makes it safe to pass a very long arange, which we need to do for generality, sincearange must have its length specified at creation time.
Simple accumulation into a scalar, ditching lambda
Although this example would seem almost self-explanatory, it stresses apitfall to be careful of: the initial output state that is supplied, that isoutputs_info
, must be of a shape similar to that of the output variablegenerated at each iteration and moreover, it must not involve an implicitdowncast of the latter.
- import numpy as np
- import theano
- import theano.tensor as T
- up_to = T.iscalar("up_to")
- # define a named function, rather than using lambda
- def accumulate_by_adding(arange_val, sum_to_date):
- return sum_to_date + arange_val
- seq = T.arange(up_to)
- # An unauthorized implicit downcast from the dtype of 'seq', to that of
- # 'T.as_tensor_variable(0)' which is of dtype 'int8' by default would occur
- # if this instruction were to be used instead of the next one:
- # outputs_info = T.as_tensor_variable(0)
- outputs_info = T.as_tensor_variable(np.asarray(0, seq.dtype))
- scan_result, scan_updates = theano.scan(fn=accumulate_by_adding,
- outputs_info=outputs_info,
- sequences=seq)
- triangular_sequence = theano.function(inputs=[up_to], outputs=scan_result)
- # test
- some_num = 15
- print(triangular_sequence(some_num))
- print([n * (n + 1) // 2 for n in range(some_num)])
- [ 0 1 3 6 10 15 21 28 36 45 55 66 78 91 105]
- [0, 1, 3, 6, 10, 15, 21, 28, 36, 45, 55, 66, 78, 91, 105]
Another simple example
Unlike some of the prior examples, this one is hard to reproduce except by using scan.
This takes a sequence of array indices, and values to place there,and a “model” output array (whose shape and dtype will be mimicked),and produces a sequence of arrays with the shape and dtype of the model,with all values set to zero except at the provided array indices.
- location = T.imatrix("location")
- values = T.vector("values")
- output_model = T.matrix("output_model")
- def set_value_at_position(a_location, a_value, output_model):
- zeros = T.zeros_like(output_model)
- zeros_subtensor = zeros[a_location[0], a_location[1]]
- return T.set_subtensor(zeros_subtensor, a_value)
- result, updates = theano.scan(fn=set_value_at_position,
- outputs_info=None,
- sequences=[location, values],
- non_sequences=output_model)
- assign_values_at_positions = theano.function(inputs=[location, values, output_model], outputs=result)
- # test
- test_locations = numpy.asarray([[1, 1], [2, 3]], dtype=numpy.int32)
- test_values = numpy.asarray([42, 50], dtype=numpy.float32)
- test_output_model = numpy.zeros((5, 5), dtype=numpy.float32)
- print(assign_values_at_positions(test_locations, test_values, test_output_model))
- [[[ 0. 0. 0. 0. 0.]
- [ 0. 42. 0. 0. 0.]
- [ 0. 0. 0. 0. 0.]
- [ 0. 0. 0. 0. 0.]
- [ 0. 0. 0. 0. 0.]]
- [[ 0. 0. 0. 0. 0.]
- [ 0. 0. 0. 0. 0.]
- [ 0. 0. 0. 50. 0.]
- [ 0. 0. 0. 0. 0.]
- [ 0. 0. 0. 0. 0.]]]
This demonstrates that you can introduce new Theano variables into a scan function.
Using shared variables - Gibbs sampling
Another useful feature of scan, is that it can handle shared variables.For example, if we want to implement a Gibbs chain of length 10 we would dothe following:
- import theano
- from theano import tensor as T
- W = theano.shared(W_values) # we assume that ``W_values`` contains the
- # initial values of your weight matrix
- bvis = theano.shared(bvis_values)
- bhid = theano.shared(bhid_values)
- trng = T.shared_randomstreams.RandomStreams(1234)
- def OneStep(vsample) :
- hmean = T.nnet.sigmoid(theano.dot(vsample, W) + bhid)
- hsample = trng.binomial(size=hmean.shape, n=1, p=hmean)
- vmean = T.nnet.sigmoid(theano.dot(hsample, W.T) + bvis)
- return trng.binomial(size=vsample.shape, n=1, p=vmean,
- dtype=theano.config.floatX)
- sample = theano.tensor.vector()
- values, updates = theano.scan(OneStep, outputs_info=sample, n_steps=10)
- gibbs10 = theano.function([sample], values[-1], updates=updates)
The first, and probably most crucial observation is that the updatesdictionary becomes important in this case. It links a shared variablewith its updated value after k steps. In this case it tells how therandom streams get updated after 10 iterations. If you do not pass thisupdate dictionary to your function, you will always get the same 10sets of random numbers. You can even use the updates
dictionaryafterwards. Look at this example :
- a = theano.shared(1)
- values, updates = theano.scan(lambda: {a: a+1}, n_steps=10)
In this case the lambda expression does not require any input parametersand returns an update dictionary which tells how a
should be updatedafter each step of scan. If we write :
- b = a + 1
- c = updates[a] + 1
- f = theano.function([], [b, c], updates=updates)
- print(b)
- print(c)
- print(a.get_value())
We will see that because b
does not use the updated version ofa
, it will be 2, c
will be 12, while a.value
is 11
.If we call the function again, b
will become 12, c
will be 22and a.value
21. If we do not pass the updates
dictionary to thefunction, then a.value
will always remain 1, b
will always be 2 andc
will always be 12
.
The second observation is that if we use shared variables ( W
, bvis
,bhid
) but we do not iterate over them (ie scan doesn’t really need to knowanything in particular about them, just that they are used inside thefunction applied at each step) you do not need to pass them as arguments.Scan will find them on its own and add them to the graph.However, passing them to the scan function is a good practice, as it avoidsScan Op calling any earlier (external) Op over and over. This results in asimpler computational graph, which speeds up the optimization and theexecution. To pass the shared variables to Scan you need to put them in a listand give it to the non_sequences
argument. Here is the Gibbs sampling codeupdated:
- W = theano.shared(W_values) # we assume that ``W_values`` contains the
- # initial values of your weight matrix
- bvis = theano.shared(bvis_values)
- bhid = theano.shared(bhid_values)
- trng = T.shared_randomstreams.RandomStreams(1234)
- # OneStep, with explicit use of the shared variables (W, bvis, bhid)
- def OneStep(vsample, W, bvis, bhid):
- hmean = T.nnet.sigmoid(theano.dot(vsample, W) + bhid)
- hsample = trng.binomial(size=hmean.shape, n=1, p=hmean)
- vmean = T.nnet.sigmoid(theano.dot(hsample, W.T) + bvis)
- return trng.binomial(size=vsample.shape, n=1, p=vmean,
- dtype=theano.config.floatX)
- sample = theano.tensor.vector()
- # The new scan, with the shared variables passed as non_sequences
- values, updates = theano.scan(fn=OneStep,
- outputs_info=sample,
- non_sequences=[W, bvis, bhid],
- n_steps=10)
- gibbs10 = theano.function([sample], values[-1], updates=updates)
Using shared variables - the strict flag
As we just saw, passing the shared variables to scan may result in a simplercomputational graph, which speeds up the optimization and the execution. Agood way to remember to pass every shared variable used during scan is to usethe strict
flag. When set to true, scan checks that all the necessary sharedvariables in fn
are passed as explicit arguments to fn
. This has to beensured by the user. Otherwise, it will result in an error.
Using the original Gibbs sampling example, with strict=True
added to thescan()
call:
- # Same OneStep as in original example.
- def OneStep(vsample) :
- hmean = T.nnet.sigmoid(theano.dot(vsample, W) + bhid)
- hsample = trng.binomial(size=hmean.shape, n=1, p=hmean)
- vmean = T.nnet.sigmoid(theano.dot(hsample, W.T) + bvis)
- return trng.binomial(size=vsample.shape, n=1, p=vmean,
- dtype=theano.config.floatX)
- # The new scan, adding strict=True to the original call.
- values, updates = theano.scan(OneStep,
- outputs_info=sample,
- n_steps=10,
- strict=True)
- Traceback (most recent call last):
- ...
- MissingInputError: An input of the graph, used to compute
- DimShuffle{1,0}(<TensorType(float64, matrix)>), was not provided and
- not given a value.Use the Theano flag exception_verbosity='high',for
- more information on this error.
The error indicates that OneStep
relies on variables that are not passedas arguments explicitly. Here is the correct version, with the sharedvariables passed explicitly to OneStep
and to scan:
- # OneStep, with explicit use of the shared variables (W, bvis, bhid)
- def OneStep(vsample, W, bvis, bhid) :
- hmean = T.nnet.sigmoid(theano.dot(vsample, W) + bhid)
- hsample = trng.binomial(size=hmean.shape, n=1, p=hmean)
- vmean = T.nnet.sigmoid(theano.dot(hsample, W.T) + bvis)
- return trng.binomial(size=vsample.shape, n=1, p=vmean,
- dtype=theano.config.floatX)
- # The new scan, adding strict=True to the original call, and passing
- # expicitly W, bvis and bhid.
- values, updates = theano.scan(OneStep,
- outputs_info=sample,
- non_sequences=[W, bvis, bhid],
- n_steps=10,
- strict=True)
Multiple outputs, several taps values - Recurrent Neural Network with Scan
The examples above showed simple uses of scan. However, scan also supportsreferring not only to the prior result and the current sequence value, butalso looking back more than one step.
This is needed, for example, to implement a RNN using scan. Assumethat our RNN is defined as follows :
Note that this network is far from a classical recurrent neuralnetwork and might be useless. The reason we defined as suchis to better illustrate the features of scan.
In this case we have a sequence over which we need to iterate u
,and two outputs x
and y
. To implement this with scan we firstconstruct a function that computes one iteration step :
- def oneStep(u_tm4, u_t, x_tm3, x_tm1, y_tm1, W, W_in_1, W_in_2, W_feedback, W_out):
- x_t = T.tanh(theano.dot(x_tm1, W) + \
- theano.dot(u_t, W_in_1) + \
- theano.dot(u_tm4, W_in_2) + \
- theano.dot(y_tm1, W_feedback))
- y_t = theano.dot(x_tm3, W_out)
- return [x_t, y_t]
As naming convention for the variables we used a_tmb
to mean a
att-b
and a_tpb
to be a
at t+b
.Note the order in which the parameters are given, and in which theresult is returned. Try to respect chronological order amongthe taps ( time slices of sequences or outputs) used. For scan is crucial onlyfor the variables representing the different time taps to be in the same orderas the one in which these taps are given. Also, not only taps should respectan order, but also variables, since this is how scan figures out what shouldbe represented by what. Given that we have allthe Theano variables needed we construct our RNN as follows :
- W = T.matrix()
- W_in_1 = T.matrix()
- W_in_2 = T.matrix()
- W_feedback = T.matrix()
- W_out = T.matrix()
- u = T.matrix() # it is a sequence of vectors
- x0 = T.matrix() # initial state of x has to be a matrix, since
- # it has to cover x[-3]
- y0 = T.vector() # y0 is just a vector since scan has only to provide
- # y[-1]
- ([x_vals, y_vals], updates) = theano.scan(fn=oneStep,
- sequences=dict(input=u, taps=[-4,-0]),
- outputs_info=[dict(initial=x0, taps=[-3,-1]), y0],
- non_sequences=[W, W_in_1, W_in_2, W_feedback, W_out],
- strict=True)
- # for second input y, scan adds -1 in output_taps by default
Now x_vals
and y_vals
are symbolic variables pointing to thesequence of x and y values generated by iterating over u. Thesequence_taps
, outputs_taps
give to scan information about whatslices are exactly needed. Note that if we want to use x[t-k]
we donot need to also have x[t-(k-1)], x[t-(k-2)],..
, but when applyingthe compiled function, the numpy array given to represent this sequenceshould be large enough to cover this values. Assume that we compile theabove function, and we give as u
the array uvals = [0,1,2,3,4,5,6,7,8]
.By abusing notations, scan will consider uvals[0]
as u[-4]
, andwill start scaning from uvals[4]
towards the end.
Conditional ending of Scan
Scan can also be used as a repeat-until
block. In such a case scanwill stop when either the maximal number of iteration is reached, or theprovided condition evaluates to True.
For an example, we will compute all powers of two smaller then some providedvalue max_value
.
- def power_of_2(previous_power, max_value):
- return previous_power*2, theano.scan_module.until(previous_power*2 > max_value)
- max_value = T.scalar()
- values, _ = theano.scan(power_of_2,
- outputs_info = T.constant(1.),
- non_sequences = max_value,
- n_steps = 1024)
- f = theano.function([max_value], values)
- print(f(45))
- [ 2. 4. 8. 16. 32. 64.]
As you can see, in order to terminate on condition, the only thing requiredis that the inner function power_of_2
to return also the conditionwrapped in the class theano.scan_module.until
. The condition has to beexpressed in terms of the arguments of the inner function (in this caseprevious_power
and max_value
).
As a rule, scan always expects the condition to be the last thing returnedby the inner function, otherwise an error will be raised.
Reducing Scan’s memory usage
This section presents the scan_checkpoints
function. In short, thisfunction reduces the memory usage of scan (at the cost of more computationtime) by not keeping in memory all the intermediate time steps of the loop,and recomputing them when computing the gradients. This function is thereforeonly useful if you need to compute the gradient of the ouptut of scan withrespect to its inputs, and shouldn’t be used otherwise.
Before going more into the details, here are its current limitations:
- It only works in the case where only the output of the last time step isneeded, like when computing
A**k
or in an encoder-decoder setup. - It only accepts sequences of the same length.
- If
n_steps
is specified, it has the same value as the length of anysequences. - It is signly-recurrent, meaning that only the previous time step can be usedto compute the current one (ie
h[t]
can only depend onh[t-1]
). Inother words,taps
can not be used insequences
andoutputs_info
.
Often, in order to be able to compute the gradients through scan operations,Theano needs to keep in memory some intermediate computations of scan. Thiscan sometimes use a prohibitively large amount of memory.scan_checkpoints
allows to discard some of those intermediate steps andrecompute them again when computing the gradients. Its save_every_N
argumentspecifies the number time steps to do without storing the intermediate results.For example, save_every_N = 4
will reduce the memory usage by 4, while havingto recompute 3/4 time steps of the forward loop. Since the grad of scan isabout 6x slower than the forward, a ~20% slowdown is expected. Apart from thesave_every_N
argument and the current limitations, the usage of this functionis similar to the classic scan
function.
Optimizing Scan’s performance
This section covers some ways to improve performance of a Theano functionusing Scan.
Minimizing Scan usage
Scan makes it possible to define simple and compact graphs that can do thesame work as much larger and more complicated graphs. However, it comes witha significant overhead. As such, when performance is the objective, a goodrule of thumb is to perform as much of the computation as possible outside ofScan. This may have the effect of increasing memory usage but can alsoreduce the overhead introduces by using Scan.
Explicitly passing inputs of the inner function to scan
It is possible, inside of Scan, to use variables previously defined outside ofthe Scan without explicitly passing them as inputs to the Scan. However, it isoften more efficient to explicitly pass them as non-sequence inputs instead.Section Using shared variables - Gibbs sampling provides an explanation for this andsection Using shared variables - the strict flag describes the strict flag, a tool that Scanprovides to help ensure that the inputs to the function inside Scan have allbeen provided as explicit inputs to the scan()
function.
Deactivating garbage collecting in Scan
Deactivating the garbage collection for Scan can allow it to reuse memorybetween executions instead of always having to allocate new memory. This canimprove performance at the cost of increased memory usage. By default, Scanreuses memory between iterations of the same execution but frees the memoryafter the last iteration.
There are two ways to achieve this, using the Theano flagconfig.scan.allow_gc
and setting it to False, or using the argumentallow_gc
of the function theano.scan() and set it to False (when a valueis not provided for this argument, the value of the flagconfig.scan.allow_gc
is used).
Graph optimizations
This one is simple but still worth pointing out. Theano is able toautomatically recognize and optimize many computation patterns. However, thereare patterns that Theano doesn’t optimize because doing so would change theuser interface (such as merging shared variables together into a single one,for instance). Additionaly, Theano doesn’t catch every case that it couldoptimize and so it remains useful for performance that the user defines anefficient graph in the first place. This is also the case, and sometimes evenmore so, for the graph inside of Scan. This is because it will be executedmany times for every execution of the Theano function that contains it.
The LSTM tutorial onDeepLearning.net provides an example of anoptimization that Theano cannot perform. Instead of performing many matrixmultiplications between matrix and each of the shared matrices, , and , the matrices, are merged into a single shared matrix and the graphperforms a single larger matrix multiplication between and. The resulting matrix is then sliced to obtain the results of thatthe small individual matrix multiplications would have produced. Thisoptimization replaces several small and inefficient matrix multiplications bya single larger one and thus improves performance at the cost of a potentiallyhigher memory usage.
reference
This module provides the Scan Op.
Scanning is a general form of recurrence, which can be used for looping.The idea is that you scan a function along some input sequence, producingan output at each time-step that can be seen (but not modified) by thefunction at the next time-step. (Technically, the function can see theprevious K time-steps of your outputs and L time steps (from the past andfuture) of your inputs.
So for example, sum()
could be computed by scanning the z+x_i
function over a list, given an initial state of z=0
.
Special cases:
- A reduce operation can be performed by returning only the lastoutput of a
scan
. - A map operation can be performed by applying a function thatignores previous steps of the outputs.
Often a for-loop can be expressed as a scan()
operation, and scan
isthe closest that theano comes to looping. The advantage of using scan
over for loops is that it allows the number of iterations to be a part ofthe symbolic graph.
The Scan Op should typically be used by calling any of the followingfunctions: scan()
, map()
, reduce()
, foldl()
,foldr()
.
theano.
map
(fn, sequences, non_sequences=None, truncate_gradient=-1, go_backwards=False, mode=None, name=None)[source]- Similar behaviour as python’s map.
Parameters:
- fn – The function that
map
applies at each iteration step(seescan
for more info). - sequences – List of sequences over which
map
iterates(seescan
for more info). - non_sequences – List of arguments passed to
fn
.map
will not iterate overthese arguments (seescan
for more info). - truncate_gradient – See
scan
. - go_backwards (bool) – Decides the direction of iteration. True means that sequences are parsedfrom the end towards the begining, while False is the other way around.
- mode – See
scan
. - name – See
scan
.
theano.
reduce
(fn, sequences, outputs_info, non_sequences=None, go_backwards=False, mode=None, name=None)[source]- Similar behaviour as python’s reduce.
Parameters:
- fn – The function that
reduce
applies at each iteration step(seescan
for more info). - sequences – List of sequences over which
reduce
iterates(seescan
for more info). - outputs_info – List of dictionaries describing the outputs ofreduce (see
scan
for more info). - non_sequences –
- List of arguments passed to
fn
.reduce
will - not iterate over these arguments (see
scan
formore info).
- List of arguments passed to
- go_backwards (bool) – Decides the direction of iteration. True means that sequences are parsedfrom the end towards the begining, while False is the other way around.
- mode – See
scan
. - name – See
scan
.
theano.
foldl
(fn, sequences, outputs_info, non_sequences=None, mode=None, name=None)[source]- Similar behaviour as haskell’s foldl.
Parameters:
- fn – The function that
foldl
applies at each iteration step(seescan
for more info). - sequences – List of sequences over which
foldl
iterates(seescan
for more info). - outputs_info – List of dictionaries describing the outputs of reduce(see
scan
for more info). - non_sequences – List of arguments passed to fn.
foldl
will not iterate overthese arguments (seescan
for more info). - mode – See
scan
. - name – See
scan
.
theano.
foldr
(fn, sequences, outputs_info, non_sequences=None, mode=None, name=None)[source]- Similar behaviour as haskell’ foldr.
Parameters:
- fn – The function that
foldr
applies at each iteration step(seescan
for more info). - sequences – List of sequences over which
foldr
iterates(seescan
for more info). - outputs_info – List of dictionaries describing the outputs of reduce(see
scan
for more info). - non_sequences – List of arguments passed to fn.
foldr
will not iterate over thesearguments (seescan
for more info). - mode – See
scan
. - name – See
scan
.
theano.
scan
(fn, sequences=None, outputs_info=None, non_sequences=None, n_steps=None, truncate_gradient=-1, go_backwards=False, mode=None, name=None, profile=False, allow_gc=None, strict=False, return_list=False)[source]- This function constructs and applies a Scan op to the providedarguments.
Parameters:
fn –
fn
is a function that describes the operations involved in onestep ofscan
.fn
should construct variables describing theoutput of one iteration step. It should expect as input theanovariables representing all the slices of the input sequencesand previous values of the outputs, as well as all other argumentsgiven to scan asnon_sequences
. The order in which scan passesthese variables tofn
is the following :- all time slices of the first sequence
- all time slices of the second sequence
- …
- all time slices of the last sequence
- all past slices of the first output
- all past slices of the second otuput
- …
- all past slices of the last output
- all other arguments (the list given as non_sequences to
- scan)
The order of the sequences is the same as the one in the listsequences given to scan. The order of the outputs is the sameas the order of
outputs_info
. For any sequence or output theorder of the time slices is the same as the one in which they havebeen given as taps. For example if one writes the following :
- scan(fn, sequences = [ dict(input= Sequence1, taps = [-3,2,-1])
- , Sequence2
- , dict(input = Sequence3, taps = 3) ]
- , outputs_info = [ dict(initial = Output1, taps = [-3,-5])
- , dict(initial = Output2, taps = None)
- , Output3 ]
- , non_sequences = [ Argument1, Argument2])
fn
should expect the following arguments in this given order:
- <code>Sequence1[t-3]</code>
- <code>Sequence1[t+2]</code>
- <code>Sequence1[t-1]</code>
- <code>Sequence2[t]</code>
- <code>Sequence3[t+3]</code>
- <code>Output1[t-3]</code>
- <code>Output1[t-5]</code>
- <code>Output3[t-1]</code>
- <code>Argument1</code>
- <code>Argument2</code>
The list of nonsequences
can also contain shared variablesused in the function, though scan
is able to figure thoseout on its own so they can be skipped. For the clarity of thecode we recommend though to provide them to scan. To some extendscan
can also figure out other non sequences
(not shared)even if not passed to scan (but used by _fn). A simple example ofthis would be :
- import theano.tensor as TT
- W = TT.matrix()
- W_2 = W**2
- def f(x):
- return TT.dot(x,W_2)
The function is expected to return two things. One is a list ofoutputs ordered in the same order as outputsinfo
, with thedifference that there should be only one output variable peroutput initial state (even if no tap value is used). Secondly_fn should return an update dictionary (that tells how toupdate any shared variable after each iteration step). Thedictionary can optionally be given as a list of tuples. There isno constraint on the order of these two list, fn
can returneither (outputs_list, update_dictionary)
or(update_dictionary, outputs_list)
or just one of the two (incase the other is empty).
To use scan
as a while loop, the user needs to change thefunction fn
such that also a stopping condition is returned.To do so, he/she needs to wrap the condition in an until
class.The condition should be returned as a third element, for example:
- ...
- return [y1_t, y2_t], {x:x+1}, theano.scan_module.until(x < 50)
Note that a number of steps (considered in here as the maximumnumber of steps ) is still required even though a condition ispassed (and it is used to allocate memory if needed). = {}):
sequences –
sequences
is the list of Theano variables or dictionariesdescribing the sequencesscan
has to iterate over. If asequence is given as wrapped in a dictionary, then a set of optionalinformation can be provided about the sequence. The dictionaryshould have the following keys:input
(mandatory) – Theano variable representing thesequence.taps
– Temporal taps of the sequence required byfn
.They are provided as a list of integers, where a valuek
impiles that at iteration stept
scan will pass tofn
the slicet+k
. Default value is[0]
Any Theano variable in the listsequences
is automaticallywrapped into a dictionary wheretaps
is set to[0]
outputs_info –
outputs_info
is the list of Theano variables or dictionariesdescribing the initial state of the outputs computedrecurrently. When this initial states are given as dictionaryoptional information can be provided about the output correspondingto these initial states. The dictionary should have the followingkeys:initial
– Theano variable that represents the initialstate of a given output. In case the output is not computedrecursively (think of a map) and does not require an initialstate this field can be skipped. Given that (only) the previoustime step of the output is used byfn
, the initial stateshould have the same shape as the output and should notinvolve a downcast of the data type of the output. If multipletime taps are used, the initial state should have one extradimension that should cover all the possible taps. For exampleif we use-5
,-2
and-1
as past taps, at step 0,fn
will require (by an abuse of notation)output[-5]
,output[-2]
andoutput[-1]
. This will be given bythe initial state, which in this case should have the shape(5,)+output.shape. If this variable containing the initialstate is calledinity
theninit_y[0]
_corresponds tooutput[-5]
.inity[1]
_correponds tooutput[-4]
,init_y[2]
corresponds tooutput[-3]
,init_y[3]
coresponds tooutput[-2]
,init_y[4]
corresponds tooutput[-1]
. While this order might seem strange, it comesnatural from splitting an array at a given point. Assume thatwe have a arrayx
, and we choosek
to be time step0
. Then our initial state would bex[:k]
, while theoutput will bex[k:]
. Looking at this split, elements inx[:k]
are ordered exactly like those ininit_y
.taps
– Temporal taps of the output that will be pass tofn
. They are provided as a list of negative integers,where a valuek
implies that at iteration stept
scanwill pass tofn
the slicet+k
.scan
will follow this logic if partial information is given:If an output is not wrapped in a dictionary,
scan
will wrapit in one assuming that you use only the last step of the output(i.e. it makes your tap value list equal to [-1]).- If you wrap an output in a dictionary and you do not provide anytaps but you provide an initial state it will assume that you areusing only a tap value of -1.
- If you wrap an output in a dictionary but you do not provide anyinitial state, it assumes that you are not using any form oftaps.
- If you provide a
None
instead of a variable or a emptydictionaryscan
assumes that you will not use any taps forthis output (like for example in case of a map) Ifoutputs_info
is an empty list or None,scan
assumesthat no tap is used for any of the outputs. If information isprovided just for a subset of the outputs an exception israised (because there is no convention on how scan should mapthe provided information to the outputs offn
)
non_sequences –
non_sequences
is the list of arguments that are passed tofn
at each steps. One can opt to exclude variableused infn
from this list as long as they are part of thecomputational graph, though for clarity we encourage not to do so.- n_steps –
nsteps
is the number of steps to iterate given as an intor Theano scalar. If any of the input sequences do not haveenough elements, scan will raise an error. If the _value is 0 theoutputs will have 0 rows. If n_steps is not provided,scan
willfigure out the amount of steps it should run given its inputsequences.n_steps
< 0 is not supported anymore. - truncate_gradient –
truncate_gradient
is the number of steps to use in truncatedBPTT. If you compute gradients through a scan op, they arecomputed using backpropagation through time. By providing adifferent value then -1, you choose to use truncated BPTT insteadof classical BPTT, where you go for onlytruncate_gradient
number of steps back in time. - go_backwards –
go_backwards
is a flag indicating ifscan
should gobackwards through the sequences. If you think of each sequenceas indexed by time, making this flag True would mean thatscan
goes back in time, namely that for any sequence itstarts from the end and goes towards 0. - name – When profiling
scan
, it is crucial to provide a name for anyinstance ofscan
. The profiler will produce an overallprofile of your code as well as profiles for the computation ofone step of each instance ofscan
. Thename
of the instanceappears in those profiles and can greatly help to disambiguateinformation. - mode – It is recommended to leave this argument to None, especiallywhen profiling
scan
(otherwise the results are not going tobe accurate). If you prefer the computations of one step ofscan
to be done differently then the entire function, youcan use this parameter to describe how the computations in thisloop are done (seetheano.function
for details aboutpossible values and their meaning). - profile – Flag or string. If true, or different from the empty string, aprofile object will be created and attached to the inner graph ofscan. In case
profile
is True, the profile object will have thename of the scan instance, otherwise it will have the passed string.Profile object collect (and print) information only when running theinner graph with the new cvm linker ( with default modes,other linkers this argument is useless) - allow_gc – Set the value of allow gc for the internal graph of scan. Ifset to None, this will use the value of config.scan.allow_gc.
The full scan behavior related to allocation is determined bythis value and the Theano flag allow_gc. If the flag allow_gcis True (default) and this scan parameter allow_gc is False(default), then we let scan allocate all intermediate memoryon the first iteration, those are not garbage collected themduring that first iteration (this is determined by the scanallow_gc). This speed up allocation of the followingiteration. But we free all those temp allocation at the end ofall iterations (this is what the Theano flag allow_gc mean).
If you use cnmem and this scan is on GPU, the speed up fromthe scan allow_gc is small. If you are missing memory, disablethe scan allow_gc could help you run graph that request muchmemory.
- strict – If true, all the shared variables used in
fn
must be provided as apart ofnon_sequences
orsequences
. - return_list – If True, will always return a list, even if there is only 1 output.Returns:
Tuple of the form (outputs, updates);
outputs
is either aTheano variable or a list of Theano variables representing theoutputs ofscan
(in the same order as inoutputs_info
).updates
is a subclass of dictionary specifying the update rules forall shared variables used in scan.This dictionary should be passed totheano.function
when you compileyour function. The change compared to a normal dictionary is that wevalidate that keys are SharedVariable and addition of those dictionaryare validated to be consistent. Return type: tuple
theano.
scancheckpoints
(_fn, sequences=[], outputs_info=None, non_sequences=[], name='checkpointscan_fn', n_steps=None, save_every_N=10, padding=True)[source]- Scan function that uses less memory, but is more restrictive.
In scan()
, if you compute the gradient of the outputwith respect to the input, you will have to store the intermediateresults at each time step, which can be prohibitively huge. Thisfunction allows to do save_every_N
steps of forward computationswithout storing the intermediate results, and to recompute them duringthe gradient computation.
Notes
Current assumptions:
- Every sequence has the same length.
- If
n_steps
is specified, it has the same value as the length ofany sequence. - The value of
save_every_N
divides the number of steps the scanwill run without remainder. - Only singly-recurrent and non-recurrent outputs are used.No multiple recurrences.
- Only the last timestep of any output will ever be used.
Parameters:
- fn –
fn
is a function that describes the operations involved in onestep ofscan
. See the documentation ofscan()
for more information. - sequences –
sequences
is the list of Theano variables or dictionariesdescribing the sequencesscan
has to iterate over. Allsequences must be the same length in this version ofscan
. - outputs_info –
outputs_info
is the list of Theano variables or dictionariesdescribing the initial state of the outputs computedrecurrently. - non_sequences –
non_sequences
is the list of arguments that are passed tofn
at each steps. One can opt to exclude variableused infn
from this list as long as they are part of thecomputational graph, though for clarity we encourage not to do so. - n_steps –
n_steps
is the number of steps to iterate given as an intor Theano scalar (> 0). If any of the input sequences do not haveenough elements, scan will raise an error. If n_steps is not provided,scan
will figure out the amount of steps it should run given itsinput sequences. - save_every_N –
save_every_N
is the number of steps to go without storingthe computations ofscan
(ie they will have to be recomputedduring the gradient computation). - padding – If the length of the sequences is not a multiple of
save_every_N
,the sequences will be zero padded to make this version ofscan
work properly, but will also result in a memory copy. It can beavoided by settingpadding
to False, but you need to makesure the length of the sequences is a multple ofsave_every_N
.Returns: Tuple of the form(outputs, updates)
as inscan()
, butwith a small change: It only contain the output at eachsave_every_N
step. The time steps that are not returned bythis function will be recomputed during the gradient computation(if any). Return type: tuple
See also
scan()
: Looping in Theano.