scan – Looping in Theano
- Guide
- reference

scan – Looping in Theano

Guide

The scan functions provides the basic functionality needed to do loopsin Theano. Scan comes with many whistles and bells, which we will introduceby way of examples.

Simple loop with accumulation: Computing

Assume that, given k you want to get Ak using a loop.More precisely, if A is a tensor you want to computeAk elemwise. The python/numpy code might look like:

result = 1
for i in range(k):
    result = result * A

There are three things here that we need to handle: the initial valueassigned to result, the accumulation of results in result, andthe unchanging variable A. Unchanging variables are passed to scan asnon_sequences. Initialization occurs in outputs_info, and the accumulationhappens automatically.

The equivalent Theano code would be:

import theano
import theano.tensor as T
 
k = T.iscalar("k")
A = T.vector("A")
 
# Symbolic description of the result
result, updates = theano.scan(fn=lambda prior_result, A: prior_result * A,
                              outputs_info=T.ones_like(A),
                              non_sequences=A,
                              n_steps=k)
 
# We only care about A**k, but scan has provided us with A**1 through A**k.
# Discard the values that we don't care about. Scan is smart enough to
# notice this and not waste memory saving them.
final_result = result[-1]
 
# compiled function that returns A**k
power = theano.function(inputs=[A,k], outputs=final_result, updates=updates)
 
print(power(range(10),2))
print(power(range(10),4))

[  0.   1.   4.   9.  16.  25.  36.  49.  64.  81.]
[  0.00000000e+00   1.00000000e+00   1.60000000e+01   8.10000000e+01
   2.56000000e+02   6.25000000e+02   1.29600000e+03   2.40100000e+03
   4.09600000e+03   6.56100000e+03]

Let us go through the example line by line. What we did is first toconstruct a function (using a lambda expression) that given prior_result andA returns prior_result * A. The order of parameters is fixed by scan:the output of the prior call to fn (or the initial value, initially)is the first parameter, followed by all non-sequences.

Next we initialize the output as a tensor with same shape and dtype as A,filled with ones. We give A to scan as a non sequence parameter andspecify the number of steps k to iterate over our lambda expression.

Scan returns a tuple containing our result (result) and adictionary of updates (empty in this case). Note that the resultis not a matrix, but a 3D tensor containing the value of A**k foreach step. We want the last value (after k steps) so we compilea function to return just that. Note that there is an optimization, thatat compile time will detect that you are using just the last value of theresult and ensure that scan does not store all the intermediate valuesthat are used. So do not worry if A and k are large.

Iterating over the first dimension of a tensor: Calculating a polynomial

In addition to looping a fixed number of times, scan can iterate overthe leading dimension of tensors (similar to Python’s for x in a_list).

The tensor(s) to be looped over should be provided to scan using thesequence keyword argument.

Here’s an example that builds a symbolic calculation of a polynomialfrom a list of its coefficients:

import numpy
 
coefficients = theano.tensor.vector("coefficients")
x = T.scalar("x")
 
max_coefficients_supported = 10000
 
# Generate the components of the polynomial
components, updates = theano.scan(fn=lambda coefficient, power, free_variable: coefficient * (free_variable ** power),
                                  outputs_info=None,
                                  sequences=[coefficients, theano.tensor.arange(max_coefficients_supported)],
                                  non_sequences=x)
# Sum them up
polynomial = components.sum()
 
# Compile a function
calculate_polynomial = theano.function(inputs=[coefficients, x], outputs=polynomial)
 
# Test
test_coefficients = numpy.asarray([1, 0, 2], dtype=numpy.float32)
test_value = 3
print(calculate_polynomial(test_coefficients, test_value))
print(1.0 * (3 ** 0) + 0.0 * (3 ** 1) + 2.0 * (3 ** 2))

19.0
19.0

There are a few things to note here.

First, we calculate the polynomial by first generating each of the coefficients, andthen summing them at the end. (We could also have accumulated them along the way, and thentaken the last one, which would have been more memory-efficient, but this is an example.)

Second, there is no accumulation of results, we can set outputs_info to None. This indicatesto scan that it doesn’t need to pass the prior result to fn.

The general order of function parameters to fn is:

sequences (if any), prior result(s) (if needed), non-sequences (if any)

Third, there’s a handy trick used to simulate python’s enumerate: simply includetheano.tensor.arange to the sequences.

Fourth, given multiple sequences of uneven lengths, scan will truncate to the shortest of them.This makes it safe to pass a very long arange, which we need to do for generality, sincearange must have its length specified at creation time.

Simple accumulation into a scalar, ditching lambda

Although this example would seem almost self-explanatory, it stresses apitfall to be careful of: the initial output state that is supplied, that isoutputs_info, must be of a shape similar to that of the output variablegenerated at each iteration and moreover, it must not involve an implicitdowncast of the latter.

import numpy as np
import theano
import theano.tensor as T
 
up_to = T.iscalar("up_to")
 
# define a named function, rather than using lambda
def accumulate_by_adding(arange_val, sum_to_date):
    return sum_to_date + arange_val
seq = T.arange(up_to)
 
# An unauthorized implicit downcast from the dtype of 'seq', to that of
# 'T.as_tensor_variable(0)' which is of dtype 'int8' by default would occur
# if this instruction were to be used instead of the next one:
# outputs_info = T.as_tensor_variable(0)
 
outputs_info = T.as_tensor_variable(np.asarray(0, seq.dtype))
scan_result, scan_updates = theano.scan(fn=accumulate_by_adding,
                                        outputs_info=outputs_info,
                                        sequences=seq)
triangular_sequence = theano.function(inputs=[up_to], outputs=scan_result)
 
# test
some_num = 15
print(triangular_sequence(some_num))
print([n * (n + 1) // 2 for n in range(some_num)])

[  0   1   3   6  10  15  21  28  36  45  55  66  78  91 105]
[0, 1, 3, 6, 10, 15, 21, 28, 36, 45, 55, 66, 78, 91, 105]

Another simple example

Unlike some of the prior examples, this one is hard to reproduce except by using scan.

This takes a sequence of array indices, and values to place there,and a “model” output array (whose shape and dtype will be mimicked),and produces a sequence of arrays with the shape and dtype of the model,with all values set to zero except at the provided array indices.

location = T.imatrix("location")
values = T.vector("values")
output_model = T.matrix("output_model")
 
def set_value_at_position(a_location, a_value, output_model):
    zeros = T.zeros_like(output_model)
    zeros_subtensor = zeros[a_location[0], a_location[1]]
    return T.set_subtensor(zeros_subtensor, a_value)
 
result, updates = theano.scan(fn=set_value_at_position,
                              outputs_info=None,
                              sequences=[location, values],
                              non_sequences=output_model)
 
assign_values_at_positions = theano.function(inputs=[location, values, output_model], outputs=result)
 
# test
test_locations = numpy.asarray([[1, 1], [2, 3]], dtype=numpy.int32)
test_values = numpy.asarray([42, 50], dtype=numpy.float32)
test_output_model = numpy.zeros((5, 5), dtype=numpy.float32)
print(assign_values_at_positions(test_locations, test_values, test_output_model))

[[[  0.   0.   0.   0.   0.]
  [  0.  42.   0.   0.   0.]
  [  0.   0.   0.   0.   0.]
  [  0.   0.   0.   0.   0.]
  [  0.   0.   0.   0.   0.]]
 
 [[  0.   0.   0.   0.   0.]
  [  0.   0.   0.   0.   0.]
  [  0.   0.   0.  50.   0.]
  [  0.   0.   0.   0.   0.]
  [  0.   0.   0.   0.   0.]]]

This demonstrates that you can introduce new Theano variables into a scan function.

Using shared variables - Gibbs sampling

Another useful feature of scan, is that it can handle shared variables.For example, if we want to implement a Gibbs chain of length 10 we would dothe following:

import theano
from theano import tensor as T
 
W = theano.shared(W_values) # we assume that ``W_values`` contains the
                            # initial values of your weight matrix
 
bvis = theano.shared(bvis_values)
bhid = theano.shared(bhid_values)
 
trng = T.shared_randomstreams.RandomStreams(1234)
 
def OneStep(vsample) :
    hmean = T.nnet.sigmoid(theano.dot(vsample, W) + bhid)
    hsample = trng.binomial(size=hmean.shape, n=1, p=hmean)
    vmean = T.nnet.sigmoid(theano.dot(hsample, W.T) + bvis)
    return trng.binomial(size=vsample.shape, n=1, p=vmean,
                         dtype=theano.config.floatX)
 
sample = theano.tensor.vector()
 
values, updates = theano.scan(OneStep, outputs_info=sample, n_steps=10)
 
gibbs10 = theano.function([sample], values[-1], updates=updates)

The first, and probably most crucial observation is that the updatesdictionary becomes important in this case. It links a shared variablewith its updated value after k steps. In this case it tells how therandom streams get updated after 10 iterations. If you do not pass thisupdate dictionary to your function, you will always get the same 10sets of random numbers. You can even use the updates dictionaryafterwards. Look at this example :

a = theano.shared(1)
values, updates = theano.scan(lambda: {a: a+1}, n_steps=10)

In this case the lambda expression does not require any input parametersand returns an update dictionary which tells how a should be updatedafter each step of scan. If we write :

b = a + 1
c = updates[a] + 1
f = theano.function([], [b, c], updates=updates)
 
print(b)
print(c)
print(a.get_value())

We will see that because b does not use the updated version ofa, it will be 2, c will be 12, while a.value is 11.If we call the function again, b will become 12, c will be 22and a.value 21. If we do not pass the updates dictionary to thefunction, then a.value will always remain 1, b will always be 2 andc will always be 12.

The second observation is that if we use shared variables ( W, bvis,bhid) but we do not iterate over them (i.e. scan doesn’t really need to knowanything in particular about them, just that they are used inside thefunction applied at each step) you do not need to pass them as arguments.Scan will find them on its own and add them to the graph.However, passing them to the scan function is a good practice, as it avoidsScan Op calling any earlier (external) Op over and over. This results in asimpler computational graph, which speeds up the optimization and theexecution. To pass the shared variables to Scan you need to put them in a listand give it to the non_sequences argument. Here is the Gibbs sampling codeupdated:

W = theano.shared(W_values) # we assume that ``W_values`` contains the
                            # initial values of your weight matrix
 
bvis = theano.shared(bvis_values)
bhid = theano.shared(bhid_values)
 
trng = T.shared_randomstreams.RandomStreams(1234)
 
# OneStep, with explicit use of the shared variables (W, bvis, bhid)
def OneStep(vsample, W, bvis, bhid):
    hmean = T.nnet.sigmoid(theano.dot(vsample, W) + bhid)
    hsample = trng.binomial(size=hmean.shape, n=1, p=hmean)
    vmean = T.nnet.sigmoid(theano.dot(hsample, W.T) + bvis)
    return trng.binomial(size=vsample.shape, n=1, p=vmean,
                     dtype=theano.config.floatX)
 
sample = theano.tensor.vector()
 
# The new scan, with the shared variables passed as non_sequences
values, updates = theano.scan(fn=OneStep,
                              outputs_info=sample,
                              non_sequences=[W, bvis, bhid],
                              n_steps=10)
 
gibbs10 = theano.function([sample], values[-1], updates=updates)

Using shared variables - the strict flag

As we just saw, passing the shared variables to scan may result in a simplercomputational graph, which speeds up the optimization and the execution. Agood way to remember to pass every shared variable used during scan is to usethe strict flag. When set to true, scan checks that all the necessary sharedvariables in fn are passed as explicit arguments to fn. This has to beensured by the user. Otherwise, it will result in an error.

Using the original Gibbs sampling example, with strict=True added to thescan() call:

# Same OneStep as in original example.
def OneStep(vsample) :
    hmean = T.nnet.sigmoid(theano.dot(vsample, W) + bhid)
    hsample = trng.binomial(size=hmean.shape, n=1, p=hmean)
    vmean = T.nnet.sigmoid(theano.dot(hsample, W.T) + bvis)
    return trng.binomial(size=vsample.shape, n=1, p=vmean,
                         dtype=theano.config.floatX)
 
# The new scan, adding strict=True to the original call.
values, updates = theano.scan(OneStep,
                              outputs_info=sample,
                              n_steps=10,
                              strict=True)

Traceback (most recent call last):
...
MissingInputError: An input of the graph, used to compute
DimShuffle{1,0}(<TensorType(float64, matrix)>), was not provided and
not given a value.Use the Theano flag exception_verbosity='high',for
more information on this error.

The error indicates that OneStep relies on variables that are not passedas arguments explicitly. Here is the correct version, with the sharedvariables passed explicitly to OneStep and to scan:

# OneStep, with explicit use of the shared variables (W, bvis, bhid)
def OneStep(vsample, W, bvis, bhid) :
    hmean = T.nnet.sigmoid(theano.dot(vsample, W) + bhid)
    hsample = trng.binomial(size=hmean.shape, n=1, p=hmean)
    vmean = T.nnet.sigmoid(theano.dot(hsample, W.T) + bvis)
    return trng.binomial(size=vsample.shape, n=1, p=vmean,
                         dtype=theano.config.floatX)
 
# The new scan, adding strict=True to the original call, and passing
# explicitly W, bvis and bhid.
values, updates = theano.scan(OneStep,
                              outputs_info=sample,
                              non_sequences=[W, bvis, bhid],
                              n_steps=10,
                              strict=True)

Multiple outputs, several taps values - Recurrent Neural Network with Scan

The examples above showed simple uses of scan. However, scan also supportsreferring not only to the prior result and the current sequence value, butalso looking back more than one step.

This is needed, for example, to implement a RNN using scan. Assumethat our RNN is defined as follows :

$x(n) = \tanh( W x(n-1) + W^{in}_1 u(n) + W^{in}_2 u(n-4) +W^{feedback} y(n-1) )y(n) = W^{out} x(n- 3)$

Note that this network is far from a classical recurrent neuralnetwork and might be useless. The reason we defined as suchis to better illustrate the features of scan.

In this case we have a sequence over which we need to iterate u,and two outputs x and y. To implement this with scan we firstconstruct a function that computes one iteration step :

def oneStep(u_tm4, u_t, x_tm3, x_tm1, y_tm1, W, W_in_1, W_in_2,  W_feedback, W_out):
 
  x_t = T.tanh(theano.dot(x_tm1, W) + \
               theano.dot(u_t,   W_in_1) + \
               theano.dot(u_tm4, W_in_2) + \
               theano.dot(y_tm1, W_feedback))
  y_t = theano.dot(x_tm3, W_out)
 
  return [x_t, y_t]

As naming convention for the variables we used a_tmb to mean a att-b and a_tpb to be a at t+b.Note the order in which the parameters are given, and in which theresult is returned. Try to respect chronological order amongthe taps ( time slices of sequences or outputs) used. For scan is crucial onlyfor the variables representing the different time taps to be in the same orderas the one in which these taps are given. Also, not only taps should respectan order, but also variables, since this is how scan figures out what shouldbe represented by what. Given that we have allthe Theano variables needed we construct our RNN as follows :

W = T.matrix()
W_in_1 = T.matrix()
W_in_2 = T.matrix()
W_feedback = T.matrix()
W_out = T.matrix()
 
u = T.matrix() # it is a sequence of vectors
x0 = T.matrix() # initial state of x has to be a matrix, since
                # it has to cover x[-3]
y0 = T.vector() # y0 is just a vector since scan has only to provide
                # y[-1]
 
 
([x_vals, y_vals], updates) = theano.scan(fn=oneStep,
                                          sequences=dict(input=u, taps=[-4,-0]),
                                          outputs_info=[dict(initial=x0, taps=[-3,-1]), y0],
                                          non_sequences=[W, W_in_1, W_in_2, W_feedback, W_out],
                                          strict=True)
     # for second input y, scan adds -1 in output_taps by default

Now x_vals and y_vals are symbolic variables pointing to thesequence of x and y values generated by iterating over u. Thesequence_taps, outputs_taps give to scan information about whatslices are exactly needed. Note that if we want to use x[t-k] we donot need to also have x[t-(k-1)], x[t-(k-2)],.., but when applyingthe compiled function, the numpy array given to represent this sequenceshould be large enough to cover this values. Assume that we compile theabove function, and we give as u the array uvals = [0,1,2,3,4,5,6,7,8].By abusing notations, scan will consider uvals[0] as u[-4], andwill start scanning from uvals[4] towards the end.

Conditional ending of Scan

Scan can also be used as a repeat-until block. In such a case scanwill stop when either the maximal number of iteration is reached, or theprovided condition evaluates to True.

For an example, we will compute all powers of two smaller then some providedvalue max_value.

def power_of_2(previous_power, max_value):
    return previous_power*2, theano.scan_module.until(previous_power*2 > max_value)
 
max_value = T.scalar()
values, _ = theano.scan(power_of_2,
                        outputs_info = T.constant(1.),
                        non_sequences = max_value,
                        n_steps = 1024)
 
f = theano.function([max_value], values)
 
print(f(45))

[  2.   4.   8.  16.  32.  64.]

As you can see, in order to terminate on condition, the only thing requiredis that the inner function power_of_2 to return also the conditionwrapped in the class theano.scan_module.until. The condition has to beexpressed in terms of the arguments of the inner function (in this caseprevious_power and max_value).

As a rule, scan always expects the condition to be the last thing returnedby the inner function, otherwise an error will be raised.

Reducing Scan’s memory usage

This section presents the scan_checkpoints function. In short, thisfunction reduces the memory usage of scan (at the cost of more computationtime) by not keeping in memory all the intermediate time steps of the loop,and recomputing them when computing the gradients. This function is thereforeonly useful if you need to compute the gradient of the output of scan withrespect to its inputs, and shouldn’t be used otherwise.

Before going more into the details, here are its current limitations:

It only works in the case where only the output of the last time step isneeded, like when computing A**k or in an encoder-decoder setup.
It only accepts sequences of the same length.
If n_steps is specified, it has the same value as the length of anysequences.
It is singly-recurrent, meaning that only the previous time step can be usedto compute the current one (i.e. h[t] can only depend on h[t-1]). Inother words, taps can not be used in sequences and outputs_info.

Often, in order to be able to compute the gradients through scan operations,Theano needs to keep in memory some intermediate computations of scan. Thiscan sometimes use a prohibitively large amount of memory.scan_checkpoints allows to discard some of those intermediate steps andrecompute them again when computing the gradients. Its save_every_N argumentspecifies the number time steps to do without storing the intermediate results.For example, save_every_N = 4 will reduce the memory usage by 4, while havingto recompute 3/4 time steps of the forward loop. Since the grad of scan isabout 6x slower than the forward, a ~20% slowdown is expected. Apart from thesave_every_N argument and the current limitations, the usage of this functionis similar to the classic scan function.

Optimizing Scan’s performance

This section covers some ways to improve performance of a Theano functionusing Scan.

Minimizing Scan usage

Scan makes it possible to define simple and compact graphs that can do thesame work as much larger and more complicated graphs. However, it comes witha significant overhead. As such, when performance is the objective, a goodrule of thumb is to perform as much of the computation as possible outside ofScan. This may have the effect of increasing memory usage but can alsoreduce the overhead introduces by using Scan.

Explicitly passing inputs of the inner function to scan

It is possible, inside of Scan, to use variables previously defined outside ofthe Scan without explicitly passing them as inputs to the Scan. However, it isoften more efficient to explicitly pass them as non-sequence inputs instead.Section Using shared variables - Gibbs sampling provides an explanation for this andsection Using shared variables - the strict flag describes the strict flag, a tool that Scanprovides to help ensure that the inputs to the function inside Scan have allbeen provided as explicit inputs to the scan() function.

Deactivating garbage collecting in Scan

Deactivating the garbage collection for Scan can allow it to reuse memorybetween executions instead of always having to allocate new memory. This canimprove performance at the cost of increased memory usage. By default, Scanreuses memory between iterations of the same execution but frees the memoryafter the last iteration.

There are two ways to achieve this, using the Theano flagconfig.scan.allow_gc and setting it to False, or using the argumentallow_gc of the function theano.scan() and set it to False (when a valueis not provided for this argument, the value of the flagconfig.scan.allow_gc is used).

Graph optimizations

This one is simple but still worth pointing out. Theano is able toautomatically recognize and optimize many computation patterns. However, thereare patterns that Theano doesn’t optimize because doing so would change theuser interface (such as merging shared variables together into a single one,for instance). Additionally, Theano doesn’t catch every case that it couldoptimize and so it remains useful for performance that the user defines anefficient graph in the first place. This is also the case, and sometimes evenmore so, for the graph inside of Scan. This is because it will be executedmany times for every execution of the Theano function that contains it.

The LSTM tutorial onDeepLearning.net provides an example of anoptimization that Theano cannot perform. Instead of performing many matrixmultiplications between matrix and each of the shared matrices, , and , the matrices, are merged into a single shared matrix and the graphperforms a single larger matrix multiplication between and. The resulting matrix is then sliced to obtain the results of thatthe small individual matrix multiplications would have produced. Thisoptimization replaces several small and inefficient matrix multiplications bya single larger one and thus improves performance at the cost of a potentiallyhigher memory usage.

reference

This module provides the Scan Op.

Scanning is a general form of recurrence, which can be used for looping.The idea is that you scan a function along some input sequence, producingan output at each time-step that can be seen (but not modified) by thefunction at the next time-step. (Technically, the function can see theprevious K time-steps of your outputs and L time steps (from the past andfuture) of your inputs.

So for example, sum() could be computed by scanning the z+x_ifunction over a list, given an initial state of z=0.

Special cases:

A reduce operation can be performed by returning only the lastoutput of a scan.
A map operation can be performed by applying a function thatignores previous steps of the outputs.

Often a for-loop can be expressed as a scan() operation, and scan isthe closest that theano comes to looping. The advantage of using scanover for loops is that it allows the number of iterations to be a part ofthe symbolic graph.

The Scan Op should typically be used by calling any of the followingfunctions: scan(), map(), reduce(), foldl(),foldr().

theano.map(fn, sequences, non_sequences=None, truncate_gradient=-1, go_backwards=False, mode=None, name=None)[source]
Similar behaviour as python’s map.

Parameters:

fn – The function that map applies at each iteration step(see scan for more info).
sequences – List of sequences over which map iterates(see scan for more info).
non_sequences – List of arguments passed to fn. map will not iterate overthese arguments (see scan for more info).
truncate_gradient – See scan.
go_backwards (bool) – Decides the direction of iteration. True means that sequences are parsedfrom the end towards the begining, while False is the other way around.
mode – See scan.
name – See scan.

theano.reduce(fn, sequences, outputs_info, non_sequences=None, go_backwards=False, mode=None, name=None)[source]
Similar behaviour as python’s reduce.

Parameters:

fn – The function that reduce applies at each iteration step(see scan for more info).
sequences – List of sequences over which reduce iterates(see scan for more info).
outputs_info – List of dictionaries describing the outputs ofreduce (see scan for more info).
non_sequences –
- List of arguments passed to fn. reduce will
- not iterate over these arguments (see scan formore info).
go_backwards (bool) – Decides the direction of iteration. True means that sequences are parsedfrom the end towards the begining, while False is the other way around.
mode – See scan.
name – See scan.

theano.foldl(fn, sequences, outputs_info, non_sequences=None, mode=None, name=None)[source]
Similar behaviour as haskell’s foldl.

Parameters:

fn – The function that foldl applies at each iteration step(see scan for more info).
sequences – List of sequences over which foldl iterates(see scan for more info).
outputs_info – List of dictionaries describing the outputs of reduce(see scan for more info).
non_sequences – List of arguments passed to fn. foldl will not iterate overthese arguments (see scan for more info).
mode – See scan.
name – See scan.

theano.foldr(fn, sequences, outputs_info, non_sequences=None, mode=None, name=None)[source]
Similar behaviour as haskell’ foldr.

Parameters:

fn – The function that foldr applies at each iteration step(see scan for more info).
sequences – List of sequences over which foldr iterates(see scan for more info).
outputs_info – List of dictionaries describing the outputs of reduce(see scan for more info).
non_sequences – List of arguments passed to fn. foldr will not iterate over thesearguments (see scan for more info).
mode – See scan.
name – See scan.

theano.scan(fn, sequences=None, outputs_info=None, non_sequences=None, n_steps=None, truncate_gradient=-1, go_backwards=False, mode=None, name=None, profile=False, allow_gc=None, strict=False, return_list=False)[source]
This function constructs and applies a Scan op to the providedarguments.

Parameters:

fn – fn is a function that describes the operations involved in onestep of scan. fn should construct variables describing theoutput of one iteration step. It should expect as input theanovariables representing all the slices of the input sequencesand previous values of the outputs, as well as all other argumentsgiven to scan as non_sequences. The order in which scan passesthese variables to fn is the following :
- all time slices of the first sequence
- all time slices of the second sequence
- …
- all time slices of the last sequence
- all past slices of the first output
- all past slices of the second otuput
- …
- all past slices of the last output
- - all other arguments (the list given as non_sequences to
  - scan) The order of the sequences is the same as the one in the listsequences given to scan. The order of the outputs is the sameas the order of outputs_info. For any sequence or output theorder of the time slices is the same as the one in which they havebeen given as taps. For example if one writes the following :

scan(fn, sequences = [ dict(input= Sequence1, taps = [-3,2,-1])
                     , Sequence2
                     , dict(input =  Sequence3, taps = 3) ]
       , outputs_info = [ dict(initial =  Output1, taps = [-3,-5])
                        , dict(initial = Output2, taps = None)
                        , Output3 ]
       , non_sequences = [ Argument1, Argument2])

fn should expect the following arguments in this given order:

- <code>Sequence1[t-3]</code>
- <code>Sequence1[t+2]</code>
- <code>Sequence1[t-1]</code>
- <code>Sequence2[t]</code>
- <code>Sequence3[t+3]</code>
- <code>Output1[t-3]</code>
- <code>Output1[t-5]</code>
- <code>Output3[t-1]</code>
- <code>Argument1</code>
- <code>Argument2</code>

The list of nonsequences can also contain shared variablesused in the function, though scan is able to figure thoseout on its own so they can be skipped. For the clarity of thecode we recommend though to provide them to scan. To some extendscan can also figure out other non sequences (not shared)even if not passed to scan (but used by _fn). A simple example ofthis would be :

import theano.tensor as TT
W   = TT.matrix()
W_2 = W**2
def f(x):
    return TT.dot(x,W_2)

The function is expected to return two things. One is a list ofoutputs ordered in the same order as outputsinfo, with thedifference that there should be only one output variable peroutput initial state (even if no tap value is used). Secondly_fn should return an update dictionary (that tells how toupdate any shared variable after each iteration step). Thedictionary can optionally be given as a list of tuples. There isno constraint on the order of these two list, fn can returneither (outputs_list, update_dictionary) or(update_dictionary, outputs_list) or just one of the two (incase the other is empty).

To use scan as a while loop, the user needs to change thefunction fn such that also a stopping condition is returned.To do so, he/she needs to wrap the condition in an until class.The condition should be returned as a third element, for example:

...
return [y1_t, y2_t], {x:x+1}, theano.scan_module.until(x < 50)

Note that a number of steps (considered in here as the maximumnumber of steps ) is still required even though a condition ispassed (and it is used to allocate memory if needed). = {}):

sequences – sequences is the list of Theano variables or dictionariesdescribing the sequences scan has to iterate over. If asequence is given as wrapped in a dictionary, then a set of optionalinformation can be provided about the sequence. The dictionaryshould have the following keys:
- input (mandatory) – Theano variable representing thesequence.
- taps – Temporal taps of the sequence required by fn.They are provided as a list of integers, where a value kimpiles that at iteration step t scan will pass to fnthe slice t+k. Default value is [0] Any Theano variable in the list sequences is automaticallywrapped into a dictionary where taps is set to [0]
outputs_info – outputs_info is the list of Theano variables or dictionariesdescribing the initial state of the outputs computedrecurrently. When this initial states are given as dictionaryoptional information can be provided about the output correspondingto these initial states. The dictionary should have the followingkeys:
- initial – Theano variable that represents the initialstate of a given output. In case the output is not computedrecursively (think of a map) and does not require an initialstate this field can be skipped. Given that (only) the previoustime step of the output is used by fn, the initial stateshould have the same shape as the output and should notinvolve a downcast of the data type of the output. If multipletime taps are used, the initial state should have one extradimension that should cover all the possible taps. For exampleif we use -5, -2 and -1 as past taps, at step 0,fn will require (by an abuse of notation) output[-5],output[-2] and output[-1]. This will be given bythe initial state, which in this case should have the shape(5,)+output.shape. If this variable containing the initialstate is called inity then init_y[0] _corresponds tooutput[-5]. inity[1] _correponds to output[-4],init_y[2] corresponds to output[-3], init_y[3]coresponds to output[-2], init_y[4] corresponds tooutput[-1]. While this order might seem strange, it comesnatural from splitting an array at a given point. Assume thatwe have a array x, and we choose k to be time step0. Then our initial state would be x[:k], while theoutput will be x[k:]. Looking at this split, elements inx[:k] are ordered exactly like those in init_y.
- taps – Temporal taps of the output that will be pass tofn. They are provided as a list of negative integers,where a value k implies that at iteration step t scanwill pass to fn the slice t+k. scan will follow this logic if partial information is given:
- If an output is not wrapped in a dictionary, scan will wrapit in one assuming that you use only the last step of the output(i.e. it makes your tap value list equal to [-1]).
- If you wrap an output in a dictionary and you do not provide anytaps but you provide an initial state it will assume that you areusing only a tap value of -1.
- If you wrap an output in a dictionary but you do not provide anyinitial state, it assumes that you are not using any form oftaps.
- If you provide a None instead of a variable or a emptydictionary scan assumes that you will not use any taps forthis output (like for example in case of a map) If outputs_info is an empty list or None, scan assumesthat no tap is used for any of the outputs. If information isprovided just for a subset of the outputs an exception israised (because there is no convention on how scan should mapthe provided information to the outputs of fn)
non_sequences – non_sequences is the list of arguments that are passed tofn at each steps. One can opt to exclude variableused in fn from this list as long as they are part of thecomputational graph, though for clarity we encourage not to do so.
n_steps – nsteps is the number of steps to iterate given as an intor Theano scalar. If any of the input sequences do not haveenough elements, scan will raise an error. If the _value is 0 theoutputs will have 0 rows. If n_steps is not provided, scan willfigure out the amount of steps it should run given its inputsequences. n_steps < 0 is not supported anymore.
truncate_gradient – truncate_gradient is the number of steps to use in truncatedBPTT. If you compute gradients through a scan op, they arecomputed using backpropagation through time. By providing adifferent value then -1, you choose to use truncated BPTT insteadof classical BPTT, where you go for only truncate_gradientnumber of steps back in time.
go_backwards – go_backwards is a flag indicating if scan should gobackwards through the sequences. If you think of each sequenceas indexed by time, making this flag True would mean thatscan goes back in time, namely that for any sequence itstarts from the end and goes towards 0.
name – When profiling scan, it is crucial to provide a name for anyinstance of scan. The profiler will produce an overallprofile of your code as well as profiles for the computation ofone step of each instance of scan. The name of the instanceappears in those profiles and can greatly help to disambiguateinformation.
mode – It is recommended to leave this argument to None, especiallywhen profiling scan (otherwise the results are not going tobe accurate). If you prefer the computations of one step ofscan to be done differently then the entire function, youcan use this parameter to describe how the computations in thisloop are done (see theano.function for details aboutpossible values and their meaning).
profile – Flag or string. If true, or different from the empty string, aprofile object will be created and attached to the inner graph ofscan. In case profile is True, the profile object will have thename of the scan instance, otherwise it will have the passed string.Profile object collect (and print) information only when running theinner graph with the new cvm linker ( with default modes,other linkers this argument is useless)
allow_gc – Set the value of allow gc for the internal graph of scan. Ifset to None, this will use the value of config.scan.allow_gc.

The full scan behavior related to allocation is determined bythis value and the Theano flag allow_gc. If the flag allow_gcis True (default) and this scan parameter allow_gc is False(default), then we let scan allocate all intermediate memoryon the first iteration, those are not garbage collected themduring that first iteration (this is determined by the scanallow_gc). This speed up allocation of the followingiteration. But we free all those temp allocation at the end ofall iterations (this is what the Theano flag allow_gc mean).

If you use cnmem and this scan is on GPU, the speed up fromthe scan allow_gc is small. If you are missing memory, disablethe scan allow_gc could help you run graph that request muchmemory.

strict – If true, all the shared variables used in fn must be provided as apart of non_sequences or sequences.
return_list – If True, will always return a list, even if there is only 1 output.Returns: Tuple of the form (outputs, updates); outputs is either aTheano variable or a list of Theano variables representing theoutputs of scan (in the same order as in outputs_info).updates is a subclass of dictionary specifying the update rules forall shared variables used in scan.This dictionary should be passed to theano.function when you compileyour function. The change compared to a normal dictionary is that wevalidate that keys are SharedVariable and addition of those dictionaryare validated to be consistent. Return type: tuple

theano.scancheckpoints(_fn, sequences=[], outputs_info=None, non_sequences=[], name='checkpointscan_fn', n_steps=None, save_every_N=10, padding=True)[source]
Scan function that uses less memory, but is more restrictive.

In scan(), if you compute the gradient of the outputwith respect to the input, you will have to store the intermediateresults at each time step, which can be prohibitively huge. Thisfunction allows to do save_every_N steps of forward computationswithout storing the intermediate results, and to recompute them duringthe gradient computation.

Notes

Current assumptions:

Every sequence has the same length.
If n_steps is specified, it has the same value as the length ofany sequence.
The value of save_every_N divides the number of steps the scanwill run without remainder.
Only singly-recurrent and non-recurrent outputs are used.No multiple recurrences.
Only the last timestep of any output will ever be used.

Parameters:

fn – fn is a function that describes the operations involved in onestep of scan. See the documentation of scan()for more information.
sequences – sequences is the list of Theano variables or dictionariesdescribing the sequences scan has to iterate over. Allsequences must be the same length in this version of scan.
outputs_info – outputs_info is the list of Theano variables or dictionariesdescribing the initial state of the outputs computedrecurrently.
non_sequences – non_sequences is the list of arguments that are passed tofn at each steps. One can opt to exclude variableused in fn from this list as long as they are part of thecomputational graph, though for clarity we encourage not to do so.
n_steps – n_steps is the number of steps to iterate given as an intor Theano scalar (> 0). If any of the input sequences do not haveenough elements, scan will raise an error. If n_steps is not provided,scan will figure out the amount of steps it should run given itsinput sequences.
save_every_N – save_every_N is the number of steps to go without storingthe computations of scan (ie they will have to be recomputedduring the gradient computation).
padding – If the length of the sequences is not a multiple of save_every_N,the sequences will be zero padded to make this version of scanwork properly, but will also result in a memory copy. It can beavoided by setting padding to False, but you need to makesure the length of the sequences is a multple of save_every_N.Returns: Tuple of the form (outputs, updates) as in scan(), butwith a small change: It only contain the output at eachsave_every_N step. The time steps that are not returned bythis function will be recomputed during the gradient computation(if any). Return type: tuple