Optimizations
- Theano applies many kinds of graph optimizations, with different objectives:
- simplifying and standardizing the form of the expression graph (e.g. merge, add canonicalization ),
- reducing the maximum memory footprint (e.g. inplace_elemwise),
- increasing execution speed (e.g. constant folding).
The optimizations are listed in roughly chronological order. The table belowgives a quick summary of the optimizations included in the default modes.The descriptions are brief and point to further reading.
If you would like to add an additional optimization, refer toGraph optimization in the guide to extending Theano.
Note
This list is partial.
The print_summary method allows several OpDBs and optimizers to list the executed optimizations.This makes it possible to have an up-to-date list.
python -c ‘import theano; theano.compile.FAST_RUN.optimizer.print_summary()’
python -c ‘import theano; theano.compile.FAST_COMPILE.optimizer.print_summary()’
Optimization | FAST_RUN | FAST_COMPILE | Stabilization |
---|---|---|---|
merge | x | x | |
constant folding | x | x | |
GPU transfer | x | x | |
shape promotion | x | ||
fill cut | x | ||
inc_subtensor srlz. | x | ||
reshape_chain | x | ||
const. elimination | x | ||
add canonical. | x | ||
mul canonical. | x | ||
dot22 | x | ||
sparse_dot | x | ||
sum_scalar_mul | x | ||
neg_neg | x | ||
neg_div_neg | x | ||
add specialize | x | ||
mul specialize | x | ||
pow specialize | x | ||
inplace_setsubtensor | x | ||
gemm | x | ||
inplace_elemwise | x | ||
inplace_random | x | ||
elemwise fusion | x | ||
local_log_softmax | x | x | |
local_remove_all_assert |
- merge
- A simple optimization in which redundant Apply nodes arecombined. For example, in
function([x,y], [(x+y)2, (x+y)3])
the mergeoptimization will ensure thatx
andy
are only added once.
This optimization is very useful because it frees users to writehighly redundant mathematical code. Theano will make sure to computejust what is necessary.
See MergeOptimizer
.
- constant folding
- When all the inputs to an expression are constant, then the expressioncan be pre-computed at compile-time.
See opt.constant_folding()
- shape promotion
- Theano often knows how to infer the shape of an output from the shapeof its inputs. Without this optimization, it would otherwise have tocompute things (e.g.
log(x)
) just to find out the shape of it!
See opt.localshape_lift*()
- fill cut
- Fill(a,b) means to make a tensor of the shape of a full of the value b.Often when fills are used with elementwise operations (e.g. f) they areun-necessary:
f(fill(a,b), c) -> f(b, c)
f(fill(a, b), fill(c, d), e) -> fill(a, fill(c, f(b, d, e)))
See opt.local_fill_sink()
- inc_subtensor serialization
- Incrementing a small subregion of a large tensor can be done quicklyusing an inplace operation, but if two increments are being done onthe same large tensor, then only one of them can be done inplace.This optimization reorders such graphs so that all increments can bedone inplace.
inc_subensor(a,b,idx) + inc_subtensor(a,c,idx) -> inc_subtensor(inc_subtensor(a,b,idx),c,idx)
See local_IncSubtensor_serialize()
- reshape_chain
- This optimizes graphs like
reshape(reshape(x, shape1), shape2)
->reshape(x, shape2)
See local_reshape_chain()
- constant elimination
- Many constants indicate special cases, such as
pow(x,1) -> x
.Theano recognizes many of these special cases.
See localmul_specialize()
, local_mul_specialize()
,:func:_local_mul_specialize
- add canonicalization
- Rearrange expressions of additions and subtractions to a canonicalform:
See Canonizer
, local_add_canonizer
- mul canonicalization
- Rearrange expressions of multiplication and division to a canonicalform:
See Canonizer
, local_mul_canonizer
- dot22
- This simple optimization replaces dot(matrix, matrix) with a specialdot22 op that only works for matrix multiplication. This op isimplemented with a call to GEMM, and sometimes replaced entirely bythe gemm optimization.
See local_dot_to_dot22()
- sparse_dot
- Theano has a sparse matrix multiplication algorithm that is faster inmany cases than scipy’s (for dense matrix output). This optimizationswaps scipy’s algorithm for ours.
See local_structured_dot()
- sum_scalar_mul
- This optimizes graphs like
sum(scalar tensor)
->scalar
sum(tensor)
See local_sum_mul_by_scalar()
- neg_neg
- Composition of two negatives can be cancelled out.
See local_neg_neg()
- neg_div_neg
- Matching negatives in both the numerator and denominator can both be removed.
See local_neg_div_neg()
- add specialization
- This optimization simplifies expressions involving the addition ofzero.
See local_add_specialize()
- mul specialization
- Several special cases of mul() exist, and this optimization tries torecognize them. Some examples include:
mul(x,x)
->x**2
mul(x,0)
->zeros_like(x)
*mul(x, -1)
->neg(x)
See local_mul_specialize()
- pow specialization
- Several special cases of pow() exist, and this optimization tries torecognize them. Some examples include:
pow(x,2)
->x**2
pow(x,0)
->ones_like(x)
*pow(x, -0.5)
->inv(sqrt(x))
See local_pow_specialize()
- inplace_setsubtensor
- In order to be a pure Op, setsubtensor must copy its entire input, andmodify just the subtensor in question (possibly a single element). Itis much more efficient to modify that element inplace.
See local_inplace_setsubtensor()
- gemm
- Numerical libraries such as MKL and ATLAS implement the BLAS-level-3interface, and provide a function GEMM that implements, for matrices A, B_and _Z, and scalars .
This optimization tries to rearrange a variety of linear algebraexpressions into one or more instances of this motif, and replace themeach with a single Gemm Op.
See GemmOptimizer
- inplace_elemwise
- When one of the inputs to an elementwise expression has the same typeand shape as the output, and is no longer needed for computation afterthe elemwise expression is evaluated, then we can reuse the storage ofthe input to store the output.
See insert_inplace_optimizer()
- inplace_random
- Typically when a graph uses random numbers, the RandomState is storedin a shared variable, used once per call and, updated after each functioncall. In this common case, it makes sense to update the random number generator in-place.
See random_make_inplace()
- elemwise fusion
- This optimization compresses subgraphs of computationally cheapelementwise operations into a single Op that does the whole job in asingle pass over the inputs (like loop fusion). This is a win whentransfer from main memory to the CPU (or from graphics memory to theGPU) is a bottleneck.
See FusionOptimizer
- GPU transfer
The current strategy for choosing which expressions to evaluate on theCPU and which to evaluate on the GPU is a greedy one. There are anumber of Ops TODO with GPU implementations and whenever we finda graph copying data from GPU to CPU in order to evaluate anexpression that could have been evaluated on the GPU, we substitutethe GPU version of that Op for the CPU version. Likewise if we arecopying the output of a Op with a GPU implementation to the GPU,then we substitute the GPU version for the CPU version. In this way, if all goes well,this procedure will result in a graph with the following form:
- copy non-shared inputs to GPU
- carry out most/all computations on the GPU
- copy output back to CPU
When using a GPU,
shared()
will default to GPU storage for‘float32’ ndarray arguments, and these shared variables act as seedsfor the greedy algorithm.
See theano.sandbox.cuda.opt.*()
.
- local_log_softmax
- This is a stabilization optimization.It can happen due to rounding errors that the softmax probability of one value gets to 0.Taking the log of 0 would generate -inf that will probably generate NaN later.We return a closer answer.
- local_remove_all_assert
- This is an unsafe optimization.For the fastest possible Theano, this optimization can be enabled bysetting
optimizer_including=local_remove_all_assert
which willremove all assertions in the graph for checking user inputs are valid.Use this optimization if you are sure everthing is valid in your graph.