-
Book Overview & Buying
-
Table Of Contents
Deep Learning with Theano
By :
For debugging purpose, Theano can print more verbose information and offers different optimization modes:
>>> theano.config.exception_verbosity='high' >>> theano.config.mode 'Mode' >>> theano.config.optimizer='fast_compile'
In order for Theano to use the config.optimizer value, the mode has to be set to Mode, otherwise the value in config.mode will be used:
|
config.mode / function mode |
config.optimizer (*) |
Description |
|---|---|---|
|
|
|
Default; best run performance, slow compilation |
|
|
|
Disable optimizations |
|
|
|
Reduce the number of optimizations, compiles faster |
|
|
Use the default mode, equivalent to | |
|
|
NaNs, Infs, and abnormally big value will raise errors | |
|
|
Self-checks and assertions during compilation |
The same parameter as in config.mode can be used in the Mode parameter in the function compile:
>>> f = theano.function([a,s0], results, updates=updates, mode='FAST_COMPILE')
Disabling optimization and choosing high verbosity will help finding errors in the computation graph.
For debugging on the GPU, you need to set a synchronous execution with the environment variable CUDA_LAUNCH_BLOCKING, since GPU execution is by default, fully asynchronous:
CUDA_LAUNCH_BLOCKING=1 python
To find out the origin of the latencies in your computation graph, Theano provides a profiling mode.
Activate profiling:
>>> theano.config.profile=True
Activate memory profiling:
>>> theano.config.profile_memory=True
Activate profiling of optimization phase:
>>> theano.config.profile_optimizer=True
Or directly during compilation:
>>> f = theano.function([a,s0], results, profile=True)
>>> f.profile.summary()
Function profiling
==================
Message: <stdin>:1
Time in 1 calls to Function.__call__: 1.490116e-03s
Time in Function.fn.__call__: 1.251936e-03s (84.016%)
Time in thunks: 1.203537e-03s (80.768%)
Total compile time: 1.720619e-01s
Number of Apply nodes: 14
Theano Optimizer time: 1.382768e-01s
Theano validate time: 1.308680e-03s
Theano Linker time (includes C, CUDA code generation/compiling): 2.405691e-02s
Import time 1.272917e-03s
Node make_thunk time 2.329803e-02s
Time in all call to theano.grad() 0.000000e+00s
Time since theano import 520.661s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
58.2% 58.2% 0.001s 7.00e-04s Py 1 1 theano.scan_module.scan_op.Scan
27.3% 85.4% 0.000s 1.64e-04s Py 2 2 theano.sandbox.cuda.basic_ops.GpuFromHost
6.1% 91.5% 0.000s 7.30e-05s Py 1 1 theano.sandbox.cuda.basic_ops.HostFromGpu
5.5% 97.0% 0.000s 6.60e-05s C 1 1 theano.sandbox.cuda.basic_ops.GpuIncSubtensor
1.1% 98.0% 0.000s 3.22e-06s C 4 4 theano.tensor.elemwise.Elemwise
0.7% 98.8% 0.000s 8.82e-06s C 1 1 theano.sandbox.cuda.basic_ops.GpuSubtensor
0.7% 99.4% 0.000s 7.87e-06s C 1 1 theano.sandbox.cuda.basic_ops.GpuAllocEmpty
0.3% 99.7% 0.000s 3.81e-06s C 1 1 theano.compile.ops.Shape_i
0.3% 100.0% 0.000s 1.55e-06s C 2 2 theano.tensor.basic.ScalarFromTensor
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
58.2% 58.2% 0.001s 7.00e-04s Py 1 1 forall_inplace,gpu,scan_fn}
27.3% 85.4% 0.000s 1.64e-04s Py 2 2 GpuFromHost
6.1% 91.5% 0.000s 7.30e-05s Py 1 1 HostFromGpu
5.5% 97.0% 0.000s 6.60e-05s C 1 1 GpuIncSubtensor{InplaceSet;:int64:}
0.7% 97.7% 0.000s 8.82e-06s C 1 1 GpuSubtensor{int64:int64:int16}
0.7% 98.4% 0.000s 7.87e-06s C 1 1 GpuAllocEmpty
0.3% 98.7% 0.000s 4.05e-06s C 1 1 Elemwise{switch,no_inplace}
0.3% 99.0% 0.000s 4.05e-06s C 1 1 Elemwise{le,no_inplace}
0.3% 99.3% 0.000s 3.81e-06s C 1 1 Shape_i{0}
0.3% 99.6% 0.000s 1.55e-06s C 2 2 ScalarFromTensor
0.2% 99.8% 0.000s 2.86e-06s C 1 1 Elemwise{Composite{Switch(LT(i0, i1), i0, i1)}}
0.2% 100.0% 0.000s 1.91e-06s C 1 1 Elemwise{Composite{Switch(i0, i1, minimum(i2, i3))}}[(0, 2)]
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
58.2% 58.2% 0.001s 7.00e-04s 1 12 forall_inplace,gpu,scan_fn}(TensorConstant{10}, GpuSubtensor{int64:int64:int16}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuFromHost.0)
21.9% 80.1% 0.000s 2.64e-04s 1 3 GpuFromHost(<TensorType(float32, vector)>)
6.1% 86.2% 0.000s 7.30e-05s 1 13 HostFromGpu(forall_inplace,gpu,scan_fn}.0)
5.5% 91.6% 0.000s 6.60e-05s 1 4 GpuIncSubtensor{InplaceSet;:int64:}(GpuAllocEmpty.0, CudaNdarrayConstant{[ 0.]}, Constant{1})
5.3% 97.0% 0.000s 6.41e-05s 1 0 GpuFromHost(s0)
0.7% 97.7% 0.000s 8.82e-06s 1 11 GpuSubtensor{int64:int64:int16}(GpuFromHost.0, ScalarFromTensor.0, ScalarFromTensor.0, Constant{1})
0.7% 98.4% 0.000s 7.87e-06s 1 1 GpuAllocEmpty(TensorConstant{10})
0.3% 98.7% 0.000s 4.05e-06s 1 8 Elemwise{switch,no_inplace}(Elemwise{le,no_inplace}.0, TensorConstant{0}, TensorConstant{0})
0.3% 99.0% 0.000s 4.05e-06s 1 6 Elemwise{le,no_inplace}(Elemwise{Composite{Switch(LT(i0, i1), i0, i1)}}.0, TensorConstant{0})
0.3% 99.3% 0.000s 3.81e-06s 1 2 Shape_i{0}(<TensorType(float32, vector)>)
0.3% 99.6% 0.000s 3.10e-06s 1 10 ScalarFromTensor(Elemwise{switch,no_inplace}.0)
0.2% 99.8% 0.000s 2.86e-06s 1 5 Elemwise{Composite{Switch(LT(i0, i1), i0, i1)}}(TensorConstant{10}, Shape_i{0}.0)
0.2% 100.0% 0.000s 1.91e-06s 1 7 Elemwise{Composite{Switch(i0, i1, minimum(i2, i3))}}[(0, 2)](Elemwise{le,no_inplace}.0, TensorConstant{0}, Elemwise{Composite{Switch(LT(i0, i1), i0, i1)}}.0, Shape_i{0}.0)
0.0% 100.0% 0.000s 0.00e+00s 1 9 ScalarFromTensor(Elemwise{Composite{Switch(i0, i1, minimum(i2, i3))}}[(0, 2)].0)
... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)