Let's take a look at what happens when we run this operator in a graph in the GPU config
mode:
>>> y = mult4plus5op(2 * x) + 4 * x >>> f = theano.function([x], y) >>> theano.printing.debugprint(f) HostFromGpu(gpuarray) [id A] '' 6 |GpuElemwise{Composite{(i0 + (i1 * i2))}}[(0, 0)]<gpuarray> [id B] '' 5 |GpuFromHost<None> [id C] '' 4 | |AXPBOp{a=4, b=5} [id D] '' 3 | |HostFromGpu(gpuarray) [id E] '' 2 | |GpuElemwise{mul,no_inplace} [id F] '' 1 | |GpuArrayConstant{[[ 2.]]} [id G] | |GpuFromHost<None> [id H] '' 0 | |<TensorType(float32, matrix)> [id I] |GpuArrayConstant{[[ 4.]]} [id J] |GpuFromHost<None> [id H] '' 0
Since we have only defined a CPU implementation of the new operator in Python and the full graph is running on GPU, the data is transferred back and forth to CPU in the middle of the graph to apply our new CPU operator:
To avoid...