-
Book Overview & Buying
-
Table Of Contents
GPU-Accelerated Computing with Python 3 and CUDA
By :
For this specific example, we see high computing metrics relative to memory throughput and efficiency metrics. This indicates that our performance is likely limited by compute capability rather than memory access. To verify this, we can calculate the value of nu explicitly inside the kernel using αt/h2 instead of passing it as a constant:
u_new[x, y] = tile_xy + (dt * alpha / (h * )) * laplacian
Performing this dummy additional calculation, especially the division operation, increases the runtime by 6%, which is an indication of a compute-bound problem. If the performance were memory-bound, this computation is expected to have minimal impact on the performance.
Accordingly, we can improve the kernel by reducing the use of conditional statements while loading the halo regions into shared memory. This is because branching can be computationally expensive as it introduces additional latency and warp synchronization. The following code shows the same implementation...