When the dimension of the data is not divisible into a block size times a grid size, threads dealing with data at the border will execute faster than other threads, and the kernel code has to be written in a way to check for out-of-bounds memory accesses.
When programming in parallel, race conditions, as well as memory bank conflicts in shared memory, and data that cannot stay local to the thread in the available registrars are some new pains to check. Coalescing global memory accesses is by far the most critical aspect of achieving good performance. The NVIDIA® Nsight™ tool will help you develop, debug, and profile the code that executes on CPU and GPU.