In this chapter we will discuss a few optimization techniques and finally illustrate some of them using a simple example of matrix multiplication. In a step-by-step process we combine multiple optimization strategies one by one to get gradual performance improvement. The main advantages of matrix multiplication over many other simpler algorithms , is that its easy to understand the data parallel work load and it demonstrates well the advantage of private memory, local memory, vectors and the problem of bank conflicts.
We start this chapter with a discussion of various ways to find performance bottleneck. First we discuss event-based timing information collection using clWaitForEvent
API. Then we mention some available tools for performance detection. After that we jump into case study, starting from sequential implementation for CPU. Then gradually describing naive OpenCL implementation on Graphics Processor Unit (GPU), followed by...