Book Image

OpenCL Programming by Example

Book Image

OpenCL Programming by Example

Overview of this book

Research in parallel programming has been a mainstream topic for a decade, and will continue to be so for many decades to come. Many parallel programming standards and frameworks exist, but only take into account one type of hardware architecture. Today computing platforms come with many heterogeneous devices. OpenCL provides royalty free standard to program heterogeneous hardware. This guide offers you a compact coverage of all the major topics of OpenCL programming. It explains optimization techniques and strategies in-depth, using illustrative examples and also provides case studies from diverse fields. Beginners and advanced application developers will find this book very useful. Beginning with the discussion of the OpenCL models, this book explores their architectural view, programming interfaces and primitives. It slowly demystifies the process of identifying the data and task parallelism in diverse algorithms. It presents examples from different domains to show how the problems within different domains can be solved more efficiently using OpenCL. You will learn about parallel sorting, histogram generation, JPEG compression, linear and parabolic regression and k-nearest neighborhood, a clustering algorithm in pattern recognition. Following on from this, optimization strategies are explained with matrix multiplication examples. You will also learn how to do an interoperation of OpenGL and OpenCL. "OpenCL Programming by Example" explains OpenCL in the simplest possible language, which beginners will find it easy to understand. Developers and programmers from different domains who want to achieve acceleration for their applications will find this book very useful.
Table of Contents (18 chapters)
OpenCL Programming by Example
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Preface
Index

Different parallel programming techniques


There are several different forms of parallel computing such as bit-level, instruction level, data, and task parallelism. This book will largely focus on data and task parallelism using heterogeneous devices. We just now coined a term, heterogeneous devices. How do we tackle complex tasks "in parallel" using different types of computer architecture? Why do we need OpenCL when there are many (already defined) open standards for Parallel Computing?

To answer this question, let us discuss the pros and cons of different Parallel computing Framework.

OpenMP

OpenMP is an API that supports multi-platform shared memory multiprocessing programming in C, C++, and Fortran. It is prevalent only on a multi-core computer platform with a shared memory subsystem.

A basic OpenMP example implementation of the OpenMP Parallel directive is as follows:

#pragma omp parallel
{
  body;
}

When you build the preceding code using the OpenMP shared library, libgomp would expand to something similar to the following code:

void subfunction (void *data)
{
    use data;
    body;
}
     
setup data;
GOMP_parallel_start (subfunction, &data, num_threads);
subfunction (&data);
GOMP_parallel_end ();
void GOMP_parallel_start (void (*fn)(void *), void *data, unsigned num_threads)     

The OpenMP directives make things easy for the developer to modify the existing code to exploit the multicore architecture. OpenMP, though being a great parallel programming tool, does not support parallel execution on heterogeneous devices, and the use of a multicore architecture with shared memory subsystem does not make it cost effective.

MPI

Message Passing Interface (MPI) has an advantage over OpenMP, that it can run on either the shared or distributed memory architecture. Distributed memory computers are less expensive than large shared memory computers. But it has its own drawback with inherent programming and debugging challenges. One major disadvantage of MPI parallel framework is that the performance is limited by the communication network between the nodes.

Supercomputers have a massive number of processors which are interconnected using a high speed network connection or are in computer clusters, where computer processors are in close proximity to each other. In clusters, there is an expensive and dedicated data bus for data transfers across the computers. MPI is extensively used in most of these compute monsters called supercomputers.

OpenACC

The OpenACC Application Program Interface (API) describes a collection of compiler directives to specify loops and regions of code in standard C, C++, and Fortran to be offloaded from a host CPU to an attached accelerator, providing portability across operating systems, host CPUs, and accelerators. OpenACC is similar to OpenMP in terms of program annotation, but unlike OpenMP which can only be accelerated on CPUs, OpenACC programs can be accelerated on a GPU or on other accelerators also. OpenACC aims to overcome the drawbacks of OpenMP by making parallel programming possible across heterogeneous devices. OpenACC standard describes directives and APIs to accelerate the applications. The ease of programming and the ability to scale the existing codes to use the heterogeneous processor, warrantees a great future for OpenACC programming.

CUDA

Compute Unified Device Architecture (CUDA) is a parallel computing architecture developed by NVIDIA for graphics processing and GPU (General Purpose GPU) programming. There is a fairly good developer community following for the CUDA software framework. Unlike OpenCL, which is supported on GPUs by many vendors and even on many other devices such as IBM's Cell B.E. processor or TI's DSP processor and so on, CUDA is supported only for NVIDIA GPUs. Due to this lack of generalization, and focus on a very specific hardware platform from a single vendor, OpenCL is gaining traction.

CUDA or OpenCL?

CUDA is more proprietary and vendor specific but has its own advantages. It is easier to learn and start writing code in CUDA than in OpenCL, due to its simplicity. Optimization of CUDA is more deterministic across a platform, since less number of platforms are supported from a single vendor only. It has simplified few programming constructs and mechanisms. So for a quick start and if you are sure that you can stick to one device (GPU) from a single vendor that is NVIDIA, CUDA can be a good choice.

OpenCL on the other hand is supported for many hardware from several vendors and those hardware vary extensively even in their basic architecture, which created the requirement of understanding a little complicated concepts before starting OpenCL programming. Also, due to the support of a huge range of hardware, although an OpenCL program is portable, it may lose optimization when ported from one platform to another.

The kernel development where most of the effort goes, is practically identical between the two languages. So, one should not worry about which one to choose. Choose the language which is convenient. But remember your OpenCL application will be vendor agnostic. This book aims at attracting more developers to OpenCL.

There are many libraries which use OpenCL programming for acceleration. Some of them are MAGMA, clAMDBLAS, clAMDFFT, BOLT C++ Template library, and JACKET which accelerate MATLAB on GPUs. Besides this, there are C++ and Java bindings available for OpenCL also.

Once you've figured out how to write your important "kernels" it's trivial to port to either OpenCL or CUDA. A kernel is a computation code which is executed by an array of threads. CUDA also has a vast set of CUDA accelerated libraries, that is, CUBLAS, CUFFT, CUSPARSE, Thrust and so on. But it may not take a long time to port these libraries to OpenCL.

Renderscripts

Renderscripts is also an API specification which is targeted for 3D rendering and general purpose compute operations in an Android platform. Android apps can accelerate the performance by using these APIs. It is also a cross-platform solution. When an app is run, the scripts are compiled into a machine code of the device. This device can be a CPU, a GPU, or a DSP. The choice of which device to run it on is made at runtime. If a platform does not have a GPU, the code may fall back to the CPU. Only Android supports this API specification as of now. The execution model in Renderscripts is similar to that of OpenCL.

Hybrid parallel computing model

Parallel programming models have their own advantages and disadvantages. With the advent of many different types of computer architectures, there is a need to use multiple programming models to achieve high performance. For example, one may want to use MPI as the message passing framework, and then at each node level one might want to use, OpenCL, CUDA, OpenMP, or OpenACC.

Besides all the above programming models many compilers such as Intel ICC, GCC, and Open64 provide auto parallelization options, which makes the programmers job easy and exploit the underlying hardware architecture without the need of knowing any parallel computing framework. Compilers are known to be good at providing instruction-level parallelism. But tackling data level or task level auto parallelism has its own limitations and complexities.