OpenCL Programming by Example

OpenCL Programming by Example

Overview of this book

Research in parallel programming has been a mainstream topic for a decade, and will continue to be so for many decades to come. Many parallel programming standards and frameworks exist, but only take into account one type of hardware architecture. Today computing platforms come with many heterogeneous devices. OpenCL provides royalty free standard to program heterogeneous hardware. This guide offers you a compact coverage of all the major topics of OpenCL programming. It explains optimization techniques and strategies in-depth, using illustrative examples and also provides case studies from diverse fields. Beginners and advanced application developers will find this book very useful. Beginning with the discussion of the OpenCL models, this book explores their architectural view, programming interfaces and primitives. It slowly demystifies the process of identifying the data and task parallelism in diverse algorithms. It presents examples from different domains to show how the problems within different domains can be solved more efficiently using OpenCL. You will learn about parallel sorting, histogram generation, JPEG compression, linear and parabolic regression and k-nearest neighborhood, a clustering algorithm in pattern recognition. Following on from this, optimization strategies are explained with matrix multiplication examples. You will also learn how to do an interoperation of OpenGL and OpenCL. "OpenCL Programming by Example" explains OpenCL in the simplest possible language, which beginners will find it easy to understand. Developers and programmers from different domains who want to achieve acceleration for their applications will find this book very useful.

OpenCL Programming by Example

Credits

About the Authors

About the Reviewers

www.PacktPub.com

Preface

Free Chapter

Hello OpenCL

Advances in computer architecture

Different parallel programming techniques

Introduction to OpenCL

OpenCL components

An example of OpenCL program

Summary

OpenCL Architecture

Summary

OpenCL Buffer Objects

Memory objects

Creating subbuffer objects

Histogram calculation

Reading and writing buffers

Copying buffers

Mapping buffer objects

Querying buffer objects

Undefined behavior of the cl_mem objects

Summary

OpenCL Images

Creating images

Samplers

Reading and writing buffers

Copying and filling images

Mapping image objects

Querying image objects

Image histogram computation

Summary

OpenCL Program and Kernel Objects

Creating program objects

Creating kernel objects

Summary

Events and Synchronization

OpenCL events and monitoring these events

OpenCL event synchronization models

Coarse-grained synchronization

Event-based or fine-grained synchronization

Getting information about cl_event

User-created events

Event profiling

Memory fences

Summary

OpenCL C Programming

Built-in data types

Aliasing rules

Conversions and type casts

Operators

Address space qualifiers

Image access qualifiers

Storage class specifiers

Built-in functions

Summary

Basic Optimization Techniques with Case Studies

Finding the performance of your program?

Case study – matrix multiplication

Case study – Histogram calculation

Finding the scope of the use of OpenCL

General tips

Summary

Image Processing and OpenCL

Image representation

Implementing image filters

OpenCL implementation of filters

JPEG compression

Summary

References

OpenCL-OpenGL Interoperation

Introduction to OpenGL

Defining Interoperation

Implementing Interoperation

Summary

Case studies – Regressions, Sort, and KNN

Regression with least square curve fitting

Bitonic sort

k-Nearest Neighborhood (k-NN) algorithm

Summary

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Advances in computer architecture

All over the 20th century computer architectures have advanced by multiple folds. The trend is continuing in the 21st century and will remain for a long time to come. Some of these trends in architecture follow Moore's Law. "Moore's law is the observation that, over the history of computing hardware, the number of transistors on integrated circuits doubles approximately every two years". Many devices in the computer industry are linked to Moore's law, whether they are DSPs, memory devices, or digital cameras. All the hardware advances would be of no use if there weren't any software advances. Algorithms and software applications grow in complexity, as more and more user interaction comes into play. An algorithm can be highly sequential or it may be parallelized, by using any parallel computing framework. Amdahl's Law is used to predict the speedup for an algorithm, which can be obtained given n threads. This speedup is dependent on the value of the amount of strictly serial or non-parallelizable code (B). The time T(n) an algorithm takes to finish when being executed on n thread(s) of execution corresponds to:

T(n) = T(1) (B + (1-B)/n)

Therefore the theoretical speedup which can be obtained for a given algorithm is given by :

Speedup(n) = 1/(B + (1-B)/n)

Amdahl's Law has a limitation, that it does not fully exploit the computing power that becomes available as the number of processing core increase.

Gustafson's Law takes into account the scaling of the platform by adding more processing elements in the platform. This law assumes that the total amount of work that can be done in parallel, varies linearly with the increase in number of processing elements. Let an algorithm be decomposed into (a+b). The variable a is the serial execution time and variable b is the parallel execution time. Then the corresponding speedup for P parallel elements is given by:

(a + P*b)

Speedup = (a + P*b) / (a + b)

Now defining α as a/(a+b), the sequential execution component, as follows, gives the speedup for P processing elements:

Speedup(P) = P – α *(P - 1)

Given a problem which can be solved using OpenCL, the same problem can also be solved on a different hardware with different capabilities. Gustafson's law suggests that with more number of computing units, the data set should also increase that is, "fixed work per processor". Whereas Amdahl's law suggests the speedup which can be obtained for the existing data set if more computing units are added, that is, "Fixed work for all processors". Let's take the following example:

Let the serial component and parallel component of execution be of one unit each.

In Amdahl's Law the strictly serial component of code is B (equals 0.5). For two processors, the speedup T(2) is given by:

T(2) = 1 / (0.5 + (1 – 0.5) / 2) = 1.33

Similarly for four and eight processors, the speedup is given by:

T(4) = 1.6 and T(8) = 1.77

Adding more processors, for example when n tends to infinity, the speedup obtained at max is only 2. On the other hand in Gustafson's law, Alpha = 1(1+1) = 0.5 (which is also the serial component of code). The speedup for two processors is given by:

Speedup(2) = 2 – 0.5(2 - 1) = 1.5

Similarly for four and eight processors, the speedup is given by:

Speedup(4) = 2.5 and Speedup(8) = 4.5

The following figure shows the work load scaling factor of Gustafson's law, when compared to Amdahl's law with a constant workload:

Comparison of Amdahl's and Gustafson's Law

OpenCL is all about parallel programming, and Gustafson's law very well fits into this book as we will be dealing with OpenCL for data parallel applications. Workloads which are data parallel in nature can easily increase the data set and take advantage of the scalable platforms by adding more compute units. For example, more pixels can be computed as more compute units are added.

OpenCL Programming by Example

OpenCL Programming by Example

Overview of this book

Related Content you might be interested in

Current Title:

OpenCL Programming by Example

Advances in computer architecture