Python Parallel Programming Cookbook

Python Parallel Programming Cookbook

By : Giancarlo Zaccone

Buy this Book

Python Parallel Programming Cookbook

By: Giancarlo Zaccone

Buy this Book

Overview of this book

This book will teach you parallel programming techniques using examples in Python and will help you explore the many ways in which you can write code that allows more than one process to happen at once. Starting with introducing you to the world of parallel computing, it moves on to cover the fundamentals in Python. This is followed by exploring the thread-based parallelism model using the Python threading module by synchronizing threads and using locks, mutex, semaphores queues, GIL, and the thread pool. Next you will be taught about process-based parallelism where you will synchronize processes using message passing along with learning about the performance of MPI Python Modules. You will then go on to learn the asynchronous parallel programming model using the Python asyncio module along with handling exceptions. Moving on, you will discover distributed computing with Python, and learn how to install a broker, use Celery Python Module, and create a worker. You will understand anche Pycsp, the Scoop framework, and disk modules in Python. Further on, you will learnGPU programming withPython using the PyCUDA module along with evaluating performance limitations.

Python Parallel Programming Cookbook

Credits

About the Author

About the Reviewers

www.PacktPub.com

Preface

Free Chapter

Getting Started with Parallel Computing and Python

Introduction

The parallel computing memory architecture

Memory organization

Parallel programming models

How to design a parallel program

How to evaluate the performance of a parallel program

Introducing Python

Python in a parallel world

Introducing processes and threads

Start working with processes in Python

Start working with threads in Python

Thread-based Parallelism

Introduction

Using the Python threading module

How to define a thread

How to determine the current thread

How to use a thread in a subclass

Thread synchronization with Lock and RLock

Thread synchronization with RLock

Thread synchronization with semaphores

Thread synchronization with a condition

Thread synchronization with an event

Using the with statement

Thread communication using a queue

Evaluating the performance of multithread applications

Process-based Parallelism

Introduction

How to spawn a process

How to name a process

How to run a process in the background

How to kill a process

How to use a process in a subclass

How to exchange objects between processes

How to synchronize processes

How to manage a state between processes

How to use a process pool

Using the mpi4py Python module

Point-to-point communication

Avoiding deadlock problems

Collective communication using broadcast

Collective communication using scatter

Collective communication using gather

Collective communication using Alltoall

The reduction operation

How to optimize communication

Asynchronous Programming

Introduction

Using the concurrent.futures Python modules

Event loop management with Asyncio

Handling coroutines with Asyncio

Task manipulation with Asyncio

Dealing with Asyncio and Futures

Distributed Python

Introduction

Using Celery to distribute tasks

How to create a task with Celery

Scientific computing with SCOOP

Handling map functions with SCOOP

Remote Method Invocation with Pyro4

Chaining objects with Pyro4

Developing a client-server application with Pyro4

Communicating sequential processes with PyCSP

Using MapReduce with Disco

A remote procedure call with RPyC

GPU Programming with Python

Introduction

Using the PyCUDA module

How to build a PyCUDA application

Understanding the PyCUDA memory model with matrix manipulation

Kernel invocations with GPUArray

Evaluating element-wise expressions with PyCUDA

The MapReduce operation with PyCUDA

GPU programming with NumbaPro

Using GPU-accelerated libraries with NumbaPro

Using the PyOpenCL module

How to build a PyOpenCL application

Evaluating element-wise expressions with PyOpenCl

Testing your GPU application with PyOpenCL

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

How to design a parallel program

The design of algorithms that exploit parallelism is based on a series of operations, which must necessarily be carried out for the program to perform the job correctly without producing partial or erroneous results. The macro operations that must be carried out for a correct parallelization of an algorithm are:

Task decomposition
Task assignment
Agglomeration
Mapping

Task decomposition

In this first phase, the software program is split into tasks or a set of instructions that can then be executed on different processors to implement parallelism. To do this subdivision, there are two methods that are used:

Domain decomposition: Here, the data of the problems is decomposed; the application is common to all the processors that work on a different portion of data. This methodology is used when we have a large amount of data that must be processed.
Functional decomposition: In this case, the problem is split into tasks, where each task will perform a particular operation on all the available data.

Task assignment

In this step, the mechanism by which the task will be distributed among the various processes is specified. This phase is very important because it establishes the distribution of workload among the various processors. The load balance is crucial here; in fact, all processors must work with continuity, avoiding an idle state for a long time. To perform this, the programmer takes into account the possible heterogeneity of the system that tries to assign more tasks to better performing processors. Finally, for greater efficiency of parallelization, it is necessary to limit communication as much as possible between processors, as they are often the source of slowdowns and consumption of resources.

Agglomeration

Agglomeration is the process of combining smaller tasks with larger ones in order to improve performance. If the previous two stages of the design process partitioned the problem into a number of tasks that greatly exceed the number of processors available, and if the computer is not specifically designed to handle a huge number of small tasks (some architectures, such as GPUs, handle this fine and indeed benefit from running millions or even billions of tasks), then the design can turn out to be highly inefficient. Commonly, this is because tasks have to be communicated to the processor or thread so that they compute the said task. Most communication has costs that are not only proportional with the amount of data transferred, but also incur a fixed cost for every communication operation (such as the latency which is inherent in setting up a TCP connection). If the tasks are too small, this fixed cost can easily make the design inefficient.

Mapping

In the mapping stage of the parallel algorithm design process, we specify where each task is to be executed. The goal is to minimize the total execution time. Here, you must often make tradeoffs, as the two main strategies often conflict with each other:

The tasks that communicate frequently should be placed in the same processor to increase locality
The tasks that can be executed concurrently should be placed in different processors to enhance concurrency

This is known as the mapping problem, and it is known to be NP-complete. As such, no polynomial time solutions to the problem in the general case exist. For tasks of equal size and tasks with easily identified communication patterns, the mapping is straightforward (we can also perform agglomeration here to combine tasks that map to the same processor.) However, if the tasks have communication patterns that are hard to predict or the amount of work varies per task, it is hard to design an efficient mapping and agglomeration scheme. For these types of problems, load balancing algorithms can be used to identify agglomeration and mapping strategies during runtime. The hardest problems are those in which the amount of communication or the number of tasks changes during the execution of the program. For these kind of problems, dynamic load balancing algorithms can be used, which run periodically during the execution.

Dynamic mapping

There exists many load balancing algorithms for various problems, both global and local. Global algorithms require global knowledge of the computation being performed, which often adds a lot of overhead. Local algorithms rely only on information that is local to the task in question, which reduces overhead compared to global algorithms, but are usually worse at finding an optimal agglomeration and mapping. However, the reduced overhead may reduce the execution time even though the mapping is worse by itself. If the tasks rarely communicate other than at the start and end of the execution, a task-scheduling algorithm is often used that simply maps tasks to processors as they become idle. In a task-scheduling algorithm, a task pool is maintained. Tasks are placed in this pool and are taken from it by workers.

There are three common approaches in this model, which are explained next.

Manager/worker

This is the basic dynamic mapping scheme in which all the workers connect to a the centralized manager. The manager repeatedly sends tasks to the workers and collects the results. This strategy is probably the best for a relatively small number of processors. The basic strategy can be improved by fetching tasks in advance so that communication and computation overlap each other.

Hierarchical manager/worker

This is the variant of a manager/worker that has a semi-distributed layout; workers are split into groups, each with their own manager. These group managers communicate with the central manager (and possibly, among themselves as well), while workers request tasks from the group managers. This spreads the load among several managers and can, as such, handle a larger amount of processors if all workers request tasks from the same manager.

Decentralize

In this scheme, everything is decentralized. Each processor maintains its own task pool and communicates with the other processors in order to request tasks. How the processors choose other processors to request tasks varies and is determined on the basis of the problem.

Python Parallel Programming Cookbook

By : Giancarlo Zaccone

Python Parallel Programming Cookbook

By: Giancarlo Zaccone

Overview of this book

Related Content you might be interested in

Current Title:

Python Parallel Programming Cookbook

How to design a parallel program

Task decomposition

Task assignment

Agglomeration

Mapping

Dynamic mapping

Manager/worker

Hierarchical manager/worker

Decentralize