Book Image

Distributed Computing with Python

Book Image

Distributed Computing with Python

Overview of this book

CPU-intensive data processing tasks have become crucial considering the complexity of the various big data applications that are used today. Reducing the CPU utilization per process is very important to improve the overall speed of applications. This book will teach you how to perform parallel execution of computations by distributing them across multiple processors in a single machine, thus improving the overall performance of a big data processing task. We will cover synchronous and asynchronous models, shared memory and file systems, communication between various processes, synchronization, and more.
Table of Contents (15 chapters)
Distributed Computing with Python
Credits
About the Author
About the Reviewer
www.PacktPub.com
Preface
Index

Multiple processes


Traditionally, the way Python programmers have worked around the GIL and its effect on CPU-bound threads has been to use multiple processes instead of multiple threads. This approach (multiprocessing) has some disadvantages, which mostly boil down to having to launch multiple instances of the Python interpreter with all the startup time and memory usage penalties that this implies.

At the same time, however, using multiple processes to execute tasks in parallel has some nice properties. Multiple processes have their own memory space and implement a share-nothing architecture, making it easy to reason about data-access patterns. They also allow us to (more) easily transition from a single-machine architecture to a distributed application, where one would have to use multiple processes (on different machines) anyway.

There are two main modules in the Python Standard Library that we can use to implement process-based parallelism, and both of them are truly excellent. One is...