Book Image

Concurrent Patterns and Best Practices

By : Atul S. Khot
Book Image

Concurrent Patterns and Best Practices

By: Atul S. Khot

Overview of this book

Selecting the correct concurrency architecture has a significant impact on the design and performance of your applications. Concurrent design patterns help you understand the different characteristics of parallel architecture to make your code faster and more efficient. This book will help Java developers take a hands-on approach to building scalable and distributed apps by following step-by-step explanations of essential concepts and practical examples. You’ll begin with basic concurrency concepts and delve into the patterns used for explicit locking, lock-free programming, futures, and actors. You’ll explore coding with multithreading design patterns, including master, slave, leader, follower, and map-reduce, and then move on to solve problems using synchronizer patterns. You'll even discover the rationale for these patterns in distributed and parallel applications, and understand how future composition, immutability, and the monadic flow help you create more robust code. By the end of the book, you’ll be able to use concurrent design patterns to build high performance applications confidently.
Table of Contents (14 chapters)

The message passing model


Before we dive into the details of the message passing model, we will look at a bit of basic terminology. 

When an executable program runs, it is a process. The shell looks up the executable, talks to the operating system (OS) using system calls, and thereby creates a child process. The OS also allocates memory and resources, such as file descriptors. So, for example, when you run the find command (the executable lives at /usr/bin/find), it becomes a child process whose parent process is the shell, as shown in the following diagram:

In case you don't have the pstree command, you could try the ptree command instead. The ps --forest command will also work to show you the tree of processes. 

Here is a UNIX shell command recursively searching a directory tree for HTML files containing a word:

 % find . -type f -name '*.html' | xargs egrep -w Mon /dev/null
./Untitled.html:Mon Jun 5 10:23:38 IST 2017
./Untitled.html:Mon Jun 5 10:23:38 IST 2017
./Untitled.html:Mon Jun 5 10:23:38 IST 2017
./Untitled.html:Mon Jun 5 10:23:38 IST 2017

We see a shell pipeline in action here. The find command searches the directory tree rooted at the current directory. It searches for all files with the .html extension and outputs the filenames to standard output. The shell creates a process from thefindcommand and another process for thexargscommand. An active (running) program is called a process.The shell also arranges the output of thefindcommand to go to the input of thexargscommand via a pipe.

The find process is a producer here. The list of files it produces is consumed by the xargs command.  xargs collects a bunch of filenames and invokes egrep on them. Lastly, the output appears in the console. It is important to note that both the processes are running concurrently, as shown in the following diagram: 

Both these processes are collaborating with each other, so our goal of recursively searching the directory tree is achieved. One process is producing the filenames. The other is searching these files. As these are running in parallel, we start getting the results as soon as there are some qualifying filenames. We start getting results faster, which means the system is responsive.

Note

Quiz: What would happen if both these processes ran one after another? How would the system arrange that the result of the find command is communicated to the xargs command?      

Just as in real life, collaboration needs communication. The pipe is the mechanism that enables the find process to communicate with the xargs process. The pipe acts both as a coordinator and as a communication mechanism. 

Coordination and communication

We need to make sure that when the find process has nothing more to report, which means that it has found all the qualifying filenames, egrep should also stop!

Similarly, if any of the processes in the pipeline quits for any reason, then the entire pipeline should stop.  

For example, here is a pipeline that computes the factorial of 1,000:

% seq 1000 | paste -s -d '*' | bc
40238726007709377354370243392300398571937486421071463254379991042993\
85123986290205920442084869694048004799886101971960586316668729948085\
.... rest of the output truncated 

The pipeline has three filters: seq, paste, and bc. The seq command just prints numbers from 1 to 1,000 and puts them in the console. The shell arranges things so that the output gets fed into the pipe that is consumed by the paste filter. 

The paste filter now joins all the lines with the * delimiter. It just does that little bit, and outputs the line to standard output, as shown in the following screenshot:

The paste command writes to the console; the shell has arranged the output to go into a pipe again. At the other end, the consumer is bcThe bc command or filter is capable of arbitrary precision arithmetic; in simpler terms, it can perform very large computations.

When the seq command exits normally, this triggers an EOF (end of file) on the pipe. This tells paste that the input stream has nothing more to read, so it does the joining, writes the output on the console (which is going to a pipe really), and quits in turn. 

This quitting results in an EOF for the bc process, so it computes the multiplication, prints the result to the standard output, which is really a console, and finally quits. This is an ordered shutdown; no more work needs to be done, so exit and relinquish the computing resources for other concurrent processes, if there are any. The melodramatic term for this marker is poison pill. See https://dzone.com/articles/producers-and-consumers-part-3 for more information.

At this point, the pipeline processing is done and we get back to the shell prompt again, as shown in the following diagram:             

Unbeknownst to all the filters participating in the pipeline, the parent shell has arranged for this coordination. This ability of the framework to be composed of smaller parts without the parts themselves being aware of the composition is a great design pattern, called pipes and filters. We will see how composition is one central theme, yielding robust concurrent programs.

What happens when the seq process produces numbers way too fast? Would the consumer (paste in this case) get overwhelmed? Aha, no! The pipeline also has an implicit flow control built into it. This is yet another central theme, called back-pressure, where the faster producer (or consumer) is forced to wait so the slower filter catches up. 

Let's next look at this flow control mechanism.    

Flow control    

The wonderful idea behind the previously mentioned pipeline is that the find producer and the xargs consumer don't know each other. That is, you could compose any filters using pipes. This is the celebrated pipes and filters design pattern in action. The shell command line gives you a framework that enables you to compose any filters together into a pipeline. 

What does it give us? You can reuse the same filter in unexpected and creative ways to get your work done. Each filter just needs to follow a simple protocol of accepting input on file descriptor 0, writing output to file descriptor 1, and writing errors to descriptor 2.

You can refer to a UNIX shell programming guide for more information on descriptors and related ideas. My personal favorite is UNIX Power Tools, 3rd Ed. by Jerry Peek et al.           

Flow control means we are trying to regulate the flow of something. When you tell someone to talk slowly so that you can follow their meaning, you are trying to control the flow of words.

Flow control is essential in ensuring that the producer (such as a fast speaker) does not overwhelm the consumer (such as a listener). In the example we have been working on, the find process could produce filenames faster; the egrep process might need more time to process each file. The find producer works at its own pace, and does not care about a slow consumer.

If the pipe gets full because of the slower consumption by xargs, the output call by find is blocked; that is, the process is waiting, and so it can't run. This pauses find until the consumer has finally found the time to consume some filenames and the pipe has some free space. It works the other way around as well. A fast consumer blocks an empty pipe. Blocking is a process-level mechanism, andfind (or any other filter) does not know it is blocking or unblocking.

The moment a process starts running, it will perform its computation for the find filter, ferret out some filenames, and output these to the console. Here is a simplified state diagram , showing a process's life cycle:

What is this scheduled state? As mentioned, a running process could get blocked waiting for some I/O to happen, and thus it cannot use the CPU. So it is put on the back burner for a while, and other processes, waiting their turn, are given a chance to run. Drawing a parallel with the previously mentioned receptionist scenario, the receptionist can ask us to be seated and wait a while, and then move on to the next guest in the queue.

The other idea is that the process has run its allocated slice of time, so other processes should now get a chance, too. In this case, even though the process can run and utilize the CPU, it is moved back to the scheduled state, and can run again once other processes have used their run slices. This is preemptive multitasking we have here, which makes it a fair world to live in! Processes need to run so that useful work can happen. Preemptive scheduling is an idea to help each process get a slice of CPU time.

However, there is another notion that could throw a spanner into this scheme of things. A process with a higher priority is given preference over lower priority processes.

A real-world example should help make this clear. While driving on roads, when we see an ambulance or a police car with a screaming siren, we are required to make way for them. Similarly, a process executing a piece of business logic may need more priority than the data backup process.

Divide and conquer

GNU parallel  (https://www.gnu.org/software/parallel/) is a tool for executing commands in parallel on one or more nodes. The following diagram shows a simple run where we generate 10 text files and zip them (using the gzip command) in parallel. All the available cores are used to run gzip , thereby reducing the overall processing time:

The core principle at work is divide and conquer. We see the same principle again and again: a parallelizable job is split into pieces, each of which is processed in parallel (thereby overlapping processing and reducing the time). The parallel command also allows you to distribute long-running jobs on different nodes (machines), thereby allowing you to harness the idle (possibly unused) cores to process jobs quickly.

The concept of state

The communication depicted in the preceding section could be looked at as message passing; find is passing on the filename as a message to the egrep process, or seq is passing messages (numbers) to the paste process. Generally speaking, a producer is sending messages to the consumer for consuming, as shown in the following diagram:

As shown in the preceding diagram, each process has its own state by design, and this state is hidden from other processes. The processes communicate with explicit messaging channels, in the same way that a pipe directs the flow of water. 

This notion of state is very important to understand the various upcoming concurrency patterns. We could look at the state as data in a certain stage of processing. For example, the paste process could be using program counters to generate the numbers. It could also be writing the numbers to the standard output (file descriptor 1; by default, the console). At the same time, the paste process is processing its input and writing data to its standard output. Both processes do not care about each other's state; in fact, they don't even know anything about the other process.

The real world is full of encapsulated states. The following diagram shows an example:

 

It defeats common sense to share the state (the need to buy milk) with the postal department employee. It is useless for him to know it, and it could create confusion.

Likewise, the employee will be going about his daily tasks and has a state of his own. Why do we, as consumers, need to know the internal details (stateof how he is going to manage his work (dispatch this big stack of letters)? The world is concurrent, and the various entities in it also hide unnecessary details from each other to avoid confusion. If we don't hide the internal details (that is, the state), it would create havoc. 

We could ask whether there is a global shared memory. If there is, then we could use it as a message channel. Using a shared data structure of our choice, the producer could put the data in it for subsequent consumption; that is, the memory is used as a channel of communication.