When an executable program runs, it is a process. The shell looks up the executable, talks to the operating system (OS) using system calls, and thereby creates a child process. The OS also allocates memory and resources, such as file descriptors. So, for example, when you run the
find command (the executable lives at
/usr/bin/find), it becomes a child process whose parent process is the shell, as shown in the following diagram:
In case you don't have the
pstree command, you could try the
ptree command instead. The
ps --forest command will also work to show you the tree of processes.
Here is a UNIX shell command recursively searching a directory tree for HTML files containing a word:
% find . -type f -name '*.html' | xargs egrep -w Mon /dev/null ./Untitled.html:Mon Jun 5 10:23:38 IST 2017 ./Untitled.html:Mon Jun 5 10:23:38 IST 2017 ./Untitled.html:Mon Jun 5 10:23:38 IST 2017 ./Untitled.html:Mon Jun 5 10:23:38 IST 2017
We see a shell pipeline in action here. The
find command searches the directory tree rooted at the
current directory. It searches for all files with the
.html extension and outputs the filenames to standard output. The shell creates a process from the
findcommand and another process for the
xargscommand. An active (running) program is called a process.The shell also arranges the output of the
findcommand to go to the input of the
xargscommand via a pipe.
find process is a producer here. The list of files it produces is consumed by the
xargs collects a bunch of filenames and invokes
egrep on them. Lastly, the output appears in the console. It is important to note that both the processes are running concurrently, as shown in the following diagram:
Both these processes are collaborating with each other, so our goal of recursively searching the directory tree is achieved. One process is producing the filenames. The other is searching these files. As these are running in parallel, we start getting the results as soon as there are some qualifying filenames. We start getting results faster, which means the system is responsive.
Quiz: What would happen if both these processes ran one after another? How would the system arrange that the result of the
find command is communicated to the
Just as in real life, collaboration needs communication. The pipe is the mechanism that enables the
find process to communicate with the
xargs process. The pipe acts both as a coordinator and as a communication mechanism.
Similarly, if any of the processes in the pipeline quits for any reason, then the entire pipeline should stop.
For example, here is a pipeline that computes the factorial of 1,000:
% seq 1000 | paste -s -d '*' | bc 40238726007709377354370243392300398571937486421071463254379991042993\ 85123986290205920442084869694048004799886101971960586316668729948085\ .... rest of the output truncated
The pipeline has three filters:
seq command just prints numbers from 1 to 1,000 and puts them in the console. The shell arranges things so that the output gets fed into the pipe that is consumed by the
paste filter now joins all the lines with the
* delimiter. It just does that little bit, and outputs the line to standard output, as shown in the following screenshot:
paste command writes to the console; the shell has arranged the output to go into a pipe again. At the other end, the consumer is
bc command or filter is capable of arbitrary precision arithmetic; in simpler terms, it can perform very large computations.
seq command exits normally, this triggers an EOF (end of file) on the pipe. This tells
paste that the input stream has nothing more to read, so it does the joining, writes the output on the console (which is going to a pipe really), and quits in turn.
This quitting results in an EOF for the
bc process, so it computes the multiplication, prints the result to the standard output, which is really a console, and finally quits. This is an ordered shutdown; no more work needs to be done, so exit and relinquish the computing resources for other concurrent processes, if there are any. The melodramatic term for this marker is poison pill. See https://dzone.com/articles/producers-and-consumers-part-3 for more information.
Unbeknownst to all the filters participating in the pipeline, the parent shell has arranged for this coordination. This ability of the framework to be composed of smaller parts without the parts themselves being aware of the composition is a great design pattern, called pipes and filters. We will see how composition is one central theme, yielding robust concurrent programs.
What happens when the
seq process produces numbers way too fast? Would the consumer (
paste in this case) get overwhelmed? Aha, no! The pipeline also has an implicit flow control built into it. This is yet another central theme, called back-pressure, where the faster producer (or consumer) is forced to wait so the slower filter catches up.
Let's next look at this flow control mechanism.
The wonderful idea behind the previously mentioned pipeline is that the
find producer and the
xargs consumer don't know each other. That is, you could compose any filters using pipes. This is the celebrated pipes and filters design pattern in action. The shell command line gives you a framework that enables you to compose any filters together into a pipeline.
What does it give us? You can reuse the same filter in unexpected and creative ways to get your work done. Each filter just needs to follow a simple protocol of accepting input on file descriptor 0, writing output to file descriptor 1, and writing errors to descriptor 2.
You can refer to a UNIX shell programming guide for more information on descriptors and related ideas. My personal favorite is UNIX Power Tools, 3rd Ed. by Jerry Peek et al.
Flow control means we are trying to regulate the flow of something. When you tell someone to talk slowly so that you can follow their meaning, you are trying to control the flow of words.
Flow control is essential in ensuring that the producer (such as a fast speaker) does not overwhelm the consumer (such as a listener). In the example we have been working on, the
find process could produce filenames faster; the
egrep process might need more time to process each file. The
find producer works at its own pace, and does not care about a slow consumer.
If the pipe gets full because of the slower consumption by
xargs, the output call by
find is blocked; that is, the process is waiting, and so it can't run. This pauses
find until the consumer has finally found the time to consume some filenames and the pipe has some free space. It works the other way around as well. A fast consumer blocks an empty pipe. Blocking is a process-level mechanism, and
find (or any other filter) does not know it is blocking or unblocking.
The moment a process starts running, it will perform its computation for the find filter, ferret out some filenames, and output these to the console. Here is a simplified state diagram , showing a process's life cycle:
What is this scheduled state? As mentioned, a running process could get blocked waiting for some I/O to happen, and thus it cannot use the CPU. So it is put on the back burner for a while, and other processes, waiting their turn, are given a chance to run. Drawing a parallel with the previously mentioned receptionist scenario, the receptionist can ask us to be seated and wait a while, and then move on to the next guest in the queue.
The other idea is that the process has run its allocated slice of time, so other processes should now get a chance, too. In this case, even though the process can run and utilize the CPU, it is moved back to the scheduled state, and can run again once other processes have used their run slices. This is preemptive multitasking we have here, which makes it a fair world to live in! Processes need to run so that useful work can happen. Preemptive scheduling is an idea to help each process get a slice of CPU time.
However, there is another notion that could throw a spanner into this scheme of things. A process with a higher priority is given preference over lower priority processes.
A real-world example should help make this clear. While driving on roads, when we see an ambulance or a police car with a screaming siren, we are required to make way for them. Similarly, a process executing a piece of business logic may need more priority than the data backup process.
GNU parallel (https://www.gnu.org/software/parallel/) is a tool for executing commands in parallel on one or more nodes. The following diagram shows a simple run where we generate 10 text files and zip them (using the
gzip command) in parallel. All the available cores are used to run
gzip , thereby reducing the overall processing time:
The core principle at work is divide and conquer. We see the same principle again and again: a parallelizable job is split into pieces, each of which is processed in parallel (thereby overlapping processing and reducing the time). The
parallel command also allows you to distribute long-running jobs on different nodes (machines), thereby allowing you to harness the idle (possibly unused) cores to process jobs quickly.
The communication depicted in the preceding section could be looked at as message passing;
find is passing on the filename as a message to the
egrep process, or
seq is passing messages (numbers) to the
paste process. Generally speaking, a producer is sending messages to the consumer for consuming, as shown in the following diagram:
As shown in the preceding diagram, each process has its own state by design, and this state is hidden from other processes. The processes communicate with explicit messaging channels, in the same way that a pipe directs the flow of water.
This notion of state is very important to understand the various upcoming concurrency patterns. We could look at the state as data in a certain stage of processing. For example, the paste process could be using program counters to generate the numbers. It could also be writing the numbers to the standard output (file descriptor 1; by default, the console). At the same time, the paste process is processing its input and writing data to its standard output. Both processes do not care about each other's state; in fact, they don't even know anything about the other process.
The real world is full of encapsulated states. The following diagram shows an example:
It defeats common sense to share the state (the need to buy milk) with the postal department employee. It is useless for him to know it, and it could create confusion.
Likewise, the employee will be going about his daily tasks and has a state of his own. Why do we, as consumers, need to know the internal details (state) of how he is going to manage his work (dispatch this big stack of letters)? The world is concurrent, and the various entities in it also hide unnecessary details from each other to avoid confusion. If we don't hide the internal details (that is, the state), it would create havoc.
We could ask whether there is a global shared memory. If there is, then we could use it as a message channel. Using a shared data structure of our choice, the producer could put the data in it for subsequent consumption; that is, the memory is used as a channel of communication.