Linux system architecture

In order to clearly understand the Linux system architecture, one needs to first understand a few important concepts: the processor Application Binary Interface (ABI), CPU privilege levels, and how these affect the code we write. Accordingly, and with a few code examples, we'll delve into these here, before diving into the details of the system architecture itself.

Preliminaries

If one is posed the question, "what is the CPU for?", the answer is pretty obvious: the CPU is the heart of the machine – it reads in, decodes, and executes machine instructions, working on memory and peripherals. It does this by incorporating various stages.

Very simplistically, in the Instruction Fetch stage, it reads in machine instructions (which we represent in various human-readable ways – in hexadecimal, assembly, and high-level languages) from memory (RAM) or CPU cache. Then, in the Instruction Decode phase, it proceeds to decipher the instruction. Along the way, it makes use of the control unit, its register set, ALU, and memory/peripheral interfaces.

The ABI

Let's imagine that we write a C program, and run it on the machine.

Well, hang on a second. C code cannot possibly be directly deciphered by the CPU; it must be converted into machine language. So, we understand that on modern systems we will have a toolchain installed – this includes the compiler, linker, library objects, and various other tools. We compile and link the C source code, converting it into an executable format that can be run on the system.

The processor Instruction Set Architecture (ISA) – documents the machine's instruction formats, the addressing schemes it supports, and its register model. In fact, CPU Original Equipment Manufacturers (OEMs) release a document that describes how the machine works; this document is generally called the ABI. The ABI describes more than just the ISA; it describes the machine instruction formats, the register set details, the calling convention, the linking semantics, and the executable file format, such as ELF. Try out a quick Google for x86 ABI – it should reveal interesting results.

The publisher makes the full source code for this book available on their website; we urge the reader to perform a quick Git clone on the following URL. Build and try it: https://github.com/PacktPublishing/Hands-on-System-Programming-with-Linux.

Let's try this out. First, we write a simple Hello, World type of C program:

 $ cat hello.c
 /*
 * hello.c
 *
 ****************************************************************
 * This program is part of the source code released for the book
 *  "Linux System Programming"
 *  (c) Kaiwan N Billimoria
 *  Packt Publishers
 *
 * From:
 *  Ch 1 : Linux System Architecture
 ****************************************************************
 * A quick 'Hello, World'-like program to demonstrate using 
 * objdump to show the corresponding assembly and machine 
 * language.
 */
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>

int main(void)
{
    int a;

    printf("Hello, Linux System Programming, World!\n");
    a = 5;
    exit(0);
}
$

We build the application via the Makefile, with make. Ideally, the code must compile with no warnings:

$ gcc -Wall -Wextra hello.c -o hello
hello.c: In function ‘main':
hello.c:23:6: warning: variable ‘a' set but not used [-Wunused-but-set-variable]
  int a;
      ^
$

Important! Do not ignore compiler warnings with production code. Strive to get rid of all warnings, even the seemingly trivial ones; this will help a great deal with correctness, stability, and security.

In this trivial example code, we understand and anticipate the unused variable warning that gcc emits, and just ignore it for the purpose of this demo.

The exact warning and/or error messages you see on your system could differ from what you see here. This is because my Linux distribution (and version), compiler/linker, library versions, and perhaps even CPU, may differ from yours. I built this on a x86_64 box running the Fedora 27/28 Linux distribution.

Similarly, we build the debug version of the hello program (again, ignoring the warning for now), and run it:

$ make hello_dbg
[...]
$ ./hello_dbg 
Hello, Linux System Programming, World!
$

We use the powerful objdump utility to see the intermixed source-assembly-machine language of our program (objdump's --source option switch
-S, --source Intermix source code with disassembly):

$ objdump --source ./hello_dbg
./hello_dbg:     file format elf64-x86-64

Disassembly of section .init:

0000000000400400 <_init>:
  400400:    48 83 ec 08              sub    $0x8,%rsp

[...]

int main(void)
{
  400527:    55                       push   %rbp
  400528:    48 89 e5                 mov    %rsp,%rbp
  40052b:    48 83 ec 10              sub    $0x10,%rsp
    int a;

    printf("Hello, Linux System Programming, World!\n");
  40052f:    bf e0 05 40 00           mov    $0x4005e0,%edi
  400534:    e8 f7 fe ff ff           callq  400430 <puts@plt>
    a = 5;
  400539:    c7 45 fc 05 00 00 00     movl   $0x5,-0x4(%rbp)
    exit(0);
  400540:    bf 00 00 00 00           mov    $0x0,%edi
  400545:    e8 f6 fe ff ff           callq  400440 <exit@plt>
  40054a:    66 0f 1f 44 00 00        nopw   0x0(%rax,%rax,1)

[...]

$

The exact assembly and machine code you see on your system will, in all likelihood, differ from what you see here; this is because my Linux distribution (and version), compiler/linker, library versions, and perhaps even CPU, may differ from yours. I built this on a x86_64 box running Fedora Core 27.

Alright. Let's take the line of source code a = 5; where, objdump reveals the corresponding machine and assembly language:

    a = 5;
  400539:    c7 45 fc 05 00 00 00     movl   $0x5,-0x4(%rbp)

We can now clearly see the following:

C source	Assembly language	Machine instructions
`a = 5;`	`movl $0x5,-0x4(%rbp)`	`c7 45 fc 05 00 00 00`

So, when the process runs, at some point it will fetch and execute the machine instructions, producing the desired result. Indeed, that's exactly what a programmable computer is designed to do!

Though we have shown examples of displaying (and even writing a bit of) assembly and machine code for the Intel CPU, the concepts and principles behind this discussion hold up for other CPU architectures, such as ARM, PPC, and MIPS. Covering similar examples for all these CPUs goes beyond the scope of this book; however, we urge the interested reader to study the processor datasheet and ABI, and try it out.

Accessing a register's content via inline assembly

Now that we've written a simple C program and seen its assembly and machine code, let's move on to something a little more challenging: a C program with inline assembly to access the contents of a CPU register.

Details on assembly-language programming are outside the scope of this book; refer to the Further reading section on the GitHub repository.

x86_64 has several registers; let's just go with the ordinary RCX register for this example. We do make use of an interesting trick: the x86 ABI calling convention states that the return value of a function will be the value placed in the accumulator, that is, RAX for the x86_64. Using this knowledge, we write a function that uses inline assembly to place the content of the register we want into RAX. This ensures that this is what it will return to the caller!

Assembly micro-basics includes the following:

at&t syntax:
movq <src_reg>, <dest_reg>
Register : prefix name with %
Immediate value : prefix with $

For more, see the Further reading section on the GitHub repository.

Let's take a look at the following code:

$ cat getreg_rcx.c
/*
 * getreg_rcx.c
 *
 ****************************************************************
 * This program is part of the source code released for the book
 *  "Linux System Programming"
 *  (c) Kaiwan N Billimoria
 *  Packt Publishers
 *
 * From:
 *  Ch 1 : Linux System Architecture
 ****************************************************************
 * Inline assembly to access the contents of a CPU register.
 * NOTE: this program is written to work on x86_64 only.
 */
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>

typedef unsigned long u64;

static u64 get_rcx(void)
{
    /* Pro Tip: x86 ABI: query a register's value by moving its value into RAX.
     * [RAX] is returned by the function! */
    __asm__ __volatile__(
            "push %rcx\n\t" 
            "movq $5, %rcx\n\t" 
            "movq %rcx, %rax"); 
                   /* at&t syntax: movq <src_reg>, <dest_reg> */
    __asm__ __volatile__("pop %rcx");
}

int main(void)
{
    printf("Hello, inline assembly:\n [RCX] = 0x%lx\n",  
                get_rcx());
    exit(0);
}
$ gcc -Wall -Wextra getreg_rcx.c -o getreg_rcx
getreg_rcx.c: In function ‘get_rcx':
getreg_rcx.c:32:1: warning: no return statement in function returning non-void [-Wreturn-type]
 }
 ^
$ ./getreg_rcx 
Hello, inline assembly:
 [RCX] = 0x5
$

There; it works as expected.

Accessing a control register's content via inline assembly

Among the many fascinating registers on the x86_64 processor, there happen to be six control registers, named CR0 through CR4, and CR8. There's really no need to delve into detail regarding them; suffice it to say that they are crucial to system control.

For the purpose of an illustrative example, let's consider the CR0 register for a moment. Intel's manual states: CR0—contains system control flags that control operating mode and states of the processor.

Intel's manuals can be downloaded conveniently as PDF documents from here (includes the Intel® 64 and IA-32 Architectures Software Developer's Manual, Volume 3 (3A, 3B and 3C): System Programming Guide):

https://software.intel.com/en-us/articles/intel-sdm

Clearly, CR0 is an important register!
We modify our previous program to access and display its content (instead of the ordinary RCX register). The only relevant code (which has changed from the previous program) is the function that queries the CR0 register value:

static u64 get_cr0(void)
{
    /* Pro Tip: x86 ABI: query a register's value by moving it's value into RAX.
     * [RAX] is returned by the function! */
    __asm__ __volatile__("movq %cr0, %rax"); 
        /* at&t syntax: movq <src_reg>, <dest_reg> */
}

Build and run it:

$ make getreg_cr0
[...]
$ ./getreg_cr0 
Segmentation fault (core dumped)
$

It crashes!

Well, what happened here? Read on.

CPU privilege levels

As mentioned earlier in this chapter, the essential job of the CPU is to read in machine instructions from memory, decipher, and execute them. In the early days of computing, this is pretty much all the processor did. But then, engineers, thinking deeper on it, realized that there is a critical issue with this: if a programmer can feed an arbitrary stream of machine instructions to the processor, which it, in turn, blindly and obediently executes, herein lies scope to do damage, to hack the machine!

How? Recall from the previous section the Intel processor's CR0 control register: Contains system control flags that control operating mode and states of the processor. If one has unlimited (read/write) access to the CR0 register, one could toggle bits that could do the following:

Turn hardware paging on or off
Disable the CPU cache
Change caching and alignment attributes
Disable WP (write protect) on memory (technically, pages) marked as read-only by the OS

Wow, a hacker could indeed wreak havoc. At the very least, only the OS should be allowed this kind of access.

Precisely for reasons such as the security, robustness, and correctness of the OS and the hardware resources it controls, all modern CPUs include the notion of privilege levels.

The modern CPU will support at least two privilege levels, or modes, which are generically called the following:

Supervisor
User

You need to understand that code, that is, machine instructions, runs on the CPU at a given privilege level or mode. A person designing and implementing an OS is free to exploit the processor privilege levels. This is exactly how modern OSes are designed. Take a look at the following table Generic CPU Privilege Levels:

Privilege level or mode name	Privilege level	Purpose	Terminology
Supervisor	High	OS code runs here	kernel-space
User	Low	Application code runs here	user-space (or userland)

Table 1: Generic CPU Privilege Levels

Privilege levels or rings on the x86

To understand this important concept better, let's take the popular x86 architecture as a real example. Right from the i386 onward, the Intel processor supports four privilege levels or rings: Ring 0, Ring 1, Ring 2, and Ring 3. On the Intel CPU's, this is how the levels work:

Figure 1: CPU ring levels and privilege

Let's visualize this Figure 1 in the form of a Table 2: x86 privilege or ring levels:

Privilege or ring level	Privilege	Purpose
Ring 0	Highest	OS code runs here
Ring 1	< ring 0	<Unused>
Ring 2	< ring 1	<Unused>
Ring 3	Lowest	Application code runs here (userland)

Table 2: x86 privilege or ring levels

Originally, ring levels 1 and 2 were intended for device drivers, but modern OSes typically run driver code at ring 0 itself. Some hypervisors (VirtualBox being one) used to use Ring 1 to run the guest kernel code; this was the case earlier when no hardware virtualization support was available (Intel VT-x, AMD SV).

The ARM (32-bit) processor has seven modes of execution; of these, six are privileged, and only one is the non-privileged mode. On ARM, generically, the equivalent to Intel's Ring 0 is Supervisor (SVC) mode, and the equivalent to Intel's Ring 3 is User mode.

For interested readers, there are more links in the Further reading section on the GitHub repository.

The following diagram clearly shows of all modern OSes (Linux, Unix, Windows, and macOS) running on an x86 processor exploit processor-privilege levels:

Figure 2: User-Kernel separation

Importantly, the processor ISA assigns every machine instruction with a privilege level or levels at which they are allowed to be executed. A machine instruction that is allowed to execute at the user privilege level automatically implies it can also be executed at the Supervisor privilege level. This distinguishing between what can and cannot be done at what mode also applies to register access.

To use the Intel terminology, the Current Privilege Level (CPL) is the privilege level at which the processor is currently executing code.

For example, that on a given processor shown as follows:

The foo1 machine instruction has an allowed privilege level of Supervisor (or Ring 0 for x86)
The foo2 machine instruction has an allowed privilege level of User (or Ring 3 for x86)

So, for a running application that executes these machine instructions, the following table emerges:

Machine instruction	Allowed-at mode	CPL (current privilege level)	Works?
foo1	Supervisor (0)	0	Yes
foo1	Supervisor (0)	3	No
foo2	User (3)	0	Yes
foo2	User (3)	3	Yes

Table 3: Privilege levels – an example

So, thinking about it, foo2 being allowed at User mode would also be allowed to execute with any CPL. In other words, if the CPL <= allowed privilege level, it works, otherwise it does not.

When one runs an application on, say, Linux, the application runs as a process (more on this later). But what privilege (or mode or ring) level does the application code run at? Refer to the preceding table: User Mode (Ring 3 on x86).

Aha! So now we see. The preceding code example, getreg_rcx.c, worked because it attempted to access the content of the general-purpose RCX register, which is allowed in User Mode (Ring 3, as well as at the other levels, of course)!

But the code of getreg_cr0.c failed; it crashed, because it attempted to access the content of the CR0 control register, which is disallowed in User Mode (Ring 3), and allowed only at the Ring 0 privilege! Only OS or kernel code can access the control registers. This holds true for several other sensitive assembly-language instructions as well. This approach makes a lot of sense.

Technically, it crashed because the processor raised a General Protection Fault (GPF).

Linux architecture

The Linux system architecture is a layered one. In a very simplistic way, but ideal to start on our path to understanding these details, the following diagram illustrates the Linux system architecture:

Figure 3: Linux – Simplified layered architecture

Layers help, because each layer need only be concerned with the layer directly above and below it. This leads to many advantages:

Clean design, reduces complexity
Standardization, interoperability
Ability to swap layers in and out of the stack
Ability to easily introduce new layers as required

On the last point, there exists the FTSE. To quote directly from Wikipedia:

The "fundamental theorem of software engineering (FTSE)" is a term originated by Andrew Koenig to describe a remark by Butler Lampson attributed to the late David J. Wheeler

We can solve any problem by introducing an extra level of indirection.

Now that we understand the concept of CPU modes or privilege levels, and how modern OSes exploit them, a better diagram (expanding on the previous one) of the Linux system architecture would be as follows:

Figure 4: Linux system architecture

In the preceding diagram, P1, P2, …, Pn are nothing but userland processes (Process 1, Process 2) or in other words, running applications. For example, on a Linux laptop, we might have the vim editor, a web browser, and terminal windows (gnome-terminal) running.

Libraries

Libraries, of course, are archives (collections) of code; as we well know, using libraries helps tremendously with code modularity, standardization, preventing the reinvent-the-wheel syndrome, and so on. A Linux desktop system might have libraries numbering in the hundreds, and possibly even a few thousand!

The classic K&R hello, world C program uses the printf API to write the string to the display:

printf(“hello, world\n”);

Obviously, the code of printf is not part of the hello, world source. So where does it come from? It's part of the standard C library; on Linux, due to its GNU origins, this library is commonly called GNU libc (glibc).

Glibc is a critical and required component on a Linux box. It not only contains the usual standard C library routines (APIs), it is, in fact, the programming interface to the operating system! How? Via its lower layer, the system calls.

System calls

System calls are actually kernel functionality that can be invoked from userspace via glibc stub routines. They serve a critical function; they connect userspace to kernel-space. If a user program wants to request something of the kernel (read from a file, write to the network, change a file's permissions), it does so by issuing a system call. Therefore, system calls are the only legal entry point to the kernel. There is no other way for a user-space process to invoke the kernel.

For a list of all the available Linux system calls, see section 2 of the man pages (https://linux.die.net/man/2/). One can also do: man 2 syscalls to see the man page on all supported system calls

Another way to think of this: the Linux kernel internally has literally thousands of APIs (or functions). Of these, only a small fraction are made visible or available, that is, exposed, to userspace; these exposed kernel APIs are system calls! Again, as an approximation, modern Linux glibc has around 300 system calls.

On an x86_64 Fedora 27 box running the 4.13.16-302.fc27.x86_64 kernel, there are close to 53,000 kernel APIs!

Here is the key thing to understand: system calls are very different from all other (typically library) APIs. As they ultimately invoke kernel (OS) code, they have the ability to cross the user-kernel boundary; in effect, they have the ability to switch from normal unprivileged User mode to completely privileged Supervisor or kernel mode!

How? Without delving into the gory details, system calls essentially work by invoking special machine instructions that have the built-in ability to switch the processor mode from User to Supervisor. All modern CPU ABIs will provide at least one such machine instruction; on the x86 processor, the traditional way to implement system calls is to use the special int 0x80 machine instruction. Yes, it is indeed a software interrupt (or trap). From Pentium Pro and Linux 2.6 onward, the sysenter/syscall machine instructions are used. See the Further reading section on the GitHub repository.

From the viewpoint of the application developer, a key point regarding system calls is that system calls appear to be regular functions (APIs) that can be invoked by the developer; this design is deliberate. The reality: the system call APIs that one invokes – such as open(), read(), chmod(), dup(), and write() – are merely stubs. They are a neat mechanism to get at the actual code that is in the kernel (getting there involves populating a register the accumulator on x86 – with the system call number, and passing parameters via other general-purpose registers) to execute that kernel code path, and return back to user mode when done. Refer to the following table:

CPU	Machine instruction(s) used to trap to Supervisor (kernel) Mode from User Mode	Allocated Register for system call number
`x86[_64]`	`int 0x80 or syscall`	`EAX / RAX`
`ARM`	`swi / svc`	R0 to R7
`Aarch64`	`svc`	`X8`
`MIPS`	`syscall`	`$v0`

Table 4: System calls on various CPU Architectures for better understanding

Linux – a monolithic OS

Operating systems are generally considered to adhere to one of two major architectural styles: monolithic or microkernel.

Linux is decidedly a monolithic OS.

What does that mean?

The English word monolith literally means a large single upright block of stone:

Figure 5: Corinthian columns – they're monolithic!

On the Linux OS, applications run as independent entities called processes. A process may be single-threaded (original Unix) or multithreaded. Regardless, for now, we will consider the process as the unit of execution on Linux; a process is defined as an instance of a program in execution.

When a user-space process issues a library call, the library API, in turn, may or may not issue a system call. For example, issuing the atoi(3) API does not cause glibc to issue a system call as it does not require kernel support to implement the conversion of a string into an integer. <api-name>(n) ; n is the man page section.

To help clarify these important concepts, let's check out the famous and classic K&R Hello, World C program again:

#include <stdio.h>
main()
{
    printf(“hello, world\n”);
}

Okay, that should work. Indeed it does.
But, the question is, how exactly does the printf(3) API write to the monitor device?

The short answer: it does not.
The reality is that printf(3) only has the intelligence to format a string as specified; that's it. Once done, printf actually invokes the write(2) API – a system call. The write system call does have the ability to write the buffer content to a special device file – the monitor device, seen by write as stdout. Go back to our discussion regarding The Unix philosophy in a nutshell : if it's not a process, it's a file! Of course, it gets really complex under the hood in the kernel; to cut a long story short, the kernel code of write ultimately switches to the correct driver code; the device driver is the only component that can directly work with peripheral hardware. It performs the actual write to the monitor, and return values propagate all the way back to the application.

In the following diagram, P is the hello, world process at runtime:

Fig 6: Code flow: printf-to-kernel

Also, from the diagram, we can see that glibc is considered to consist of two parts:

Arch-independent glibc: The regular libc APIs (such as [s|sn|v]printf, memcpy, memcmp, atoi)
Arch-dependent glibc: The system call stubs

Here, by arch, we mean CPU.
Also the ellipses (...) represent additional logic and processing within kernel-space that we do not show or delve into here.

Now that the code flow path of hello, world is clearer, let's get back to the monolithic stuff!

It's easy to assume that it works this way:

The hello, world app (process) issues the printf(3) library call.
printf issues the write(2) system call.
We switch from User to Supervisor (kernel) Mode.
The kernel takes over – it writes hello, world onto the monitor.
Switch back to non-privileged User Mode.

Actually, that's NOT the case.

The reality is, in the monolithic design, there is no kernel; to word it another way, the kernel is actually part of the process itself. It works as follows:

The hello, world app (process) issues the printf(3) library call.
printf issues the write(2) system call.
The process invoking the system call now switches from User to Supervisor (kernel) Mode.
The process runs the underlying kernel code, the underlying device driver code, and thus, writes hello, world onto the monitor!
The process is then switched back to non-privileged User Mode.

To summarize, in a monolithic kernel, when a process (or thread) issues a system call, it switches to privileged Supervisor or kernel mode and runs the kernel code of the system call (working on kernel data). When done, it switches back to unprivileged User mode and continues executing userspace code (working on user data).

This is very important to understand:

Fig 7: Life of a process in terms of privilege modes

The preceding diagram attempts to illustrate that the X axis is the timeline, and the Y axis represents User Mode (at the top) and Supervisor (kernel) Mode (at the bottom):

time t₀: A process is born in kernel mode (the code to create a process is within the kernel of course). Once fully born, it is switched to User (non-privileged) Mode and it runs its userspace code (working on its userspace data items as well).
time t₁: The process, directly or indirectly (perhaps via a library API), invokes a system call. It now traps into kernel mode (refer the table System Calls on CPU Architectures shows the machine instructions depending on the CPU to do so) and executes kernel code in privileged Supervisor Mode (working on kernel data items as well).
time t₂: The system call is done; the process switches back to non-privileged User Mode and continues to execute its userspace code. This process continues, until some point in the future.
time t_n: The process dies, either deliberately by invoking the exit API, or it is killed by a signal. It now switches back to Supervisor Mode (as the exit(3) library API invokes the _exit(2) system call), executes the kernel code of _exit(), and terminates.

In fact, most modern operating systems are monolithic (especially the Unix-like ones).

Technically, Linux is not considered 100 percent monolithic. It's considered to be mostly monolithic, but also modular, due to the fact that the Linux kernel supports modularization (the plugging in and out of kernel code and data, via a technology called Loadable Kernel Modules (LKMs)).
Interestingly, MS Windows (specifically, from the NT kernel onward) follows a hybrid architecture that is both monolithic and microkernel.

Tech Concepts

Programming languages

Tech Tools

Unlimited access to the largest independent learning library in tech of over 8,000 expert-authored tech books and videos.

Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.

50+ new titles added per month and exclusive early access to books as they are being written.

Hands-On System Programming with Linux

By : Kaiwan N. Billimoria, Tigran Aivazian

Hands-On System Programming with Linux

By: Kaiwan N. Billimoria, Tigran Aivazian

Overview of this book

Linux system architecture

Preliminaries

The ABI

Accessing a register's content via inline assembly

Accessing a control register's content via inline assembly

CPU privilege levels

Privilege levels or rings on the x86

Linux architecture

Libraries

System calls

Linux – a monolithic OS

What does that mean?

Hands-On System Programming with Linux

By : Kaiwan N. Billimoria, Tigran Aivazian

Hands-On System Programming with Linux

By: Kaiwan N. Billimoria, Tigran Aivazian

Overview of this book

Confirmation

Buy this book with your credits?

Submit Your Feedback

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access