Book Image

Mastering Linux Kernel Development

By : CH Raghav Maruthi
Book Image

Mastering Linux Kernel Development

By: CH Raghav Maruthi

Overview of this book

Mastering Linux Kernel Development looks at the Linux kernel, its internal arrangement and design, and various core subsystems, helping you to gain significant understanding of this open source marvel. You will look at how the Linux kernel, which possesses a kind of collective intelligence thanks to its scores of contributors, remains so elegant owing to its great design. This book also looks at all the key kernel code, core data structures, functions, and macros, giving you a comprehensive foundation of the implementation details of the kernel’s core services and mechanisms. You will also look at the Linux kernel as well-designed software, which gives us insights into software design in general that are easily scalable yet fundamentally strong and safe. By the end of this book, you will have considerable understanding of and appreciation for the Linux kernel.
Table of Contents (19 chapters)
Title Page
About the Author
About the Reviewer
Customer Feedback

Process descriptors

Right from the time a process is born until it exits, it’s the kernel's process management subsystem that carries out various operations, ranging from process creation, allocating CPU time, and event notifications to destruction of the process upon termination.

Apart from the address space, a process in memory is also assigned a data structure called the process descriptor, which the kernel uses to identify, manage, and schedule the process. The following figure depicts process address spaces with their respective process descriptors in the kernel:

In Linux, a process descriptor is an instance of type struct task_struct defined in <linux/sched.h>, it is one of the central data structures, and contains all the attributes, identification details, and resource allocation entries that a process holds. Looking at struct task_struct is like a peek into the window of what the kernel sees or works with to manage and schedule a process.

Since the task structure contains a wide set of data elements, which are related to the functionality of various kernel subsystems, it would be out of context to discuss the purpose and scope of all the elements in this chapter. We shall consider a few important elements that are related to process management.

Process attributes - key elements

Process attributes define all the key and fundamental characteristics of a process. These elements contain the process's state and identifications along with other key values of importance.


A process right from the time it is spawned until it exits may exist in various states, referred to as process states--they define the process’s current state:

  • TASK_RUNNING (0): The task is either executing or contending for CPU in the scheduler run-queue.
  • TASK_INTERRUPTIBLE (1): The task is in an interruptible wait state; it remains in wait until an awaited condition becomes true, such as the availability of mutual exclusion locks, device ready for I/O, lapse of sleep time, or an exclusive wake-up call. While in this wait state, any signals generated for the process are delivered, causing it to wake up before the wait condition is met.
  • TASK_KILLABLE: This is similar to TASK_INTERRUPTIBLE, with the exception that interruptions can only occur on fatal signals, which makes it a better alternative to TASK_INTERRUPTIBLE.
  • TASK_UNINTERRUTPIBLE (2): The task is in uninterruptible wait state similar to TASK_INTERRUPTIBLE, except that generated signals to the sleeping process do not cause wake-up. When the event occurs for which it is waiting, the process transitions to TASK_RUNNING. This process state is rarely used.
  • TASK_ STOPPED (4): The task has received a STOP signal. It will be back to running on receiving the continue signal (SIGCONT).
  • TASK_TRACED (8): A process is said to be in traced state when it is being combed, probably by a debugger.
  • EXIT_ZOMBIE (32): The process is terminated, but its resources are not yet reclaimed.
  • EXIT_DEAD (16): The child is terminated and all the resources held by it freed, after the parent collects the exit status of the child using wait.

The following figure depicts process states:


This field contains a unique process identifier referred to as PID. PIDs in Linux are of the type pid_t (integer). Though a PID is an integer, the default maximum number PIDs is 32,768 specified through the /proc/sys/kernel/pid_max interface. The value in this file can be set to any value up to 222 (PID_MAX_LIMIT, approximately 4 million).

To manage PIDs, the kernel uses a bitmap. This bitmap allows the kernel to keep track of PIDs in use and assign a unique PID for new processes. Each PID is identified by a bit in the PID bitmap; the value of a PID is determined from the position of its corresponding bit. Bits with value 1 in the bitmap indicate that the corresponding PIDs are in use, and those with value 0 indicate free PIDs. Whenever the kernel needs to assign a unique PID, it looks for the first unset bit and sets it to 1, and conversely to free a PID, it toggles the corresponding bit from 1 to 0.


This field contains the thread group id. For easy understanding, let's say when a new process is created, its PID and TGID are the same, as the process happens to be the only thread. When the process spawns a new thread, the new child gets a unique PID but inherits the TGID from the parent, as it belongs to the same thread group. The TGID is primarily used to support multi-threaded process. We will delve into further details in the threads section of this chapter.

thread info

This field holds processor-specific state information, and is a critical element of the task structure. Later sections of this chapter contain details about the importance of thread_info.


The flags field records various attributes corresponding to a process. Each bit in the field corresponds to various stages in the lifetime of a process. Per-process flags are defined in <linux/sched.h>:

#define PF_EXITING           /* getting shut down */
#define PF_EXITPIDONE        /* pi exit done on shut down */
#define PF_VCPU              /* I'm a virtual CPU */
#define PF_WQ_WORKER         /* I'm a workqueue worker */
#define PF_FORKNOEXEC        /* forked but didn't exec */
#define PF_MCE_PROCESS       /* process policy on mce errors */
#define PF_SUPERPRIV         /* used super-user privileges */
#define PF_DUMPCORE          /* dumped core */
#define PF_SIGNALED          /* killed by a signal */
#define PF_MEMALLOC          /* Allocating memory */
#define PF_NPROC_EXCEEDED    /* set_user noticed that RLIMIT_NPROC was exceeded */
#define PF_USED_MATH         /* if unset the fpu must be initialized before use */
#define PF_USED_ASYNC        /* used async_schedule*(), used by module init */
#define PF_NOFREEZE          /* this thread should not be frozen */
#define PF_FROZEN            /* frozen for system suspend */
#define PF_FSTRANS           /* inside a filesystem transaction */
#define PF_KSWAPD            /* I am kswapd */
#define PF_MEMALLOC_NOIO0    /* Allocating memory without IO involved */
#define PF_LESS_THROTTLE     /* Throttle me less: I clean memory */
#define PF_KTHREAD           /* I am a kernel thread */
#define PF_RANDOMIZE         /* randomize virtual address space */
#define PF_SWAPWRITE         /* Allowed to write to swap */
#define PF_NO_SETAFFINITY    /* Userland is not allowed to meddle with cpus_allowed */
#define PF_MCE_EARLY         /* Early kill for mce process policy */
#define PF_MUTEX_TESTER      /* Thread belongs to the rt mutex tester */
#define PF_FREEZER_SKIP      /* Freezer should not count it as freezable */
#define PF_SUSPEND_TASK      /* this thread called freeze_processes and should not be frozen */

exit_code and exit_signal

These fields contain the exit value of the task and details of the signal that caused the termination. These fields are to be accessed by the parent process through wait() on termination of the child.


This field holds the name of the binary executable used to start the process.


This field is enabled and set when the process is put into trace mode using the ptrace() system call.

Process relations - key elements

Every process can be related to a parent process, establishing a parent-child relationship. Similarly, multiple processes spawned by the same process are called siblings. These fields establish how the current process relates to another process.

real_parent and parent

These are pointers to the parent's task structure. For a normal process, both these pointers refer to the same task_struct; they only differ for multi-thread processes, implemented using posix threads. For such cases, real_parent refers to the parent thread task structure and parent refers the process task structure to which SIGCHLD is delivered.


This is a pointer to a list of child task structures.


This is a pointer to a list of sibling task structures.


This is a pointer to the task structure of the process group leader.

Scheduling attributes - key elements

All contending processes must be given fair CPU time, and this calls for scheduling based on time slices and process priorities. These attributes contain necessary information that the scheduler uses when deciding on which process gets priority when contending.

prio and static_prio

prio helps determine the priority of the process for scheduling. This field holds static priority of the process within the range 1 to 99 (as specified by sched_setscheduler()) if the process is assigned a real-time scheduling policy. For normal processes, this field holds a dynamic priority derived from the nice value.

se, rt, and dl

Every task belongs to a scheduling entity (group of tasks), as scheduling is done at a per-entity level. se is for all normal processes, rt is for real-time processes, and dl is for deadline processes. We will discuss more on these attributes in the next chapter on scheduling.


This field contains information about the scheduling policy of the process, which helps in determining its priority.


This field specifies the CPU mask for the process, that is, on which CPU(s) the process is eligible to be scheduled in a multi-processor system.


This field specifies the priority to be applied by real-time scheduling policies. For non-real-time processes, this field is unused.

Process limits - key elements

The kernel imposes resource limits to ensure fair allocation of system resources among contending processes. These limits guarantee that a random process does not monopolize ownership of resources. There are 16 different types of resource limits, and the task structure points to an array of type struct rlimit, in which each offset holds the current and maximum values for a specific resource.

struct rlimit {
  __kernel_ulong_t        rlim_cur;
  __kernel_ulong_t        rlim_max;
These limits are specified in include/uapi/asm-generic/resource.h

 #define RLIMIT_CPU        0       /* CPU time in sec */
 #define RLIMIT_FSIZE      1       /* Maximum filesize */
 #define RLIMIT_DATA       2       /* max data size */
 #define RLIMIT_STACK      3       /* max stack size */
 #define RLIMIT_CORE       4       /* max core file size */
 #ifndef RLIMIT_RSS
 # define RLIMIT_RSS       5       /* max resident set size */
 # define RLIMIT_NPROC     6       /* max number of processes */
 # define RLIMIT_NOFILE    7       /* max number of open files */
 # define RLIMIT_MEMLOCK   8       /* max locked-in-memory   
 address space */
 #ifndef RLIMIT_AS
 # define RLIMIT_AS        9       /* address space limit */
 #define RLIMIT_LOCKS      10      /* maximum file locks held */
 #define RLIMIT_SIGPENDING 11      /* max number of pending signals */
 #define RLIMIT_MSGQUEUE   12      /* maximum bytes in POSIX mqueues */
 #define RLIMIT_NICE       13      /* max nice prio allowed to 
 raise to 0-39 for nice level 19 .. -20 */
 #define RLIMIT_RTPRIO     14      /* maximum realtime priority */
 #define RLIMIT_RTTIME     15      /* timeout for RT tasks in us */
 #define RLIM_NLIMITS      16

File descriptor table - key elements

During the lifetime of a process, it may access various resource files to get its task done. This results in the process opening, closing, reading, and writing to these files. The system must keep track of these activities; file descriptor elements help the system know which files the process holds.


Filesystem information is stored in this field.


The file descriptor table contains pointers to all the files that a process opens to perform various operations. The files field contains a pointer, which points to this file descriptor table.

Signal descriptor - key elements

For processes to handle signals, the task structure has various elements that determine how the signals must be handled.


This is of type struct signal_struct, which contains information on all the signals associated with the process.


This is of type struct sighand_struct, which contains all signal handlers associated with the process.

sigset_t blocked, real_blocked

These elements identify signals that are currently masked or blocked by the process.


This is of type struct sigpending, which identifies signals which are generated but not yet delivered.


This field contains a pointer to an alternate stack, which facilitates signal handling.


This filed shows the size of the alternate stack, used for signal handling.