Book Image

Mastering Linux Device Driver Development

By : John Madieu
Book Image

Mastering Linux Device Driver Development

By: John Madieu

Overview of this book

Linux is one of the fastest-growing operating systems around the world, and in the last few years, the Linux kernel has evolved significantly to support a wide variety of embedded devices with its improved subsystems and a range of new features. With this book, you’ll find out how you can enhance your skills to write custom device drivers for your Linux operating system. Mastering Linux Device Driver Development provides complete coverage of kernel topics, including video and audio frameworks, that usually go unaddressed. You’ll work with some of the most complex and impactful Linux kernel frameworks, such as PCI, ALSA for SoC, and Video4Linux2, and discover expert tips and best practices along the way. In addition to this, you’ll understand how to make the most of frameworks such as NVMEM and Watchdog. Once you’ve got to grips with Linux kernel helpers, you’ll advance to working with special device types such as Multi-Function Devices (MFD) followed by video and audio device drivers. By the end of this book, you’ll be able to write feature-rich device drivers and integrate them with some of the most complex Linux kernel frameworks, including V4L2 and ALSA for SoC.
Table of Contents (19 chapters)
1
Section 1:Kernel Core Frameworks for Embedded Device Driver Development
6
Section 2: Multimedia and Power Saving in Embedded Linux Systems
13
Section 3: Staying Up to Date with Other Linux Kernel Subsystems

Work deferring mechanisms

Work deferring is a mechanism the Linux kernel offers. It allows you to defer work/a task until the system’s workload allows it to run smoothly or after a given time has lapsed. Depending on the type of work, the deferred task can run either in a process context or in an atomic context. It is common to using work deferring to complement the interrupt handler in order to compensate for some of its limitations, some of which are as follows:

  • The interrupt handler must be as fast as possible, meaning that only a critical task should be performed in the handler so that the rest can be deferred later when the system is less busy.
  • In the interrupt context, we cannot use blocking calls. The sleeping task should be scheduled in the process context.

The deferring work mechanism allows us to perform the minimum possible work in the interrupt handler (the so-called top-half, which runs in an interrupt context) and schedule an asynchronous action (the so-called bottom-half, which may – but not always – run in a user context) from the interrupt handler so that it can be run at a later time and execute the rest of the operations. Nowadays, the concept of bottom-half is mostly assimilated to deferred work running in a process context, since it was common to schedule work that might sleep (unlike rare work running in an interrupt context, which cannot happen). Linux now has three different implementations of this: softIRQs, tasklets, and work queues. Let’s take a look at these:

  • SoftIRQs: These are executed in an atomic context.
  • Tasklets: These are also executed in an atomic context.
  • Work queues: These run in a process context.

We will learn about each of them in detail in the next few sections.

SoftIRQs

As the name suggests, softIRQ stands for software interrupt. Such a handler can preempt all other tasks on the system except for hardware IRQ handlers, since they are executed with IRQs enabled. SoftIRQs are intended to be used for high frequency threaded job scheduling. Network and block devices are the only two subsystems in the kernel that make direct use of softIRQs. Even though softIRQ handlers run with interrupts enabled, they cannot sleep, and any shared data needs proper locking. The softIRQ API is defined as kernel/softirq.c in the kernel source tree, and any drivers that wish to use this API need to include <linux/interrupt.h>.

Note that you cannot dynamically register nor destroy softIRQs. They are statically allocated at compile time. Moreover, the usage of softIRQs is restricted to statically compiled kernel code; they cannot be used with dynamically loadable modules. SoftIRQs are represented by struct softirq_action structures defined in <linux/interrupt.h>, as follows:

struct softirq_action {
    void (*action)(struct softirq_action *);
};

This structure embeds a pointer to the function to be run when the softirq action is raised. Thus, the prototype of your softIRQ handler should look as follows:

void softirq_handler(struct softirq_action *h)

Running a softIRQ handler results in this action function being executed. It only has one parameter: a pointer to the corresponding softirq_action structure. You can register the softIRQ handler at runtime by means of the open_softirq() function:

void open_softirq(int nr, 
                   void (*action)(struct softirq_action *))

nr represents the softIRQ’s index, which is also considered as the softIRQ’s priority (where 0 is the highest). action is a pointer to the softIRQ’s handler. Any possible indexes are enumerated in the following enum:

enum
{
    HI_SOFTIRQ=0,   /* High-priority tasklets */    TIMER_SOFTIRQ,  /* Timers */    NET_TX_SOFTIRQ, /* Send network packets */    NET_RX_SOFTIRQ, /* Receive network packets */    BLOCK_SOFTIRQ,  /* Block devices */    BLOCK_IOPOLL_SOFTIRQ, /* Block devices with I/O polling                            blocked on other CPUs */    TASKLET_SOFTIRQ, /* Normal Priority tasklets */    SCHED_SOFTIRQ,   /* Scheduler */    HRTIMER_SOFTIRQ, /* High-resolution timers */    RCU_SOFTIRQ,     /* RCU locking */    NR_SOFTIRQS      /* This only represent the number or                       * softirqs type, 10 actually                       */
};

SoftIRQs with lower indexes (highest priority) run before those with higher indexes (lowest priority). The names of all the available softIRQs in the kernel are listed in the following array:

const char * const softirq_to_name[NR_SOFTIRQS] = {
    “HI”, “TIMER”, “NET_TX”, “NET_RX”, “BLOCK”, “BLOCK_IOPOLL”,
        “TASKLET”, “SCHED”, “HRTIMER”, “RCU”
};

It is easy to check the output of the /proc/softirqs virtual file, as follows:

~$ cat /proc/softirqs 
                    CPU0       CPU1       CPU2       CPU3       
          HI:      14026         89        491        104
       TIMER:     862910     817640     816676     808172
      NET_TX:          0          2          1          3
      NET_RX:       1249        860        939       1184
       BLOCK:        130        100        138        145
    IRQ_POLL:          0          0          0          0
     TASKLET:      55947         23        108        188
       SCHED:    1192596     967411     882492     835607
     HRTIMER:          0          0          0          0
         RCU:     314100     302251     304380     298610
~$

A NR_SOFTIRQS entry array of struct softirq_action is declared in kernel/softirq.c:

static struct softirq_action softirq_vec[NR_SOFTIRQS] ;

Each entry in this array may contain one and only one softIRQ. As a consequence of this, there can be a maximum of NR_SOFTIRQS (10 in v4.19, which is the last version at the time of writing this) for registered softIRQs:

void open_softirq(int nr, 
                   void (*action)(struct softirq_action *))
{
    softirq_vec[nr].action = action;
}

A concrete example of this is the network subsystem, which registers softIRQs that it needs (in net/core/dev.c) as follows:

open_softirq(NET_TX_SOFTIRQ, net_tx_action);
open_softirq(NET_RX_SOFTIRQ, net_rx_action);

Before a registered softIRQ gets a chance to run, it should be activated/scheduled. To do this, you must call raise_softirq() or raise_softirq_irqoff() (if interrupts are already off):

void __raise_softirq_irqoff(unsigned int nr)
void raise_softirq_irqoff(unsigned int nr)
void raise_softirq(unsigned int nr)

The first function simply sets the appropriate bit in the per-CPU softIRQ bitmap (the __softirq_pending field in the struct irq_cpustat_t data structure, which is allocated per-CPU in kernel/softirq.c), as follows:

irq_cpustat_t irq_stat[NR_CPUS] ____cacheline_aligned;
EXPORT_SYMBOL(irq_stat);

This allows it to run when the flag is checked. This function has been described here for study purposes and should not be used directly.

raise_softirq_irqoff needs be called with interrupts disabled. First, it internally calls __raise_softirq_irqoff(), as described previously, to activate the softIRQ. Then, it checks whether it has been called from within an interrupt (either hard or soft) context by means of the in_interrupt() macro (which simply returns the value of current_thread_info( )->preempt_count, where 0 means preemption is enabled. This states that we are not in an interrupt context. A value greater than 0 means we are in an interrupt context). If in_interrupt() > 0, this does nothing as we are in an interrupt context. This is because softIRQ flags are checked on the exit path of any I/O IRQ handler (asm_do_IRQ() for ARM or do_IRQ() for x86 platforms, which makes a call to irq_exit()). Here, softIRQs run in an interrupt context. However, if in_interrupt() == 0, then wakeup_softirqd() gets invoked. This is responsible for waking the local CPU ksoftirqd thread up (it schedules it) to ensure the softIRQ runs soon but in a process context this time.

raise_softirq first calls local_irq_save() (which disables interrupts on the local processor after saving its current interrupt flags). It then calls raise_softirq_irqoff(), as described previously, to schedule the softIRQ on the local CPU (remember, this function must be invoked with IRQs disabled on the local CPU). Finally, it calls local_irq_restore()to restore the previously saved interrupt flags.

There are a few things to remember about softIRQs:

  • A softIRQ can never preempt another softIRQ. Only hardware interrupts can. SoftIRQs are executed at a high priority with scheduler preemption disabled, but with IRQs enabled. This makes softIRQs suitable for the most time-critical and important deferred processing on the system.
  • While a handler runs on a CPU, other softIRQs on this CPU are disabled. SoftIRQs can run concurrently, however. While a softIRQ is running, another softIRQ (even the same one) can run on another processor. This is one of the main advantages of softIRQs over hardIRQs, and is the reason why they are used in the networking subsystem, which may require heavy CPU power.
  • For locking between softIRQs (or even the same softIRQ as it may be running on a different CPU), you should use spin_lock() and spin_unlock().
  • SoftIRQs are mostly scheduled in the return paths of hardware interrupt handlers. SoftIRQs that are scheduled outside of the interrupt context will run in a process context if they are still pending when the local ksoftirqd thread is given the CPU. Their execution may be triggered in the following cases:

    --By the local per-CPU timer interrupt (on SMP systems only, with CONFIG_SMP enabled). See timer_tick(), update_process_times(), and run_local_timers() for more.

    --By making a call to the local_bh_enable() function (mostly invoked by the network subsystem for handling packet receiving/transmitting softIRQs).

    --On the exit path of any I/O IRQ handler (see do_IRQ, which makes a call to irq_exit(), which in turn invokes invoke_softirq()).

    --When the local ksoftirqd is given the CPU (that is, it’s been awakened).

The actual kernel function responsible for walking through the softIRQ’s pending bitmap and running them is __do_softirq(), which is defined in kernel/softirq.c. This function is always invoked with interrupts disabled on the local CPU. It performs the following tasks:

  1. Once invoked, the function saves the current per-CPU pending softIRQ’s bitmap in a so-called pending variable and locally disables softIRQs by means of __local_bh_disable_ip.
  2. It then resets the current per-CPU pending bitmask (which has already been saved) and then reenables interrupts (softIRQs run with interrupts enabled).
  3. After this, it enters a while loop, checking for pending softIRQs in the saved bitmap. If there is no softIRQ pending, nothing happens. Otherwise, it will execute the handlers of each pending softIRQ, taking care to increment their executions' statistics.
  4. After all the pending handlers have been executed (we are out of the while loop), __do_softirq() once again reads the per-CPU pending bitmask (required to disable IRQs and save them into the same pending variable) in order to check if any softIRQs were scheduled when it was in the while loop. If there are any pending softIRQs, the whole process will restart (based on a goto loop), starting from step 2. This helps with handling, for example, softIRQs that have rescheduled themselves.

However, __do_softirq() will not repeat if one of the following conditions occurs:

  • It has already repeated up to MAX_SOFTIRQ_RESTART times, which is set to 10 in kernel/softirq.c. This is actually the limit for the softIRQ processing loop, not the upper bound of the previously described while loop.
  • It has hogged the CPU more than MAX_SOFTIRQ_TIME, which is set to 2 ms (msecs_to_jiffies(2)) in kernel/softirq.c, since this prevents the scheduler from being enabled.

If one of the two situations occurs, __do_softirq() will break its loop and call wakeup_softirqd()to wake the local ksoftirqd thread, which will later execute the pending softIRQs in the process context. Since do_softirq is called at many points in the kernel, it is likely that another invocation of __do_softirqs will handle pending softIRQs before ksoftirqd has the chance to run.

Note that softIRQs do not always run in an atomic context, but in this case, this is quite specific. The next section explains how and why softIRQs may be executed in a process context.

A word on ksoftirqd

A ksoftirqd is a per-CPU kernel thread that’s raised in order to handle unserved software interrupts. It is spawned early on in the kernel boot process, as stated in kernel/softirq.c:

static __init int spawn_ksoftirqd(void)
{
  cpuhp_setup_state_nocalls(CPUHP_SOFTIRQ_DEAD,                             “softirq:dead”, NULL,
                            takeover_tasklets);
    BUG_ON(smpboot_register_percpu_thread(&softirq_threads));
    return 0;
}
early_initcall(spawn_ksoftirqd);

After running the top command, you will be able to see some ksoftirqd/n entries, where n is the logical CPU index of the CPU running the ksoftirqd thread. Since the ksoftirqds run in a process context, they are equal to classic processes/threads, and so are their competing claims for the CPU. ksoftirqd hogging CPUs for a long time may indicate a system under heavy load.

Now that we have finished looking at our first work deferring mechanism in the Linux kernel, we’ll discuss tasklets, which are an alternative (from an atomic context point of view) to softIRQs, though the former are built using the latter.

Tasklets

Tasklets are bottom halves built on top of the HI_SOFTIRQ and TASKLET_SOFTIRQ softIRQs, with the only difference being that HI_SOFTIRQ-based tasklets run prior to the TASKLET_SOFTIRQ-based ones. This simply means tasklets are softIRQs, so they follow the same rules. Unlike softIRQs however, two of the same tasklets never run concurrently. The tasklet API is quite basic and intuitive.

Tasklets are represented by the struct tasklet_struct structure, which is defined in <linux/interrupt.h>. Each instance of this structure represents a unique tasklet:

struct tasklet_struct {
    struct tasklet_struct *next; /* next tasklet in the list */
    unsigned long state;         /* state of the tasklet,
                                  * TASKLET_STATE_SCHED or
                                  * TASKLET_STATE_RUN */
    atomic_t count;              /* reference counter */
    void (*func)(unsigned long); /* tasklet handler function */
    unsigned long data; /* argument to the tasklet function */
};

The func member is the handler of the tasklet that will be executed by the underlying softIRQ. It is the equivalent of what action is to a softIRQ, with the same prototype and the same argument meaning. data will be passed as its sole argument.

You can use the tasklet_init() function to dynamically allocate and initialize tasklets at run-ime. For the static method, you can use the DECLARE_TASKLET macro. The option you choose will depend on your need (or requirement) to have a direct or indirect reference to the tasklet. Using tasklet_init() would require embedding the tasklet structure into a bigger and dynamically allocated object. An initialized tasklet can be scheduled by default – you could say it is enabled. DECLARE_TASKLET_DISABLED is an alternative to declaring default-disabled tasklets, and this will require the tasklet_enable() function to be invoked to make the tasklet schedulable. Tasklets are scheduled (similar to raising a softIRQ) via the tasklet_schedule() and tasklet_hi_schedule() functions. You can use tasklet_disable() to disable a tasklet. This function disables the tasklet and only returns when the tasklet has terminated its execution (assuming it was running). After this, the tasklet can still be scheduled, but it will not run on the CPU until it is enabled again. The asynchronous variant known as tasklet_disable_nosync() can be used too and returns immediately, even if termination has not occurred. Moreover, a tasklet that has been disabled several times should be enabled exactly the same number of times (this is allowed thanks to its count field):

DECLARE_TASKLET(name, func, data)
DECLARE_TASKLET_DISABLED(name, func, data);
tasklet_init(t, tasklet_handler, dev);
void tasklet_enable(struct tasklet_struct*);
void tasklet_disable(struct tasklet_struct *);
void tasklet_schedule(struct tasklet_struct *t);
void tasklet_hi_schedule(struct tasklet_struct *t);

The kernel maintains normal priority and high priority tasklets in two different queues. Queues are actually singly linked lists, and each CPU has its own queue pair (low and high priority). Each processor has its own pair. tasklet_schedule() adds the tasklet to the normal priority list, thereby scheduling the associated softIRQ with a TASKLET_SOFTIRQ flag. With tasklet_hi_schedule(), the tasklet is added to the high priority list, thereby scheduling the associated softIRQ with a HI_SOFTIRQ flag. Once the tasklet has been scheduled, its TASKLET_STATE_SCHED flag is set, and the tasklet is added to a queue. At the time of execution, the TASKLET_STATE_RUN flag is set and the TASKLET_STATE_SCHED state is removed, thus allowing the tasklet to be rescheduled during its execution, either by the tasklet itself or from within an interrupt handler.

High-priority tasklets are meant to be used for soft interrupt handlers with low latency requirements. Calling tasklet_schedule() on a tasklet that’s already been scheduled, but whose execution has not started, will do nothing, resulting in the tasklet being executed only once. A tasklet can reschedule itself, which means you can safely call tasklet_schedule() in a tasklet. High-priority tasklets are always executed before normal ones and should be used carefully; otherwise, you may increase system latency.Stopping a tasklet is as simple as calling tasklet_kill(), which will prevent the tasklet from running again or waiting for it to complete before killing it if the tasklet is currently scheduled to run. If the tasklet reschedules itself, you should prevent the tasklet from rescheduling itself prior to calling this function:

void tasklet_kill(struct tasklet_struct *t);

That being said, let’s take a look at the following example of tasklet code usage:

#include <linux/kernel.h>#include <linux/module.h>#include <linux/interrupt.h> /* for tasklets API */
char tasklet_data[] =     “We use a string; but it could be pointer to a structure”;
/* Tasklet handler, that just prints the data */void tasklet_work(unsigned long data){    printk(“%s\n”, (char *)data);}
static DECLARE_TASKLET(my_tasklet, tasklet_function,                       (unsigned long) tasklet_data);static int __init my_init(void){    tasklet_schedule(&my_tasklet);    return 0;}void my_exit(void){    tasklet_kill(&my_tasklet);
}module_init(my_init);module_exit(my_exit);MODULE_AUTHOR(“John Madieu <[email protected]>”);MODULE_LICENSE(“GPL”);

In the preceding code, we statically declared our my_tasklet tasklet and the function that’s supposed to be invoked when this tasklet is scheduled, along with the data that will be given as an argument to this function.

Important note

Because the same tasklet never runs concurrently, the locking case between a tasklet and itself doesn’t need to be addressed. However, any data that’s shared between two tasklets should be protected with spin_lock() and spin_unlock(). Remember, tasklets are implemented on top of softIRQs.

Workqueues

In the previous section, we dealt with tasklets, which are atomically deferred mechanisms. Apart from atomic mechanisms, there are cases where we may want to sleep in the deferred task. Workqueues allow this.

A workqueue is an asynchronous work deferring mechanism that is widely used across kernels, allowing them to run a dedicated function asynchronously in a process execution context. This makes them suitable for long-running and lengthy tasks or work that needs to sleep, thus improving the user experience. At the core of the workqueue subsystem, there are two data structures that can explain the concept behind this mechanism:

  • The work to be deferred (that is, the work item) is represented in the kernel by instances of struct work_struct, which indicates the handler function to be run. Typically, this structure is the first element of a user’s structure of the work definition. If you need a delay before the work can be run after it has been submitted to the workqueue, the kernel provides struct delayed_work instead. A work item is a basic structure that holds a pointer to the function that is to be executed asynchronously. To summarize, we can enumerate two types of work item structures:

    --The struct work_struct structure, which schedules a task to be run at a later time (as soon as possible when the system allows it).

    --The struct delayed_work structure, which schedules a task to be run after at least a given time interval.

  • The workqueue itself, which is represented by a struct workqueue_struct. This is the structure that work is placed on. It is a queue of work items.

Apart from these data structures, there are two generic terms you should be familiar with:

  • Worker threads, which are dedicated threads that execute the functions off the queue, one by one, one after the other.
  • Workerpools are a collection of worker threads (a thread pool) that are used to manage worker threads.

The first step in using work queues consists of creating a work item, represented by struct work_struct or struct delayed_work for the delayed variant, that’s defined in linux/workqueue.h. The kernel provides either the DECLARE_WORK macro for statically declaring and initializing a work structure, or the INIT_WORK macro for doing the same by dynamically. If you need delayed work, you can use the INIT_DELAYED_WORK macro for dynamic allocation and initialization, or DECLARE_DELAYED_WORK for the static option:

DECLARE_WORK(name, function)
DECLARE_DELAYED_WORK(name, function)
INIT_WORK(work, func);
INIT_DELAYED_WORK(work, func);

The following code shows what our work item structure looks like:

struct work_struct {
    atomic_long_t data;
    struct list_head entry;
    work_func_t func;
#ifdef CONFIG_LOCKDEP
    struct lockdep_map lockdep_map;
#endif
};
struct delayed_work {
    struct work_struct work;
    struct timer_list timer;
    /* target workqueue and CPU ->timer uses to queue ->work */
    struct workqueue_struct *wq;
    int cpu;
};

The func field, which is of the work_func_t type, tells us a bit more about the header of a work function:

typedef void (*work_func_t)(struct work_struct *work);

work is an input parameter that corresponds to the work structure associated with your work. If you’ve submitted a delayed work, this would correspond to the delayed_work.work field. Here, it will be necessary to use the to_delayed_work() function to get the underlying delayed work structure:

struct delayed_work *to_delayed_work(struct work_struct *work)

Workqueues let your driver create a kernel thread, called a worker thread, to handle deferred work. A new workqueue can be created with these functions:

struct workqueue_struct *create_workqueue(const char *name                                           name)
struct workqueue_struct
    *create_singlethread_workqueue(const char *name)

create_workqueue() creates a dedicated thread (a worker) per CPU on the system, which is probably not a good idea. On an 8-core system, this will result in 8 kernel threads being created to run work that’s been submitted to your workqueue. In most cases, a single system-wide kernel thread should be enough. In this case, you should use create_singlethread_workqueue() instead, which creates, as its name suggests, a single threaded workqueue; that is, with one worker thread system-wide. Either normal or delayed work can be enqueued on the same queue. To schedule works on your created workqueue, you can use either queue_work() or queue_delayed_work(), depending on the nature of the work:

bool queue_work(struct workqueue_struct *wq,
                struct work_struct *work)
bool queue_delayed_work(struct workqueue_struct *wq,
                        struct delayed_work *dwork,
                        unsigned long delay)

These functions return false if the work was already on a queue and true otherwise. queue_dalayed_work() can be used to plan (delayed) work for execution with a given delay. The time unit for the delay is jiffies. If you don’t want to bother with seconds-to-jiffies conversion, you can use the msecs_to_jiffies() and usecs_to_jiffies() helpers, which convert milliseconds or microseconds into jiffies, respectively:

unsigned long msecs_to_jiffies(const unsigned int m)
unsigned long usecs_to_jiffies(const unsigned int u)

The following example uses 200 ms as a delay:

schedule_delayed_work(&drvdata->tx_work, usecs_to_                      jiffies(200));

Submitted work items can be canceled by calling either cancel_delayed_work(), cancel_delayed_work_sync(), or cancel_work_sync():

bool cancel_work_sync(struct work_struct *work)
bool cancel_delayed_work(struct delayed_work *dwork)
bool cancel_delayed_work_sync(struct delayed_work *dwork)

The following describes what these functions do:

  • cancel_work_sync() synchronously cancels the given workqueue entry. In other words, it cancels work and waits for its execution to finish. The kernel guarantees that work won’t be pending or executing on any CPU when it’s return from this function, even if the work migrates to another workqueue or requeues itself. It returns true if work was pending, or false otherwise.
  • cancel_delayed_work() asynchronously cancels a pending workqueue entry (a delayed one). It returns true (a non-zero value) if dwork was pending and canceled and false if it wasn’t pending, probably because it is actually running, and thus might still be running after cancel_delayed_work(). To ensure the work really ran to its end, you may want to use flush_workqueue(), which flushes every work item in the given queue, or cancel_delayed_work_sync(), which is the synchronous version of cancel_delayed_work().

To wait for all the work items to finish, you can call flush_workqueue(). When you are done with a workqueue, you should destroy it with destroy_workqueue(). Both these options can be seen in the following code:

void flush_workqueue(struct worksqueue_struct * queue);
void destroy_workqueue(structure workqueque_struct *queue);

While you’re waiting for any pending work to execute, the _sync variant functions sleep, which means they can only be called from a process context.

The kernel shared queue

In most situations, your code does not necessarily need to have the performance of its own dedicated set of threads, and because create_workqueue() creates one worker thread for each CPU, it may be a bad idea to use it on very large multi-CPU systems. In this situation, you may want to use the kernel shared queue, which has its own set of kernel threads preallocated (early during boot, via the workqueue_init_early() function) for running works.

This global kernel workqueue is the so-called system_wq, and is defined in kernel/workqueue.c. There is one instance per CPU, with each backed by a dedicated thread named events/n, where n is the processor number that the thread is bound to. You can queue work to the system’s default workqueue using one of the following functions:

int schedule_work(struct work_struct *work);
int schedule_delayed_work(struct delayed_work *dwork,
                            unsigned long delay);
int schedule_work_on(int cpu, struct work_struct *work);
int schedule_delayed_work_on(int cpu,
                             struct delayed_work *dwork,
                             unsigned long delay);

schedule_work() immediately schedules the work that will be executed as soon as possible after the worker thread on the current processor wakes up. With schedule_delayed_work(), the work will be put in the queue in the future, after the delay timer has ticked. The _on variants are used to schedule the work on a specific CPU (this does not need to be the current one). Each of these function queues work on the system’s shared workqueue, system_wq, which is defined in kernel/workqueue.c:

struct workqueue_struct *system_wq __read_mostly;
EXPORT_SYMBOL(system_wq);

To flush the kernel-global workqueue – that is, to ensure the given batch of work is completed – you can use flush_scheduled_work():

void flush_scheduled_work(void);

flush_scheduled_work() is a wrapper that calls flush_workqueue() on system_wq. Note that there may be work in system_wq that you have not submitted and have no control over. Due to this, flushing this workqueue entirely is overkill. It is recommended to use cancel_delayed_work_sync() or cancel_work_sync() instead.

Tip

Unless you have a strong reason to create a dedicated thread, the default (kernel-global) thread is preferred.

Workqueues – a new generation

The original (now legacy) workqueue implementation used two kinds of workqueues: those with a single thread system-wide, and those with a thread per-CPU. However, due to the increasing number of CPUs, this led to some limitations:

  • On very large systems, the kernel could run out of process IDs (defaulted to 32k) just at boot, before the init was started.
  • Multi-threaded workqueues provided poor concurrency management as their threads competed for the CPU with other threads on the system. Since there were more CPU contenders, this introduced some overhead; that is, more context switches than necessary.
  • The consumption of much more resources than what was really needed.

Moreover, subsystems that needed a dynamic or fine-grained level of concurrency had to implement their own thread pools. As a result of this, a new workqueue API has been designed and the legacy workqueue API (create_workqueue(), create_singlethread_workqueue(), and create_freezable_workqueue()) has been scheduled to be removed. However, these are actually wrappers around the new ones – the so-called concurrency-managed workqueues. This is done using per-CPU worker pools that are shared by all the workqueues in order to automatically provide a dynamic and flexible level of concurrency, thus abstracting such details for API users.

Concurrency-managed workqueues

The concurrency-managed workqueue is an upgrade of the workqueue API. Using this new API implies that you must choose between two macros to create the workqueue: alloc_workqueue() and alloc_ordered_workqueue(). These macros both allocate a workqueue and return a pointer to it on success, and NULL on failure. The returned workqueue can be freed using the destroy_workqueue() function:

#define alloc_workqueue(fmt, flags, max_active, args...)
#define alloc_ordered_workqueue(fmt, flags, args...)
void destroy_workqueue(struct workqueue_struct *wq)

fmt is the printf format for the name of the workqueue, while args... are arguments for fmt. destroy_workqueue() is to be called on the workqueue once you are done with it. All work that’s currently pending will be completed first, before the kernel destroys the workqueue. alloc_workqueue() creates a workqueue based on max_active, which defines the concurrency level by limiting the number of work (tasks) that can be executing (workers in a runnable sate) simultaneously from this workqueue on any given CPU. For example, a max_active of 5 would mean that, at most, five work items on this workqueue can be executing at the same time per CPU. On the other hand, alloc_ordered_workqueue() creates a workqueue that processes each work item one by one in the queued order (that is, FIFO order).

flags controls how and when work items are queued, assigned execution resources, scheduled, and executed. Various flags are used in this new API. Let’s take a look at some of them:

  • WQ_UNBOUND: Legacy workqueues had a worker thread per CPU and were designed to run tasks on the CPU where they were submitted. The kernel scheduler had no choice but to always schedule a worker on the CPU that it was defined on. With this approach, even a single workqueue could prevent a CPU from idling and being turned off, which leads to increased power consumption or poor scheduling policies. WQ_UNBOUND turns off this behavior. Work is not bound to a CPU anymore, hence the name unbound workqueues. There is no more locality, and the scheduler can reschedule the worker on any CPU as it sees fit. The scheduler has the last word now and can balance CPU load, especially for long and sometimes CPU-intensive work.
  • WQ_MEM_RECLAIM: This flag is to be set for workqueues that need to guarantee forward progress during a memory reclaim path (when free memory is running dangerously low; here, the system is under memory pressure. In this case, GFP_KERNEL allocations may block and deadlock the entire workqueue). The workqueue is then guaranteed to have a ready-to-use worker thread, a so-called rescuer thread reserved for it, regardless of memory pressure, so that it can progress. One rescuer thread is allocated for each workqueue that has this flag set.

Let’s consider a situation where we have three work items (w1, w2, and w3) in our workqueue, W. w1 does some work and then waits for w3 to complete (let’s say it depends on the computation result of w3). Afterward, w2 (which is independent of the others) does some kmalloc() allocation (GFP_KERNEL). Now, it seems like there is not enough memory. While w2 is blocked, it still occupies the workqueue of W. This results in w3 not being able to run, despite the fact that there is no dependency between w2 and w3. Since there is not enough memory available, there is no way to allocate a new thread to run w3. A pre-allocated thread would definitely solve this problem, not by magically allocating the memory for w2, but by running w3 so that w1 can continue its job, and so on. w2 will continue its progression as soon as possible, when there is enough available memory to allocate. This pre-allocated thread is the so-called rescuer thread. You must set this WQ_MEM_RECLAIM flag if you think the workqueue might be used in the memory reclaim path. This flag replaces the old WQ_RESCUER flag as of the following commit: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=493008a8e475771a2126e0ce95a73e35b371d277.

  • WQ_FREEZABLE: This flag is used for power management purposes. A workqueue with this flag set will be frozen when the system is suspended or hibernates. On the freezing path, all current work(s) of the worker(s) will be processed. When the freeze is complete, no new work items will be executed until the system is unfrozen. Filesystem-related workqueue(s) may use this flag to ensure that modifications that are made to files are pushed to disk or create the hibernation image on the freezing path and that no modifications are made on-disk after the hibernation image has been created. In this situation, non-freezable items may do things differently that could lead to filesystem corruption. As an example, all of the XFS internal workqueues have this flag set (see fs/xfs/xfs_super.c) to ensure no further changes are made on disk once the freezer infrastructure freezes the kernel threads and creates the hibernation image. You should not set this flag if your workqueue can run tasks as part of the hibernation/suspend/resume process of the system. More information on this topic can be found in Documentation/power/freezing-of-tasks.txt, as well as by taking a look at the kernel’s internal freeze_workqueues_begin() and thaw_workqueues() functions.
  • WQ_HIGHPRI: Tasks that have this flag set run immediately and do not wait for the CPU to become available. This flag is used for workqueues that queue work items that require high priority for execution. Such workqueues have worker threads with a high priority level (a lower nice value).

    In the early days of the CMWQ, high-priority work items were just queued at the head of a global normal priority worklist so that they could immediately run. Nowadays, there is no interaction between normal priority and high-priority workqueues as each has its own worklist and its own worker pool. The work items of a high-priority workqueue are queued to the high-priority worker pool of the target CPU. Tasks in this workqueue should not block much. Use this flag if you do not want your work item competing for CPU with normal or lower-priority tasks. Crypto and Block subsystems use this, for example.

  • WQ_CPU_INTENSIVE: Work items that are part of a CPU-intensive workqueue may burn a lot of CPU cycles and will not participate in the workqueue’s concurrency management. Instead, their execution is regulated by the system scheduler, just like any other task. This makes this flag useful for bound work items that may hog CPU cycles. Though their execution is regulated by the system scheduler, the start of their execution is still regulated by concurrency management, and runnable non-CPU-intensive work items can delay the execution of CPU-intensive work items. Actually, the crypto and dm-crypt subsystems use such workqueues. To prevent such tasks from delaying the execution of other non-CPU-intensive work items, they will not be taken into account when the workqueue code determines whether the CPU is available.

In order to be compliant with the old workqueue API, the following mappings are made to keep this API compatible with the original one:

  • create_workqueue(name) is mapped to alloc_workqueue(name,WQ_MEM_RECLAIM, 1).
  • create_singlethread_workqueue(name) is mapped to alloc_ordered_workqueue(name, WQ_MEM_RECLAIM).
  • create_freezable_workqueue(name) is mapped to alloc_workqueue(name,WQ_FREEZABLE | WQ_UNBOUND|WQ_MEM_RECLAIM, 1).

To summarize, alloc_ordered_workqueue() actually replaces create_freezable_workqueue() and create_singlethread_workqueue() (as per the following commit: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=81dcaf6516d8). Workqueues allocated with alloc_ordered_workqueue() are unbound and have max_active set to 1.

When it comes to scheduled items in a workqueue, the work items that have been queued to a specific CPU using queue_work_on() will execute on that CPU. Work items that have been queued via queue_work() will prefer the queueing CPU, though this locality is not guaranteed.

Important Note

Note that schedule_work() is a wrapper that calls queue_work() on the system workqueue (system_wq), while schedule_work_on() is a wrapper around queue_work_on(). Also, keep in mind that system_wq = alloc_workqueue(“events”, 0, 0);. Take a look at the workqueue_init_early() function in kernel/workqueue.c in the kernel sources to see how other system-wide workqueues are created.

Memory reclaim is a Linux kernel mechanism on the memory allocation path. This consists of allocating memory after throwing the current content of that memory somewhere else.

With that, we have finished looking at workqueues and the concurrency-managed ones in particular. Next, we’ll introduce Linux kernel interrupt management, which is where most of the previous mechanisms will be solicited.