Book Image

Podman for DevOps

By : Alessandro Arrichiello, Gianni Salinetti
Book Image

Podman for DevOps

By: Alessandro Arrichiello, Gianni Salinetti

Overview of this book

As containers have become the new de facto standard for packaging applications and their dependencies, understanding how to implement, build, and manage them is now an essential skill for developers, system administrators, and SRE/operations teams. Podman and its companion tools Buildah and Skopeo make a great toolset to boost the development, execution, and management of containerized applications. Starting with the basic concepts of containerization and its underlying technology, this book will help you get your first container up and running with Podman. You'll explore the complete toolkit and go over the development of new containers, their lifecycle management, troubleshooting, and security aspects. Together with Podman, the book illustrates Buildah and Skopeo to complete the tools ecosystem and cover the complete workflow for building, releasing, and managing optimized container images. Podman for DevOps provides a comprehensive view of the full-stack container technology and its relationship with the operating system foundations, along with crucial topics such as networking, monitoring, and integration with systemd, docker-compose, and Kubernetes. By the end of this DevOps book, you'll have developed the skills needed to build and package your applications inside containers as well as to deploy, manage, and integrate them with system services.
Table of Contents (19 chapters)
1
Section 1: From Theory to Practice: Running Containers with Podman
7
Section 2: Building Containers from Scratch with Buildah
12
Section 3: Managing and Integrating Containers Securely

What are containers?

This section describes the container technology from the ground up, beginning from basic concepts such as processes, filesystems, system calls, the process isolation up to container engines, and runtimes. The purpose of this section is to describe how containers implement process isolation. We also describe what differentiates containers from virtual machines and highlight the best use case of both scenarios.

Before asking ourselves what a container is, we should answer another question: what is a process?

According to The Linux Programming Interface, an enjoyable book by Michael Kerrisk, a process is an instance of an executing program. A program is a file holding information necessary to execute the process. A program can be dynamically linked to external libraries, or it can be statically linked in the program itself (the Go programming language uses this approach by default).

This leads us to an important concept: a process is executed in the machine CPU and allocates a portion of memory containing program code and variables used by the code itself. The process is instantiated in the machine's user space and its execution is orchestrated by the operating system kernel. When a process is executed, it needs to access different machine resources such as I/O (disk, network, terminals, and so on) or memory. When the process needs to access those resources, it performs a system call into the kernel space (for example, to read a disk block or send packets via the network interface).

The process indirectly interacts with the host disks using a filesystem, a multi-layer storage abstraction, that facilitates the write and read access to files and directories.

How many processes usually run in a machine? A lot. They are orchestrated by the OS kernel with complex scheduling logics that make the processes behave like they are running on a dedicated CPU core, while the same is shared among many of them.

The same program can instantiate many processes of its kind (for example, multiple web server instances running on the same machine). Conflicts, such as many processes trying to access the same network port, must be managed accordingly.

Nothing prevents us from running a different version of the same program on the host, assuming that system administrators will have the burden of managing potential conflicts of binaries, libraries, and their dependencies. This could become a complex task, which is not always easy to solve with common practices.

This brief introduction was necessary to set the context.

Containers are a simple and smart answer to the need of running isolated process instances. We can safely affirm that containers are a form of application isolation that works on many levels:

  • Filesystem isolation: Containerized processes have a separated filesystem view, and their programs are executed from the isolated filesystem itself.
  • Process ID isolation: This is a containerized process run under an independent set of process IDs (PIDs).
  • User isolation: User IDs (UIDs) and group IDs (GIDs) are isolated to the container. A process' UID and GID can be different inside a container and run with a privileged UID or GID inside the container only.
  • Network isolation: This kind of isolation relates to the host network resources, such as network devices, IPv4 and IPv6 stacks, routing tables, and firewall rules.
  • IPC isolation: Containers provide isolation for host IPC resources, such as POSIX message queues or System V IPC objects.
  • Resource usage isolation: Containers rely on Linux control groups (cgroups) to limit or monitor the usage of certain resources, such as CPU, memory, or disk. We will discuss more about cgroups later in this chapter.

From an adoption point of view, the main purpose of containers, or at least the most common use case, is to run applications in isolated environments. To better understand this concept, we can look at the following diagram:

Figure 1.1 – Native applications versus containerized ones

Figure 1.1 – Native applications versus containerized ones

Applications running natively on a system that does not provide containerization features share the same binaries and libraries, as well as the same kernel, filesystem, network, and users. This could lead to many issues when an updated version of an application is deployed, especially conflicting library issues or unsatisfied dependencies.

On other hand, containers offer a consistent layer of isolation for applications and their related dependencies that ensures seamless coexistence on the same host. A new deployment only consists of the execution of the new containerized version, as it will not interact or conflict with the other containers or native applications.

Linux containers are enabled by different native kernel features, with the most important being Linux namespaces. Namespaces abstract specific system resources (notably, the ones described before, such as network, filesystem mount, users, and so on) and make them appear as unique to the isolated process. In this way, the process has the illusion of interacting with the host resource, for example, the host filesystem, while an alternative and isolated version is being exposed.

Currently, we have a total of eight kinds of namespaces:

  • PID namespaces: These isolate the process ID number in a separate space, allowing processes in different PID namespaces to retain the same PID.
  • User namespaces: These isolate user and group IDs, root directory, keyrings, and capabilities. This allows a process to have a privileged UID and GID inside the container while simultaneously having unprivileged ones outside the namespace.
  • UTS namespaces: These allow the isolation of hostname and NIS domain name.
  • Network namespaces: These allow isolation of networking system resources, such as network devices, IPv4 and IPv6 protocol stacks, routing tables, firewall rules, port numbers, and so on. Users can create virtual network devices called veth pairs to build tunnels between network namespaces.
  • IPC namespaces: These isolate IPC resources such as System V IPC objects and POSIX message queues. Objects created in an IPC namespace can be accessed only by the processes that are members of the namespace. Processes use IPC to exchange data, events, and messages in a client-server mechanism.
  • cgroup namespaces: These isolate cgroup directories, providing a virtualized view of the process's cgroups.
  • Mount namespaces: These provide isolation of the mount point list that is seen by the processes in the namespace.
  • Time namespaces: These provide an isolated view of system time, letting processes in the namespace run with a time offset against the host time.

Now's, let's move on to resource usage.

Resource usage with cgroups

cgroups are a native feature of the Linux kernel whose purpose is to organize processes in a hierarchical tree and limit or monitor their resource usage.

The kernel cgroups interface, similar to what happens with /proc, is exposed with a cgroupfs pseudo-filesystem. This filesystem is usually mounted under /sys/fs/cgroup in the host.

cgroups offer a series of controllers (also called subsystems) that can be used for different purposes, such as limiting the CPU time share of a process, memory usage, freeze and resume processes, and so on.

The organizational hierarchy of controllers has changed through time, and there are currently two versions, V1 and V2. In cgroups V1, different controllers could be mounted against different hierarchies. Instead, cgroups V2 provide a unified hierarchy of controllers, with processes residing in the leaf nodes of the tree.

cgroups are used by containers to limit CPU or memory usage. For example, users can limit CPU quota, which means limiting the number of microseconds the container can use the CPU over a given period, or limit CPU shares, the weighted proportion of CPU cycles for each container.

Now that we have illustrated how process isolation works (both for namespaces and resources), we can illustrate a few basic examples.

Running isolated processes

A useful fact to know is that GNU/Linux operating systems offer all the features necessary to run a container manually. This result can be achieved by working with a specific system call (notably unshare() and clone()) and utilities such as the unshare command.

For example, to run a process, let's say /bin/sh, in an isolated PID namespace, users can rely on the unshare command:

# unshare --fork --pid --mount-proc /bin/sh 

The result is the execution of a new shell process in an isolated PID namespace. Users can try to monitor the process view and will get an output such as the following:

sh-5.0# ps aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.0 226164  4012 pts/4    S    22:56   0:00 /bin/sh
root           4  0.0  0.0 227968  3484 pts/4    R+   22:56   0:00 ps aux

Interestingly, the shell process of the preceding example is running with PID 1, which is correct, since it is the very first process running in the new isolated namespace.

Anyway, the PID namespace will be the only one to be abstracted, while all the other system resources still remain the original host ones. If we want to add more isolation, for example on a network stack, we can add the --net flag to the previous command:

 # unshare --fork --net --pid --mount-proc /bin/sh

The result is a shell process isolated on both PID and network namespaces. Users can inspect the network IP configuration and realize that the host native devices are no longer directly seen by the unshared process:

sh-5.0# ip addr show
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

The preceding examples are useful to understand a very important concept: containers are strongly related to Linux native features. The OS provided a solid and complete interface that helped container runtime development, and the capability to isolate namespaces and resources was the key that unlocked containers adoption. The role of the container runtime is to abstract the complexity of the underlying isolation mechanisms, with the mount point isolation being probably the most crucial of them. Therefore, it deserves a better explanation.

Isolating mounts

We have seen so far examples of unsharing that did not impact mount points and the filesystem view from the process side. To gain the filesystem isolation that prevents binary and library conflicts, users need to create another layer of abstraction for the exposed mount points.

This result is achieved by leveraging mount namespaces and bind mounts. First introduced in 2002 with the Linux kernel 2.4.19, mount namespaces isolate the list of mount points seen by the process. Each mount namespace exposes a discrete list of mount points, thus making processes in different namespaces aware of different directory hierarchies.

With this technique, it is possible to expose to the executing process an alternative directory tree that contains all the necessary binaries and libraries of choice.

Despite seeming a simple task, the management of a mount namespace is all but straightforward and easy to master. For example, users should handle different archive versions of directory trees from different distributions, extract them, and bind mount on separate namespaces. We will see later that the first approaches with containers in Linux followed this approach.

The success of containers is also bound to an innovative, multi-layered, copy-on-write approach of managing the directory trees that introduced a simple and fast method of copying, deploying, and using the tree necessary to run the container – container images.

Container images to the rescue

We must thank Docker for the introduction of this smart method of storing data for containers. Later, images would become an Open Container Initiative (OCI) standard specification (https://github.com/opencontainers/image-spec).

Images can be seen as a filesystem bundle that is downloaded (pulled) and unpacked in the host before running the container for the first time.

Images are downloaded from repositories called image registries. Those repositories can be seen as specialized object storages that hold image data and related metadata. There are both public and free-to-use registries (such as quay.io or docker.io) and private registries that can be executed in the customer private infrastructure, on-premises, or in the cloud.

Images can be built by DevOps teams to fulfill special needs or embed artifacts that must be deployed and executed on a host.

During the image build, process developers can inject pre-built artifacts or source code that can be compiled in the build container itself. To optimize image size, it is possible to create multi-stage builds with a first stage that compiles the source code using a base image with the necessary compilers and runtimes, and a second stage where the built artifacts are injected into a minimal, lightweight image, optimized for fast startup and minimal storage footprint.

The recipe of the build process is defined in a special text file called a Dockerfile, which defines all the necessary steps to assemble the final image.

After building them, users can push their own images on public or private registries for later use or complex, orchestrated deployments.

The following diagram summarizes the build workflow:

Figure 1.2 – Image build workflow

Figure 1.2 – Image build workflow

We will cover the build topic more extensively later in this book.

What makes a container image so special? The smart idea behind images is that they can be considered as a packaging technology. When users build their own image with all the binaries and dependencies installed in the OS directory tree, they are effectively creating a self-consistent object that can be deployed everywhere with no further software dependencies. From this point of view, container images are an answer to the long-debated sentence, It works on my machine.

Developer teams love them because they can be certain of the execution environment of their applications, and operations teams love them because they simplify the deployment process by removing the tedious task of maintaining and updating a server's library dependencies.

Another smart feature of container images is their copy-on-write, multi-layered approach. Instead of having a single bulk binary archive, an image is made up of many tar archives called blobs or layers. Layers are composed together using image metadata and squashed into a single filesystem view. This result can be achieved in many ways, but the most common approach today is by using union filesystems.

OverlayFS (https://www.kernel.org/doc/html/latest/filesystems/overlayfs.html) is the most used union filesystem nowadays. It is maintained in the kernel tree, despite not being completely POSIX-compliant.

According to kernel documentation, "An overlay filesystem combines two filesystems – an 'upper' filesystem and a 'lower' filesystem." This means that it can combine more directory trees and provide a unique, squashed view. The directories are the layers and are referred to as lowerdir and upperdir to respectively define the low-level directory and the one stacked on top of it. The unified view is called merged. It supports up to 128 layers.

OverlayFS is not aware of the concept of container image; it is merely used as a foundation technology to implement the multi-layered solution used by OCI images.

OCI images also implement the concept of immutability. The layers of an image are all read-only and cannot be modified. The only way to change something in the lower layers is to rebuild the image with appropriate changes.

Immutability is an important pillar of the cloud computing approach. It simply means that an infrastructure (such as an instance, container, or even complex clusters) can only be replaced by a different version and not modified to achieve the target deployment. Therefore, we usually do not change anything inside a running container (for example, installing packages or updating config files manually), even though it could be possible in some contexts. Rather, we replace its base image with a new updated version. This also ensures that every copy of the running containers stays in sync with others.

When a container is executed, a new read/write thin layer is created on top of the image. This layer is ephemeral, thus any changes on top of it will be lost after the container is destroyed:

Figure 1.3 – A container's layers

Figure 1.3 – A container's layers

This leads to another important statement: we do not store anything inside containers. Their only purpose is to offer a working and consistent runtime environment for our applications. Data must be accessed externally, by using bind mounts inside the container itself or network storage (such as Network File System (NFS), Simple Storage Service (S3), Internet Small Computer System Interface (iSCSI), and so on).

Containers' mount isolation and images layered design provide a consistent immutable infrastructure, but more security restrictions are necessary to prevent processes with malicious behaviors escape the container sandbox to steal the host's sensitive information or use the host to attack other machines. The following subsection introduces security considerations to show how container runtimes can limit those behaviors.

Security considerations

From a security point of view, there is a hard truth to share: if a process is running inside a container, it simply does not mean it is more secure than others.

A malicious attacker can still make its way through the host filesystem and memory resources. To achieve better security isolation, additional features are available:

  • Mandatory access control: SELinux or AppArmor can be used to enforce container isolation against the parent host. These subsystems, and their related command-line utilities, use a policy-based approach to better isolate the running processes in terms of filesystem and network access.
  • Capabilities: When an unprivileged process is executed in the system (which means a process with an effective UID different from 0), it is subject to permission checking based on the process credentials (its effective UID). Those permissions, or privileges, are called capabilities and can be enabled independently, assigning to an unprivileged process limited privileged permissions to access specific resources. When running a container, we can add or drop capabilities.
  • Secure Computing Mode (Seccomp): This is a native kernel feature that can be used to restrict the syscall that a process is able to make from user space to kernel space. By identifying the strictly necessary privileges needed by a process to run, administrators can apply seccomp profiles to limit the attack surface.

Applying the preceding security features manually is not always easy and immediate, as some of them require a shallow learning curve. Instruments that automate and simplify (possibly in a declarative way) these security constraints provide a high value.

We will discuss security topics in further detail later in this book.

Container engines and runtimes

Despite being feasible and particularly useful from a learning point of view, running and securing containers manually is an unreliable and complex approach. It is too hard to reproduce and automate on production environments and can easily lead to configuration drift among different hosts.

This is the reason container engines and runtimes were born – to help automate the creation of a container and all the related tasks necessary that culminate with a running container.

The two concepts are quite different and tend to be often confused, thus requiring a clearance:

  • A container engine is a software tool that accepts and processes requests from users to create a container with all the necessary arguments and parameters. It can be seen as a sort of orchestrator, since it takes care of putting in place all the necessary actions to have the container up and running; yet it is not the effective executor of the container (the container runtime's role).

Engines usually solve the following problems:

  • Providing a command line and/or REST interface for user interaction
  • Pulling and extracting container images (discussed later in this book)
  • Managing container mount point and bind-mounting the extracted image
  • Handling container metadata
  • Interacting with the container runtime

We have already stated that when a new container is instantiated, a thin R/W layer is created on top of the image; this task is achieved by the container engine, which takes care of presenting a working stack of the merged directories to the container runtime.

The container ecosystem offers a wide choice of container engines. Docker is, without doubt, the most well-known (despite not being the first) engine implementation, along with Podman (the core subject of this book), CRI-O, rkt, and LXD.

A container runtime is a low-level piece of software used by container engines to run containers in the host. The container runtime provides the following functionalities:

Starting the containerized process in the target mount point (usually provided by the container engine) with a set of custom metadata

Managing the cgroups' resource allocation

Managing mandatory access control policies (SELinux and AppArmor) and capabilities

There are many container runtimes nowadays, and most of them implement the OCI runtime spec reference (https://github.com/opencontainers/runtime-spec). This is an industry standard that defines how a runtime should behave and the interface it should implement.

The most common OCI runtime is runc, used by most notable engines, along with other implementations such as crun, kata-containers, railcar, rkt, and gVisor.

This modular approach lets container engines swap the container runtime as needed. For example, when Fedora 33 came out, it introduced a new default cgroups hierarchy called cgroups V2. runc did not support cgroups V2 in the beginning, and Podman simply swapped runc with another OCI-compatible container runtime (crun) that was already compliant with the new hierarchy. Now that runc finally supports cgroups V2, Podman will be able to safely use it again with no impact for the end user.

After introducing container runtimes and engines, it's time for one of the most debated and asked questions during container introductions – the difference between containers and virtual machines.

Containers versus virtual machines

Until now, we have talked about isolation achieved with native OS features and enhanced with container engines and runtimes. Many users could be tricked into thinking that containers are a form of virtualization.

There is nothing farther from the truth; containers are not virtual machines.

So, what is the main difference between a container and a virtual machine? Before answering, we can look at the following diagram:

Figure 1.4 – A system call to a kernel from a container

Figure 1.4 – A system call to a kernel from a container

A container, despite being isolated, holds a process that directly interacts with the host kernel using system calls. The process may not be aware of the host namespaces, but it still needs to context-switch into kernel space to perform operations such as I/O access.

On the other hand, a virtual machine is always executed on top of a hypervisor, running a guest operating system with its own filesystem, networking, storage (usually as image files), and kernel. The hypervisor is software that provides a layer of hardware abstraction and virtualization to the guest OS, enabling a single bare-metal machine running on capable hardware to instantiate many virtual machines. The hardware seen by the guest OS kernel is mostly virtualized hardware, with some exceptions:

Figure 1.5 – Architecture – virtualization versus containers

Figure 1.5 – Architecture – virtualization versus containers

This means that when a process performs a system call inside a virtual machine, it is always directed to the guest OS kernel.

To recap, we can affirm that containers share the same kernel with the host, while virtual machines have their own guest OS kernel.

This statement implies a lot of considerations.

From a security point of view, virtual machines provide better isolation from potential attacks. Anyway, some of the latest CPU-based attacks (Spectre or Meltdown, most notably) could exploit CPU vulnerabilities to access VMs' address spaces.

Containers have refined the isolation features and can be configured with strict security policies (such as CIS Docker, NIST, HIPAA, and so on) that make them quite hard to exploit.

From a scalability point of view, containers are faster to spin up than VMs. Running a new container instance is a matter of milliseconds if the image is already available in the host. These fast results are also achieved by the kernel-less nature of the container. Virtual machines must boot a kernel and initramfs, pivot into the root filesystem, run some kind of init (such as systemd), and start a variable number of services.

A VM will usually consume more resources than a container. To spin up a guest OS, we usually need to allocate more RAM, CPU, and storage than the resources needed to start a container.

Another great differentiator between VMs and containers is the focus on workloads. The best practice for containers is to spin up a container for every specific workload. On the other hand, a VM can run different workloads together.

Imagine a LAMP or WordPress architecture: on non-production or small production environments, it would not be strange to have everything (Apache, PHP, MySQL, and WordPress) installed on the same virtual machine. This design would be split into a multi-container (or multi-tier) architecture, with one container running the frontend (Apache-PHP-WordPress) and one container running the MySQL database. The container running MySQL could access storage volumes to persist the database files. At the same time, it would be easier to scale up/down the frontend containers.

Now that we understand how containers work and what differentiates them from virtual machines, we can move on to the next big question: why do I need a container?