Book Image

Kubernetes in Production Best Practices

By : Aly Saleh, Murat Karslioglu
Book Image

Kubernetes in Production Best Practices

By: Aly Saleh, Murat Karslioglu

Overview of this book

Although out-of-the-box solutions can help you to get a cluster up and running quickly, running a Kubernetes cluster that is optimized for production workloads is a challenge, especially for users with basic or intermediate knowledge. With detailed coverage of cloud industry standards and best practices for achieving scalability, availability, operational excellence, and cost optimization, this Kubernetes book is a blueprint for managing applications and services in production. You'll discover the most common way to deploy and operate Kubernetes clusters, which is to use a public cloud-managed service from AWS, Azure, or Google Cloud Platform (GCP). This book explores Amazon Elastic Kubernetes Service (Amazon EKS), the AWS-managed version of Kubernetes, for working through practical exercises. As you get to grips with implementation details specific to AWS and EKS, you'll understand the design concepts, implementation best practices, and configuration applicable to other cloud-managed services. Throughout the book, you’ll also discover standard and cloud-agnostic tools, such as Terraform and Ansible, for provisioning and configuring infrastructure. By the end of this book, you’ll be able to leverage Kubernetes to operate and manage your production environments confidently.
Table of Contents (12 chapters)

Designing an Amazon EKS infrastructure

In this chapter, we have discussed and explored various aspects of Kubernetes clusters design, and the different architectural considerations that you need to take into account. Now, we need to put things together for the design that we will follow during this book. The decisions that we will make here do not mean that they are the only right ones, but this is the preferred design that we will follow in terms of having minimally acceptable production clusters for this book's practical exercise. You can definitely use the same design, but with modifications, such as cluster sizing.

In the following sections, we will explore our choices regarding the cloud provider, provisioning and configuration tools, and the overall infrastructure architecture, and in the chapters to follow, we will build upon these choices and use them to provision production-like clusters as well as deploy the configuration and services above the cluster.

Choosing the infrastructure provider

As we learned in the previous sections, there are different ways in which to deploy Kubernetes. You can deploy it locally, on-premises, or in a public cloud, private cloud, hybrid, multi-cloud, or an edge location. Each of these infrastructure type has use cases, benefits, and drawbacks. However, the most common one is the public cloud, followed by the hybrid model. The remaining choices are still limited to specific use cases.

In a single book like ours, we cannot discuss each of these infrastructure platforms, so we decided to go with the common choice for deploying Kubernetes, by using one of the public clouds (AWS, Azure, or GCP). You still can use another cloud provider, a private cloud, or even an on-premises setup, and most of the concepts and best practices discussed in this book are still applicable.

When it comes to choosing one of the public clouds, we do not advocate one over the others, and we definitely recommend using the cloud provider that you already use for your existing infrastructure, but if you are just embarking on your cloud journey, we advise you to perform a deeper benchmarking analysis between the public clouds to see which one is better for your business.

In the practical exercises in this book, we will use AWS and the Elastic Kubernetes Service (EKS). We explained in the previous chapter regarding the infrastructure design principle that we always prefer a managed service over its self-managed counterpart, and this applies here when it comes to choosing between EKS and building our self-managed clusters over AWS.

Choosing the cluster and node size

When you plan for your cluster, you need to decide both the cluster and node sizes. This decision should be based on the estimated utilization of your workloads, which you may know beforehand based on your old infrastructure, or it can be calculated approximately and then adjusted after going live in production. In either case, you will need to decide on the initial cluster and node sizes, and then keep adjusting them until you reach the correct utilization level to achieve a balance between cost and reliability. You can target a utilization level of between 70 and 80% unless you have a solid justification for using a different level.

These are the common cluster and node size choices that you can consider either individually or in a combination:

  • Few large clusters: In this setup, you deploy a few large clusters. These can be production and non-production clusters. A cluster could be large in terms of node size, node numbers, or both. Large clusters are usually easier to manage because they are few in number. They are cost efficient because you achieve higher utilization per node and cluster (assuming you are running the correct amount of workloads), and this improved utilization comes from saving the resources required for system management. On the downside, large clusters lack hard isolation for multi-tenants, as you only use namespaces for soft isolation between tenants. They also introduce a single point of failure to your production (especially when you run a single cluster). There is another limitation, as any Kubernetes cluster has an upper limit of 5,000 nodes that it can manage and when you have a single cluster, you can hit this upper limit if you are running a large number of pods.
  • Many small clusters: In this setup, you deploy a lot of small clusters. These could be small in terms of node size, node numbers, or both. Small clusters are good when it comes to security as they provide hard isolation between resources and tenants and also provide strong access control for organizations with multiple teams and departments. They also reduce the blast radius of failures and avoid having a single point of failure. On the downside, small clusters come with an operational overhead, as you need to manage a fleet of clusters. They are also inefficient in terms of resource usage, as you cannot achieve the utilization levels that you can achieve with large clusters, in addition to increasing costs, as they require more control plane resources to manage a fleet of small clusters that manage the same total number of worker nodes in a large cluster.
  • Large nodes: This is about the size of the nodes in a cluster. When you deploy large nodes in your cluster, you will have better and higher utilization of the node (assuming you deploy workloads that utilize 70-80% of the node). This is because a large node can handle application spikes, and it can handle applications with high CPU/memory requirements. In addition to that, a well utilized large node usually entails cost savings as it reduces the overall cluster resources required for system management and you can purchase such nodes at discounted prices from your cloud provider. On the downside, large nodes can introduce a high blast radius of failures, thereby affecting the reliability of both the cluster and apps. Also, adding a new large node to the cluster during an upscaling event will add a lot of cost that you may not need, so if your cluster is hit by variable scaling events over a short period, large nodes will be the wrong choice. Added to this is the fact that Kubernetes has an upper limit in terms of the number of pods that can run on a single node regardless of its type and size, and for a large node, this limitation could lead to underutilization.
  • Small nodes: This is about the size of the nodes per single cluster. When you deploy small nodes in your cluster, you can reduce the blast radius during failures, and also reduce costs during upscaling events. On the downside, small nodes are underutilized, they cannot handle applications with high resource requirements, and the total amount of system resources required to manage these nodes (kubelet, etcd, kube-proxy, and so on) is higher than managing the same compute power for a larger node, in addition to which small nodes have a lower limit for pods per node.
  • Centralized versus decentralized clusters: Organizations usually use one of these approaches in managing their Kubernetes clusters.

    In a decentralized approach, the teams or individuals within an organization are allowed to create and manage their own Kubernetes clusters. This approach provides flexibility for the teams to get the best out of their clusters, and customize them to fit their use cases; on the other hand, this increases the operational overhead, cloud cost, and makes it difficult to enforce standardization, security, best practices, and tools across the clusters. This approach is more appropriate for organizations that are highly decentralized, or when they are going through cloud transformation, product life cycle transitional periods, or exploring and innovating new technologies and solutions.

    In a centralized approach, the teams or individuals share a single cluster or small group of identical clusters that use a similar set of standards, configurations, and services. This approach overcomes and decreases the drawbacks in the decentralized model; however, it can be inflexible, slow down the cloud transformations, and decreases teams' agility. This approach is more suitable for organizations working towards maturity, platform stability, increasing cloud cost reduction, enforcing and promoting standards and best practices, and focusing on products rather than the underlaying platform.

Some organizations can run a hybrid models from the aforementioned alternatives, such as having large, medium, and small nodes to get the best of each type according to their apps needs. However, we recommend that you run experiments to decide which model suits your workload's performance, and meets your cloud cost reduction goal.

Choosing tools for cluster deployment and management

In the early days of Kubernetes, we used to deploy it from scratch, which was commonly called Kubernetes the Hard Way. Fast forward and the Kubernetes community got bigger and a lot of tools emerged to automate the deployment. These tools range from simple automation to complete one-click deployment.

In the context of this book, we are not going to explain each of these tools in the market (there are a lot), nor to compare and benchmark them. However, we will propose our choices with a brief reasoning behind the choices.

Infrastructure provisioning

When you deploy Kubernetes for the first time, most likely you will use a command-line tool with a single command to provision the cluster, or you may use a cloud provider web console to do that. In both ways, this approach is suitable for experimental and learning purposes, but when it comes to real implementation across production and development environments a provisioning tool becomes a must.

The majority of organizations that consider deploying Kubernetes already have an existing cloud infrastructure or they are going through a cloud migration process. This makes Kubernetes not the only piece of the cloud infrastructure that they will use. This is why we prefer a provisioning tool that achieves the following:

  • It can be used to provision Kubernetes as well as other pieces of infrastructure (databases, file stores, API gateways, serverless, monitoring, logging, and so on).
  • It fulfills and empowers the IaC principles.
  • It is a cloud-agnostic tool.
  • It has been battle-tested in production by other companies and teams.
  • It has community support and active development.

We can find these characteristics in Terraform, and this is why we chose to use it in the production clusters that we managed, as well as in this practical exercise in this book. We highly recommend Terraform for you as well, but if you prefer another portioning tool, you can skip this chapter and then continue reading this book and apply the same concepts and best practices.

Configuration management

Kubernetes configuration is declarative by nature, so, after deploying a cluster, we need to manage its configuration. The add-ons deployed provide services for various areas of functionality, including networking, security, monitoring, and logging. This is why a solid and versatile configuration management tool is required in your toolset.

The following are solid choices:

  • Regular configuration management tools, such as Ansible, Chef, and Puppet
  • Kubernetes-specific tools, such as Helm and Kustomize
  • Terraform

Our preferred order of suitable tools is as follows:

  1. Ansible
  2. Helm
  3. Terraform

We can debate this order, and we believe that any of these tools can fulfill the configuration management needs for Kubernetes clusters. However, we prefer to use Ansible for its versatility and flexibility as it can be used for Kubernetes and also for other configuration management needs for your environment, which makes it preferable over Helm. On the other hand, Ansible is preferred over Terraform because it is a provisioning tool at heart, and while it can handle configuration management, it is not the best tool for that.

In the hands-on exercises in this book, we decided to use Ansible with Kubernetes module and Jinja2 templates.

Deciding the cluster architecture

Each organization has its own way of managing cloud accounts. However, we recommend having at least two AWS accounts, one for production and another for non-production. The production Kubernetes cluster resides in the production account, and the non-production Kubernetes cluster resides in the non-production account. This structure is preferred for security, reliability, and operational efficiency.

Based on the technical decisions and choices that we made in the previous sections, we propose the following AWS architecture for the Kubernetes clusters that we will use in this book, which you can also use to deploy your own production and non-production clusters:

Figure 2.1 – Cluster architecture diagram

Figure 2.1 – Cluster architecture diagram

In the previous architecture diagram, we decided to do the following:

  • Create a separate VPC for the cluster network; we chose the Classless Inter-Domain Routing (CIDR) range, which has sufficient IPv4 addressing capacity for future scaling. Each Kubernetes node, pod, and service will have its own IP address, and we should keep in mind that the number of services will increase.
  • Create public and private subnets. The publicly accessible resources, such as load balancers and bastions, are placed in the public subnets, and the privately accessible resources, such as Kubernetes nodes, databases, and caches, are placed in the private subnets.
  • For high availability, we create the resources in three different availability zones. We placed one private and one public subnet in each availability zone.
  • For scaling, we run multiple EKS node groups.

We will discuss the details of these design specs in the next chapters, in addition to the remainder of the technical aspects of the cluster's architecture.