Book Image

Mastering Ceph

By : Nick Fisk
Book Image

Mastering Ceph

By: Nick Fisk

Overview of this book

Mastering Ceph covers all that you need to know to use Ceph effectively. Starting with design goals and planning steps that should be undertaken to ensure successful deployments, you will be guided through to setting up and deploying the Ceph cluster, with the help of orchestration tools. Key areas of Ceph including Bluestore, Erasure coding and cache tiering will be covered with help of examples. Development of applications which use Librados and Distributed computations with shared object classes are also covered. A section on tuning will take you through the process of optimisizing both Ceph and its supporting infrastructure. Finally, you will learn to troubleshoot issues and handle various scenarios where Ceph is likely not to recover on its own. By the end of the book, you will be able to successfully deploy and operate a resilient high performance Ceph cluster.
Table of Contents (12 chapters)

Infrastructure design

While considering infrastructure design we need to take care of certain components. We will now briefly look at this components.


SSDs are great. They have come down enormously in price over the past 10 years, and every evidence suggests that they will continue to do so. They have the ability to offer access times several orders of magnitude lower than rotating disks and consume less power.

One important concept to understand about SSDs is that although their read and write latencies are typically measured in 10's of microseconds, to overwrite an existing data in a flash block, it requires the entire flash block to be erased before the write can happen. A typical flash block size in SSD may be 128 KB, and even a 4 KB write I/O would require the entire block to be read, erased and then the existing data and new I/O to be finally written. The erase operation can take several milliseconds and without clever routines in the SSD firmware, would make writes painfully slow. To get around this limitation, SSDs are equipped with a RAM buffer, so they can acknowledge writes instantly, whereas the firmware internally moves data around flash blocks to optimize the overwrite process and wear leveling. However, the RAM buffer is volatile memory and would normally result in the possibility of data loss and corruption in the event of sudden power loss. To protect against this, SSDs can have power loss protection, which is accomplished by having a large capacitor on board, to store enough power to flush any outstanding writes to flash.

One of the biggest trends in recent years is the different tiers of SSDs that have become available. Broadly speaking, these can be broken down into the following categories.


These are the cheapest you can buy and are pitched at the average PC user. They provide a lot of capacity very cheaply and offer fairly decent performance. They will likely offer no power loss protection and will either demonstrate extremely poor performance when asked to do synchronous writes or lie about stored data integrity. They will also likely have very poor write endurance, but still more than enough for standard use.


These are a step up from the consumer models and will typically provide better performance and have higher write endurance although still far off what enterprise SSDs provide.

Before moving on to the enterprise models, it is worth just covering why you should not under any condition use the earlier-mentioned models of SSDs for Ceph:

  • Lack of proper power loss protection will either result in extremely poor performance or not ensure proper data consistency
  • Firmware is not as heavily tested as enterprise SSDs often revealing data corrupting bugs
  • Low write endurance will mean that they will quickly wear out, often ending in sudden failure
  • Due to high wear and failure rates, their initial cost benefits rapidly disappear

The use of consumer SSDs with Ceph will result in poor performance and increase the chance of catastrophic data loss.

Enterprise SSDs

The biggest difference between consumer and enterprise SSDs is that an enterprise SSD should provide the guarantee that when it responds to the host system confirming that data has been safely stored, it actually is. That is to say, that if power is suddenly removed from a system all data that the operating system believes was committed to disk will be safely stored in flash. Furthermore, it should also be expected that in order to accelerate writes but maintain the data safety condition, the SSDs will contain super capacitors to provide just enough power to flush the SSDs RAM buffer to flash in the event of a power loss condition.

Enterprise SSDs are normally provided in a number of different flavors to provide a wide cost per GB options balanced against write endurance.

Enterprise -read intensive

Read intensive SSDs are a bit of a marketing term. All SSDs will easily handle reads, but the name is referring to the lower write endurance. They will, however, provide the best cost per GB. These SSDs will often only have a write endurance of around 0.3-1 over a 5 year period drive writes per day (DWPD). That is to say you should be able to write 400 GB a day to a 400 GB SSD and expect it to still be working in 5 years' time. If you write 800 GB a day to it, it will only be guaranteed to last 2.5 years. In general, for most Ceph workloads, these ranges of SSDs are normally deemed to not have enough write endurance.

Enterprise - general usage

General usage SSDs will normally provide 3-5 DWPD and are a good balance of cost and write endurance. For using in Ceph, they will normally be a good choice for a SSD-based OSD assuming that the workload on the Ceph cluster is not planned to be overly write heavy.

Enterprise -write intensive

Write intensive SSDs are the most expensive type; they will often offer write endurances up to and over 10 DWPD. They should be used for journals for spinning disks in Ceph clusters or also for SSD-only OSDs if very heavy write workloads are planned.

Currently, Ceph uses filestore as its method of storing objects on disks. The details of how and why filestore works is covered later in Chapter 3, BlueStore. For now, it's important to understand that due to the limitations in normal POSIX filesystems to be able to provide atomic transactions to the several pieces of data Ceph needs to write a journal is used. If no separate SSD is used for the journal, a separate partition is created for it. Every write that the OSD handles will first be written to the journal and then flushed to the main storage area on the disk. This is the main reason why using SSD for a journal for spinning disks is advised. The double write severely impacts spinning disk performance, which is mainly caused by the random nature of the disk heads moving between the journal and data areas.

Likewise, SSD OSD still requires a journal, and so it will experience approximately double the number of writes and thus provide half the client performance expected.

As can now be seen, not all models of SSDs are equal, and Ceph's requirements can make choosing the correct one a tough process. Fortunately, a quick test can be carried out to establish SSD's potential for use as a Ceph journal.


Official recommendations are for 1 GB of memory for every 1 TB of storage. In truth, there are a number of variables that lead to this recommendation, but suffice to say that you never want to find yourself where your OSDs are running low on memory and any excess memory will be used to improve performance.

Aside from the baseline memory usage of OSD, the main variable effecting memory usage is the number of PGs running on OSD. Although total data size does have an impact on memory usage, it is dwarfed by the effect of the number of PGs. A healthy cluster running within the recommendations of 200 PGs per OSD will probably use less than 2 GB of RAM per OSD. However, in a cluster where the number of PGs has been set higher against best practice, memory usage will be higher. It is also worth noting that when OSD is removed from a cluster, extra PGs will be placed on remaining OSDs to rebalance the cluster; this will also increase memory usage as well as the recovery operation itself. This spike in memory usage can sometimes be the cause of cascading failures if insufficient ram has been provisioned. A large swap partition on SSD should always be provisioned to reduce the risk of the Linux out-of-memory (OOM) killer randomly killing OSD processes in the event of a low memory situation.

As a minimum, look to provision around 2 GB per OSD + OS overheads, but this should be treated as the bare minimum and 4 GB per OSD would be recommended.

Depending on your workload and size of spinning disks being used for the Ceph OSDs, extra memory may be required to ensure that the operating system can sufficiently cache the directory entries and file nodes from the filesystem used to store the Ceph objects. This may have a bearing on the RAM you wish to configure your nodes with and is covered in more detail in the tuning section of the book.

Regardless of the configured memory size, ECC memory should be used at all times.


Ceph's official recommendations are for 1 GHz of CPU power per OSD. Unfortunately, in real life, it's not quite as simple as this. What the official recommendations don't point out is that a certain amount of CPU power is required per I/O, and it's not just a static figure. Thinking about it, this makes sense; the CPU is only used when there is something to be done. No I/O, no CPU is required. This, however, scales the other way, more I/O, more CPUs are required. The official recommendation is a good safe bet for spinning disk-based OSDs. An OSD node equipped with fast SSDs can often find itself consuming several times this recommendation. To complicate things further, the CPU requirements vary depending on I/O size as well with larger I/Os requiring more CPU.

If the OSD node starts to struggle for CPU resource, it can lead to OSDs to start timing out and get marked out from the cluster, often to rejoin several seconds later. This continual loss and recovery tends to place more strain on the already limited CPU resource causing cascading failures.

A good figure to aim for would be around 1-10 MHz per I/O, corresponding to 4 KB-4 MB I/Os, respectively. As always, testing should be carried out before going live to confirm that CPU requirements are met both in normal and stressed I/O loads.

Another aspect of CPU selection, which is key to determine performance in Ceph, is the clock speed of the cores. A large proportion of the I/O path in Ceph is single threaded and so a faster clocked core will run through this code path faster leading to lower latency. Due to the limited thermal design of most CPUs, there is often a trade-off of clock speed as the number of cores increases. High core count CPUs with high clock speeds also tend to be placed at the top of the pricing structure. Therefore, it is beneficial to understand your I/O and latency requirements to choose the best CPU.

A small experiment was done to find the effect of CPU clock speed against write latency. A Linux workstation running Ceph had its CPU clock manually adjusted using the userspace governor. The following results clearly show the benefit of high-clocked CPUs:

CPU MHz 4 KB write I/O Average latency (microseconds)
1600 797 1250
2000 815 1222
2400 1161 857
2800 1227 812
3300 1320 755
4300 1548 644

If low latency and especially low write latency is important, then go for the highest clocked CPUs you can get, ideally at least higher than 3 GHz. This may require a compromise in SSD only nodes on how many cores are available and thus how many SSDs each node can support. For nodes with 12 spinning disks and SSD journals, single socket quad core processors make an excellent choice as they are often available with very high clock speeds and are very aggressively priced.

Where latency is not as important, for example, object workloads, look at entry-level processors with well-balanced core counts and clock speeds.

Another consideration around CPU and motherboard choice should be around the number of sockets. In Dual socket designs, the memory, disk controllers, and network interface controllers (NICs) are shared between the sockets. When data required by one CPU is required from a resource located on another CPU's socket, it must cross the interlink bus between the two CPUs. Modern CPUs have high-speed interconnects, but they do introduce some performance penalty and thought should be given to whether a single socket design is achievable. There are some options given in the tuning section on how to work around some of these possible performance penalties.


When choosing the disks to build a Ceph cluster with, there is always the temptation to go with the biggest disks you can, as the figures look great on paper. Unfortunately, in reality, this is often not a great choice. Although disks have dramatically increased in capacity over the past 20 years, their performance hasn't. First, ignore any sequential MBps figures, and you will never see them in enterprise workloads. There is always something making the I/O pattern nonsequential enough that it might as well be random. Second, remember these figures:

7.2k disks = 70-80 4k IOPS

10k disks = 120-150 4k IOPS

15k disks = You should be using SSDs

As a general rule, if you are designing a cluster that will offer active workloads rather than bulk inactive/archive storage. Design for the required Input/Output Operations Per Second (IOPS), not capacity. If your cluster will contain largely spinning disks with the intention of providing storage for an active workload, an increased number of smaller capacity disks are normally preferred over the use of larger disks. With the decrease in cost of SSD capacity, serious thought should be given to using them in your cluster, either as a cache tier or even for a full SSD cluster.

A thought should also be given to the use of SSDs as either journals with Ceph's filestore or for storing the DB and write-ahead log (WAL) when using BlueStore. Filestore performance is dramatically improved when using SSD journals and would not be recommended to be used without unless the cluster is designed to be used with very cold data.

Also, consider that the default replication level of 3 will mean that each client write I/O will generate at least 3x the I/O on the backend disks. In reality, due to the internal mechanisms in Ceph, this number in some instances will be nearer six times write amplification. If no SSD journals are to be used in the cluster, then this number might be nearer 12 times write amplification in the worst case scenarios.

Understand that although Ceph enables much more rapid recovery from a failed disk as every disk in the cluster will take part in the recovery. However, larger disks still pose a challenge, particularly when looking at having to recover from a node failure. In a cluster comprising 10 1 TB disks each 50% full, in the event of a disk failure, the remaining disks would have to recover 500 GB of data between them or around 55 GB each. At an average recovery speed of 20 MBps, recovery would be expected in around 45 minutes. A cluster with a hundred 1 TB disks would still only have to recover 500 GB of data, but this time, that task is shared between 99 disks. In theory for the larger cluster to recover from a single disk failure, it would take around four minutes. In reality, these recovery times will be higher as there are additional mechanisms at work, which increases recovery time. In smaller clusters, recovery times should be a key factor when selecting disk capacity.


The network is a key and often overlooked component in a Ceph cluster; a poorly designed network can often lead to a number of problems that manifest themselves in peculiar ways and make for a confusing troubleshooting session.

10G networking requirement

10G networking is strongly recommended for building a Ceph cluster, while 1G networking will work; latency will be pushing on the bounds of being unacceptable and will limit you to the size of nodes you can deploy. A thought should also be given to recovery; in the event of a disk or node failure, large amounts of data will need to be moved around the cluster. Not only will a 1G network be able to provide sufficient performance for this, but normal I/O traffic will be impacted. In the very worst of cases, this may lead to OSDs timing out causing cluster instabilities.

As mentioned, one of the main benefits of 10G networking is the lower latency. Quite often a cluster will never push enough traffic to make full use of the 10G bandwidth; however, the latency improvement is realized, no matter the load on the cluster. The round time trip for a 4k packet over a 10G network might take around 90 microseconds, and the same 4k packet over 1G networking will take over 1 milliseconds. In the tuning section of this book, you will learn that latency has a direct effect on the performance of a storage system, particularly when performing direct or synchronous I/O.

If your OSD node will come equipped with dual NICs, strongly look into a network design that allows you to use them active/active for both transmit and receive. It's wasteful to leave a 10G link in a passive state and will help to lower latency under load.