Book Image

MariaDB High Performance

By : Pierre Mavro
Book Image

MariaDB High Performance

By: Pierre Mavro

Overview of this book

Table of Contents (18 chapters)
MariaDB High Performance
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

BIOS power management optimization


By default, new servers are configured for low consumption power (green energy). This is a really good point for the environment! However, reducing electric power implies reducing computation power.

The CPU is the main component that directly affects performance because of its wake up time during sleep states. CPU power management should be disabled as much as possible. However, depending on the hardware manufacturer, multiple options may change.

If you're wondering what the effects are of that power management, it's simple. If the activity of the processor is reduced too much, a standard action that takes 1 second could take more than 2 seconds in that period. During this period of 1 or more seconds, CPU power management techniques are changing the CPU state (lowering frequency, lowering voltage, and deactivating some subsystems, and so on).

When the workload increases, subsystems are reactivated, but C-State reactivation implies some latency (milliseconds to seconds) before being at maximum performance.

C-States

C-States are idle CPU states. C0 is the working processor while C1 is the first idle level of the processor. The problem is no instructions are executed during the C1 state, as the CPU manufacturer introduced new C-States to reduce power consumption during an idle period.

P-States

P-States are different levels of the operation state of the processor. Each level differs from the others by different working voltages/frequencies. For example, at P0, a processor can run at 3 GHz, but at P1 it can run at only 1.5 GHz. Voltage is also scaled to reduce power consumption.

In a P-State, the CPU is working, performing operations and executing instructions. This is not an idle state.

Constructor name options

Depending on the CPU constructor, power management technologies do not have the same name. Here is a list of functionalities that should be disabled. For Intel, they are called:

  • Speedstep

  • Turbo boost

  • C1E

  • QPI Power Management

For AMD, the functionalities are as follows:

  • PowerNow!

  • Cooln'Quiet

  • Turbo Core

Some constructors have enabled extra power management features that are keyboard tricks (for example, with HP hardware, a Ctrl + A in the BIOS shows an additional Services Options menu).

Note

In most cases, it's recommended to look at the BIOS constructor documentation to see what should be turned on or off.

Power management optimization

Most of the following power management commands may not work under virtual machines. So, you should consider having a physical machine to test and run them.

cpufreq

cpufreq allows the OS to control P-States. This means that the OS-based idleness can lower the frequency of the CPU to reduce power consumption.

To get cpufreq information on a specified core, look at the cpufreq folder by running the following command:

> ls -1 /sys/devices/system/cpu/cpu0/cpufreq
affected_cpus
bios_limit
cpuinfo_cur_freq
cpuinfo_max_freq
cpuinfo_min_freq
cpuinfo_transition_latency
freqdomain_cpus
related_cpus
scaling_available_frequencies
scaling_available_governors
scaling_cur_freq
scaling_driver
scaling_governor
scaling_max_freq
scaling_min_freq
scaling_setspeed

When enabled in the BIOS, cpufreq drivers are loaded and the cpufreq directory in sysfs is available. You can look at the used governor (power management mechanism using the following command):

> cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
ondemand

If the BIOS settings are disabled, no drivers are loaded and no scaling frequency is allowed. That means your server works at maximum performance. However, you can also tune the scaling governor for performance purposes.

To always get the maximum performance without disabling the options in the BIOS, we'll install a package that will configure all the cores on your machine:

aptitude install cpufrequtils
cp /usr/share/doc/cpufrequtils/examples/cpufrequtils.loadcpufreq.sample /etc/default/cpufrequtils

Then, edit the configuration file (/etc/default/cpufrequtils) and set the new configuration:

# /etc/default/loadcpufreq sample file
#
# Use this file to override the CPUFreq kernel module to
# be loaded or disable loading at all

ENABLE=true
FREQDRIVER=performance

You can get all the available governors with the following command:

> cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors
ondemand performance

Then, you can restart the cpufrequtils service and your governors' cores will be updated to performance.

cpuidle

cpuidle allows the OS to control the CPU C-states (control how the CPU goes into idle/sleep state). Depending on the CPU constructor and model, several C-States are available. Standard ones are C0 and C1. C0 is a running state while C1 is an idle state.

Even if C-States have been disabled in the BIOS settings, the cpuidle driver can be loaded and managed. To look at the loaded driver, run the following command:

> cat /sys/devices/system/cpu/cpuidle/current_driver
intel_idle

The intel_idle driver handles more C-States and aggressively puts the CPU into a lower idle mode. Since we have significant latency to wake up from lower C-States, this can affect performance.

When the intel_idle driver is loaded, specific cpuidle configurations are available for each CPU:

> ls -1R /sys/devices/system/cpu/cpu0/cpuidle/      
/sys/devices/system/cpu/cpu0/cpuidle/:
state0
state1
state2
state3

/sys/devices/system/cpu/cpu0/cpuidle/state0:
desc
latency
name
power
time
usage
[...]
/sys/devices/system/cpu/cpu0/cpuidle/state3:
desc
latency
name
power
time
usage

Each of the C-States are described here with the latency to wake up. To know the time it takes a C-State to wake up and check the latency file, run the following command:

> cat /sys/devices/system/cpu/cpu0/cpuidle/state3/latency
150

In the preceding command, the time for C-State 3 to wake up is 150 ms! To avoid having all the C-States enabled, change the grub boot configuration and add the following option (in the /etc/default/grub location):

GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_idle.max_cstate=0"

To make it work, upgrade the grub configuration and reboot:

update-grub

Disk and filesystem optimization

For disks and filesystems, there are multiple factors that can slow down your MariaDB instance:

  • Magnetic drives' rotation per minute

  • Magnetic drives with data at the beginning of the disk

  • Partitions not aligned to the disk

  • Small partitions at the end of the disk

  • Disk bus speed

  • Magnetic drives' seek time

  • Active SWAP partitions

Some of these factors can only be resolved by changing the hardware, but others can be changed by tuning the operating system.

Kernel disks' I/O schedulers

The kernel I/O scheduler permits us to change the way we read and write data on the disk. There are three kinds of schedulers. You can select a disk and look at the currently used scheduler using the following command:

# cat /sys/block/sda/queue/scheduler
noop deadline [cfq]

The I/O scheduler used here is Completely Fair Queuing (CFQ).

The noop scheduler queues requests as they are sent to the I/O.

The deadline scheduler prevents excessive seek movement by serving I/O requests that are near to the new location on the disk. This is the best solution for SSDs.

The CFQ scheduler is the default scheduler on most Linux distributions. The goal of this scheduler is to minimize seek head movements. This is the best solution for magnetic disks if there is no other mechanism above it (such as RAID, Fusion-io, and so on). In the case of SSDs, you have to use the deadline scheduler.

To change the disk I/O scheduler with deadline, use the following command:

echo deadline > /sys/block/device/queue/scheduler

You have to replace the device with the device name (such as sda).

Another solution to avoid changing the disk I/O scheduler manually is to install sysfsutils by using the following command:

aptitude install sysfsutils

Then, you have to configure it in /etc/sysfs.conf:

block/sdb/queue/scheduler = deadline

Easy to use and understand, sysfsutils is a daemon that permits us to make changes in /sys automatically (as there is no sysctl for /sys).

Now, it could be a problem if you have a lot of disks on your machine and want to set the same I/O scheduler on all devices. Simply change the grub boot settings (/etc/default/grub) with the elevetor option:

GRUB_CMDLINE_LINUX_DEFAULT="quiet elevator=deadline"

To make the preceding setting work, upgrade the grub configuration and reboot:

update-grub

If you want to go ahead, there are several options for each I/O scheduler, and there is no optimal configuration. For example, on MyISAM, you need to increase nr_requests to multiply the throughput. You have to test them and look at the better solution corresponding to your needs.

In the latest version of CFQ, it automatically detects if it's a magnetic disk and adapts itself to avoid changing the elevator value. You can find all the required information on the Linux kernel website (http://www.kernel.org).

Partition alignment

The goal of partition alignment is to match logical block partitions with physical blocks to limit the number of disk operations. You must make the first partition begin from the disk sector 2048. However, it can be done automatically if you're using the parted command. First of all, install the package:

aptitude install parted

Here is an example:

device=/dev/sdb
parted -s -a optimal $device mklabel gpt
parted -s -a optimal $device mkpart primary ext4 0% 100%

In the preceding example, we set /dev/sdb as the disk device, then created a gpt table partition, and finally created a single partition that takes the full disk size. 0% means the beginning of the disk (which in fact starts at 2M) and goes to the end (100%). The optimal option means we want the best partition alignment to get the best performance.

SSD optimization

From the 2.6.33 version of kernel, you can enable TRIM support. Btrfs, Ext4, JFS, and XFS are optimized for TRIM when you activate this option. The TRIM feature blocks data that is no longer considered in use and that can be wiped internally. It allows the SSDs to handle garbage collection overhead that otherwise slows down future operations on the blocks.

Ext4 is one of the best solutions for high performance. To enable TRIM on it, modify your fstab (/etc/fstab) configuration to add the discard option:

/dev/sda2 / ext4 rw,discard,errors=remount-ro 0 1

Now, remount your partition to enable TRIM support for Ext4:

mount -o remount /

On LVM, you can also enable TRIM for all the logical volumes by changing the issue_discards option in your LVM configuration file (/etc/lvm/lvm.conf):

issue_discards = 1

Finally, we want to limit needless utilization of SSD, and this can be done by setting temporary folders in the RAM using the tmpfs filesystem. To achieve this, edit the fstab file at /etc/fstab and add the following three lines:

tmpfs /tmp tmpfs defaults,noatime,mode=1777 0 0
tmpfs /var/lock tmpfs defaults,noatime,mode=1777 0 0
tmpfs /var/run tmpfs defaults,noatime,mode=1777 0 0

Mount the preceding partitions to make them active.

Tip

On Debian, you do not need to change /etc/fstab, and you can make add tmpfs is /etc/default/tmpfs instead.

Filesystem options

Several kinds of filesystems exist and their performances generally depend on their usage. For MariaDB, I've performed several tests against XFS. My conclusion is the same as what we can find on most of the specialized websites on the Internet: XFS is a good solution but Ext4 is slightly faster.

On Ext4, you can add several interesting options to limit write access on the disk. You can, for example, disable the access time on all files and folders. This will avoid writing the last access time information to any acceded files on partition. As MariaDB often needs to access the same files, they are updated on each MariaDB modification (insert/update/delete), which is disk I/O consuming.

This could be a problem in some cases (for example, if you absolutely need these updates), but most of the time, it can be disabled by adding the following options in the fstab configuration (/etc/fstab):

/dev/sda2 / ext4 rw,noatime,nodiratime,data=writeback,discard 0 1

On a high disk I/O system, you will reduce the disk's access significantly.

You've also noticed that we used data=writeback. This option means that only metadata writes are journalized. It works well with InnoDB and is safe. Why? Because InnoDB has its own transaction logs, there is no need to duplicate the same action. This is the fastest solution, but if you prefer a safer one, you can use data=ordered instead to get data written before metadata.

Another interesting filesystem performance solution is to separate the Ext4 journal from the data disk (as in journaling, the filesystem writes data twice). Place the journal on a separate fast drive such as SSD. By default, the journal occupies between 2.5 percent and 5 percent of the filesystem size. It's suggested to keep the size at minimum for performance (it could be reduced on a very large data size).

First of all, check your current filesystem block size (here /dev/mapper/vg-home):

> dumpe2fs /dev/mapper/vg-home | grep "^Block"
dumpe2fs 1.42.5 (29-Jul-2012)
Block count:              1327104
Block size:                 4096
Blocks per group:      32768

Here, we've got a 4096 block size and the journal needs to have the same block size as well.

To dedicate a journal to a current partition, we need to unmount it. To be sure that there is no access, remove the current journal from the partition, create the journal partition on the dedicated device (partition size * 5 / 100), attach it to the wished partition, and then remount it:

> umount /home
> dumpe2fs /dev/mapper/vg-home | grep "Journal"
Journal inode:            8
Journal backup:           inode blocks
Journal features:         (none)
Journal size:             128M
Journal length:           32768
Journal sequence:         0x0000002f
Journal start:            0
> tune2fs -f -O ^has_journal /dev/mapper/vg-home
> mke2fs -O journal_dev -b 4096 /dev/sdb1
> tune2fs -j -J device=/dev/sdb1 /dev/mapper/vg-home
> mount /home

Now, you check on your partition to see whether the journal is located on another partition:

> dumpe2fs /dev/mapper/vg-home | grep "Journal"
Journal UUID:             8a3c6cec-2d45-4aa9-ac2f-4a181093a92e
Journal device:	          0x0811
Journal backup:           inode blocks

To locate it, use the following command:

> blkid | grep 8a3c6cec-2d45-4aa9-ac2f-4a181093a92e
/dev/mapper/vg-home: UUID="6b8f2604-e1ac-4bea-a5c9-e7acf08cec8c" TYPE="ext4" EXT_JOURNAL="8a3c6cec-2d45-4aa9-ac2f-4a181093a92e"
/dev/sdb1: UUID="8a3c6cec-2d45-4aa9-ac2f-4a181093a92e" TYPE="jbd"

As you now have a dedicated journal for your partition, add two other options to /etc/fstab (journal_async_commit). The advantage is that the commit block can be written to disk without waiting for the descriptor blocks. This option will boost performance. The code is as follows:

/dev/mapper/vg-home /home ext4 rw,noatime,nodiratime,data=writeback,discard,journal_async_commit 0 2

Another option exists for Ext4: barrier=0. It will boost performance as well. Do not use it if you have a standalone server, because it will delay journal data writes and you may not be able to recover your data if your system crashes. You only have to use barrier=0 if you're using a RAID car with a BBU.

Tip

The Linux kernel evolves very quickly. XFS has new options, new filesystems appear, and Ext4 may not be the best solution in all cases. You should stay in touch with all the kernel-related news and test your usage cases yourself.

SWAP

As SWAP is used on a physical disk (magnetic or SSD), it's slower than RAM. Linux, by default, likes swapping for several reasons. To avoid your MariaDB data being SWAP instead of RAM, you have to play with a kernel parameter called swappiness.

A swappiness value is used to change the balance between swapping out runtime memory and dropping pages from the system page cache. The higher the value, the more the system will swap. The lower the value, the less the system will swap. The maximum value is 100, the minimum is 0, and 60 is the default. To change this parameter in the persistence mode, add this line to your sysctl.conf file in /etc/sysctl.conf:

vm.swappiness = 0

To avoid a system reboot to get this value set on the running system, you can launch the following command:

sysctl -w vm.swappiness=0

And now check the value to be sure it has been applied:

> sysctl vm.swappiness
vm.swap
piness = 0

Dedicating hardware with cgroups

Linux kernel brings features that permit the isolation of a process from others, called cgroups (since version 2.6.24). If we want to dedicate CPU, RAM, or disk I/O, we can use cgroups to do it (it also provides other interesting features if you want to go ahead). With this solution, you can be sure to dedicate hardware to your MariaDB instance.

To start using cgroups, we must start preparing the environment. In fact, cgroups needs a specific folder hierarchy to work, but you'll see the advantages when we use it. So, edit the fstab file in /etc/fstab to mount cgroups at each machine startup, and add the following line:

cgroup  /sys/fs/cgroup  cgroup  defaults  0   0

Mount cgroup now to make cgroups available:

  mount /sys/fs/cgroup

To get all the CPU and memory features enabled, you need to change the grub configuration by adding two new features in /etc/default/grub (cgroup_enable and swapaccount):

GRUB_CMDLINE_LINUX_DEFAULT="quiet cgroup_enable=memory swapaccount=1"

Then, upgrade your grub settings and reboot:

update-grub

After the machine has rebooted, you can check whether your cgroup hierarchy exists:

> mount | grep ^cgroup
cgroup on /sys/fs/cgroup type cgroup (rw,relatime,perf_event,blkio,net_cls,freezer,devices,memory,cpuacct
,cpu,cpuset)

Manual solution

Let's create our first cgroup, the MariaDB one! Create a folder with a name of your choice in the cgroup folder:

mkdir /sys/fs/cgroup/mariadb_cgroup

If we now look at the mariadb_cgroup content, you can see all the limitations that the cgroup features are able to offer:

> ls -1 /sys/fs/cgroup/mariadb_cgroup/
[...]
cpuset.cpu_exclusive
cpuset.cpus
cpuset.mem_exclusive
cpuset.mem_hardwall
cpuset.memory_migrate
cpuset.memory_pressure
cpuset.memory_spread_page
cpuset.memory_spread_slab
cpuset.mems
[...]
tasks

You can see that there's a lot of stuff! Ok, now let's look at your processor information to see how many cores you've got:

> cat /proc/cpuinfo | grep ^processor
processor  : 0
processor  : 1
processor  : 2
processor  : 3

I can see that I've got four cores available on this machine. For example, let's say I want to dedicate two cores to my MariaDB instance. The first thing to do is to assign two cores to the mariadb_cgroup cgroup:

echo 2,3 > /sys/fs/cgroup/mariadb_cgroup/cpuset.cpus

You can set multiple cores separated by commas or with the minus character if you want a CPU range (0-3 to set from C0 to C3).

In case of multiple cores, I've just asked the cgroup to be bound to the last two cores. That means this cgroup is only able to use those two cores and that doesn't mean it is the only one able to use them. Those cores are still sharable with other processes. To make them dedicated to this cgroup, simply use the following command:

echo 1 > /sys/fs/cgroup/mariadb_cgroup/cpuset.cpu_exclusive

You can check the configuration of your cgroup simply with cat:

> cat /sys/fs/cgroup/mariadb_cgroup/cpuset.cpu*
1
2-3

We also need to specify the memory nodes that the tasks will be allowed to access. First, let's get a look at the available memory nodes:

> numactl --hardware | grep ^available
available: 1 nodes (0)

Then, set to the wished memory node (here 0):

echo 0 > /sys/fs/cgroup/mariadb_cgroup/cpuset.mems

Now, the cgroup is ready to dedicate cores to a process ID:

echo $(pidof mysqld) > /sys/fs/cgroup/mariadb_cgroup/tasks

That is it! If you want to be sure that you've correctly configured your cgroups, you can add another PID in that cgroup that will burst the two cores and check with the top or htop command, for example.

You can check your configuration using a PID in the following way:

> cat /proc/$(pidof mysqld)/status | grep _allowed
Cpus_allowed:  c
Cpus_allowed_list:  2-3
Mems_allowed:  00000000,00000001
Me
ms_allowed_list:  0

Automatic solution using the cgconfig daemon

It's preferable to be able to manage the manual solution before the automatic solution to check whether your configuration works as expected.

Now, if you want to have it enabled on boot and automatically configured correctly, you will need to use the cgconfig daemon. It will load a configuration and then watch all the launched processes. If one matches its set configuration, it will automatically apply the defined rules.

To get cgconfig, you'll need to install the following package:

aptitude install cgroup-bin daemon

The cgroup-bin package in Debian wheezy is a little bit young, so we need to manually set up the init file and the configuration from the package documentation.

Unfortunately, you need to do a little hack with the init skeleton file to be able to use the update-rc.d command for the cgconfig services because the original init files are not 100 percent Debian-compliant yet:

cd /etc/init.d
cp skeleton cgconfig
cp skeleton cgred
chmod 755 cgconfig cgred
sed -i 's/skeleton/cgconfig/' cgconfig
sed -i 's/skeleton/cgred/' cgred
update-rc.d cgconfig defaults
update-rc.d cgred defaults
cd /usr/share/doc/cgroup-bin/examples/
cp cgred.conf /etc/default/
cp cgconfig.conf cgrules.conf /etc/
gzip -d cgconfig.gz
cp cgconfig cgred /etc/init.d/
cd /etc/init.d/
sed -i 's/sysconfig/defaults/' cgred cgconfig
sed -i 's/\/etc\/rc.d\/init.d\/functions/\/lib\/init\/vars.sh/' cgred
sed -i 's/--check/--name/' cgred
sed -i 's/killproc.*/kill $(cat $pidfile)/' cgred
sed -i 's/touch "$lockfile"/test -d \/var\/lock\/subsys || mkdir \/var\/lock\/subsys\n\t&/' cgconfig
chmod 755 cgconfig cgred

In the meantime, we've updated a Red Hat path to a Debian one (sysconfig | defaults), modified the folder to store the lock file of the daemon, and changed the default cgred init to correct some bugs.

Regarding the configuration files, let's start with /etc/cgconfig.conf:

#
#  Copyright IBM Corporation. 2007
#
#  Authors:     Balbir Singh <[email protected]>
#  This program is free software; you can redistribute it and/or modify it
#  under the terms of version 2.1 of the GNU Lesser General Public License
#  as published by the Free Software Foundation.
#
#  This program is distributed in the hope that it would be useful, but
#  WITHOUT ANY WARRANTY; without even the implied warranty of
#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
#
group mariadb_cgroup {
    perm {
        admin {
            uid = mysql;
        }
        task {
            uid = mysql;
        }
    }

        cpuset {
                cpuset.mems = 0;
                cpuset.cpus = "2,3";
                cpuset.cpu_exclusive = 1;
        }
}

Here, we've got the cgroup name mariadb_cgroup. When a user, mysql, launches an operation, the cpuset configuration will be applied. In the same way as in the manual method, we've limited the mysql user process to the second and third cores.

The last thing to configure is the cgrules.conf file in /etc/cgrules.conf, which will indicate which process belongs to which cgroup. You need to add the user mysql to modify the cpu information and the cgroup folder name where it should be placed:

mysql          cpu           mariadb_cgroup/

Of course, you can check your configuration in /sys/fs/cgroup when you want.

When you've finished configuring your cgroup and want the new configuration to be active, restart the services in the following order:

  • /etc/init.d/cgred stop

  • /etc/init.d/cgconfig stop

  • umount /sys/fs/cgroup 2>/dev/null

  • rmdir /sys/fs/cgroup/* /sys/fs/cgroup 2>/dev/null

  • mount /sys/fs/cgroup

  • /etc/init.d/cgconfig start

  • /etc/init.d/cgred start

Dedicating hardware optimization with NUMA

With large InnoDB databases (~ >32G), it becomes important to take a look at this kind of optimization.

In old/classic Uniform Memory Architecture (UMA), all the memory was shared among all the processors with equal access. There wasn't any affinity and performances were equal among all cores to the memory bank. With the Non-Uniform Memory Access (NUMA) architecture (starting with Intel Nehalem and AMD Opteron), this is totally different:

Each core has a local memory bank that gives closer access and thus reduces the latency. Of course, the whole system is visible as one unit, but optimization can be done to restrict a processor to its local memory bank. If there is no NUMA optimization, a core can ask for memory outside its local memory, which will increase the latency and lower the global performances.

By default, Linux automatically knows when it runs on a NUMA architecture and performs the following kind of operations natively:

  • Collects hardware information to understand the running architecture

  • Binds the correct memory module to the local core it belongs to

  • Splits physical processors to nodes

  • Collects cost information regarding inter-node communication

To look at the NUMA hardware on a running system, you can use the numactl command (install it first if not present):

> numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
node 0 size: 65490 MB
node 0 free: 56085 MB
node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
node 1 size: 65536 MB
node 1 free: 35150 MB
node distances:
node 0 1 
0: 10 20 
1: 20 10

We can see two different nodes here that indicate two different physical CPUs and the physical allocated RAM.

The node distances represent the cost of interconnect access. The weight for node 0 to access its local bank is 10, and for node 1, it's 20. This is the same constraint for node 1 to access node 0.

You can see the NUMA policy and information using the following command:

> numactl --show
policy: default
preferred node: current
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 
cpubind: 0 1 
nodebind: 0 1 
membind: 0 1

Now, if you want to bind a process to a CPU, use the following command:

numactl –physcpubind=0,1 <PID>

To allocate the local memory of a NUMA node, use the following command:

numactl --physcpubind=0 --localalloc <PID>

Now, if you want to get stats and see how your NUMA system works when the interleave has been hit and so on, use the numastat command:

> numastat
node0 node1
numa_hit 407467513 656867541
numa_miss 0 0
numa_foreign 0 0
interleave_hit 32442 32470
local_node 407037248 656824235
other_node 430265 43306