Red Hat Enterprise Linux Troubleshooting Guide

Red Hat Enterprise Linux Troubleshooting Guide

By : Benjamin Cane

Buy this Book

Red Hat Enterprise Linux Troubleshooting Guide

By: Benjamin Cane

Buy this Book

Overview of this book

Red Hat Enterprise Linux is an operating system that allows you to modernize your infrastructure, boost efficiency through virtualization, and finally prepare your data center for an open, hybrid cloud IT architecture. It provides the stability to take on today's challenges and the flexibility to adapt to tomorrow's demands. In this book, you begin with simple troubleshooting best practices and get an overview of the Linux commands used for troubleshooting. The book will cover the troubleshooting methods for web applications and services such as Apache and MySQL. Then, you will learn to identify system performance bottlenecks and troubleshoot network issues; all while learning about vital troubleshooting steps such as understanding the problem statement, establishing a hypothesis, and understanding trial, error, and documentation. Next, the book will show you how to capture and analyze network traffic, use advanced system troubleshooting tools such as strace, tcpdump & dmesg, and discover common issues with system defaults. Finally, the book will take you through a detailed root cause analysis of an unexpected reboot where you will learn to recover a downed system.

Red Hat Enterprise Linux Troubleshooting Guide

Credits

About the Author

About the Reviewers

www.PacktPub.com

Preface

Free Chapter

Troubleshooting Best Practices

Styles of troubleshooting

Troubleshooting steps

Root cause analysis

Understanding your environment

Summary

Troubleshooting Commands and Sources of Useful Information

Finding useful information

Troubleshooting commands

Summary

Troubleshooting a Web Application

A small back story

The reported issue

Data gathering

Establishing a hypothesis

Resolving the issue

Summary

Troubleshooting Performance Issues

Performance issues

Performance

Comparing historical metrics

Summary

Network Troubleshooting

Database connectivity issues

Data collection

Hypothesis

Trial and error

Summary

Diagnosing and Correcting Firewall Issues

Diagnosing firewalls

Déjà vu

Troubleshooting from historic issues

Basic troubleshooting

A quick summary of what you have learned so far

Managing the Linux firewall with iptables

Summary

Filesystem Errors and Recovery

Diagnosing filesystem errors

NFS – Network Filesystem

Making mounts permanent

Troubleshooting the NFS server, again

Recovering the filesystem

Validation

Summary

Hardware Troubleshooting

Starting with a log entry

What is a RAID?

Back to troubleshooting our RAID

Identifying a bigger issue

Understanding /dev

Device messages with dmesg

Using mdadm to examine the superblock

What we have learned so far

Re-adding the drives to the arrays

Summary

Using System Tools to Troubleshoot Applications

Open source versus home-grown applications

When the application won't start

Checking whether the application is already running

Finding out more about the application

Resolving the conflict

Summary

Understanding Linux User and Kernel Limits

A reported issue

Why is the job failing?

Understanding user limits

Changing user limits

Kernel tunables

A look back

Summary

Recovering from Common Failures

The reported problem

Resolving the issue in the long-term and short-term

Summary

Root Cause Analysis of an Unexpected Reboot

A late night alert

Identifying the issue

What caused the high load average?

Investigating the filesystem being full

Preventing reoccurrence

A sample Root Cause Analysis

Summary

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Root cause analysis

Root cause analysis is a process that is performed after incidents occur. The goal of the RCA process is to identify the root cause of an incident and identify any possible corrective actions to prevent the same incident from occurring again. These corrective actions might be as simple as establishing user training to reconfiguring Apache across all web servers.

The RCA process is not unique to technology and is a widely practiced process in fields such as aviation and occupational safety. In these fields, an incident is often more than simply a few computers being offline. They are incidents where a person's life might have been at risk.

The anatomy of a good RCA

Different work environments might implement RCA processes differently but at the end of the day there are a few key elements in every good RCA:

The problem as it was reported
The actual root cause of the problem
A timeline of events and actions taken
Any key data points
A plan of action to prevent the incident from reoccurring

The problem as it was reported

One of the first steps in the troubleshooting process is to identify the problem; this information is a key piece of information for RCAs. The importance can vary in reason depending on the issue. Sometimes, this information will show whether or not the issue was correctly identified. Most times, it can serve as an estimate of the impact of the issue.

Understanding the impact of an issue can be very important, for some companies and issues it could mean lost revenue; for other companies, it could mean damage to their brand or depending on the issue, it could mean nothing at all.

The actual root cause of the problem

This element of a Root Cause Analysis is pretty self-explanatory on its importance. However, sometimes it might not be possible to identify a root cause. In this chapter and in Chapter 12, Root Cause Analysis of an Unexpected Reboot, I will discuss how to handle issues where a full root cause is unavailable.

A timeline of events and actions taken

If we use an aviation incident as an example, it is easy to see where a timeline of events such as, when did the plane take off, when were passengers boarded, and when did the maintenance crew finish their evaluation, can be useful. A timeline for technology incidents can also be very useful, as it can be used to identify the length of impact and when key actions are taken.

A good timeline should consist of times and major events of the incident. The following is an example timeline of a technology incident:

At 08:00, Joe B. phones the NOC helpline reporting an outage with e-mail servers in Tempe
At 08:15, John C. logged into the e-mail servers in Tempe and noticed they were running out of available memory
At 08:17, as per the Runbook, John C. began rebooting the e-mail servers one by one

Any key data points to validate the root cause

In addition to a timeline of events, the RCA should also include key data points. To use the aviation example again, a key data point would be the weather conditions during the incident, the work hours of those involved, or the condition of the aircraft.

Our timeline example included a few key data points, which include:

Time of incident: 08:00
Condition of e-mail servers: Running out of available memory
Affected service: E-mail

Whether the data points are on their own or within a timeline, it is important to ensure those data points are well documented in the RCA.

A plan of action to prevent the incident from reoccurring

The entire point of performing a root cause analysis is to establish why an incident occurred and the plan of action to prevent it from happening again.

Unfortunately, this is an area that I see many RCA's neglect. An RCA process can be useful when implemented well; however, when implemented poorly they can turn into a waste of time and resources.

Often with poor implementations, you will find that RCAs are required for every incident big or small. The problem with this is that it leads to a reduction of quality in the RCAs. An RCA should only be performed when the incident causes significant impact. For example, hardware failures are not preventable, you can proactively identify hardware failure using tools such as smartd for hard drives but apart from replacing them you cannot always prevent them from failing. Requiring an RCA for every hardware failure and replacement is an example of a poorly implemented RCA process.

When an engineer is required to establish a root cause for something as common as hardware failing, they neglect the root cause process. When engineers neglect the RCA process for one type of incident, it can spread to other types of incidents causing quality of RCAs to suffer.

An RCA should only be reserved for incidents with significant impact. Minor incidents or routine incidents should never have an RCA requirement; they should however, be tracked. By tracking the number of hard drives that have been replaced along with the make and model of those hard drives, it is possible to identify hardware quality issues. The same is true with routine incidents such as resetting user passwords. By tracking these types of incidents, it is possible to identify possible areas of improvement.

Establishing a root cause

To give a better understanding of the RCA process, let's use a hypothetical problem seen in production environments.

Note

A web application crashed when writing to a file

After logging into the system, you were able to find that the application crashed because the file system where the application attempted to write to was full.

Note

The root cause is not always the obvious cause

Was the root cause of the issue the fact that the file system was full? No. While the file system being full might have caused the application to crash, this is what is called a contributing factor. A contributing factor, such as the filesystem being full can be corrected but this will not prevent the issue from reoccurring.

At this point, it is important to identify why the filesystem was full. On further investigation, you find that it was due to a co-worker disabling a cron job that removes old application files. After the cron job was disabled, the available space on the filesystem slowly kept decreasing. Eventually, the filesystem was 100 percent utilized.

In this case, the root cause of the issue was the disabled cron job.

Sometimes you must sacrifice a root cause analysis

Let's look at another hypothetical situation, where an issue causes an outage. Since the issue caused significant impact, it will absolutely require an RCA. The problem is, in order to resolve the issue, you will need to perform an activity that eliminates the possibility of performing an accurate RCA.

These situations sometimes require a judgment call, whether to live with the outage a little longer or resolve the outage and sacrifice any chance of an RCA. Unfortunately, there is no single answer for these situations, the correct answer depends on both the issue and the environment affected.

Tip

While working on financial systems, I find myself having to make this decision often. With mission critical systems, the answer was almost always to restore service above performing the root cause analysis. However, whenever possible, it is always preferred to first capture data even if that data cannot be reviewed immediately.

Red Hat Enterprise Linux Troubleshooting Guide

By : Benjamin Cane

Red Hat Enterprise Linux Troubleshooting Guide

By: Benjamin Cane

Overview of this book

Related Content you might be interested in

Current Title:

Red Hat Enterprise Linux Troubleshooting Guide

Root cause analysis

The anatomy of a good RCA

The problem as it was reported

The actual root cause of the problem

A timeline of events and actions taken

Any key data points to validate the root cause

A plan of action to prevent the incident from reoccurring

Establishing a root cause

Note

Note

Sometimes you must sacrifice a root cause analysis

Tip