Red Hat Enterprise Linux Troubleshooting Guide

Red Hat Enterprise Linux Troubleshooting Guide

By : Benjamin Cane

Buy this Book

Red Hat Enterprise Linux Troubleshooting Guide

By: Benjamin Cane

Buy this Book

Overview of this book

Red Hat Enterprise Linux is an operating system that allows you to modernize your infrastructure, boost efficiency through virtualization, and finally prepare your data center for an open, hybrid cloud IT architecture. It provides the stability to take on today's challenges and the flexibility to adapt to tomorrow's demands. In this book, you begin with simple troubleshooting best practices and get an overview of the Linux commands used for troubleshooting. The book will cover the troubleshooting methods for web applications and services such as Apache and MySQL. Then, you will learn to identify system performance bottlenecks and troubleshoot network issues; all while learning about vital troubleshooting steps such as understanding the problem statement, establishing a hypothesis, and understanding trial, error, and documentation. Next, the book will show you how to capture and analyze network traffic, use advanced system troubleshooting tools such as strace, tcpdump & dmesg, and discover common issues with system defaults. Finally, the book will take you through a detailed root cause analysis of an unexpected reboot where you will learn to recover a downed system.

Red Hat Enterprise Linux Troubleshooting Guide

Credits

About the Author

About the Reviewers

www.PacktPub.com

Preface

Free Chapter

Troubleshooting Best Practices

Styles of troubleshooting

Troubleshooting steps

Root cause analysis

Understanding your environment

Summary

Troubleshooting Commands and Sources of Useful Information

Finding useful information

Troubleshooting commands

Summary

Troubleshooting a Web Application

A small back story

The reported issue

Data gathering

Establishing a hypothesis

Resolving the issue

Summary

Troubleshooting Performance Issues

Performance issues

Performance

Comparing historical metrics

Summary

Network Troubleshooting

Database connectivity issues

Data collection

Hypothesis

Trial and error

Summary

Diagnosing and Correcting Firewall Issues

Diagnosing firewalls

Déjà vu

Troubleshooting from historic issues

Basic troubleshooting

A quick summary of what you have learned so far

Managing the Linux firewall with iptables

Summary

Filesystem Errors and Recovery

Diagnosing filesystem errors

NFS – Network Filesystem

Making mounts permanent

Troubleshooting the NFS server, again

Recovering the filesystem

Validation

Summary

Hardware Troubleshooting

Starting with a log entry

What is a RAID?

Back to troubleshooting our RAID

Identifying a bigger issue

Understanding /dev

Device messages with dmesg

Using mdadm to examine the superblock

What we have learned so far

Re-adding the drives to the arrays

Summary

Using System Tools to Troubleshoot Applications

Open source versus home-grown applications

When the application won't start

Checking whether the application is already running

Finding out more about the application

Resolving the conflict

Summary

Understanding Linux User and Kernel Limits

A reported issue

Why is the job failing?

Understanding user limits

Changing user limits

Kernel tunables

A look back

Summary

Recovering from Common Failures

The reported problem

Resolving the issue in the long-term and short-term

Summary

Root Cause Analysis of an Unexpected Reboot

A late night alert

Identifying the issue

What caused the high load average?

Investigating the filesystem being full

Preventing reoccurrence

A sample Root Cause Analysis

Summary

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Back to troubleshooting our RAID

Now that we have a better understanding of RAID and the different configurations, let's go back to investigating our errors.

Apr 26 10:25:44 nfs kernel: md/raid1:md127: Disk failure on sdb1, disabling device.
md/raid1:md127: Operation continuing on 1 devices.

From the preceding error, we can see that our RAID device is md127. We can also see that this device is a RAID 1 device (md/raid1). The message stating Operation continuing on 1 devices means the second part of the mirror is still operational.

The good thing is that, if both sides of the mirror were unavailable, the RAID would completely fail and result in worse issues.

Since we now know the RAID device affected, the type of RAID used, and even the hard disk that failed, we have quite a bit of information about this failure. If we continue looking at the log entries from /var/log/messages, we can find out even more:

Apr 26 10:25:55 nfs kernel: md: unbind<sdb1>
Apr 26 10:25:55 nfs kernel: md: export_rdev...

Red Hat Enterprise Linux Troubleshooting Guide

By : Benjamin Cane

Red Hat Enterprise Linux Troubleshooting Guide

By: Benjamin Cane

Overview of this book

Related Content you might be interested in

Current Title:

Red Hat Enterprise Linux Troubleshooting Guide

Back to troubleshooting our RAID