Book Image

VMware vSphere Troubleshooting

Book Image

VMware vSphere Troubleshooting

Overview of this book

VMware vSphere is the leading server virtualization platform with consistent management for virtual data centers. It enhances troubleshooting skills to diagnose and resolve day to day problems in your VMware vSphere infrastructure environment. This book will provide you practical hands-on knowledge of using different performance monitoring and troubleshooting tools to manage and troubleshoot the vSphere infrastructure. It begins by introducing systematic approach for troubleshooting different problems and show casing the troubleshooting techniques. You will be able to use the troubleshooting tools to monitor performance, and troubleshoot issues related to Hosts and Virtual Machines. Moving on, you will troubleshoot High Availability, storage I/O control problems, virtual LANS, and iSCSI, NFS, VMFS issues. By the end of this book, you will be able to analyze and solve advanced issues related to vShpere environment such as vcenter certificates, database problems, and different failed state errors.
Table of Contents (16 chapters)
VMware vSphere Troubleshooting
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Installing VMware vRealize Operations Manager
Power CLI - A Basic Reference
Index

Troubleshooting techniques


We all fix things in our daily lives, and all it takes to fix these things are troubleshooting skills. As with all skills, whether it's playing the piano, fixing a broken car, acting, or writing a computer program, some people are gifted with these skills for troubleshooting by nature. If you have a natural skill, you might assume that everyone else is also gifted. You may have learned how to ride a bike effortlessly, without knowing how much work other people may have had to put into it.

In the same way, some people have a natural talent for troubleshooting and are better at it than others. Such people quickly grasp the necessary steps and easily isolate the problem until they are able to find the root cause. Let's say your motorbike stops working and you take it to a mechanic, telling him the problems and the symptoms of your motorbike. A mechanic who is good at troubleshooting could be able to isolate the problem right away. He could also be able to explain you why your motorbike fails and what is the root cause of the problem. On the contrary, when you take your motorbike to a mechanic who isn't good in troubleshooting, you can expect more time to fix the motorbike and a higher repair bill. You may also need to go every now and then to see the mechanic to get your motorbike fixed at the earliest.

But this does not mean that if you don't have troubleshooting skills, you cannot learn them. Troubleshooting skills can be learned and mastered by anyone. For example, like many other skills, we apply certain techniques in troubleshooting as well—it does not matter whether we are gifted with this skill or not. When we start practicing, it becomes our second nature. We all want to be better troubleshooters, but we also need to be precise and fast. A good system engineer is gifted with troubleshooting skills. When we work in highly available environments where downtime is measured in dollars, we always want to have the right troubleshooting skill set to solve the problem. This requires precision, speed, comprehension, and troubleshooting skills.

Of course, it makes sense that you would prefer to go to the good mechanic who knows what it takes to fix your motorbike efficiently. Applying these scenarios will not only help you to troubleshoot in all aspects of life but also to troubleshoot vSphere in terms of identifying problems and their root causes, and understanding how to fix them.

You should consider a structured approach to troubleshooting rather than doing so without applying any methodology. The following aspects can be helpful and can teach you how to best practice troubleshooting, taking the motorbike to be repaired as an example:

Root Cause of Problem

Troubleshooting Skills Required

In the Engine

Action Needed

Not working at all

Easy

Dead battery

Problem understanding

Malfunctioning

Medium

Dashboard blinking light

Problem understanding + investigation

Malfunctioning, but the symptoms are seen in other components

Hard

Loss of power

Problem understanding + real-time investigation + correlation of events

Not working, but the problem disappeared

Requires on long analysis

Weak battery or some mechanical problems

Problem understanding + historical investigation + correlation of events

Precise communication

You should always establish good communication methods within your work environment. Communicating your problem effectively is one of the key skills required essentially for troubleshooting, especially when you are working in a collaborative environment. Lack of communication can lead to some serious and never-ending problems with increasing down-time. You might be working continuously without realizing that your other team members are working on the same problem as your are. If you've precise communication, you will always avoid the path that your other team members have already discovered.

The following communication methods can be effectively used to communicate within and outside of teams:

  • Direct conversation: You can communicate the problem directly, in person, with your team members

  • Voice/Video chats: Voice and video chats are very common now a days and enable a geographically distributed team to conduct regular meetings

  • Web sessions: Web sessions can be used to access remote systems, conducting presentations and sharing whiteboards

  • Email/Text chat: Email is the most common tool to used now a days for all kind of office communication

Creating a knowledge base of identified problems and solutions

While working on any system, you will face many common problems again and again. You should always create a knowledge base of these common problems, which includes identifying the problem, its symptoms, and the solution to be applied, along with a Root Cause Analysis (RCA) of the problem. Documenting and creating a knowledge repository of these problems and steps taken to troubleshoot them will save you a lot of work in the future. This will also help you to share the knowledge of troubleshooting with all your team members at one place. In addition, it will help you transfer knowledge to your newly hired team members and allow them to use a smarter and more methodological approach towards troubleshooting.

You might be able to fix the issue with no understanding of the root cause, but you cannot completely prevent it. You should always isolate and find the correct root cause in order to avoid problems in the future. If you know the root cause, you can easily assign the problematic issue to the correct team to resolve it accordingly. Sometimes you can come across very complex problems, where you may find the root cause, but sometimes that changes several times in the procedure. Highly available environments also have high stress and require your full concentration, excellent troubleshooting skills, and the correct domain knowledge. This becomes more crucial when it costs your organization money at every single second.

Obtaining the required knowledge of the problem space

For highly available environments, where every second of down time can cost you dollars, you would always have the right people in the right place in order to make sure your investment has been made at the right place. The value you will get by having the right people for the right job would save you not only in terms of Return on Investment (ROI) but also in terms of your reputation. If the required knowledge is missing, you should conduct training: first educate yourself and then transfer the knowledge to your team members. A technical team equipped with the knowledge of the problem space is highly desirable at all times.

Isolating the problem space

Whenever you face a critical problem, you should always try to divide the problem into smaller issues and try to divide it among your team members. If your team has only one member, you can still divide the problem into smaller ones. This approach does not only enable you to solve the problem quickly but also engages your team members to concentrate on different areas of the problem. Obviously, you should avoid working on the same problem that your other team members are working on. Thus, you should always make sure you have divided the problem space appropriately.

Documenting and keeping track of changes

You should always encourage your team members to log all their problems, their solutions, and the steps that were taken to reach to the solutions. You could centralize such information using a Knowledgebase or a local Wiki within your organization. Once you have your Knowledgebase in place with records of problems and their troubleshooting solutions, you can start testing the solutions. This will assure you that the solutions in your knowledge base are robust and well tested. You can use some kind of document version control so that as the problem evolves, your documentation can keep track of all of these changes.

When you are working in a data center, where you need to work together with other members of a team, this documentation process enables the entire team to solve the problem more easily. If you document the solutions in your organization, you truly enable your junior team members to learn new things and solve problems without involving senior team members.

Troubleshooting with power tools

In VMware vSphere troubleshooting, we will discuss and troubleshoot problems with different vSphere hosts, virtual machines, and vCenter Server. In simple walkthroughs, we will identify the problems and fix those problems by applying our knowledge. You will see how to isolate vSphere-related technical issues and how to apply troubleshooting techniques to those issues. We will discuss different VMware power tools to mange a vSphere infrastructure in centralized way, which includes VMware vSphere Management Assistant (vMA), EXCLI, vSphere PowerCLI, ESXTOP, resxotop, performance monitoring charts, and many other tools. These tools will be introduced step by step in the upcoming chapters.