Book Image

VMware vRealize Operations Essentials

By : Matthew Steiner
Book Image

VMware vRealize Operations Essentials

By: Matthew Steiner

Overview of this book

This book will enable you to deliver on the operational disciplines of Performance, Health, Capacity, Configuration, and Compliance by making the best use of solutions provided by vRealize Operations. Starting with architecture, design, and sizing, we will ensure your implementation of vRealize Operations is a success. We will dive into the utilization of a solution to manage your vSphere infrastructure. Then, we will employ out-of-the-box Dashboards and the very powerful Views and Reporting functionality of vRealize Operations to create your custom dashboards and address your reporting requirements. Next, we go through the Alerting framework and how Symptoms, Recommendations, and Actions are used to achieve efficient operations. Later you will master the topic of Capacity Planning, where we look at how important it is to craft appropriate policies to match your requirements, and we’ll consider attitude toward capacity risk, which will aid you to build future project requirements into your capacity plans. Finally, we will look at extending the solution to manage Storage, Applications, and other IT infrastructures using Management Packs from Solution Exchange, as well as how the solution can be enhanced with the integration of Log Insight.
Table of Contents (18 chapters)
VMware vRealize Operations Essentials
Credits
Foreword
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Operational disciplines addressed by vRealize Operations


As the name of the solution suggests, vRealize Operations is an operational management solution. It has been designed to address the operational disciplines of Performance, Capacity, Configuration, and Compliance.

Each of these can be thought of as being related and acting in concert with each other. Together they define the level of availability achieved by the infrastructure being managed, and whether the Service Level Agreements (SLAs) in place between the business and the IT department are being met.

For example, if there is insufficient capacity in a cluster, the performance of VMs in that cluster may deteriorate, and the service or application that these VMs support may become unavailable.

vRealize Operations uses a variety of features such as content, alerts, symptoms, management packs, and reporting to provide the required visibility and control of the infrastructure, and deliver on these operational disciplines. Let's look at them in more detail.

Performance

vRealize Operations Manager monitors the performance of managed systems, and provides the system administrators with a set of very intuitive dashboards that provide them quick visualization of problems and issues that may arise. When the performance of the systems is not as expected, the solution helps with troubleshooting by directing the administrator quickly to the root cause of the problem. This is all underpinned with analytics and content.

Analytics

Every five minutes, vRealize Operations collects and stores the metric and property data about every resource it manages. The data is kept for six months at full granularity and is used by the Analytics engine to allow the system to understand normal behavior.

Note

The frequency of data collection and retention is tunable from the default 5 minute data collection and 6 months data retention periods. However, care must be taken when changing these as they can affect, quite significantly, the sizing requirements of the vRealize Operations nodes.

Every night, a set of analytics algorithms are run against every metric's historical dataset, to determine the expected behavior of each metric for the upcoming 24 hours. This expected behavior for a metric is called a Dynamic Threshold (DT). As metrics are collected and stored, they are compared against the DT to determine whether the object is exhibiting normal behavior. This is described in more detail in Figure 1.1.

The analytics are designed to look for different patterns of behavior, such as hourly, daily, weekly, monthly, and quarterly.

It will obviously take some time for vRealize Operations to learn all the expected behavior, as it needs to observe at least three data points to start seeing a trend, and many more to predict the trend with greater confidence. For example, a metric exhibiting a weekly cadence of behavior requires at least three weeks of data for a weekly trend to be detected.

Figure 1.1

The preceding simplified example shows how a DT and metric may be measured and tracked. The grey shading is the DT, and the diagram shows that during the early morning it is expecting this metric's value to be 0-10%, then 50-60% during the work day, and then back down to 0-10% for the evening. There is a short peak just before midnight, which is possibly a batch or a backup job. The black line is the observed metric and we can see that normal behavior has occurred; so in this case, there is no alerting to be done as the metric is operating normally.

If an observed metric deviates outside of the DT range, it is classed as an Anomaly and highlighted in yellow in the Metric Selectors and the associated Metric Graphs in the vRealize Operations dashboards.

The number of anomalies observed over time is also recorded for every object, and vRealize Operations uses these derived metrics to determine whether the number of anomalies being observed is significant and if it is required that an alert is generated.

Performance or availability problems are generally caused by something different happening with the resources within an environment, and this "something different" causes associated metrics to breach their DTs. This means that the majority of alerts that are performance or metric related will only be generated when abnormal behavior occurs. This dramatically reduces the number of alerts that IT operations receive and increases the quality of those alerts.

Content

The content baked into vRealize Operations is how the solution creates the intelligent and meaningful alerts. There is a lot of content provided by the solution and much more content will be added with the installation of Management Packs. Custom content can also be created very easily and will be described in Chapter 5, Alerts, Symptoms, Recommendations, and Actions.

An example of one of the out of the box content alerts and how it is constructed is as follows:

  • Symptom(s): They are descriptions of one or more conditions under which the alert is triggered. In the preceding example, the symptoms are that a VM is swapping to disk, has high ballooning or has memory compressed, and has high memory contention.

  • Recommendations: They are remediation action(s) that can be taken to resolve the symptoms. In the preceding case, the action may be to add a memory reservation or initiate a vMotion to migrate some VMs to another host or cluster with more capacity.

  • Actions: They act on the recommendation(s). vRealize Operations has the capability to initiate actions using Python scripts or vRealize Orchestrator workflows to carry out the recommendations. In the preceding example, the out of the box Python script can be used to set a memory reservation, or vRealize Orchestrator can be used to initiate a vMotion.

Dynamic thresholds and hard thresholds

Alerting based on metrics, which are outside the range of the calculated DTs, can be considered fairly generic and caused by "things happening differently". They tend to be used to troubleshoot and alert on unexpected behavior.

As well as triggering alerts based on unexpected behavior, much of the content in vRealize Operations Manager is based on specific behavior and documented best practices. For instance, storage latency would generally be considered performance impacting by a storage administrator, when it reaches 20-30ms.

Content within vRealize Operations Manager can also include Hard Thresholds (HTs), such as a figure of 20-30ms for storage latency, which can trigger alerts regardless of the state of the DT for the given metrics.

Content and alerts will be covered in much more depth in Chapter 5, Alerts, Symptoms, Recommendations, and Actions.

Capacity

Capacity management is one of the most important disciplines in IT Operations. Unfortunately, as virtualization has matured, traditional capacity management techniques have tended not to keep up with the technology. My experience of working with clients with mature virtualized environments and outdated capacity management practices is that they find themselves with a lot of underutilized infrastructure, resulting in a lot of wasted resources.

vRealize Operations Manager has a very rich capacity engine, which will help with this, illustrating capacity utilization in two main ways:

  • Capacity remaining: Taking into account the reserved capacity for vSphere HA and the headroom buffers, it answers the question about how much capacity remaining does a given resource have?. In the preceding screenshot, we can see that we have enough capacity available in this resource to support a further 32 average sized virtual machines.

  • Time remaining: Again, taking into account the reserved capacity for vSphere HA and the headroom buffers, it answers the question when am I going to run out of capacity?. In the following screenshot, we can see that the capacity for this resource is going to run out in 87 days and that CPU is the constrained resource.

Capacity models

Every object or resource in vRealize Operations can have a capacity model configured against it. This describes the metric(s) used to determine the capacity and the other factors, or constraints to be considered, such as vSphere HA. The models themselves are not configurable, however, how they are applied generally is configurable, and is managed within the policies section of vRealize Operations.

Note

Many of the VMware and third-party Management Packs have capacity models associated with the resources they are managing. The documentation for these Management Packs usually provides the administrator detail on how the capacity of a given object type is calculated.

The policies governing the capacity management in vRealize Operations are very granular and controllable. This allows the administrator to define what combination of demand or allocation capacity policies are applied against specific resources or groups of resources. This will be covered in detail in Chapter 6, Capacity Planning and Capacity Projects.

Capacity projects

As well as understanding the current capacity and the time remaining, many organizations will have ongoing projects that are going to add planned workload or additional hardware to their infrastructure. A new feature, Capacity Projects, introduced in vRealize Operations 6.0, allows the administrator to define these forecasted changes in the workload or resources, and assign a date against them.

The effect on capacity and the time remaining can then be visualized and any capacity shortfalls identified. The projects can be subsequently committed and they will then be reflected in the real-time capacity reporting.

For example, if an infrastructure has the capacity for a further 50 average sized VMs, but a project is planned to implement 20 average sized VMs, the capacity dashboards, badges, and reports will all change to reflect that there is now only capacity for 30 average sized VMs.

Configuration and compliance

The final operational disciplines being addressed are configuration and compliance. Misconfiguration of systems is the root cause of a large proportion of system outages; so ensuring that all your systems are configured the way you want them to be is one of the key weapons in ensuring up-time.

As well as ensuring up-time, there may be legal and regulatory reasons, such as PCI-DSS, for the systems to be configured in a certain way. Alternatively, there may be security or hardening standards that an organization's security department determines are essential, to ensure that the integrity of the systems is maintained. Both of these would be classed as compliance requirements.

For in-depth configuration and compliance, vCenter Configuration Manager is provided as part of vRealize Operations Advanced and Enterprise editions. However, the use of vCenter Configuration Manager is not covered in this book.

With the release of vRealize Operations 6.0, some configuration and compliance capabilities have been introduced into the vRealize Operations Manager platform. As well as collecting metrics, vRealize Operations now collects properties from the ESXi hosts and the VM containers.

These properties can be used to assess the configuration posture of the ESXi hosts and the VM containers, using the Alerts, Symptoms, Recommendations, and Actions framework.

Content has been created that reflects the vSphere Hardening Guidelines, which means that, out of the box, vRealize Operations can now report on how compliant the ESXi hosts and the VM containers are against these guidelines. The reporting is available through the alerts, views, and reports functionality, and also via the Compliance badge in the vRealize Operations dashboards.

vSphere Hardening Guidelines will be covered in Chapter 5, Alerts, Symptoms, Recommendations, and Actions.