Book Image

VMware vRealize Operations Performance and Capacity Management

By : Iwan 'e1' Rahabok
Book Image

VMware vRealize Operations Performance and Capacity Management

By: Iwan 'e1' Rahabok

Overview of this book

Table of Contents (18 chapters)
VMware vRealize Operations Performance and Capacity Management
Credits
Foreword
Foreword
About the Author
Acknowledgments
About the Reviewers
www.PacktPub.com
Preface
Index

A virtual data center versus a physical data center


We covered SDDC to a certain depth. We can now summarize the key differences between a physical data center and a virtual one. To highlight the differences, I'm assuming in this comparison the physical data center is 0 percent virtualized and the virtual data center is 100 percent virtualized. For the virtual data center, I'm assuming you have also adjusted your operation, because operating a virtual data center with a physical operation mindset results in a lot of frustration and suboptimal virtualization. This means your processes and organization chart have been adapted to a virtual data center.

Data center

The following table summarizes the data centers as physical and virtual data centers:

Physical data center

Virtual data center

This is bounded by one physical site. Data center migration is a major and expensive project.

This is not bound to any physical site. Multiple virtual data centers can exist in one physical data center, and a single virtual data center can span multiple physical data centers. The entire DC can be replicated and migrated.

Server

The following table summarizes servers in physical and virtual data centers:

Physical data center

Virtual data center

1,000 physical servers (just an example, so we can provide a comparison).

It may have 2,000 VMs. The number of VMs is higher for multiple reasons: VM sprawl; the physical server tends to run multiple applications or instances whereas VM runs only one; DR is much easier and hence, more VMs are protected.

Growth is relatively static and predictable, and normally it is just one way (adding more servers).

The number of VMs can go up and down due to dynamic provisioning.

Downtime for hardware maintenance or a technology refresh is a common job in a large environment due to component failure.

Planned downtime is eliminated with vMotion and storage vMotion.

5 to 10 percent average CPU utilization, especially in the CPU with a high core count.

50 to 80 percent utilization for both VM and ESXi.

Racks of physical boxes, often with a top-of-rack access switch and UPS. The data center is a large consumer of power.

Rack space requirements shrink drastically as servers are consolidated and the infrastructure is converged. There is a drastic reduction in space and power.

Low complexity. Lots of repetitive work and coordination work, but not a lot of expertise required.

High complexity. Less quantity, but deep expertise required. A lot less number of people, but each one is an expert.

Availability and performance monitored by management tools, which normally uses an agent. It is typical for a server to have many agents.

Availability and performance monitoring happens via vCenter, and it's agentless for the infrastructure. All other management tools get their data from vCenter, not individual ESXi or VM. Application-level monitoring is typically done using agents.

The word cluster means two servers joined with a heartbeat and shared storage, which is typically SAN.

The word cluster has a very different meaning. It's a group of ESXi hosts sharing the workload. Normally, 8 to 12 hosts, not 2.

High Availability (HA) is provided by clusterware, such as MSCS and Veritas. Every cluster pair needs a shared storage, which is typically SAN. Typically one service needs two physical servers with a physical network heartbeat; hence, most servers are not clustered as the cost and complexity is high.

HA is provided by vSphere HA, including services monitoring via Application HA. All VMs are protected, not just a small percentage. The need for traditional clustering software drops significantly, and a new kind of clustering tool develops. The cluster for VMware integrates with vSphere and uses the vSphere API.

Fault Tolerance is rarely used due to cost and complexity. You need specialized hardware, such as Stratus ftServer.

Fault tolerance is an on-demand feature as it is software-based. For example, you can temporarily turn it on during batch jobs run.

Anti-Virus is installed on every server. Management is harder in a large environment.

Anti-Virus is at the hypervisor level. It is agentless and hence, is no longer visible by malware.

Storage

The following table summarizes storage in physical and virtual data centers:

Physical data center

Virtual data center

1,000 physical servers (just an example, so we can provide a comparison), where IOPS and capacity do not impact each another. A relatively static environment from a storage point of view because normally, only 10 percent of these machines are on SAN/NAS due to cost.

It has a maximum of 2,000 interdependent VMs, which impact one another. A very dynamic environment where management becomes critical because almost all VMs are on a shared storage, including distributed storage.

Every server on SAN has its own dedicated LUN. Some data centers, such as databases, may have multiple LUNs.

Most VMs do not use RDM. They use VMDK and share the VMFS or NFS datastore. The VMDK files may reside in different datastores.

Storage migration is a major downtime, even within the same array. A lot of manual work is required.

Storage migration is live with storage vMotion. Intra-array is faster due to VAAI API.

Backup, especially in the x64 architecture, is done with backup agents. As SAN is relatively more expensive and SAN boot is complex at scale, backup is done via the backup LAN and with the agent installed. This creates its own problem as the backup agents have to be deployed, patched, upgraded, and managed. The backup process also creates high disk I/O, impacting the application performance. Because the data center is network intensive and carries sensitive data, an entire network is born for backup purposes.

The backup service is provided by the hypervisor. It is LAN-free and agentless. Most backup software use the VMware VADP API to do VM backup. No, it does not apply to databases or other applications, but it is good enough for 90 percent of the VM population. Because backup is performed outside the VM, there is no performance impact on the application or Guest OS. There is also no security risk, as the Guest OS Admin cannot see the backup network.

Storage's QoS is taken care of by an array, although the array has no control over the demand of IOPS coming from servers.

Storage's QoS is taken care of by vSphere, which has full control over every VM.

Network

The following table summarizes the network in physical and virtual data centers:

Physical data center

Virtual data center

The access network is typically 1 GE, as it is sufficient for most servers. Typically, it is a top-of-rack entry-level switch.

The top-of-rack switch is generally replaced with the end-of-row distribution switch, as the access switch is completely virtualized. ESXi typically uses 10 GE.

VLAN is normally used for segregation. This results in VLAN complexity.

VLAN is not required (the same VLAN can be blocked) for segregation by NSX.

Impacted by the spanning tree.

No Spanning Tree.

A switch must learn the MAC address as it comes with the server.

No need to learn the MAC address as it's given by vSphere.

Network QoS is provided by core switches.

Network QoS by vSphere and NSX.

DMZ Zone is physically separate. Separation is done at the IP layer. IDS/IPS deployment is normally limited in DMZ due to cost and complexity.

DMZ Zone is logically separate. Separation is not limited to IP and done at the hypervisor layer. IDS/IPS is deployed in all zones as it is also hypervisor-based.

No DR Test network is required. As a result, the same hostname cannot exist on DR Site, making a true DR Test impossible without shutting down production servers.

DR Test Network is required. The same hostname can exist on any site as a result. This means DR Test can be done anytime as it does not impact production.

Firewall is not part of the server. It is typically centrally located. It is not aware of the servers as it's completely independent from it.

Firewall becomes a built-in property of the VM. The rules follow the VM. When a VM is vMotion-ed to another host, the rule follows it and is enforced by the hypervisor.

Firewall scales vertically and independently from the workload (demand from servers). This makes sizing difficult. IT ends up buying the biggest firewall they can afford, hence increasing the cost.

Firewall scales horizontally. It grows with demand, since it is deployed as part of the hypervisor (using NSX). Upfront cost is lower as there is no need to buy a pair of high-end firewall upfront.

Traffic has to be deliberately directed to the firewall. Without it, the traffic "escapes" the firewall.

All traffic passes the firewall as it's embedded into the VM and hypervisor. It cannot "escape" the firewall.

Firewall rules are typically based on the IP address. Changing the IP address equals changing the rules. This results in a database of long and complicated rules. After a while, the firewall admin dare not delete any rules as the database becomes huge and unmanageable.

Rules are not tied to the IP address or hostname. This makes rules much easier. For example, we can say that all VMs in the Contractor Desktop pool cannot talk to each other. This is just one rule. When a VM gets added to this pool, the rule is applied to it.

Load Balancer is typically centrally located. Just like the firewall, sizing becomes difficult and the cost goes higher.

Load Balancer is distributed. It scales with the demand.

Disaster Recovery

The following table summarizes Disaster Recovery (DR) in physical and virtual data centers:

Physical data center

Virtual data center

Architecturally, DR is done on a per-application basis. Every application has its own bespoke solution.

DR is provided as a service by the platform. It is one solution for all applications. This enables data center-wide DR.

The standby server on the DR site is required. This increases the cost. Because the server has to be compatible with the associated production server, this increases complexity in a large environment.

No need for a standby server. The ESXi cluster on the DR site typically runs the non-production workload, which can be suspended (hibernate) during DR. The DR site can be of a different server brand and CPU.

DR is a manual process, relying on a run book written manually. It also requires all hands on deck. An unavailability of key IT resources when disaster strikes can impact the organization's ability to recover.

The entire DR steps are automated. Once management decides to trigger DR, all that needs to be done is to execute the right recovery process in VMware Site Recovery Manager. No manual intervention.

A complete DR dry run is rarely done, as it is time consuming and requires production to be down.

A DR dry run can be done frequently, as it does not impact the production system. It can even be done on the day before the actual planned DR.

The report produced after a DR exercise is manually typed. It is not possible to prove that what is documented in the Microsoft Word or Excel document is what actually happened in the data center.

The report is automatically generated, with no human intervention. It timestamps every step, and provides a status whether it was successful or not. The report can be used as audit proof.

Application

The following table summarizes the application in physical and virtual data centers:

Physical data center

Virtual data center

Licensing is bound by the physical server. It is a relatively simple thing to manage.

Licensing is bound by an entire cluster or per VM. It can be more expensive or cheaper, hence it is complex from a management point of view.

All applications are supported.

Most applications are supported. The ones that are not supported are primarily due to the outdated perception by the ISV vendor. When more apps are developed in the virtual environment, this perception will go away.

Infrastructure team

The following table summarizes the infrastructure team in physical and virtual data centers:

Physical data center

Virtual data center

There's a clear silo between the compute, storage, and network teams. In organizations where the IT team is big, the DR team, Windows team, and Linux team could also be separate teams. There is also a separation between the engineering, integration (projects), and operations (business as usual) teams. The team, in turn, needs layers of management. This results in rigidity in IT.

With virtualization, IT is taking the game to the next level. It's a lot more powerful than the previous architecture. When you take the game to the next level, the enemy is also stronger. In this case, the expertise required is deeper and the experience requirement is more extensive. Earlier, you may have needed 10 people to manage 1,000 physical servers. With virtualization, you might only need three people to manage 2,000 VMs on 100 ESXi hosts. However, these 3 people have deeper expertise and more experience than the 10 people combined.