Microsoft Exchange Server 2013 High Availability

Before we delve into how we will make Exchange 2013 a highly available solution, it is important to understand the differences between a highly available solution and a resilient solution.

Availability

According to the Oxford English dictionary, available means able to be used or obtained; at someone's disposal. From an Exchange perspective, we can interpret availability as the proportion of time that Exchange is accessible to users during normal operations and during planned maintenance or unplanned outages. In simple terms, we are trying to provide service availability, that is, keep the messaging service running and available to users. Remember that uptime and availability are not synonymous; Exchange can be up and running but not available to users, as in the case of a network outage.

The availability of any IT system is often measured using percentages; more commonly, the number of nines in the digits ("class of nines"). The higher the percentage, the higher the availability of the system. As an example, when the business states that the organization's target is 99.9 percent Exchange availability, it is referred to as three nines, or class three. And 99.9 percent sounds excellent, right? Actually, it depends on the organization itself, and its requirements and goals. Looking at the following table, we can see that 99.9 percent of availability means that Exchange is actually down for almost 9 hours in a year, or 10 minutes every week on average. While this might seem acceptable, imagine if the period of downtime was to happen every week during peak utilization hours. The following table gives an overview of the approximate downtime hours for different values of availability, starting from 90 percent and higher:

Availability (%)	Downtime
Availability (%)	per year (365d)	per month (30d)	per week
90 percent (1 nine)	36.50 days	3.00 days	16.80 hours
95 percent	18.25 days	36.00 hours	8.40 hours
99 percent (2 nines)	3.65 days	7.20 hours	1.68 hours
99.5 percent	1.82 days	3.60 hours	50.40 minutes
99.9 percent (3 nines)	8.76 hours	43.20 minutes	10.08 minutes
99.95 percent	4.38 hours	21.60 minutes	5.04 minutes
99.99 percent (4 nines)	52.56 minutes	4.32 minutes	1.01 minutes
99.999 percent (5 nines)	5.26 minutes	25.92 seconds	6.05 seconds
99.9999 percent (6 nines)	31.54 seconds	2.59 seconds	0.60 seconds

While a typical user would probably be content with an availability of 99.9 percent, users in a financial institution may expect, or even demand, better than 99.99 percent. High levels of availability do not happen naturally or by chance; they are the result of excellent planning, design, and maintenance.

The ideal environment for any Exchange administrator is obviously one that is capable of achieving the highest level of availability possible. However, the higher the level of availability one tries to achieve, the higher the cost and complexity of the requirements that guarantee those extra few minutes or hours of availability.

Furthermore, how does one measure the availability of an Exchange environment? Is it by counting the minutes for which users were unable to access their mailboxes? What if only a subset of the user population was affected? Unfortunately, how availability is measured changes from organization to organization, and sometimes even from administrator to administrator depending on its interpretation. An Exchange environment that has been up for an entire year might have been unavailable to users due to a network failure that lasted for 8 hours. Users, and possibly the business, will see Exchange unavailable while its administrators will still claim 100 percent of availability. If we take the true definition of availability, Exchange was only approximately 99.9 percent available. But is this fair for the Exchange administrator? After all, the reason why Exchange was not available was not because of an issue with Exchange itself, but with the network.

The use of "nines" has been questioned a few times since it does not appropriately reflect the impact of unavailability according to the time of occurrence. If in an entire year, Exchange was only unavailable for 50 minutes during Christmas day at 3 A.M when no one tried to access it, should its availability be quantified as 99.99 percent or 100 percent?

The definition of availability must be properly established and agreed upon. It also needs to be accurately measured, ideally with powerful monitoring tools, such as Microsoft System Center Operations Manager, which are themselves highly available. Only when everyone agrees on a shared interpretation and define how to accurately measure availability, will it actually be useful for the business.

The level of availability that the business expects from Exchange will not be simply expressed as, for example, 99.9 percent. It will be part of a Service Level Agreement (SLA), which is one of the few ways of ensuring that Exchange meets the business objectives. SLAs differ for every organization, and there is not an established process on how to define one for Exchange. Typically, Exchange SLAs contain five categories:

Performance: An SLA of this category pertains to the delivery and speed of e-mails. An example would be 90% of all e-mails are to be delivered within 10 minutes. If desired, the SLA might also define the remaining 10 percent.
Availability: An SLA of this category establishes the level of availability of Exchange to the end users using the "class of nines" that we discussed previously.
Disaster Recovery: An SLA of this category defines how long it should take to recover data or restore a service when a disaster occurs. These SLAs typically focus on the service recovery time as well as on more specific targets such as a single server or a mailbox. To help establish these SLAs, two other elements of business continuity are used:
- Recovery Time Objective (RTO): This element establishes the duration of time in which Exchange must be restored after a disaster. For example, Exchange must be made available within 4 hours in the secondary datacenter if a major incident happens in the primary datacenter.
- Recovery Point Objective (RPO): This element establishes the maximum tolerable period in which data might be lost from Exchange due to a major incident. For example, in case of a major incident, no more than 1 hour of data can be lost. In environments where a secondary datacenter is used for disaster recovery, the RPO can be defined as the amount of time taken for the data to be replicated to the secondary datacenter. If a disaster occurs during this time, any data written during that time frame could be lost if the primary datacenter is unrecoverable.
Security: An SLA of this category generally includes assurances regarding malware-detection rate, encryption performance, data at rest and in transit, e-mail-scanning time, and physical security of servers and the datacenter(s) where these are located.
Management: An SLA of this category helps ensure that the messaging solution meets both user and maintenance requirements.

Translating an SLA document by putting it into practice requires administrators to be suitably skilled and to have the necessary infrastructure and tools to achieve the SLA. After SLAs have been planned, developed, and deployed, they must be periodically reviewed to ensure they are being met and are achieving the desired results. It is extremely important to ensure SLAs remain cost-effective and realistic.

Resilience

According to the Oxford English dictionary, the adjective resilient means able to withstand or recover quickly from difficult conditions.

Resilience, or resiliency as it is sometimes used, is the ability to provide a satisfactory level of service when faced with faults and challenges during normal operation. More specifically, it is the ability of a server, network, or an entire datacenter to recover quickly and continue operating normally during a disruption.

Resilience is usually achieved by installing additional equipment (redundancy) together with careful design to eliminate single points of failure (deploying multiple Hub Transport servers, for example) and well-planned maintenance. Although adding redundant equipment might be straightforward, it can be expensive and, as such, should be done only after considering its costs versus its benefits.

A typical example is one in which when a server's power supply fails, the server also fails, and its services become unavailable until the services are restored to another suitable server or the server itself is repaired. However, if this same server had a redundant power supply, it would keep the server running while the failed power supply was being replaced.

A resilient network infrastructure, for example, is expected to continue operating at or above the minimum service levels, even during localized failures, disruptions, or attacks. Continuing operation, in this example, refers to the service provided by the communications infrastructure. If the routing infrastructure is capable of maintaining its core purpose of routing packets despite local failures or attacks, it is said to be robust or resilient.

The same concept holds true from the server level all the way up to the datacenter facilities. Datacenter resilience is typically guaranteed by using redundant components and/or facilities. When an element experiences a disruption (or fails),its redundant counterpart seamlessly takes over and continues to provide services to the users. For example, datacenters are usually powered by two independent utility feeds from different providers. This way, a backup provider is available in case the other fails. If one is to design a resilient Exchange environment that extends across multiple datacenters, no detail should be overlooked.

Let us briefly throw another term into the mix, reliability, which signifies the probability of a component or system to perform for an anticipated period of time without failing. Reliability in itself does not account for any repairs that may take place, but for the time it takes the component (or system) to fail while it is in operation.

Reliability is also an important notion because maintaining a high level of availability with unreliable equipment is unrealistic, as it would require too much effort and a large stock of spares and redundant equipment. A resilient design takes into consideration the reliability of equipment in a redundant topology.

Taking storage as an example, the advertised reliability of the disks used in a storage array might influence the decision between using a RAID 1, RAID 5, RAID 6, or even a RAID 10.

Note

Designing and implementing a highly available and resilient Exchange 2013 environment is the sole purpose of this book. Although the main focus will be on the Exchange application layer, technologies such as Active Directory, DNS, and virtualization are also covered in some detail in the final chapter.

Microsoft Exchange Server 2013 High Availability

By : Nuno Filipe M Mota, Nuno Mota

Microsoft Exchange Server 2013 High Availability

By: Nuno Filipe M Mota, Nuno Mota

Overview of this book

Related Content you might be interested in

Current Title:

Microsoft Exchange Server 2013 High Availability

Defining high availability and resilience

Availability

Resilience

Note