When choosing a DR approach, organizations rely on the level of service required, as measured by two recovery objectives:
The preceding objectives are still relevant with SharePoint and the amount of money the business is willing to spend. This is covered in depth in Chapter 2, Creating, Testing, and Maintaining the DR Plan, and Chapter 6, Working with Data Sizing and Data Structure.
With dedicated and shared DR models, organizations are often forced to make trade-offs between cost and speed. As the necessity to achieve high availability and reduce costs continues to increase, organizations can no longer accept trade-offs, that is, a bank, for example, cannot use a cold standby model because it's cheaper, the C-level executives, that is, your CIO is going to want to know why it took 4 or 5 days to recover and why was there loss of data costing your organization possibly thousands of dollars. There is no set rule for this, except how much is your organization willing to pay and how much data loss is acceptable that is the formula.
Most organizations where SharePoint is mission critical use a hot standby; this is a duplicate farm in a DR datacentre. Depending on how much downtime is acceptable to your organization and how much time you want to spend on maintaining both farms synchronized, you would make the following decisions:
Just have three servers running and the rest turned off, and in the case of a disaster you would turn on the rest of the servers, and add whatever solutions and patches need to be added.
Have all your servers live all the time; this is much faster but obviously more expensive
Have all your servers live all the time and use a third-party tool, such as Metalogix Replicator (C) for real time synchronization
I was the lead architect for recovery.gov
. They have 45 servers on the AWS cloud in one region and 45 servers in their DR region. Although all the servers are live, it is not an active active environment; it is an active passive environment.
In case of a disaster, they would need to fail over to their DR farm manually, this is about a 1 hour window that is expectable to them. So you see the decision is yours; what is an acceptable loss of data and what is an acceptable amount of down-time?
While DR was originally intended for critical back-office processes, many organizations are now dependent on real-time enterprise applications like SharePoint that handle everything from their internet, intranet and extranet which are primary interfaces for their clients and employees. The cost of a minute of downtime may cost them thousands of dollars.
Standby datacentres are required for scenarios where local redundant systems and backups cannot recover from the outage at the primary datacentre. The time to get a farm up and running in a different location is often known as a hot, warm, or cold standby. Our definitions for these farm recovery datacentres are as follows:
Cold standby: A redundancy method that involves having one system as a backup for another identical primary system that can provide availability within hours or days.
Warm standby: A redundancy method that involves having one system running in the background of an identical primary system that can provide availability within minutes or hours.
Hot standby: A redundant method of having one system running simultaneously with another identical primary system that can provide availability within seconds or minutes.
Each of these standby datacentres have an associated cost to operate and maintain.
Cold standby DR strategy: A business ships backups to an offsite storage site regularly, and has contracts in place for emergency server rentals.
Pros:
Cons:
The slowest option to recover.
Often an expensive option to recover, because it requires that physical servers be configured correctly after a disaster has occurred.
Some datacentres do not have the SharePoint expertise in house to deploy and configure your farm, so you will need to implement a solution to facilitate this, such as Microsoft's System Center Data Protection Manager or PowerShell script. You may still run into problems such as the hardware not being the same, this can cause all sorts of problems and delays.
Warm standby DR strategy: A business ships/uploads backups or virtual machine images to local and regional disaster recovery farms.
Pros:
Cons:
Can be very expensive and time consuming to maintain.
You pay lots of money in storage fees, that is, if you take a backup of one of your servers and it is 90 GB in size, the virtual machine will be 90 GB in size; multiply that by 6 or 10 servers and the cost of uploading that data every time you send the datacentre a new backup not to mention the cost of having them upload those images and of course test them at least once a month. (Remember: if you haven't tested it and had a successful restore it is not a good DR plan it's a shot in the dark.)
Hot standby DR strategy: A business runs multiple datacentres, but serves content and services through only one datacentre.
Pros:
It is often fairly fast to recover. If you are using third-party tools, such as Metalogix Replicator (C), that can synchronize two or more distant SharePoint farms in real time you can ensure that SharePoint content is always available and up-to-date. Bi-directional replication syncs all your SharePoint content; documents, sites, applications, permissions, and workflows with full metadata, versioning, and permissions. Replicator can sync immediately after changes happen or on a regular schedule.
Cons:
In a cold standby disaster recovery scenario, you have to recover by setting up a new farm in your cold standby datacentre and restore the backups that you have stored there. In this scenario, if your primary farm fails before you get to make the backups to ship out to the cold standby datacentre, you will lose all the data added or changed since your last backup.
In a warm standby disaster recovery scenario, you have to create a duplicate farm in the warm standby datacentre and ensure that it is updated regularly by using full and incremental backups of the farm in the primary datacentre. This requires some continuous monitoring, server maintenance, SharePoint upgrades, and other data activity to keep the environment warm. In the event of a failure, you will lose all the data added or changed since your last backup.
Virtualization provides a cost effective option for a warm standby recovery solution. Typically, you can use Hyper-V or VMware as an in-house solution for recovery. This is explained in further detail in Chapter 4, Virtual Environment Backup and Restore Procedures. But even this has its downside. If it takes two days for the VMs or backups to get to the DR datacentre or to upload all the VMs to the DR datacenter, your backups are now two days out of date.
Otherwise, you have to make sure that the virtual images are created often enough to provide the level of farm configuration and content freshness that you must have for recovering the farm at the secondary DR site. You must have an environment available in which you can host the VMs. We will dig a bit deeper into virtualization technologies later in this chapter.
In a hot standby disaster recovery scenario, you have to create a duplicate farm in the hot standby datacentre, so that it can assume production operations almost immediately after the primary farm fails. This requires a third-party tool, such as Metalogix Replicator for real time synchronization.
Note
For more information on Metalogix Replicator visit, http://www.metalogix.com/Products/Replicator/Replicator-for-SharePoint.aspx.
Both the RTO and RPO approaches include shared and dedicated models. These are explained below.
In a dedicated model, the infrastructure is dedicated to a single organization. Compared to other traditional models, this can offer a faster time for recovery, because the IT infrastructure is mirrored at the disaster recovery site and is ready to be called upon in the event of a disaster. While this model can reduce RTO because the hardware and software are preconfigured, it does not eliminate all delays. You still need to restore the data. This approach is costly because the hardware sits idle when not being used for disaster recovery. Some organizations use the DR infrastructure for development and testing, to mitigate the cost, but that introduces additional risk. When organizations start using their DR site for development or test, it becomes a huge problem because when the time comes to use it for an actual disaster, the farms are not the same; they are drastically different. There are solutions that were not maintained or documented correctly and now you are in a bind.
In a shared model, the infrastructure is shared among multiple organizations so it is more cost effective. After a disaster is declared, the hardware, the operating system, and the application software at the disaster site must be configured from the ground up to match the IT site that has declared a disaster. On top of that, the data restoration process must be completed. This can take hours or even days.
This is normally a service provided by the company that is managing your data operations.
There is a hybrid model, where a certain SharePoint technology such as SQL Server leverages a DR process from another application; this does reduce costs, but of course both DR plans need to be in sync. This can also become very complex; how do you separate the two and when it comes to restoring what is the process? I personally don't like this model because of its complexity, and as a best practice it is never a good idea to add any other database to your SharePoint SQL Server.