Disaster Types and Scenarios Covered by This Book
Since this book is meant as a reference, and we discuss different scenarios here, an overview of these scenarios is necessary. The following types of disasters or incidents are covered in this book. Illustrations and flowcharts are provided to visualize the disasters more easily, wherever necessary.
Recovery of Deleted Objects
The most common scenario (more common than a single DC hardware failure) is the accidental deletion of objects, computer accounts, users or Organizational Units (OU) within the AD. This is a possible scenario where no proper change management controls are in place, or where testing is not done properly. The restore can take some time, even if the backup tapes are immediately at hand, because the object relationship in AD is quite complex, and simply restoring the deleted objects will not work.
The real fun starts when you have a "safe" replication schedule due to various time zones and other reasons, such as office locations and line speeds. There are, and have been, scenarios where the deletion or modification of a critical service account, such as the Exchange service group, gets replicated in the course of 12 hours to all locations within the organization. The service that uses the account then stops working, and as it is probably a mission-critical service, gets noticed, fixed, and force-replicated to the closest DC. If things proceed smoothly, all locations will have their service restored, one after another, to the point where one of the last locations starts replicating forward in the chain to the first DC again, before it gets the restored information applied. Then, a vicious circle forms, as shown in the following diagram, giving way to some interesting possibilities. One possibility is that the service in different locations goes from working to non-working and back within a few hours, or returns to step one while the account remains deleted. This addresses the need for proper restoration of lost objects, and the proper process of forced replication.
Single DC Hardware Failure
This is another common scenario. You lose a DC due to a hardware or software failure. The reason for this can of course be failure of any of the hardware components caused by a faulty part, or an external event, such as water damage, a computer virus, or other reasons. At this stage, the DC is no longer operational and cannot be booted again.
If you have a small branch office with only one DC, this can be catastrophic and the need to bring the lost DC back online is critical because no-one at the location will be able to log in or use the directory service. Bringing a failed DC back is not very difficult, but there are steps that need to be taken to ensure that this does not affect the rest of your AD infrastructure. This incident might not be classified as extremely critical if you have two DCs at the site, but if some of these steps are not taken, and the DC has not been cleanly demoted, this can cause issues in the long term.
Some small offices also like to combine the file server, Exchange server, and DC onto one physical server so that more than just the authentication and the directory service is hosted on it. In the case of a file server, the recovery of the files is out of the scope of this book. However, if you run an Exchange server, and/or use the distributed file system service (DFS), or run services with domain accounts, such as Microsoft SQL, then the procedures outlined in this book will most definitely help you get your services back up and running.
Single DC AD Corruption
The single DC AD corruption is also quite common, especially in smaller companies where the DC has more than one role, such as also being a file, Exchange, and print server. AD corruption essentially means that the Directory service cannot be initiated because the directory database is corrupted, and that no user can log on to this DC with domain authentication or use any of the AD services, such as a global address book in Exchange. It is also possible (though not very common) that during a write process or replication process, one of the DCs fails or interrupts the data stream for some reason. It then replicates the changes with its nearest DC, which is usually its failover, located in the same server room. Both AD databases are then corrupted, and essentially all Directory services for that site fail.
Owing to the nature of AD, DNS, and the client authentication process (mentioned earlier in this chapter), the clients may still try to authenticate against the corrupted DCs but may not get a valid response and may therefore have to rely on the cached login information on the client server. The users will be allowed to log in, but will not be able to access any file shares or other services in the domain, if the information on the servers has not been cached, or the cache has expired (on Windows 2003's Universal Group caching is for 8 hours).
Site AD Corruption
If your AD gets corrupted on one DC in one site, the corrupted data is likely to replicate itself to other DCs within the same site very quickly. This leaves your entire site with a corrupted AD that makes it impossible for any users or services to use domain authentication. Basically, this is the same as the Single DC AD corruption, except that steps are outlined to recover an entire site, and not just a single DC.
Corporate (Complete) AD Corruption
This scenario is very dramatic but it can happen faster than you would have thought possible. A corruption can be anything from failed forest preps to schema modifications that were either incomplete or wrong. Another possibility is denial of service attacks, or exploits of vulnerabilities by a disgruntled employee (maybe an administrator within the organization), although this is quite remote. Consider a situation where one DC has a corrupt AD due to a human error, such as making changes to the AD schema at a remote location on a Saturday night, and the remote person does not recognize his or her mistake. The chances are high that this mistake this mistake is replicated out to the other DCs before anyone notices it.
Now, this becomes something of a race condition with the clients or systems continuously authenticating against the AD. The DCs will replicate the corrupt AD one by one, while the clients don't notice anything, because if one DC gives no answer, the client continues to query the next one in the list and so on until the last DC receives the replication of the bad database and goes offline. Then, the alarm bells go off and the systems come to a grinding halt. In addition, you have a very decentralized organization, a lot of time will be spent in coordinating the restoration efforts as well.
Of course, there are steps to initiate and recover from this as well, but response time is very important in this situation, and effective and correct processes and steps are also necessary.
Complete Site Hardware Failure
This scenario, describing an AD site and not necessarily a single physical site, is already quite drastic as it describes a total loss of AD service due to a complete hardware failure at a specific site. A site is a branch in your organization that is connected to your domain forest via a LAN or a WAN connection. This could also mean that a site includes two or more buildings, possibly distributed across an entire city. This scenario assumes that you have at least one other DC in your organization at another location that is unaffected. This scenario can be caused by anything that affects the whole server room, and is most likely to be physical. Fire and water, as well as storms or explosions, are very high on the probability list.
In this scenario, it is most likely that you have other servers that are also affected. This scenario will address the issue of how to get a complete site back up and running as quickly as possible. This is a critical scenario that needs to be fixed as soon as possible. You can, of course, re-route your users to another site for authentication if your WAN link gets backed up quickly, but if the links are not very fast, this can cause extreme slowness and precipitate incidents such as timeouts, and domain controller not found messages to the clients.
This is even worse if you have mission-critical systems authenticating against the AD as illustrated in the following diagram:
Corporate (Complete) Hardware Failure
If your corporation or organization has their entire AD infrastructure in one location (which is not recommended, but neither is it unheard of in small organizations), and a disaster, such as fire, water, or any other destructive incident happens, you need to rebuild everything. Backups are valuable but will not do the work for you. The most crucial task, at that point, is to get the working system back so that users can start their work. Damage control is not part of your job, but bringing back the company's domain infrastructure is. This means that your first priority is to get the DCs back online, and restore the applications that rely on it. Don't waste valuable time trying to get the print server to work when your clients and applications cannot authenticate. You also need to be aware that just re-installing the DCs from scratch will not work as you have hundreds or even thousands of systems bound to your AD infrastructure. Some services depend on this structure very heavily, and re-configuring all the clients and services is definitely not an option once your organization grows to critical size.
Your client machines at this point have no way of getting any information out of the AD, and the only reason why most of them are still operating is because of cached logins. You might even have a Group Policy preventing cached logins in which case you will have quite a few users who cannot get anything done, and a Management team that is calculating the loss of revenue per hour.