Implementing Cloud Design Patterns for AWS

The previous sections have tried to make light of issues found in traditional settings, which might make moving to a Cloud infrastructure seem like a logical choice with no ramifications. But this is not true. While Cloud infrastructure aims to resolve many problems, it does bring up new issues to the user.

Underlying hardware failures

Some issues can be avoided while others may not. Some examples of issues that may not be avoided, other than user error, are underlying hardware issues that propagate themselves to the user. Hardware has a fail rate and can be guaranteed to fail at some point while the benefit of IaaS is that, even though the hardware is abstracted away, it is still a relevant topic to anyone using it.

AWS has a Service Level Agreement (SLA) policy for each service, which guarantees that at their end you will meet a certain percentage of uptime. This implies a certain amount of downtime for scheduled maintenance and repairs of the hardware underneath.

As an AWS user this means you can expect an e-mail at some point during usage warning about instances being stopped and the need to restart manually. While this is no different from a physical environment where the user schedules their own downtime, it does mean that instances can misbehave when the hardware is starting to fail. Most of the replication and failover is handled underneath but if the application is real-time and is stopped, it does mean that you, as a user, should have policies in place to handle this situation.

Over-provisioning

Another issue with having virtual machines in the Cloud is over-provisioning. An instance type is selected when an instance is launched that corresponds to the virtualized hardware required for it. Without taking measures to ensure that replication or scaling happens on multiple data centers, there is a very real risk that when a new instance is needed, the hardware will not be immediately available. If scaling policies are in effect that specify your application should scale out to a certain number of instances, but all of those instances are in a data center nearing its maximum capacity, the scaling policy will fail. This failure negates having a scaling policy in place if it cannot always scale to the size required.

Under-provisioning

A topic that is rarely talked about but is very common is under-provisioning and it is the opposite of over-provisioning. We will start with an example: say we purchase a server for hosting a database and purchase the smallest instance type possible. Let's assume that for the next few days this is the only machine running in a specific rack in the AWS data center. We are promised the resources of the instance we purchased but as the hardware is free, it gives us a boost in performance that we get accustomed to unknowingly.

After a few days, the hardware that has been provisioned for other customers, now gives us the resources we were promised and not the extra boost we got for free in the background. While monitoring we now see a performance degradation! While this database was originally able to perform so many transactions per second it now does much less. The problem here is that we grew accustomed to the processing power that technically was not ours and now our database does not perform the way we expected it to.

Perhaps the promised amount is not suitable but it is live and has customer data within it. To resolve this, we must terminate the instance and change the instance type to something more powerful, which could have downstream effects or even full downtime to the customer. This is the danger of under-provisioning and it is extremely hard to trace. Not knowing what kind of a performance we should actually get (as promised in the SLA) causes us to possibly affect the customer, which is never ideal.

Replication

The previous examples are extreme and rarely encountered. For example, in a traditional hosting environment where there are multiple applications behind a load balancer, replication is not trivial. Replication of this application server requires registration with the load balancer and is usually done manually or requires configuration each time. AWS-provided ELBs are a first-class entity just like the virtual machines themselves. The registration between this is abstracted and can be done with the click of a button or automatically through auto scaling groups and start-up scripts.

Redundancy

Apart from replication, redundancy is another hot topic. Most database clustering takes redundancy into effect but requires special configuration and initial setup. The RDS allows replication to be specified at the time of setup and guarantees redundancy and uptime as per its SLA. Their Multi-AZ specification guarantees that the replication crosses availability zones and provides automatic failover. Besides replication, special software or configuration is needed to store offsite backups. With S3, an instance may synchronize with an S3 bucket. S3 is itself a redundant storage that crosses data center sites and can be done via an AWS CLI or their provided API. S3 is also a first-class entity so permissions can be provided transparently to virtual machines.

The previous database example hints towards a set of issues deemed high availability. The purpose of high availability is to mitigate redundancy through a load balancer, proxy, or crossing availability zones. This is a part of risk management and disaster recovery. The last thing an operations team would want is to have their database go down and be replicated to New Orleans during Hurricane Katrina. This is an extreme and incredibly rare example but the risk exists. The reason that disaster recovery exists and will always exist is the simple fact that accidents happen and have happened when ill-prepared.

Improving the end user experience

Another set of problems to be solved is optimization to the end user. Optimization in this case is proxying through cache servers so that high workloads can be handled without spinning up more instances. In a scaling policy, high bandwidth would lead to more instances, which incur cost and startup time. Caching static content, where possible, can help mitigate high bandwidth peaks. Other ways to optimize without caching might be to use Content Delivery Networks (CDNs) such as the AWS-provided CloudFront service, which automatically choose servers geographically close to the user.

Monitoring and log-gathering

The last set of problems to discuss in small detail is operational in nature. Most applications generate logs and large software stacks with many disparate logs. Third-party software such as Loggly and Splunk exist to aggregate and search log collections but AWS has services dedicated to this as well. The preferred way is to upload logs to CloudWatch. CloudWatch allows you to directly search and create alerts on the data within logs. Since CloudWatch is a first-class AWS service, they provide an SLA similar to the instance itself and the storage is scalable.

These are only some of the issues that someone shifting into AWS might encounter or need to fortify their infrastructure against. Reading through the chapters of this book will serve as a beginner's guide of sorts to help create a resilient infrastructure against these issues and others.

Implementing Cloud Design Patterns for AWS

Implementing Cloud Design Patterns for AWS

Overview of this book

Related Content you might be interested in

Current Title:

Implementing Cloud Design Patterns for AWS

Common problems encountered at AWS

Underlying hardware failures

Over-provisioning

Under-provisioning

Replication

Redundancy

Improving the end user experience

Monitoring and log-gathering