As we’ve said before, this part of the book will be a level set for everyone to be on the same page with the basic concepts, so if you are 100% sure you understand the following, you can skip this part:
- Basic cloud concepts
- Security and privacy implications
- Cloud services
- Cloud workload types
- Pricing and support options
If you feel less confident, maybe just skim this section. You can always come back and read it if required. Also, when you gift copies of this book to the people in your organization, they can quickly catch up with acronyms, concepts, and ideas here. If you are one of those people that got gifted this book and need to understand the cloud concepts, welcome! Someone in your organization loves you enough that they would like you to educate yourself further, to be an active participant in your organization’s digital transformation journey.
Smash that like button!
I feel like I should now shout at you to comment, like, and subscribe as it really helps the channel out. But this is not YouTube, so it might be a bit more difficult for you to do. So, get in touch in other ways!
As we continue forward, we will focus on Azure services, however, these concepts and the concepts in this book more broadly apply to any hyperscale cloud provider, so if you are primarily working day to day with AWS or GCP, you should be perfectly fine translating these to their respective services.
So why adopt the cloud and why should we care for it?
Again, you really must think of it in terms of agility. The cloud is a way for us (all of us) to deliver value into the hands of our customers faster. It is also a way for us to deliver value that we just couldn’t before from our own data centers. Be it due to physical constraints or economical constraints (hundreds of thousands of compute units at our fingertips), it was not easy or quick to run an experiment or prototype an application, let alone test it with a small subset of our customers and then scale it to the entire global market.
The cloud also brings services that we would normally have had to introduce ourselves into our architectures and plan for their development, deployment, testing, scaling, supporting, updating, monitoring, and so on.
Azure Service Bus
A service such as Azure Service Bus now gives us the flexibility to handle publish-subscribe events with one deployment of a template and all our services can avail of it, without us having to develop, deploy, and test it for functionality.
It can be highly available and scale automatically, it has 24/7 support, and it updates itself both in terms of bringing additional functionality and security (it can even be made highly available across Azure availability zones by just picking the premium tier and configuring it) and by delivering new features. That is an awful lot of work we don’t have to do. Focusing on features and services that your organization cares about, you shouldn’t be building a service like Azure Service Bus.
Azure Service Bus is a comprehensive offering with many options and possible configurations. There is a wealth of information available on the Azure website on things such as pricing tiers, messaging design pattern scalability, observability, and so on. For example, try out these links:
- Azure Service Bus: https://docs.microsoft.com/en-us/azure/service-bus-messaging/service-bus-messaging-overview
- Publish-Subscribe pattern: https://docs.microsoft.com/en-us/azure/architecture/patterns/publisher-subscriber
- Azure support: https://azure.microsoft.com/en-us/support/plans/
As another example, a service such as Azure Monitor brings with it a whole wealth of integrations (automatic, semi-automatic, and manual) that allow you to monitor your entire Azure estate from a single pane of glass (to use another buzz phrase there). This means that for all Azure services and for a whole bunch of applications and services you are building, you get out-of-the-box monitoring and metrics without you having to do anything other than configuring the endpoints and start ingesting your telemetry.
The power of Azure Monitor (the App Insights part of it especially) doesn’t end there as you can extend Azure Monitoring default events with your own custom events, which usually Azure cannot reason about on its own – for example, every time a person completes a level in your game, evaluate all the inputs, game paths, times, and scores, check them for cheating and submit an event into App Insights on the result of the evaluation. Later, you can investigate these events either automatically or manually and further improve your anti-cheat methods.
Different definitions of the cloud
Getting back to the concept of the cloud, you now understand why the cloud is so powerful. But now let’s switch to what the cloud is. Ask 10 people to define the cloud and you will get at least 13 answers. Ask these 10 people tomorrow and you will get 13 completely different answers. And for sure at least 50% of all those are correct. They might even all be. The cloud means different things to different people. What does it mean for you?
CFOs might focus on cost-saving provided by PaaS services over IaaS and traditional virtualization in their own data centers – bringing costs down means an opportunity to reinvest in more research.
CTOs might focus on the standard catalog of services to be used in a compliant and repeatable way – henceforth bringing an easy onboarding of future services the organization creates.
The head of engineering might focus on reusable components, technologies, and services – thus unlocking career progression opportunities for team members to move between different teams with ease.
A developer might focus on writing just the code they need – rather than also needing to worry about what type of infrastructure will be needed to run the code. They also might focus on how easy it is now to debug in production when compared to on-premises deployments in customers’ environments the developer had no direct insight into.
And all of these are true, so how can the cloud bring that about?
What is the cloud?
Picture a seemingly endless web of physical servers spread across the world. These servers, each with their own special tasks, come together to create the all-encompassing wonder we call the cloud. The cloud is compute, storage, memory, and common IT building blocks at your fingertips without the (traditional) headaches. It is also, for most purposes, “infinitely” scalable in those dimensions. (Alright, technically not infinite, but we rarely worry about having the resources available to scale typical business applications.)
The cloud also delivers global scale, massive bandwidth, and minimal latency through data centers located closer to your customers than you can ever be. Azure has more than 60 regions with more than 60 data centers and tens of thousands of edge locations (with partners such as Akamai and others).
Could you build a service and offer it globally and cheaply before the cloud? Sure, you could. Would it be as cost-optimized? Hell no! Can you do so today in the cloud literally in hours? Yes, absolutely. You can and you should.
The cloud is a vast network of virtualized services ready for you to pick and choose (cherry-pick) which ones you need and is capable of bursting and scaling as you require for your services. The cloud is glued together by an unimaginable length of wires, numbers of chips, and ultimately is a testament to human ingenuity and a vision of evermore powerful computers in the hands of every single person and organization, enabling them and you to achieve more – every day. I hope Microsoft forgives the paraphrasing of their corporate mission here: empower every person and every organization on the planet to achieve more:
Figure 1.1 – Cloud meme
The cloud is, as that meme tells us, someone else’s computer. In fact, it is hundreds of thousands of someone else’s computers, and even millions of them. And it’s perfect that it is that way. We can use them and we don’t have to maintain them.
I don’t own a car
I rent. Either long-term (weeks and months) or short-term (hours or days). I get the benefit of a car – transportation. I don’t get the hassle of repairing and maintaining the car, taxing and insuring it, and a piece of mind worrying about what happens if I scratch it or crash it. I get any car that I want, small and cheap, large and useful, or fancy and expensive.
Is renting a car for everyone? Maybe not – self-driving cars may eventually bring about a mindset change for us all. Is this marginally more expensive on a per-use basis? Yes. Are the benefits worth it? Absolutely. Do I even like driving and am I even a good driver? No, to both. Am I terrible at parking? Yes. Did I get to drive cars I would never be able to afford (at least before everyone and their friend buys this book)? Yes.
Like the cloud, where you rent compute, memory, storage, and bandwidth, I rent cars.
And both renting computers in the cloud and renting cars are a future certainty. An inevitability that is coming soon for all of us.
Now that we are aligned on the cloud itself, let’s focus on to what it means to architect for the cloud.
Architecting the cloud
- Operational excellence
- Performance efficiency
- Cost optimization
What you should consider under the topic of operational excellence is why the processes are set up such as they are in your organization and what needs to change to achieve agility. You should also look to balance your team’s freedom (the desire to do as they like and define their own processes) versus following standard processes as defined.
APM tools must enable visibility of all aspects of application performance, from server health to user experience, giving teams deep insights into how their applications are operating in real time. APM, over time, should provide teams with data points of how changes are impacting their applications, positive or negative, allowing them to take proactive measures, for example, to maintain an optimal level of efficiency or pivot on the functional direction of their application – this type of agility is core to operational excellence.
IaC and automation go hand in hand. They essentially mean that nothing you do should be unique, one of a kind, or manual. Every change and every action needs to go through continuous integration and continuous deployment pipelines. It needs to go through as a unit, as a completed task that is traceable from the idea to the line of code to the telemetry event. This needs to be repeatable and must produce the same identical result every time (this is also referred to as idempotency).
What this also gives you is – say it again – agility. You must be able to roll back any change that is causing disruption, performance dips, or instability.
Is that easy? No.
Is there a lot to do to prepare for that? Yes.
Can it be done iteratively, so we get better at operational excellence over time? Yes.
Is it worth the peace of mind achieved once it is in place? Yes.
The end goal is for you and for everyone in your organization to be able to step away from work at any time for vacations, for fun and adventure, or just to sleep and have no worries that anything will happen that won’t be automatically resolved (by at least rolling back to what worked before). If your organization can deploy new code and new services on a Friday afternoon and no one cares or worries, you are there – you are living the dream of operational excellence. If you are one of these individuals, we’d love to hear from you.
Have I seen any organization achieve all of this? No. Never. Some, though, are so very close.
And that is what it’s all about – doing better today than you did yesterday. And every good deployment, and equally every bad deployment, is an opportunity to learn. No one needs to accept the blame and no one needs to get fired – the solution is always that the process improves.
Yes, someone may still get fired and even prosecuted for deliberate malicious activity, but the solution is and must always be the process improves, we improve, and we do better going forward.
I’ve had customers work with me and try and work out what they do with their services if a DDoS attack is initiated against them. Inevitably, someone will mention we should probably just turn all the services off to save costs in the event of DDoS as throwing infrastructure resources at the problem is sometimes necessary, so just shut down the services and wait until the attacker goes away.
To which my reply is always, let us consider the reason behind a DDoS attack and what the goal is. Pause here and think. What is the goal?
OK, so if the goal is to make your services inaccessible to others, what good does shutting them down do, except doing exactly what they wanted to achieve? For example, a DDoS attack against an Xbox service is designed to make gamers unable to, well, game. If you then turn off the service as a response, what have you achieved?
The key thing about reliability is for the services to continue to function.
DDoS mitigation could very well be a book in its own right so we won’t go into that here, but just to give you a head start: Azure has a service that mitigates DDoS attacks, one tier being free and the other costing you money. Turning that on is a really (really, really) good idea for public-facing endpoints. Also, Microsoft will have teams ready to assist at a moment’s notice if the attack does happen and the automatic mechanisms don’t prevent it. And you will have a priority service if that is the case.
Before you invest time in high availability and resiliency from a redundancy perspective, ensure that is the actual business requirement. I’ve seen so many teams struggle to achieve unreasonably high availability, only to answer my question “What is the traffic to the service?” with “Nine queries a week on average.” Or, my question “What exactly does the service do?” with “PDF generator”. Unless your business is PDF generation, people can usually come back for their PDF or wait until it is processed and generated in a background thread and emailed to them.
I am already looking forward to all the feedback like “Well, actually, our PDF service is mission-critical.” All I am saying is think before you invest effort in reliability. Ask the business how critical the service is.
And another aside here: if all services are critical, then no service is critical. This has a slight possibility of being incorrect, but I’ve never seen it.
Another way to improve resiliency is for the services to fall back to less-intensive responses. For example, if the service returns the most sold items today and it needs to do massive parallel queries and correlate the values, it can fall back to a cached list from yesterday, which is just one fast query.
Resiliency is another topic we could spend a lot of time on, but for now, just remember these concepts: single point of failure, graceful degradation, and one last thing – if there are issues with one service in your architecture, expect issues to cascade up and/or down the stack, and even after you have mitigated the issues, expect further issues in the next week or two, so be prepared and staffed. A rule of thumb– here for you free of charge (almost) – will save you a lot of headache.
The reason behind this is that in architectures we see today, interconnectedness is baked in (unfortunately) more than it should be as it is often not easy to visualize all the dependencies, so maybe work on visualizing those as well – before issues happen.
Why is it that in the cloud, which is so powerful and useful, these issues are more pronounced? Well, there are now more people and machines connected to the internet and there are more and more services being used by more and more people and machines, so this wasn’t such an issue in the 1990s, but it is today. The underpinning concept behind cloud computing is using commodity hardware, and at such a scale that small percentages matter. For example, 1% failure per year on 2 disks means disks will be fine almost all the time. But 1% failure at a scale of 60 million disks means that 600,000 will fail this year. That is an issue. And while disks fail at more than 1% per year, other components must also be considered, such as chips, and so on. Also, the cloud is, for our purposes, public (as opposed to the private cloud), meaning the cloud is a shared service. Though logically isolated, you may find yourself with noisy neighbors that may impact your services. You will get hackers from the bedroom variety to the state-sponsored type that sometimes do, but most often don’t, target specific organizations, but rather spray and pray they get you – and you too can pray that you don’t get caught in the crossfire.
Now that you are in the cloud, you also need to consider that updates to the underlying technology don’t always go well, and Microsoft, Amazon, and Google will destroy and disrupt services in one or all regions, with regularity. No slight meant here against their SRE teams, that is just again playing with large numbers and small percentages. If they do 1 update a year, then 1% failure is negligible, but if they do thousands a day, then 1% starts growing rapidly. However, that is the whole idea behind the cloud – everything can and will fail, and you will learn to love it and understand it because that very fact brings about new ways to simplify and plan for high availability in a different way than if you were running your own data center. Not to mention you and your organization are not above failure as well.
What is the risk of losing a data center? I have seen risk logs with an entry for a scenario were a meteor crashes into the data center. But that is such a remote chance that your cloud provider destroying a data center is much more likely.
Now that you know that failures are not only expected but inevitable, you can design and architect your services around that – if the business requirements are there that demand it. Remember, people doing manual work make so many more mistakes compared to automated machine processes – hence automation is again your friend. Invest in your friend.
Performance efficiency is defined by Microsoft as the ability of the system to adapt to changes in load. And this again brings us back to… agility. How hard do you have to work for your service to go from supporting one customer to a billion customers?
Can you design and configure a service that does this automatically? Azure Active Directory, Azure Traffic Manager, and Azure Functions are examples of such services with auto-scaling.
Prefer PaaS and SaaS services over IaaS, prefer managed services over unmanaged, and prefer horizontal scaling (out) rather than vertical (up). This also applies to scaling up and down.
You should consider offloading processing to the background. If you can collect data and then process it later, the performance will improve. If you can process data not in real time but as and when you have the baseline capacity, the performance will improve.
You should consider caching everything you can – static resources. Then consider caching more – dynamic resources that aren’t real-time sensitive as well. Then consider caching more – results from the database, lookup tables, and so on. When should you stop caching? When everything is cached, and you can cache no more, the performance will improve. A great caching service is Azure Redis, but it is by no means the only one. Another amazing one to consider is the CDN service.
Have you considered your write and read paths and are they stressing the environment? Try data partitioning, data duplication, data aggregation, data denormalization, or data normalization. All of these can help improve performance.
Are you using the most expensive service to store data? Azure SQL is great when you need queries, and you need to do them often. But having a 1-TB database for the past 6 months of records that keeps growing while all your users only search today’s events is a waste.
Moving data around is what you should get used to. Use the right storage and the right compute resources at the right time. Moving data to another region may be costly but moving it within the region may be completely free. And using the most appropriate storage can save you millions. And to facilitate this, a lot of Azure services provide data management and offloading capabilities.
Cosmos DB has time-to-live functionality, so if you know an item won’t be needed after a time, you can expunge it automatically, while you can still simultaneously store it in a file. Azure Blob Storage has Hot, Cold, and Archive tiers and it can move the underlying storage automatically as well. If the file is no longer needed to be highly available, move it to lower-tier storage – you will pay a lot less.
And remember, there is an egress cost! When you are about to move data, always ask What about egress costs?
Security, as defined by Microsoft, follows the zero trust model in protecting your applications and data from threats – including from components within your controlled network. There are so many ways to protect your workload.
We have Azure DDoS Protection, which protects against denial of service attacks; Azure Front Door geo-filtering, which limits traffic that you will accept to specific regions or countries; Azure Web Application Firewall, which controls access by inspecting traffic on a per request basis; IP whitelisting, which limits exposure to only the accepted IPs; VNET integration of backend services, which restricts access from the public internet; Azure Sentinel, which is cloud-native security information and event management (SIEM), and so on.
A lot of these don’t require you to manage them day to day – you set and forget them. For example, with VNET integration, once you’ve enabled it and written some automated tests to ensure it works every time, you are done.
- Microsoft Azure Well-Architected Framework:
- Microsoft Azure Well-Architected Review:
AWS and GCP offer similar guidance as well. These are specific to each hyperscale cloud provider and to each service and concept as it pertains to them, so while the general concepts are similar, the actual guidance may differ based on service definitions and implementations.
Cloud security and data privacy
Security is a shared responsibility between your entire organization and your cloud provider. Especially, as we are playing here on different levels, from the physical security of the data centers to the security of your passwords and other cryptographic secrets you need in your services’ operation.
You need to protect your – as well as your customers’ – data, services, applications, and the underlying infrastructure.
Services such as Microsoft Defender for Cloud are your friend and will give you plenty to concern yourself with – everything from ports open to the public to automatic security insights such as traffic anomalies, for example, machine A has started communicating with machine E and has never previously done so.
You will also need to understand the patterns around the use of Azure Key Vault and how to successfully use Key Vault in your IaC scripts and in your applications and services.
Then there are services that protect the public perimeter, such as Azure DDoS Protection, Azure Front Door, Azure Application Firewall, and so on. And each service has security recommendations and best practices and guidance on how best to protect it from internal and external threats.
Sometimes though, you will just need to guarantee that data hasn’t been tampered with, so we slowly start moving from security to compliance. Azure confidential ledger (ACL) is one such service that ensures that your data is stored and hosted in a trusted execution environment. The scope around these is fascinating and the science and patterns are really showcasing what is possible today with technology – not just possible but guaranteed.
In Microsoft, there are teams whose job is to ensure the compliance of services and the platform with legal and regulatory standards around the world. You name it, they have it. AWS and GCP are close behind as well.
Again, a reminder that implementing recommendations from any or all of these does not mean you are compliant as well or that you are secure. Shared responsibility means you still must do your due diligence and work to satisfy the requirements of compliance frameworks. Theory and practice both must be satisfied.
As mentioned, we’ve focused on Azure in this book as a primary hyperscale cloud provider, but here are three great pages (one from GCP and two from Azure) that give an overview and compare services and offerings so you can easily understand similar services across these providers:
- AWS, Azure, GCP service comparison: https://cloud.google.com/free/docs/aws-azure-gcp-service-comparison
- Azure for GCP Professionals: https://docs.microsoft.com/en-us/azure/architecture/gcp-professional/
Figure 1.2 – Azure for GCP Professionals screenshot
- Azure for AWS Professionals: https://docs.microsoft.com/en-us/azure/architecture/aws-professional/
Figure 1.3 – Azure for AWS Professionals screenshot
Getting to grips with one cloud platform may seem like a daunting task. If so, you probably think that learning about all three is an impossibility. Rest assured that each cloud has many similarities and the skills you acquire now will stand you in good stead if you ever need to use another cloud in the future. Hopefully, these articles have enlightened you a little and shown just how similar the major cloud platforms really are.
Cloud workload types
A workload is a collection of assets that are deployed together to support a technology service or a business process – or both. Specifically, we are talking about things such as database migration, cloud-native applications, and so on.
When talking about cloud adoption, we are looking for an inventory of things that we will be deploying to the cloud, either directly or via migration.
You need to work across the organization with all the stakeholders to identify workloads and understand them, prioritize them, and understand their interdependencies to be able to properly plan and parallelize or serialize your workloads depending on their needs and dependencies.
You and the stakeholders will need to identify, explain, and document each workload in terms of its name, description, motivations, sponsors, units, parts of the organization they belong to, and so on. This then means you can further identify metrics of success for each workload and the impact this workload has on the business, on data, and on applications.
Then you can approach the technical details such as the adoption approach and pattern, criticality, and resulting SLAs, data classification, sources, and applicable regions. This will enable you to assign who will lead each workload and who can align and manage the assets and personnel required.
The highest priority must be given to a decision between migration as is (commonly known as lift and shift) or a more modern cloud-native approach. The highest priority must be given to this task as any error here will cause delays and, because of dependency issues, the timeline slip may escalate quickly. And with enterprise customers, there may be thousands of workloads to execute. Take care that this step is taken very seriously and meticulously by the organization.
One common thing that happens is that a lot of responsibility gets assigned to a very small team who may not have all the information and must hunt for the information in the organization while trying to prioritize and plan the workloads and dependencies. This usually results in poor decisions. While it might be tempting to go for modernization, where migration is concerned it is best to lift and shift first, followed quickly by an optimization phase. Business reasons for the migration are usually tied to contractual obligations (for example, a data center contract) and modernization for teams new to the cloud rarely goes swimmingly with a looming deadline.
On the topic of business cases for each workload, do remember to compare apples to apples and so compare the total cost of ownership for all options. This rarely gets done properly, if at all, especially if done internally without a cloud provider or consulting support.
Ensuring cost consciousness is another activity that gets overlooked. You need to plan before you start moving workloads around. Who will be responsible and how will we monitor costs. Overprovisioning happens often and with regularity. And remember: ensuring cost-optimized workloads is not a one-time activity. You are now signing up for continuous improvement over the lifetime of the workload, or risk costs spiraling out of control. Once they do, it is even harder to understand them and get them back under control.
As mentioned before, quite a few times now, agility not cost must be your primary goal. Having said that, letting costs spiral out of control is wasteful, so occasionally, (at least quarterly) every team should invest some time in optimizing costs. And if you are a great architect, you (or your team) will want to join in (or initiate things) and help out with synergies across teams they may have missed.
Counter-intuitively, cloud providers and their account teams should and usually are incentivized to help you optimize costs, so check in with them regularly if they don’t proactively reach out to you. The reason is simple: the happier you are with the cloud performance cost-wise, the more you can do for the same amount of money, so you will do more in the future. It really is that simple.
OK, so you want to optimize your costs. What do you look at first?
The easiest is to start with two things: individual services and high availability requirements. Individual services are updating all the time, adding new cost tiers (for example, the Azure Storage archival tier), adding serverless options (for example, pay only for the actual usage on a per request basis, such as the Azure Cosmos DB serverless option), and moving features from higher tiers into lower ones, giving you the ability to trade off cost versus capacity, performance, and features.
The next best thing is that, thanks to overzealous reliability requirements at the start of any project, you can usually go back and architect around or remove completely such requirements and save considerably. For example, a calculation service that is deployed to 14 data centers because you started with one Azure region pair of 2 and then replicated that to all other paired regions and now have deployed that service 14 times because you have 7 two region pairs. Is this really required or could it be just 1 in each paired region and the fallback is to any of the other 6?! Beware of data residency requirements here, so maybe it still is a valid requirement.
Multi-regional failover is relatively easy and is often overlooked. With just a few DNS changes and a few Azure Traffic Manager settings, you can increase reliability significantly and quickly with little effort.
Other things you can do require a bit more effort, such as moving from one database type to another (for example, Azure SQL to Azure Cosmos DB), switching between comparable services, optimizing APIs to have them be less chatty, deploying to Linux machines instead of Windows, and so on.
Sometimes you can get amazing results – for example, my favorite service in Azure is Azure SignalR, which is used to add real-time functionality to your apps. But if you think about it, real-time functionality is similar to querying a database directly, and if you have a lot of the same queries, there may be a way to use SignalR to execute the query once and have thousands of requests return the same response, like caching but not even having to query the cache, getting the response through a push mechanism before the request to then cache or database gets made.
Azure has a pricing calculator on the website, which you can use to get your overall estimate, but for cost optimization, it doesn’t really help outside of showing you the reservation options. For example, if you have a standard baseline usage of some services (for example, Azure VMs, Azure Cosmos DB, Azure SQL, etc.), you may reserve capacity and prepay for it and get significant discounts – over 50% in some cases.
You will also get recommendations from the AI behind the Azure Advisor service, and while those are almost always great to act upon, quarterly reviews are still a necessity.
As for paid support, there are multiple options available. If you are playing around in a sandbox environment and you really don’t need support, you will get some help when trying to make things work from Stack Overflow and other random blogs. However, only the official support can diagnose certain technical issues. Of course, in production, you will likely need a quick response time and help through service outages.
The support options in Azure are as follows (https://azure.microsoft.com/en-us/support/plans/):
Included for all customers, provides self-help resources
Access to Technical Support via email
For production environments
24/7, 8-hour response
Includes proactive guidance
24/7, 1-hour response
For support across the Microsoft suite of products, including Azure
24/7, 1-hour response
Table 1.1 – Compare Azure support plans