Book Image

Software Architecture Patterns for Serverless Systems - Second Edition

By : John Gilbert
Book Image

Software Architecture Patterns for Serverless Systems - Second Edition

By: John Gilbert

Overview of this book

Organizations undergoing digital transformation rely on IT professionals to design systems to keep up with the rate of change while maintaining stability. With this edition, enriched with more real-world examples, you’ll be perfectly equipped to architect the future for unparalleled innovation. This book guides through the architectural patterns that power enterprise-grade software systems while exploring key architectural elements (such as events-driven microservices, and micro frontends) and learning how to implement anti-fragile systems. First, you'll divide up a system and define boundaries so that your teams can work autonomously and accelerate innovation. You'll cover the low-level event and data patterns that support the entire architecture while getting up and running with the different autonomous service design patterns. This edition is tailored with several new topics on security, observability, and multi-regional deployment. It focuses on best practices for security, reliability, testability, observability, and performance. You'll be exploring the methodologies of continuous experimentation, deployment, and delivery before delving into some final thoughts on how to start making progress. By the end of this book, you'll be able to architect your own event-driven, serverless systems that are ready to adapt and change.
Table of Contents (16 chapters)
Other Books You May Enjoy

Creating subsystem bulkheads

What is the most critical subsystem in your system? This is an interesting question. Certainly, they are all important, but some are more important. For many businesses, the customer-facing portion of the system is the most important. After all, we must be able to engage with the customers to provide a service for them. For example, in an e-commerce system, the customer must be able to access the catalog and place orders, whereas the authoring of new offers is less crucial. So, we want to protect the critical subsystems from the rest of the system.

Our aim is to fortify all the architectural boundaries in a system so that autonomous teams can forge ahead with experiments, confident in the knowledge that the blast radius will be contained when teams make mistakes. At the subsystem level, we are essentially enabling autonomous organizations to manage their autonomous subsystems independently. Let’s look at how we can fortify our autonomous subsystems with bulkheads.

Separate cloud accounts

Cloud accounts form natural bulkheads that we should leverage as much as possible to help protect us from ourselves. Far too often we overload our cloud accounts with too many unrelated workloads, which puts all the workloads at risk. At a bare minimum, development and production environments must be in separate accounts. But we can do better by having separate accounts, per subsystem, per environment. This will help control the blast radius when there is a failure in one account. Here are some of the natural benefits of using multiple accounts:

  • We control the technical debt that naturally accumulates as the number of resources within an account grows. It becomes difficult, if not impossible, to see the forest for the trees, so to speak, when we put too many workloads in one account. The learning curve increases because the account is not clean. Tagging resources helps, but they are prone to omission. Engineers eventually resist making any changes because the risk of making a mistake is too high. The likelihood of a catastrophic system failure also increases.
  • We improve our security posture by limiting the attack surface of each account. Restricting access is as simple as assigning team members to the right role in the right account. If a breach does occur, then access is limited to the resources in that one account. In the case of legacy systems, we can minimize the number of accounts that have access to the corporate network, preferably to just one per environment.
  • We have less competition for limited resources. Many cloud resources have soft limits at the account level that throttle access when transaction volumes exceed a threshold. The likelihood of hitting these limits increases as the number of workloads increases. A denial-of-service attack or a runaway mistake on one workload could starve all other workloads. We can request increases to these limits, but this takes time. Instead, the default limits may provide plenty of headroom once we allocate accounts for individual subsystems.
  • Cost allocation is simple and error resistant because we allocate everything in an account to a single cost bucket without the need for tagging. This means that no unallocated costs occur when tagging is incomplete. We also minimize the hidden costs of orphaned resources because they are easier to identify.
  • Observability and governance are more accurate and informative because monitoring tools tag all metrics by account. This allows filtering and alerting by subsystem. When failures do occur, the limited blast radius also facilitates root cause analysis and a shorter mean time to recovery.

Having multiple accounts means that there are cross-cutting capabilities that we must duplicate across accounts, but we will address this shortly when we discuss automation in the Governing without impeding section.

External domain events

We have already discussed the benefits of using events as contracts, the importance of backward compatibility, and how asynchronous communication via events creates a bulkhead along our architectural boundaries. Now we need to look at the distinction between internal and external domain events.

Within a subsystem, its services will communicate via internal domain events. The definitions of these events are relatively easy to change because the autonomous teams that own the services work together in the same autonomous organization. The event definitions will start out messy but will quickly evolve and stabilize as the subsystem matures. We will leverage the Robustness principle to facilitate this change. The events will also contain raw information that we want to retain for auditing purposes but that is of no importance outside of the subsystem. All of this is OK because it is all in the family, so to speak.

Conversely, across subsystem boundaries, we need more regulated team communication and coordination to facilitate changes to these contracts. As we have seen, this communication increases lead time, which is the opposite of what we want. We want to limit the impact that this has on internal lead time, so we are free to innovate within our autonomous subsystems. We essentially want to hide internal information and not air our dirty laundry in public.

Instead, we will perform all inter-subsystem communication via external domain events (sometimes called integration events). These external events will have much more stable contracts with stronger backward compatibility requirements. We will intend for these contracts to change slowly to help create a bulkhead between subsystems. Domain-Driven Design (DDD) refers to this technique as context mapping, such as when we use domain aggregates with the same terms in multiple bounded contexts, but with different meanings.

External events represent the subsystem’s ports in hexagonal terminology. In Chapter 7, Bridging Intersystem Gaps, we will cover the External Service Gateway (ESG) pattern. Each subsystem will treat related subsystems as external systems. We will bridge the internal event hubs of related subsystems to create the event-first topology depicted in Figure 2.3. Each subsystem will define egress gateways that define what events it is willing to share and hide everything else. Subsystems will define ingress gateways that act as an anti-corruption layer to consume upstream external domain events and transform (that is, adapt) them to its internal formats.

Now that we have an all-important subsystem architecture with proper bulkheads, let’s look at the architecture within a subsystem. Let’s see how we can decompose an autonomous subsystem into autonomous services.