In the meantime, start breaking things. Not in your user-facing environment. We have the ability to spin up copies of our service landscape using Terraform. By experimenting with failure in a low-risk place, it is possible to pinpoint potential deficiencies while minimizing blast radius. Netflix pioneered the emerging science of chaos engineering on AWS—more information can be found on the Chaos Monkey GitHub page (https://netflix.github.io/chaosmonkey/).
In our earlier work with Terraform, we have seen how it requires a plan before an apply. This test-driven approach can be applied to our services too. Turn off an instance. No problem because our load balancer shifts all traffic to the other one. However, what happens if the load balancer disappears? Probably not good things. Luckily, AWS ALBs run across a number of instances. We shouldn't have a problem even though they have one. Is the inadvertent security rule change? That's probably going to cause a...