-
Book Overview & Buying
-
Table Of Contents
The Platform Engineer's Handbook
By :
When Maria was optimizing NewTech's cloud spend in the previous chapter, she discovered that her engineering teams were provisioning resources conservatively, afraid of hitting the cost controls. Cost optimization techniques we discussed there helped the team, but she realized something else that was missing—that she couldn't save money on systems that weren't reliable. The moment a production outage hits, the cost savings disappear in the face of customer churn or even an incident response overhead.
Resilience automation is where reliability commitments are tested against real failure conditions. It's the difference between hoping that the systems stay up and knowing that they will survive potentially catastrophic failures. Think of this as a graduate level observability where we are talking proactive resilience and not just proactive monitoring. This needs to be done in the context of defining what acceptable means for...