To address the massive impact of system downtime on business revenues, many organizations are adopting Chaos Engineering in order to gain confidence that their systems are fault-tolerant, that is, built to anticipate and mitigate a variety of software and hardware failures. Many organizations are implementing internal "failure as a service" systems, such as Failure Injection Testing (FIT) [6], Simian Army [7] at Netflix, uDestroy at Uber, and even commercial offerings like https://gremlin.com.
These systems advocate treating Chaos Engineering as a scientific discipline:
Form a hypothesis: What do you think could go wrong in the system?
Plan an experiment: How can you recreate the failure without impacting users?
Minimize the blast radius: Try the smallest experiment first to learn something.
Run the experiment: Monitor the results and the system behavior carefully.
Analyze: If the system did not work as expected, congratulations, you found a bug. If everything worked as...