Each of the performance bottlenecks and risks I've just discussed can be mitigated with good design and robust testing strategies. However, on distributed systems, failure is inevitable. As such, the risk of failure we seek to minimize with the testing and design of your system can never fully be eliminated. However, by using the results of our performance tests as a guide, we can minimize the impact of that inevitable failure. To do this, we'll need to implement a robust system of monitoring the health and availability of our services.
Performance monitoring
Naive monitoring strategies
It's not uncommon for developers to confuse the concept of application monitoring with that of logging. And this is not an...