Alerting us when something goes wrong
If you're seriously looking at microservices, you're probably running a 24/7 service. Customers demand that your service is available to use at any time. Contrast this increase in the need for availability with the reality that distributed systems are constantly experiencing some kind of failure. No system is ever completely healthy.
Whether you have a monolith or microservices architecture, it is pointless to try to avoid production incidents altogether. Instead, you should try to optimize how you are able to respond to failures, limiting their impact on customers by reducing the time it takes to resolve them.
Reducing the time it takes to resolve incidents (often measured as mean time to resolve or MTTR) involves first reducing the Mean Time To Detect (MTTD). Being able to accurately alert the right on-call engineer when a service is in a customer-impacting failure state is paramount to being able to maintain uptime. Good alerts should be actionable...