Monitoring systems become critical as you scale. Effective monitoring can drastically ease the maintenance of services.
Having spoken to multiple experts in this field, this is the advice I have collected on the subject:
Choose your key statistics carefully. Users don't care if your machine is low on CPU but they do care if your API is slow.
Use aggregators; think about services, not machines. If you have more than a handful of machines, you should treat them as an amorphous blob.
Avoid the Wall of Graphs. They are slow and it's information overload for a human. Each dashboard should have five graphs with no more than five lines per graphs.
Quantiles aren't aggregable, they're hard to get meaningful information from. However, averages are easy to reason. A response time of 10 ms in the first quartile isn't really useful as information, but a 400 ms average response time shows a clear problem that needs to be addressed.
In addition to this, averages are far easier to calculate than quantiles...