This talk from RebelCon 2017 gives an overview of different monitoring patterns and how different tools can be used to help build a fuller picture of distributed applications in production.

Understanding the running state of an application is the key to efficiently troubleshoot production issues and ultimately anticipate outages.

When systems grow larger and become distributed, the visibility of application health needs to become a first class concern; as the likelyhood of something going wrong increases, the focus is shifting from increasing Mean-Time Between Failures to reducing Mean-Time To Recovery. The best way to achieve this consistently is to build in monitoring as an integral part of Product development, instead of it being an after thought.

Monitoring can start simple, with basic telemetry such as Healthchecks which will increase visibility in the system’s status. Exposing more advanced Metrics can then give more details on how the system is working, on a system level (e.g. resource usage), application level (e.g. response times) and business level (e.g. completed sales). Later on, these Healthchecks & Metrics can be used to trigger alerts when observed values are outside of expected thresholds.