Prometheus workshops follow-up: frequently asked questions

When I run workshops on practical monitoring with Prometheus, the same kind of questions usually get asked. Here are a few of them, with my answers and with pointers to other resources (articles, talks) to learn more.

How much data can you put in labels?

Labels are really powerful so it can be tempting to annotate each metric with very specific information, however there are some important limitations to what should be used for labels.

Prometheus considers each unique combination of labels and label value as a different time series. As a result if a label has an unbounded set of possible values, Prometheus will have a very hard time storing all these time series. In order to avoid performance issues, labels should not be used for high cardinality data sets (e.g. Customer unique ids).

My personal advice on this is:

Labels should rarely have more than 100 possible values
The set of label values should be bounded (think of them as enums, and not open ended sets of strings)

Further info:

Prometheus docs: Do not overuse labels
PromCon 2017: Best Practices and Beastly Pitfalls by Julius Voltz

Running Prometheus in Kubernetes: what about storage?

Prometheus stores the content of its time-series database on the file system. When running Prometheus in a containerized environment, like Kubernetes, this data is by default stored inside the running container. The main drawback this is that data stored will be lost if the container restart (node failure, load rebalancing or simply redeployment).

The advice here is to use volumes to store the Prometheus data, which will ensure all data is persisted through Prometheus restarts.

Further info:

Prometheus docs: Storage
Kubernetes docs: Persistent Volumes

What are the recommended alerting practices to avoid alerting overload?

The number one benefit of observability is a better understanding on how your system behaves. Once you have Prometheus metrics coming from you application, start by learning what normal looks like, and only then move on to creating alerts if specific thresholds aren’t met.

Remember that not every metric is worth alerting on, a lot of them are only interesting when you want to ask questions to your system (during exploratory testing, debugging or especially during an outage).

My advice on valuable alerts:

Alerts must be actionable (and actioned) any time they are triggered
Alerts must come with context (why is this triggered? what is the impact? is there a dashboard with more info?)
Alerts must link to run-books (what should I do to investigate & mitigate?)

If some of the above can’t be done for an alert you’re about to create, then think twice, especially for the sake of whoever is going to have to handle it (maybe at 2am).

Futher info:

Part 5 of my Prometheus series: Alerting Rules
The My philosophy on alerting paper by Rob Ewaschuk
Avoiding Alerts Overload from Microservices talk at QCon London 2017, by Sarah Wells

Looking to get more practical training with Prometheus?

I am regularly teaching workshops on Cloud Native monitoring with Prometheus & Grafana - don’t hesitate to get in touch with me (speaking@pvincent.io) for more info about running a workshop for your team or at your meetup/conference.

Prometheus workshops follow-up: frequently asked questions

How much data can you put in labels?

Running Prometheus in Kubernetes: what about storage?

What are the recommended alerting practices to avoid alerting overload?

Looking to get more practical training with Prometheus?

Pierre Vincent