On-call only works when responders are happy

June 9, 2022 in SRE

Incidents can happen despite our best efforts. It’s the nature of the software systems we build today, relying on complex distributed systems with emergent behaviours. Incidents don’t care about the concept of working hours and without the luxury of a follow-the-sun response team, it means sometimes, somebody needs to respond out-of-hours.

Continue reading · 7 min read

Bubbling up customer-specific experience with high-cardinality observability

June 16, 2021 in observability

When averages look fine, how can we discover that one customer who’s having a bad experience?

Continue reading · 4 min read

Hack The Box write-up: Cache

October 10, 2020 in security

Since completing OSCP in November 2019, I have been refining my penetration testing skills on Hack The Box, a Penetration Testing lab. Every target is usually a rollercoaster of both frustration and excitement, definitely pushing the Try harder philosophy. Here is a step-by-step guide to root one of the recently retired machines: Cache.

Continue reading · 9 min read

Prometheus workshops follow-up: frequently asked questions

May 2, 2019 in monitoring

When I run workshops on practical monitoring with Prometheus, the same kind of questions usually get asked. Here are a few of them, with my answers and with pointers to other resources (articles, talks) to learn more.

Continue reading · 3 min read

Patterns for zero-downtime deployments of large applications

December 10, 2018 in devops

For anybody starting a product today, the idea of deployments requiring downtime is ludicrous. The truth is, a lot of today’s products started their development more than a decade ago, with architectures reflecting a different set of practices and assumptions.

Continue reading · 7 min read

Prometheus Blog Series (Part 5): Alerting rules

December 31, 2017 in monitoring

Instrumented applications bring in a wealth of information on how they behave. In the previous parts of this blog series, the focus has been mostly on getting applications to expose their metrics and on how to query Prometheus to make sense of these metrics. This exploratory approach is extremely valuable to uncover unknown unknowns, either pro-actively (testing) or reactively (debugging).

Metrics can also be used to help with things we already know and care about: instrumenting those things and knowing what is their normal state, then it’s possible to alert on situations that are judged problematic. Prometheus makes this possible through the definition of alerting rules.

Continue reading · 6 min read

Prometheus Blog Series (Part 4): Instrumenting code in Go and Java

December 28, 2017 in monitoring

So far in this Prometheus blog series, we have looked into Prometheus metrics and labels (see Part 1 & 2), as well as how Prometheus integrates in a distributed architecture (see Part 3). In this 4th part, it is time to look at code to create custom instrumentation. Luckily, client libraries make this pretty easy, which is one of the reasons behind Prometheus' wide adoption.

This post will go through examples in Go and Java. Prometheus has a number of other supported languages, through official libraries and community maintained ones.

Continue reading · 7 min read

On-call only works when responders are happy

Bubbling up customer-specific experience with high-cardinality observability

Hack The Box write-up: Cache

Prometheus workshops follow-up: frequently asked questions

Patterns for zero-downtime deployments of large applications

Prometheus Blog Series (Part 5): Alerting rules

Prometheus Blog Series (Part 4): Instrumenting code in Go and Java

Pierre Vincent