Incidents can happen despite our best efforts. It’s the nature of the software systems we build today, relying on complex distributed systems with emergent behaviours. Incidents don’t care about the concept of working hours and without the luxury of a follow-the-sun response team, it means sometimes, somebody needs to respond out-of-hours.
Here is an overview of the key principles of our on-call rotation at Glofox, which allowed my team to transition away from stressful firefighting by fostering a predictable and sustainable approach to incident response.
Making out-of-hours explicit
The very first thing is to acknowledge that out-of-hours response happens, regardless of whether an on-call rotation is in place. This was needed for my team a few years ago: the responsibility organically grew to be taken on by a couple of key people who were the de-facto responders. Let’s be honest, they were doing an awesome job at it, but let’s be even more honest, it’s absolutely not sustainable for anybody to be expected to respond to problems 24 hours a day, 7 days a week. Besides the inevitable risk of burning out the very few people that hold the weight of uptime on their shoulders, this dynamic has many downsides, like siloing knowledge or fostering heroic behaviour.
When the reality of out-of-hours response is called out, it’s a lot easier to have conversations about what this response means, when it’s needed and how we can support the teams to do it in a fair, balanced and sustainable way.
A fair and balanced schedule
The scheduling rule of our on-call rotation is “no more than 1 week per month”. With four people on the rotation (at the time of writing), it’s something we’ve almost never deviated from. Coming from living on the edge 24/7, this rule alone made a huge difference to the work-life balance of our SREs, finally able to step away from Slack after work knowing that somebody is in charge. Knowing our on-call schedule a few months in advance makes sure there are no surprises and allows people in the rotation to freely swap shifts. The schedule is also flexible for last-minute changes when daily life takes over - and there’s always somebody volunteering to take the pager for a night when that happens.
Moving to an explicit rotation schedule has meant the need to let go of old habits. It had become a routine for some of us to continuously check emails or Slack after work, just to make sure everything was ok - for fear that we’d miss an important alert or a thread we were tagged on.
Being able to step away (and stay away!) when not on-call can be hard at first. It’s critical because it sets expectations for others. Old habits get broken by trusting that there is always somebody in charge to deal with escalations. Turn off notifications after work or better even, uninstall Slack from your phone: if you’re on-call, you have Pagerduty, if not, somebody else has it.
Time on-call means extra pay
Speak to anybody that has ever been on-call: the additional stress comes more from the fact that you can be called, rather than the calls themselves. This is why we chose a compensation model that doesn’t depend on the number of calls - in other words, regardless of whether the pager goes off, the on-call person gets the same additional pay for every day of their shift.
In addition to this, any time spent responding to incidents is credited as additional time-off, usually taken shortly after. You’ve been woken up at 2 am? You get to start a few hours later tomorrow. You’ve spent your Saturday afternoon responding to an alert? Take a half-day next Friday!
Compensation and time-off for on-call duties aren’t perks, they are important signals to acknowledge these additional responsibilities and their associated impact on personal lives.
On-call rotation is opt-in, not mandatory
Regardless of extra pay, not everyone has space in their personal lives to be on-call.
That’s why there’s no obligation to be on-call - on the contrary, our rotation is on an opt-in basis and anybody can join, including developers (not just SREs). Having developers as responders brings another perspective to the incident response, as they have more context about application-related issues. They also quickly become advocates for building and architecting with reliability in mind, having more empathy for those who may be woken up at night.
In either case, everybody still gets to play a part in incident response, at least during working hours.
Sticking with a small number of actionable alerts
Very few of our alerts get promoted to triggering Pagerduty, for the simple reason that if somebody is going to be woken up in the middle of the night, it better be worth it. The conditions are:
- The alert has been shown to have no false-positives
- The alert is synonymous with a customer-impacting outage, or imminent risk of outage
- The alert is actionable, documented with a runbook to guide the response.
In the rare case where a critical issue did not result in an alert, we also provide a break-the-glass mechanism for anybody in the company to trigger incident response. It comes with a few warnings to confirm that there is indeed an emergency affecting production, as to be mindful that the on-call person might be about to be woken up! Thankfully, this is something that we only saw used a couple of times last year, and always in justifiable situations.
Making response repeatable with runbooks
Runbooks are one of the cheapest tools to scale incident response by making sure every responder is equipped with the same information to evaluate the impact of an alert, troubleshoot the system and mitigate the issue.
Here are some of the things we expect in all our runbooks:
- Investigation checklists
- Links to dashboards and pre-built queries in our observability tooling
- Links to specific services in AWS Console
- Useful troubleshooting commands, with examples of their output in good or bad scenarios
- Step-by-step guides for mitigation actions, with short videos (e.g. “this is how you manually scale up a Fargate service”)
As a side-effect, the simple act of writing the runbook acts as a filter for non-actionable alerts: if the alert is only informative and it’s not clear what should be looked into or done, it most likely doesn’t belong in Pagerduty.
Dropping tools to avoid repeating alerts
It’s so frustrating to respond several times to the same alerts, caused by the same problems, without having the power to address the underlying causes.
Our philosophy is that if Pagerduty was triggered overnight, the only thing we’ll focus on the next day is to ensure it doesn’t happen again the exact same way. It means moving further than the mitigation, looking at the contributing causes of the issue and collaborating with the appropriate teams on durable fixes.
Finally, the incident is wrapped up by systematically conducting a post-mortem, openly (and blamelessly) discussing the details of the incident. More importantly, this is a chance to reflect on the incident response experience:
- What was hard to do? What was confusing?
- How come this area of the product is suddenly the source of more issues?
- Where did we get lucky? What could have gone worse?
On-call only works when responders are happy
The Glofox platform is used 24/7 by Gym goers all around the world; this implies the necessity of being able to respond outside of our working hours. When the unexpected happens, our on-call rotation is a consistent safety net that gives us the assurance that somebody is going to respond and have the tools at their disposal to respond fast.
The principles described here are our way to acknowledge that on-call only works when the people responding feel like they’re treated fairly and that they are empowered to improve the system, rather than passively deal with the consequences of decisions they can’t influence.