Patterns for zero-downtime deployments of large applications
For anybody starting a product today, the idea of deployments requiring downtime is ludicrous. The truth is, a lot of today’s products started their development more than a decade ago, with architectures reflecting a different set of practices and assumptions.
What do you mean downtime during deployments?
Historical assumptions lead to early design and architecture choices, which are following a project for years. One of these assumptions 10-15 years ago was that it was acceptable to take a system out for maintenance for a few hours every 6 months to deploy the latest version.
As the years pass, the product becomes successful and more engineers start contributing to its growth. Embracing agile principles, the teams gravitate to more frequent releases. And the friction starts: more releases means more downtime. Besides, finding a good time to upgrade the application becomes a struggle, as there is no longer such a thing as out-of-hours for the worldwide user base.
All this compounds with the business concerns: downtime is a thing of the past! How come the new competitor on the block doesn’t have any monthly maintenance?
Looking back at the application architecture, carrying the inertia of all these original decisions, the immediate reaction is “this will never work”. But hold off the big re-architecting project, reality is actually much less scary in my experience.
The reason zero-downtime deployments sound scary is because we look at it from the inside out. Zero-downtime is actually an external property of the system: it doesn’t mean everything is up, it doesn’t mean everything is running the latest version, it means nobody notices!
Deployment mapping exercise
Trying to solve everything at once is rarely the best way the achieve the optimal result. Going from a 3-hour blackout to nothing is not going to happen in one go. But there is an obvious success metric to track progress: by how much did we reduce deployment downtime?
First thing is to map out what happens during the deployment:
- What are the different steps?
- How long do they take?
- Do they happen sequentially? In parallel?
- Why does the system need to be down during these steps?
- How could the system be up during these steps?
Armed with the answers to the above, work can be prioritised like for building a product: what is going to bring the most value?
Tackling roadblocks to zero-downtime
The following points will vary for each application, and this is based on my experience getting the Poppulo core product to zero-downtime deployments. However, these should apply to what most applications would have to tackle.
Backward compatible database schemas
A reason for downtime is to have a period of time to migrate database schemas, without worrying about backward compatibility. In other words, the old schema is for the old application version, the new one for the new version.
Decoupling database migration from application upgrade forces to think about this backward compatibility: the old version of the application needs to keep working with the new schema version. This can be achieved by applying the expand and contract pattern, which staggers breaking changes across multiple releases.
Loosen up application & schema version checks
Once database schema and application upgrade are decoupled, some code assertions may need to be reviewed or dropped. For example, the application may be running some sanity checks against its database to ensure it is not running against an old version.
If these sanity checks cannot be dropped, they can be adjusted for the application (n) to tolerate running with a newer version of the schema (n+1).
Remove need for wind-down period
Upgrades may need some form of pre-downtime service degradation. For example, if the application is processing some long-running tasks, the upgrade must start by preventing anybody from starting any more of these tasks. While the application may be up during that wind-down time, there is a non-trivial impact in stopping users trying to use core functionality.
If these tasks are pretty heavy weight, it’s probably safe to assume they are already decoupled in separate processes, for horizontal scaling purposes. These processes may pick up tasks from a queue and run a number of them at a time.
Instead of shutting down all of them during the upgrade, the strategy here is to apply the upgrade in a rolling fashion. The tasks are still long-running, so we need to apply the wind-down period on each process before upgrading it (in other words: “finish what you are doing and stop picking up new work from the queue”) - the big difference here is that this is invisible to the user, each process does that successively always making sure that some processes are fully working during the entire upgrade.
Prevent messaging incompatibilities
Databases schemas aren’t the only thing that need backward compatibility. If messages are being sent between processes, rolling upgrades will imply different versions of processes will be communicating (at least for a period of time).
Backward-compatibility of message schemas can be tackled the same way as in databases, using the expand/contract pattern to handle breaking changes. Processes must also follow Tolerant Reader principles when deserializing messages (e.g. ignoring unknown fields, defining default values for missing fields). Message serialization method may also need to be reviewed - for example, serialized Java objects can lack the flexibility of an evolving structure, compared to a JSON representation.
Mitigate user session loss
User-facing applications typically come with some session persistence. When processes are restarted, the session information must be preserved as much as possible to avoid user disruption (e.g. have to login again).
This is a pretty challenging area, especially for large applications that store a lot of information in the session. Here are some options (none of them are trivial unfortunately):
- Persist the session in a data store (can have performance impact)
- Share sessions between applications (can be challenging due to network broadcasts)
- Remove the session entirely from backend services (i.e. stateless backends)
- Use session-affinity, directing all new sessions to upgraded backends and waiting for other sessions to complete before upgrading.
Making zero-downtime operationally viable
Operating a zero-downtime upgrade is quite a different experience than upgrading during a blackout period: users will be going about their usual work while traffic gets rerouted, processes get shut down, machines get restarted, database tables get migrated. And all of this without them noticing anything (hopefully).
Use load-balancing safely
Upgrading processes in rolling fashion means that traffic needs to be drained from backends before shutdown and properly re-enabled to newly restarted backends. It goes without saying that if all of this is done by manually editing configuration files, this is not going to end well.
Load Balancers like HAProxy can be automated very efficiently, using APIs to manage backends or even better relying on health-checks to know what backends should be routed to.
Automate the deployment pipeline
The number of tasks involved in an upgrade can be quite daunting: database migrations, load-balancing, rolling upgrades of hundreds of processes.
The answer here is a solid deployment pipeline: each part of the upgrade is written with code, stored in source control and automated to run the same way regardless of who triggers it (e.g. GitLab Pipelines, Jenkins Pipelines).
The automated deployment pipeline takes away a main anxiety from the deployment operator: “will I remember everything I need to do?”. Additionally, using source control not only means the code can be reviewed, but also that the deployment process can be repeatedly tested and any change validated to prevent regressions.
Be ready for the unexpected
Unfortunately, it is impossible to guarantee failure will not happen, so the best we can do is be ready for it to recover quickly.
On top of automating the deployment process, it is very important to build enough visibility to understand what is happening while the upgrade is happening (and after).
What can be useful is dedicated dashboards showing the progress of upgrades, with information such as:
- Breakdown of processes running in old vs new version
- Number and progress of long-running tasks awaiting completion
- Queue lengths for long-running tasks
- Health status of all processes
- Status of synthetic monitoring of core user journeys
This goes further than dashboards. Especially if something unexpected happens, it needs to be easy for individuals involved to have access to all the information the applications are exposing: metrics, logs, events, traces, etc.
In closing
Calling deployments to be zero-downtime is entirely something in the eyes of the user. Applications carrying the weight of architecture and technology decisions of 10 years ago don’t need to go through a re-write to achieve this.
Start by mapping what your downtime time-line looks like and you will be able to point out easy wins. Applying a few simple patterns should help reduce the impact of deployments bit by bit, while consolidating the risky parts of it (with automation and observability), getting closer to upgrading your applications during working hours without anybody noticing.