Measurement is the number one ally of technical work

As engineers, we tend to have high expectations of product managers to justify the value of adding new features. What problem is this solving? How will we validate that we’re building the right thing? Is this the most important thing to work on right now?

These are great questions but interestingly we aren’t always applying the same rigor to the technical work.

Without concrete data backing up theories, the outcome of any technical work is only the result of guess work. In the long run, it ends up being undervalued because of its lack of predictable impact, which leads to engineering work getting under-prioritised and under-resourced.

Can’t we just add more cpu/memory?

In the face of production issues (e.g. performance, instability…), the temptation is to throw more resources at the problem.

It is fine to increase available memory to mitigate an incident. Unfortunately, often no time is taken to find out why added resources were needed. The lack of data will prevent any durable solution once the symptoms go away. The result is poor in the long run: there is no real understanding of what the exact problem was or how to prevent it from happening again (and it will!).

Load testing is an excellent way to gain awareness of what normal looks like. Once a baseline is established, production monitoring can be used to flag anything out of the ordinary: peaks, creeping trends, etc. The simple fact of seeing a component’s key metrics before and after a release can help isolate the source of a change in behaviour.

Component XYZ really needs a rewrite!

We all take pride in the code we write, but the reality is that code written yesterday is already legacy code. As we keep learning, it is often frustrating to look back at code written a few months or years back and think of a dozen ways to write it differently. The information available at the time was perhaps incomplete, time constraints required some compromises on the design or maybe simply the code could be more readable.

Software engineering is about choosing what code to write and what code not to write. Refactoring a particular piece of code will come at the price of not improving another component, not improving operability or not adding new features. Having data to figure out which of these is the most valuable is a great way to do our job better.

Typical measurements to justify refactoring of a component can revolve around the visible consequences of complex code, for example the number of new bugs/incidents resulting from frequent changes to this component. On the other hand, if a component isn’t the source of issues and is almost never changed, the cost and risk of rewriting it purely for readability purposes likely outweigh the benefits.

We should definitely move to (insert new trend here)!

The technical landscape is moving fast and there is always new frameworks, architectures and other silver-bullets claiming to solve everything. It is very tempting to trade away yesterday’s solution for today’s new shiny tool, but this doesn’t come without a cost. Once again, it is purely guess work without measurement: not only the current system might actually be fit-for-purpose but we won’t know whether the new system is an improvement!

Before jumping on the next trend, we need to ask ourselves a few questions:

What problem are we trying to solve?
Can we measure it? If not, what do we need to do to measure it?
What dial(s) are we trying to move?
How is this new solution getting us to the right direction?
Can we trial it? (and what improvement can we expect?)
Are there simpler ways to achieve the same result?
When will know that we succeeded? (and that we can focus on to something more important!)

But wait, measurement is not free!

To quote Site Reliability Engineering (Betsy Beyer, Chris Jones, Jennifer Petoff and Niall Richard Murphy):

If you can’t monitor a service, you don’t know what’s happening, and if you’re blind to what’s happening, you can’t be reliable.

Instrumenting code and maintaining a monitoring platform is definitely not free. Like everything mentioned in this post, it will come at the expense of other types of work. However, without it any other type of work remains based on instinct; it might make things better, it might not… and worse even, we won’t even know which it is!

Measurement is the number one ally of technical work

Can’t we just add more cpu/memory?

Component XYZ really needs a rewrite!

We should definitely move to (insert new trend here)!

But wait, measurement is not free!

Pierre Vincent