I’m a betting man; I just enjoy making bets, even when there are no stakes at all.

And when you talk about bets, you end up talking about odds.
It turns out that reliability is also about odds, even though we don’t use the language of odds in our domain. Consider how we talk about availability. We report system availability as a number of nines: for example, we might say “four nines of availability”, which means 99.99% of somethings are good over some time interval. The canonical example of those somethings are successful requests. In that case, if someone says a service has four nines of availability over the past three months, that means that 99.99% of requests succeeded over that time period. We could express the same information by saying that there is a one in ten-thousand chance that any given request failed in the past three months.
If your system has exhibited four nines of availability in the past three months, and you assume that the availability of your system in the near future will be like the availability of the past (a dangerous and unwarranted assumption, but let’s go with it for a moment), then we could also express this information using the language of odds, by stating that the odds of a request failing are ten-thousand to one.
But this isn’t a post about describing availability in the language of odds. Instead, what I want to talk about is how all reliability work is inherently about improving the odds, increasing the likelihood that the system stays up. Any time we build any sort of reliability mechanism, be it load shedding, autoscaling, canarying, staged deployments, automated rollbacks, or what have you, we are building automation into the system that either eliminates or reduces the impact a subset of potential problems. If you ask an engineer working on improving reliability, “will this prevent all future incidents”, they will tell you “no, of course not”.
However, we don’t explicitly think of reliability work in terms of improving the odds. Instead, we tend to think of it as deterministically addressing a specific class of problem. You’ll hear questions like, “how many historical incidents would this tech have prevented?” in trying to determine whether engineering should invest in a particular reliability solution. They are looking for an answer like, “this would have prevented 20% of our SEV1s and SEV0s”. This 20% isn’t interpreted as a likelihood, instead it’s used as an estimate of impact, as in “this will improve our availability by around 20%”. The idea is that this reliability work will deterministically eliminate or mitigate a certain fraction of incidents; we just don’t know exactly what that fraction is, so we estimate it from historical data.
What I would like to propose in this post is that we think about all of the various kinds of reliability work as improving the odds of our system being up longer, instead of assuming that reliability work will have a fixed effect, and try to estimate the effect size. I’ve got two motivations for taking this perspective of reliability work as odds improvement.
The first motivation is that I don’t think we can ever estimate the effect size without error bars that are so huge that the estimates are themselves meaningless. As I’ve written about previously, the variation in incidents is just too large relative to the amount of data we have available. And, to make the estimation problem from historical data even worse, our system is changing over time. Or, to put it in technical terms, I don’t believe that incidents can be modeled as a stationary process. (Heck, if they were stationary, then that means that reliability work could not have an impact, because then the process would change over time!). Note that I’ve never seen anybody try to validate the estimates, they’re always point-in-time estimates used to justify work, and then promptly forgotten about. In one sense, that’s fine, they served their purpose of convincing leadership that we should allocate cycles for a particular kind of reliability work. But we shouldn’t fool ourselves into believing that these estimates are meaningful: they’re for persuasion, not insight.
It’s my second motivation, though, that prompted me to write this blog post. And that’s because the idea of reliability work as improving the odds of effectively mitigating future incidents is a useful framework for thinking about work that improves resilience. I’m interested in improving the skills of the people who respond to incidents, putting them in a better position to deal with those future unforeseen, surprising scenarios. One way to do this is learning from how responders dealt with previous incidents, the different sorts of observability data they had access to and how, the different knobs that were able to turn, and so on. While the next incidents will be different, the set of tools that are available during incident response are generally the same. There’s no way I can give a quantitative of estimate how this sort of skill improvement work will impact reliability. And despite the enormous number of random factors, I am confident that it will improve our odds.