Here’s a thought experiment that John Allspaw related to me, in paraphrased form (John tells me that he will eventually capture this in a blog post of his own, at which time I’ll put a proper link).
Consider a small-ish tech company that has four engineering teams (A,B,C,D), where an engineer from Team A was involved in an incident (In John’s telling, the incident involves the Norway problem). In the wake of this incident, a post-incident write-up is completed, and the write-up does a good job of describing what happened. Next, imagine that the write-up is made available to teams A,B, and C, but not to team D. Nobody on team D is allowed to read the write-up, and nobody from the other teams is permitted to speak to team D about the details of the incident. The question is: are the members of team D at a disadvantage compared to the other teams?
The point of this scenario is to convey the intuition that, even though team D wasn’t involved in the incident, its members can still learn something from its details that makes them better engineers.
Switching gears for a moment, let’s talk about the new tools that are emerging under the label AI SRE. We’re now starting to see more tools that leverage LLMs to try to automate incident diagnosis and remediation, such as incident.io’s AI SRE product, Datadog’s Bits AI SRE, Resolve.ai (tagline: Your always-on AI SRE), and Cleric (tagline: AI SRE teammate). These tools work by reading in signals from your organization such as alerts, metrics, Slack messages, and source code repositories.
To effectively diagnose what’s happening in your system, you don’t just want to know what’s happening right now, but you also want to have access to historical data, since maybe there was a similar problem that happened, say, a year ago. While LLMs will have been trained with a lot of general knowledge about software systems, it won’t have been trained on the specific details of your system, and your system will fail in system-specific ways, which means that (I assume!) these AI SRE systems will work better if they have access to historical data about your system.
Here’s second thought experiment, this one my own: Imagine that you’ve adopted one of these AI SRE tools, but the only historical data of the system that you can feed your tool is the collection of your company’s post-incident write-ups. What kinds of details would be useful to an AI SRE tool in helping to troubleshoot future incidents? Perhaps we should encourage people to write their incident reports as if they will be consumed by an AI SRE tool that will use it to learn as much as possible about the work involved in diagnosing and remediating incidents in your company. I bet the humans who read it would learn more that way too.
(With apologies to the screenwriters of Forrest Gump)
I’m going to use this post to pull together some related threads from different sources I’ve been reading lately.
Rationalization as discarding information
The first thread is from The Control Revolution by the late American historian and sociologist James Beniger, which was published back in the 1980s: I discovered this book because it was referenced in Neil Postman’s Technopoly.
Beniger references Max Weber’s concept of rationalization, which I had never heard of before. I’m used to the term “rationalization” as a pejorative term meaning something like “convincing yourself that your emotionally preferred option is the most rational option”, but that’s not how Weber meant it. Here’s Beniger, emphasis mine (from p15):
Although [rationalization] has a variety of meanings … most definitions are subsumed by one essential idea: control can be increased not only by increasing the capacity to process information but also by decreasing the amount of information to be processed.
…
In short, rationalization might be defined as the destruction or ignoring of information in order to facilitate its processing.
This idea of rationalization feels very close to James Scott’s idea of legibility, where organizations depend on simplified models of the system in order to manage it.
Decision making: humans versus statistical models
The second thread is from Benjamin Recht, a professor of computer science at UC Berkeley who does research in machine learning. Recht wrote a blog post recently called The Actuary’s Final Word about the performance of algorithms versus human experts on performing tasks such as medical diagnosis. The late American psychology professor Paul Meehl argued back in the 1950s that the research literature showed that statistical models outperformed human doctors when it came to diagnosing medical conditions. Meehl’s work even inspired the psychologist Daniel Kahneman, who famously studied heuristics and biases.
In his post, Recht asks, “what gives?” If we have known since the 1950s that statistical models do better than human experts, why do we still rely on human experts? Recht’s answer is that Meehl is cheating: he’s framing diagnostic problems as statistical ones.
Meehl’s argument is a trick. He builds a rigorous theory scaffolding to define a decision problem, but this deceptively makes the problem one where the actuarial tables will always be better. He first insists the decision problem be explicitly machine-legible. It must have a small number of precisely defined actions or outcomes. The actuarial method must be able to process the same data as the clinician. This narrows down the set of problems to those that are computable. We box people into working in the world of machines.
…
This trick fixes the game: if all that matters is statistical outcomes, then you’d better make decisions using statistical methods.
Once you frame a problem as being statistical in nature, than a statistical solution will be the optimal one, by definition. But, Recht argues, it’s not obvious that we should be using the average of the machine-legible outcomes in order to do our evaluation. As Recht puts it:
How we evaluate decisions determines which methods are best. That we should be trying to maximize the mean value of some clunky, quantized, performance indicator is not normatively determined. We don’t have to evaluate individual decisions by crude artificial averages. But if we do, the actuary will indeed, as Meehl dourly insists, have the final word.
Statistical averages and safe self-driving cars
I had Recht’s post in mind when Reading Philip Koopman’s new book Embodied AI Safety. Koopman is Professor Emeritus of Electrical Engineering at Carnegie-Mellon University, he’s a safety researcher that specializes in automotive safety. (I first learned about him from his work on the Toyota unintended acceleration cases from about ten years ago).
I’ve just started his book, but these lines from the preface jumped out at me (emphasis mine):
In this book, I consider what happens once you … come to realize there is a lot more to safety than low enough statistical rates of harm.
…
[W]e have seen numerous incidents and even some loss events take place that illustrate “safer than human” as a statistical average does not provide everything that stakeholders will expect from an acceptably safe system. From blocking firetrucks, to a robotaxi tragically “forgetting” that it had just run over a pedestrian, to rashes of problems at emergency response scenes, real-world incidents have illustrated that a claim of significantly fewer crashes than human drivers does not put the safety question to rest.
More numbers than you can count
I’m also reading The Annotated Turing by Charles Petzold. I had tried to read Alan Turing’s original paper where he introduced the Turing machine, but found it difficult to understand, and Petzold provides a guided tour through the paper, which is exactly what I was looking for.
I’m currently in Chapter 2, where Petzold discusses the German mathematician Georg Cantor’s famous result that the real numbers are not countable, that the size of the set of real numbers is larger than the size of the set of natural numbers. (In particular, it’s the transcendental numbers like π and e that aren’t countable: we can actually count what are called the algebraic real numbers, like √2).
To tie this back to the original thread: rationalization feels like to me like the process of focusing on only the algebraic numbers (which include the integers and rational numbers), even though most of the real numbers are transcendental.
Ignoring the messy stuff is tempting because it makes analyzing what’s left much easier. But we can’t forget that our end goal isn’t to simplify analysis, it’s to achieve insight. And that’s exactly why you don’t want to throw away the messy stuff.
Recent U.S. headlines have been dominated by school shootings. The bulk of the stories have been about the assassination of Charlie Kirk on the campus of Utah Valley University and the corresponding political fallout. On the same day, there was also a shooting at Evergreen High School in Colorado, where a student shot and injured two of his peers. This post isn’t about those school shootings, but rather, one that happened three years ago. On May 24, 2022, at Robb Elementary School in Uvalde, Texas, 19 students and 2 teachers were killed by a shooter who managed to make his way onto the campus.
Law enforcement were excoriated for how they responded to the Uvalde shooting incident: several were fired, and two were indicted on charges of child endangerment. On January 18, 2024, the Department of Justice released the report on their investigation of the shooting: Critical Incident Review: Active Shooter at Robb Elementary School. According to the report, there were multiple things that went wrong during the incident. Most significantly, the police originally believed that the shooter had barricaded himself in an empty classroom, where in fact shooter was in a classroom with students. There were also communication issues that resulted in a common ground breakdown during the response. But what I want to talk about in this post is the keys.
The search for the keys
During the response to the Uvalde shooting, there was significant effort by the police on the scene to locate master keys to unlock rooms 111/112 (numbered p14, PDF p48, emphasis mine).
Phase III of the timeline begins at 12:22 p.m., immediately following four shots fired inside classrooms 111 and 112, and continues through the entry and ensuing gunfight at 12:49 p.m. During this time frame, officers on the north side of the hallway approach the classroom doors and stop short, presuming the doors are locked and that master keys are necessary.
The search for keys started before this, because room 109 was locked, and had children in it, and the police wanted to evacuate those children (numbered p 13, PDF p48):
By approximately 12:09 p.m., all classrooms in the hallways have been evacuated and/or cleared except rooms 111/112, where the subject is, and room 109. Room 109 is found to be locked and believed to have children inside.
If you look at the Minute-by-Minute timeline section of the report (numbered p17, PDF p50) you’ll see the text “Events: Search for Keys” appear starting at 12:12 PM, all of the way until 12:45 PM.
The irony here is that the door to room 111/112 may have never been locked to begin with, as suggested by the following quote (numbered p15, PDF p48), emphasis mine:
At around 12:48 p.m., the entry team enters the room. Though the entry team puts the key in the door, turns the key, and opens it, pulling the door toward them, the [Critical Incident Review] Team concludes that the door is likely already unlocked, as the shooter gained entry through the door and it is unlikely that he locked it thereafter.
Ultimately, the report explicitly calls out how the search for the keys led to delays in response (numbered p xxviii, PDF p30):
Law enforcement arriving on scene searched for keys to open interior doors for more than 40 minutes. This was partly the cause of the significant delay in entering to eliminate the threat and stop the killing and dying inside classrooms 111 and 112. (Observation 10)
Fixation
In hindsight, we can see that the responders got something very important wrong in the moment: they were searching for keys for a door that probably wasn’t even locked. In this specific case, there appears to have been some communicated-related confusion about the status of the door, as shown by the following (numbered p53, PDF p86):
The BORTAC [U.S. Border Patrol Tactical Unit] commander is on the phone, while simultaneously asking officers in the hallway about the status of the door to classrooms 111/112. UPD Sgt. 2 responds that they do not know if the door is locked. The BORTAC commander seems to hear that the door is locked, as they say on the phone, “They’re saying the door is locked.” UPD Sgt. 2 repeats that they do not know the status of the door.
More generally, this sort of problem is always going to happen during incidents: we are forever going to come to conclusions during an incident about what’s happening that turn out to be wrong in hindsight. We simply can’t avoid that, no matter how hard we try.
The problem I want to focus on here is not the unavoidable getting it wrongin the moment, but the actually-preventable problem of fixation. We “fixate” when we focus solely on one specific aspect of the situation. The problem here is not searching for keys, but on searching for keys to the exclusion of other activities.
During complex incidents, the underlying problem is frequently not well understood, and so the success of a proposed mitigation strategy is almost never guaranteed. Maybe a rollback will fix things, but maybe it won’t! The way to overcome this problem is to pursue multiple strategies in parallel. One person or group focuses on rolling back a deployment that aligns in time, another looks for other types of changes that occurred around the same time, yet another investigates the logs, another looks into scaling up the amount of memory, someone else investigates traffic pattern changes, and so on. By pursuing multiple diagnostic and mitigation strategies in parallel, we reduce the risk of delaying the mitigation of the incident by blocking on the investigation of one avenue that may turn out to not be fruitful.
Doing this well requires diversity of perspectives and effective coordination. You’re more likely to come up with a broader set of options to pursue if your responders have a broader range of experiences. And the more avenues that you pursue, the more the coordination overhead increases, as you now need to keep the responders up to date about what’s going on in the different threads without overwhelming them with details.
Fixation is a pernicious risk because we’re more likely to fixate when we’re under stress. Since incidents are stressful by nature, they are effectively incubators of fixation. In the heat of the moment, it’s hard to take a breath, step back for a moment, understand what’s been tried already, and calmly ask about what the different possible options are. But the alternative is to tumble down the rabbit hole, searching for keys to a door that is already unlocked.
A progressive rollout refers to the act of rolling out some new functionality gradually rather than all at once. This means that, when you initially deploy it, the change only impacts a fraction of your users. The idea behind a progressive rollout is to reduce the risk of a deployment by reducing the blast radius: if something goes wrong with the new thing during deployment, then the impact is much smaller than if you had deployed it all-at-once, to all of the traffic.
The impact of a bad rollout is shown in red
There are two general strategies for doing a progressive rollout. One strategy is coarse grained, where you stage your deploys across domains. For example, deploying the new functionality to one geographic region at a time. The second strategy is more fine-grained, where you define a ramp up schedule (e.g., 1% of traffic to the new thing, then 5%, then 10%, etc.).
Note that the two strategies aren’t mutually exclusive: you can stage your deploy across regions, and within each region, you can do a fine-grained ramp-up within each regions. And you can also think of it as a spectrum rather than two separate categories, since you can control the granularity. But I make the distinction here because I want to talk specifically about the fine-grained approach, where we use a ramp.
The ramp is clearly superior if you’re able to detect a problem during deployment, as shown in the diagram above. It’s a real win if you have automation that can automatically detect based on a metric like error rate. The problem with the ramp is the scenario when you don’t detect that there’s a problem with the deployment.
My claim here in this post is that if you don’t detect a problem with a fine-grained progressive rollout until after the rollout has completed, then it will tend to take you longer to diagnose what the problem is:
Paradoxically, progressive rollout can increase the blast radius by making after-the-fact diagnosis harder
Here’s my argument: once you know something is wrong with your system, but you don’t know what it is that has gone wrong, one of the things you’ll do is to look at dashboard graphs to look for a signal that identifies when the problem started, such as an increase in error rate or request latency. When you do a fine-grained progressive rollout, if something has gone wrong, then the impact will get smeared out over time, and it will be harder to identify the rollout as the relevant change by looking at a dashboard. If you’re lucky, your observability tools will let you slice on the rollout dimension. This is why I like coarse-grained rollouts, because if you have explicit deployment domains like geographical regions, then your observability tools will almost certainly let you slice the data based on those. Heck, you should have existing dashboards that already slice on it. But for fine-grained rolled-out, you may not think to slice on a particular rollout dimension (especially if you’re rolling out a bunch of things at once, all of them doing fine-grained deployments), and you might not even be able to.
To determine whether fine-grained rollouts are a net win depends on a number of factors whose values are not obvious, including:
the probability you detect a problem during the rollout vs after the rollout
how much longer it takes to diagnose the problem if not caught during rollout
your cost model for an incident
On the third bullet: the above diagram implicitly assumes that impact to the business is linear with respect to time. However, it might be non-linear: an hour-long incident may turn out to be more than twice as expensive as two half-hour-long incidents.
As someone who works in the reliability space, I’m acutely aware of the pain of incidents that take a long time to mitigate because they are difficult to diagnose. But I think that the trade-off of fine-grained progressive rollouts are generally not recognized as such: it’s easy to imagine the benefits when the problems are caught earlier, it’s harder to imagine the scenarios where the problem isn’t caught until later, and how harder things get because of it.
The Axiom of Experience: the future will be like the past, because, in the past, the future was like the past. – Gerald M. Weinberg, An Introduction to General Systems Thinking
Last Friday, the San Francisco Bay Area Rapid Transit system (known as BART) experienced a multiple hour outage. Later that day, the BART Deputy General Manager released a memo about the outage with some technical details. The memo is brief, but I was honestly surprised to see this amount of detail in a public document that was released so quickly after an incident, especially from a public agency. What I want to focus on in this post is this line (emphasis mine):
Specifically, network engineers were performing a cutover to a new network switch at Montgomery St. Station… The team had already successfully performed eight similar cutovers earlier this year.
This reminded me of something I read in the Buildkite writeup from an incident that happened back in January of this year (emphasis mine):
Given the confidence gained by initial load testing and the migrations already performed over the past year, we wanted to allow customers to take advantage of their seasonal low periods to perform shard migrations, as a win-win. This caused us to discount the risk of performing migrations during a seasonal low period and what impacts might emerge when regular peak traffic returned.
Rogers had assessed the risk for the initial change of this seven-phased process as “High”. Subsequent changes in the series were listed as “Medium.” [redacted] was “Low” risk based on the Rogers algorithm that weighs prior success into the risk assessment value. Thus, the risk value for [redacted] was reduced to “Low” based on successful completion of prior changes.
Whenever we make any sort of operational change, we have a mental model of the risk associated with the change. We view novel changes (I’ve never done something like this before!) as riskier than changes we’ve performed successfully multiple times in the past (I’ve done this plenty of times). I don’t think this sort of thinking is a fallacy: rather, it’s a heuristic, and it’s generally a pretty effective one! But, like all heuristics, it isn’t perfect. As shown in the examples above, the application of this heuristic can result in a miscalibrated mental model of the risk associated with a change.
So, what’s the broader lesson? In practice, our risk models (implicit or otherwise) are always miscalibrated: a history of past successes is just one of multiple avenues that can lead us astray. Trying to achieve a perfect risk model is like trying to deploy software that is guaranteed to have zero bugs: it’s never going to happen. Instead, we need to accept the reality that, like our code, our models of risk will always have defects that are hidden from us until it’s too late. So we’d better get damned good at recovery.
You’ve undoubtedly heard of the psychological concept called flow state. This is the feeling you get when you’re in the zone, where you’re doing some sort of task, and you’re just really into it, and you’re focused, and it’s challenging but not frustratingly so. It’s a great feeling. You might experience this with a work task, or a recreational one, like when playing a sport or a video game. The pioneering researcher on the phenomenon of flow was the Hungarian-American psychologist Mihaly Csikszentmihalyi, and he wrote a popular book on the subject back in 1990 with the title Flow: The Psychology of Optimal Experience, which I read many years ago. But the one thing that stuck around most with me from Csikszentmihalyi’s book on Flow was the research method that he used to study flow.
One of the challenges of studying people’s experiences is that it’s difficult for researchers to observe them directly. This problem comes up when an organization tries to do research on the current state of developer productivity within the organizations. I harp on “make work visible” a lot because so much of the work we do in the software world is so hard for others to see. There are different data collection techniques that developer productivity researchers use, including surveys, interviews, focus groups, as well as automatic collection of metrics, like the DORA metrics. Of those, only the automatic collection of metrics focuses on in-the-moment data, and it’s a very thin type of data at that. Those metrics can’t give you any insights into the challenges that your developers are facing.
My preferred technique is the case study, which I try to apply to incidents. I like the incident case study technique because it gives us an opportunity to go deep into the nature of the work for a specific episode. But incident-as-case-study only works for, well, incidents, and while a well-done incident case study can shine a light on the nature of the development work, there’s also a lot that it will miss.
Csikszentmihalyi used a very clever approach which was developed by his PhD student Suzanne Prescott, called experience sampling. He gave the participants of his study pagers, and he would page them at random times. When paged, the participants would write down information about their experiences in a journal in-the-moment. In this way, he was able to collect information about subjective experience, without the problems you get when trying to elicit an account retrospectively.
I’ve never read about anybody trying to use this approach to study developer productivity, and I think that’s a shame. It’s something I’ve wanted to try myself, except that I have not worked in the developer productivity space for a long, long time.
These days, I’d probably use slack rather than a pager and journal to randomly reach out to the volunteers during the study and collect their responses, but the principle is the same. I’ve long wanted to capture an “are you currently banging your head against a wall” metric from developers, but with experience sampling, you could capture a “what are you currently banging your head against the wall about?”
Would this research technique actually work for studying developer productivity issues within an organization? I honestly don’t know. But I’d love to see someone try.
Note: I originally had the incorrect publication date for the Flow book. Thanks to Daniel Millerfor the correction.
Accountability is a mechanism that achieves better outcomes by aligning incentives, in particular, negative ones. Specifically: if you do a bad thing, or fail to do a good thing, under your sphere of control, then bad things will happen to you. I recently saw several LinkedIn posts that referenced the U.S. Coast Guard report on the OceanGate experimental submarine implosion. These posts described how this incident highlights the importance of accountability in leadership. And, indeed, the report itself references accountability five times.
However, I think this incident is an example of a type of problem where accountability doesn’t actually help. Here I want to talk about two classes of problems where accountability is a poor solution to addressing the problem, where the OceanGate accident falls into the second class.
Coordination challenges
Managing a large organization is challenging. Accountability is a popular tool in such organizations to ensure that work actually gets done, by identifying someone who is designated as the stuckee for ensuring that a particular task or project gets completed. I’ll call this top-down accountability. This kind of accountability is sometimes referred to, unpleasantly, as the “one throat to choke” model.
Darth Vader enforcing accountability
For this model to work, the problem you’re trying to solve needs to be addressable by the individual that is being held accountable for it. Where I’ve seen this model fall down is in post-incident work. As I’ve written about previously, I’m a believer in the resilience engineering model of complex systems failures, where incidents arise due to unexpected interactions between components. These are coordination problems, where the problems don’t live in one specific component, but, rather, how the components interact with each other.
But this model of accountability demands that we identify an individual to own the relevant follow-up incident work. And so it creates an incentive to always identify a root cause service, which is owned by the root cause team, who are then held accountable for addressing the issue.
Now, just because you have a coordination problem, that doesn’t mean that you don’t need an individual to own driving the reliability improvements around it. In fact, that’s why technical project managers (known as TPMs) exist. They act as the accountable individuals for efforts that require coordination across multiple teams, and every large tech organization that I know of employs TPMs. The problem I’m highlighting here, such as in the case of incidents, is that accountability is applied as a solution without recognizing that the problem revealed by the incident is a coordination problem.
You can’t solve a coordination problem by identifying one of the agents involved in the coordination and making them accountable. You need someone who is well-positioned in the organization, recognizes the nature of the problem, and has the necessary skills to be the one who is accountable.
Miscalibrated risk models
The other way people talk about accountability is about holding leaders such as politicians and corporate executives responsible for their actions, where there are explicit consequences for them acting irresponsibly, including actions such as corruption, or taking dangerous risks with the people and resources that have been entrusted to them. I’ll call this bottom-up accountability.
The bottom-up accountability enforcement tool of choice in France, circa 1792
This brings us back to the OceanGate accident of June 18, 2023. In this accident, the TITAN submersible imploded, killing everyone aboard. One of the crewmembers who died was Stockton Rush, who was both pilot of the vessel and CEO of OceanGate.
The report is a scathing indictment of Rush. In particular, it criticizes how he sacrificed safety for his business goals, ran an organization that lacked that the expertise required to engineer experimental submersibles, promoted a toxic workplace culture that suppressed signs of trouble instead of addressing them, and centralized all authority in himself.
However, one thing we can say about Rush was that he was maximally accountable. After all, he was both CEO and pilot. He believed so much that TITAN was safe that he literally put his life on the line. As Nassim Taleb would put it, he had skin in the game. And yet, despite this accountability, he still took irresponsible risks, which led to disaster.
By being the pilot, Rush personally accepted the risks. But his actual understanding of the risk, his model of risk, was fundamentally incorrect. It was wrong, dangerously so.
Rush assessed the risk index of the fateful dive at 35. The average risk index of previous dives was 36.
Assigning accountability doesn’t help when there’s an expertise gap. Just as giving a software engineer a pager does not bestow up them the skills that they need to effectively do on-call operations work, having the CEO of OceanGate also be the pilot of the experimental vehicle did not lead to him being able to exercise better judgment about safety.
Rush’s sins weren’t merely lack of expertise, and the report goes into plenty of detail about his other management shortcomings that contributed to this incident. But, stepping back from the specifics of the OceanGate accident, there’s a greater point here that making executives accountable isn’t sufficient to avoid major incidents, if the risk models that executives use to make decisions are are out of whack with the actual risks. And by risk models here, I don’t just mean some sort of formal model like the risk assessment matrix above. Everyone carries with them an implicit risk model in their heads: this is a mental risk model.
Double binds
While the CEO also being a pilot sounds like it should be a good thing for safety (skin in the game!), it also creates a problem that the resilience engineering folks refer to as a double bind. Yes, Rush had strong incentives to ensure he wasn’t taking stupid risks, because otherwise he might die. But he also had strong incentives to keep the business going, and those incentives were in direct conflict with the safety incentives. But double-binds are not just an issue for CEO-pilots, because anyone in the organization will feel pressure from above to make decisions in support of the business, which may cut against safety. Accountability doesn’t solve the problem of double-binds, it exacerbates them, by putting someone on the hook for delivering.
Once again, from the resilience engineering literature, one way to deal with this problem is through cross-checks. For example, see the paper Collaborative Cross-Checking to Enhance Resilience by Patterson, Woods, Cook, and Render. Instead of depending on a single individual (accountability), you take advantage of the different perspectives of multiple people (diversity).
You also need someone who is not under a double-bind who has the authority to say “this is unsafe”. That wasn’t possible at OceanGate, where the CEO was all-powerful, and anybody who spoke up was silenced or pushed out.
On this note, I’ll leave you with a six-minute C-SPAN video clip from 2003. In this clip, the resilience engineering David Woods spoke at a U.S. Senate hearing in the wake of the Columbia accident. Here he was talking about the need for an independent safety organization at NASA as a mechanism for guarding against the risks that emerge from double binds.
One of the early criticisms of Darwin’s theory of evolution by natural selection was about how it could account for the development of complex biological structures. It’s often not obvious to us how the earlier forms of some biological organ would have increase fitness. “What use”, asked the 19th century English biologist St. George Jackson Mivart, “is half a wing?”
One possible answer is that while half a wing might not be useful for flying, it may have had a different function, and evolution eventually repurposed that half-wing for flight. This concept, that evolution can take some existing trait in an organism that serves a function and repurpose it to serve a different function, is called exaptation.
Biology seems to be quite good at using the resources that it has at hand in order to solve problems. Not too long ago, I wrote a review of the book How Life Works: A User’s Guide to the New Biology by the British science writer Philip Ball. One of the main themes of the book is how biologists’ view of genes has shifted over time from the idea DNA-as-blueprint to DNA-as-toolbox. Biological organisms are able to deal effectively with a wide range of challenges by having access to a broad set of tools, which they can deploy as needed based on their circumstances.
We’ll come back to the biology, but for a moment, let’s talk about software design. Back in 2011, Rich Hickey gave a talk at the (sadly defunct) Strange Loop conference with the title Simple Made Easy (transcript, video). In this talk, Hickey drew a distinction between the concepts of simple and easy. Simple is the opposite of complex, where easy is something that’s familiar to us: the term he used to describe the concept of easy that I really liked was at hand. Hickey argues that when we do things that are easy, we can initially move quickly, because we are doing things that we know how to do. However, because easy doesn’t necessarily imply simple, we can end up with unnecessarily complex solutions, which will slow us down in the long run. Hickey instead advocates for building simple systems. According to Hickey, simple and easy aren’t inherently in conflict, but are instead orthogonal. Simple is an absolute concept, and easy is relative to what the software designer already knows.
I enjoy all of Rich Hickey’s talks, and this one is no exception. He’s a fantastic speaker, and I encourage you to listen to it (there are some fun digs at agile and TDD in this one). And I agree with the theme of his talk. But I also think that, no matter how many people listen to this talk and agree with it, easy will always win out over simple. One reason is the ever-present monster that we call production pressure: we’re always under pressure to deliver our work within a certain timeframe, and easier solutions are, by definition, going to be ones that are faster to implement. That means the incentives on software developers tilts the scales heavily towards the easy side. Even more generally, though, easy is just too effective a strategy for solving problems. The late MIT mathematics professor Gian-Carlo Rota noted that every mathematician has only a few tricks, and that includes famous mathematicians like Paul Erdős and David Hilbert.
Let’s look at two specific examples of the application of easy from the software world, specifically, database systems. The first example is about knowledge that is at-hand. Richard Hipp implemented the SQLite v1 as a compiler that would translate SQL into byte code, because he had previous experience building compilers but not building database engines. The second example is about an exaptation, leveraging an implementation that was at-hand. Postgres’s support for multi-version concurrency control (MVCC) relies upon an implementation that was originally designed for other features, such as time-travel queries. (Multi-version support was there from the beginning, but MVCC was only added in version 6.5).
Now, the fact that we rely frequently on easy solutions doesn’t necessarily mean that they are good solutions. After all, the Postgres source I originally linked to has the title The Part of PostgreSQL We Hate the Most. Hickey is right that easy solutions may be fast now, but they will ultimately slow us down, as the complexity accretes in our system over time. Heck, one of the first journal papers that I published was a survey paper on this very topic of software getting more difficult to maintain over time. Any software developer that has worked at a company other than a startup has felt the pain of working with a codebase that is weighed down by what Hickey refers to in his talk as incidental complexity. It’s one of the reasons why startups can move faster than more mature organizations.
But, while companies are slowed down by this complexity, it doesn’t stop them entirely. What Hickey refers to in his talk as complected systems, the resilience engineering researcher David Woods refers to as tangled. In the resilience engineering view, Woods’s tangled, layered networks inevitably arise in complex systems.
Hickey points out that humans can only keep a small number of entities in their head at once, which puts a hard limit on our ability to reason about our systems. But the genuinely surprising thing about complex systems, including the ones that humans build, is that individuals don’t have to understand the system for them to work! It turns out that it’s enough for individuals to only understand parts of the system. Even without anyone having a complete understanding of the whole system, we humans can keep the system up and running, and even extend its functionality over time.
Now, there are scenarios when we do need to bring to bear an understanding of the system that is greater than any one person possesses. My own favorite example is when there’s an incident that involves an interaction between components, where no one person understands all of the components involved. But here’s another thing that human beings can do: we can work together to perform cognitive tasks that none of us could do on their own, and one such task is remediating an incident. This is an example of the power of diversity, as different people have different partial understandings of the system, and we need to bring those together.
To circle back to biology: evolution is terrible at designing simple systems: I think biological systems are the most complex systems that we humans have encountered. And yet, they work astonishingly well. Now, I don’t think that we should design software the way that evolution designs organisms. Like Hickey, I’m a fan of striving for simplicity in design. But I believe that complex systems, whether you call them complected or tangled, are inevitable, they’re just baked in to the fabric of the adaptive universe. I also believe that easy is such a powerful heuristic that it is also baked in to how we build and involved systems. That being said, we should be inspired, by both biology and Hickey, to have useful tools at-hand. We’re going to need them.
There are software technologies that work really well in-the-small, but they don’t scale up well. The challenge here is that the problem size grows incrementally, and migrating off of them requires significant effort, and so locally it makes sense it to keep using them, but then you reach a point where you’re well into the size where they are a liability rather than an asset. Here are some examples.
Shell scripts
Shell scripts are fantastic in the small: throughout my career, I’ve written hundreds and hundreds of bash scripts that are twenty lines are less, typically closer than to ten, frequently less than even five lines. But, as soon as I need to write an if statement, that’s a sign to me that I should probably write it in something like Python instead. Fortunately, I’ve rarely encountered large shell scripts in the wild these days, with DevStack being a notable exception.
Makefiles
I love using makefiles as simple task runners. In fact, I regularly use just, which is like an even simpler version of make, and has similar syntax. And I’ve seen makefiles used to good effect for building simple Go programs.
But there’s a reason technologies like Maven, Gradle, and Bazel emerged, and it’s because large-scale makefiles are an absolute nightmare. Someone even wrote a paper called Recursive Make Considered Harmful.
YAML
I’m not a YAML hater, I actually like it for configuration files that are reasonably sized, where “reasonably sized” means something like “30 lines or fewer”. I appreciate support for things like comments and not having to quote strings.
However, given how much of software operations runs on YAML these days, I’ve been burned too many times by having to edit very large YAML files. What’s human-readable in the small isn’t human-readable is the large.
Spreadsheets
The business world runs on spreadsheets: they are the biggest end-user programming success story in human history. Unfortunately, spreadsheets sometimes evolve into being de facto databases, which is terrifying. The leap required to move from using a spreadsheet as your system of record to a database is huge, which explains why this happens so often.
Amazon’s recent announcement of their spec-driven AI tool, Kiro, inspired me to write a blog post on a completely unrelated topic: formal specifications. In particular, I wanted to write about how a formal specification is different from a traditional program. It took a while for this idea to really click in my own head, and I wanted to motivate some intuition here.
In particular, there have been a number of formal specification tools that have been developed in recent years which use programming-language-like notation, such as FizzBee, P, PlusCal, and Quint. I think these notations are more approachable for programmers than the more set-theoretic notation of TLA+. But I think the existence of programming-language-like formal specification languages makes it even more important to drive home the difference between a program and a formal spec.
The summary of this post is: a program is a list of instructions, a formal specification is a set of behaviors. But that’s not very informative on its own. Let’s get into it.
What kind of software do we want to specify
Generally speaking, we can divide the world of software into two types of programs. There is one type where you give the program a single input, and it produces a single output, and then it stops. The other type is one that runs for an extended period of time and interacts with the world by receiving inputs over time, and generating outputs over time. In a paper published in the mid 1980s, the computer scientists David Harel (developer of statecharts) and Amir Pneuli (the first person to apply temporal logic to software specifications) made a distinction between programs they called transformational (which is like the first kind) and the another which they called reactive.
A compiler is an example of a transformational tool, but you can think of many command-line tools as falling into this category. An example of the second type is the flight control software in an airplane, which runs continuously, taking in inputs and generating outputs over time. In my world, we call services are a great example of reactive systems. They’re long-running programs that receive requests as inputs and generate responses as outputs. The specifications that I’m talking about here apply to the more general reactive case.
A motivating example: a counter
Let’s consider the humble counter as an example of a system whose behavior we want to specify. I’ll describe what operations I want my counter to support using Python syntax:
My example will be sequential to keep things simple, but all of the concepts apply to specifying concurrent and distributed systems as well. Note that implementing a distributed counter is a common system design interview problem.
Behaviors
Above I just showed the method signatures, but I implemented this counter and interacted with it in the Python REPL, here’s what that looked like:
People sometimes refer to the sort of thing above by various names: a session, an execution, an execution history, an execution trace. The formal methods people refer to this sort of thing as a behavior, and that’s the term that we’ll use in the rest of this post. Specifications are all about behaviors.
Sometimes I’m going to draw behaviors in this post. I’m going to denote a behavior as a squiggle.
To tie this back to the discussion about reactive systems, you can think of method invocation as inputs, and return values as outputs. The above example is a correct behavior for our counter. But a behavior doesn’t have to be correct: a behavior is just an arbitrary sequence of inputs and outputs. Here’s an example of an incorrect behavior for our counter.
>>> c = Counter()
>>> c.inc()
>>> c.get()
4
We expected the get method to return 1, but instead it returned 4. If we saw that behavior, we’d say “there’s a bug somewhere!”
Specifications and behaviors
What we want out of a formal specification is a device that can answer the question: “here’s a behavior: is it correct or not?”. That’s what a formal spec is for a reactive system. A formal specification is an entity such that, given a behavior, we can determine whether the behavior satisfies the spec. Correct = satisfies the specification.
Once again, a spec is a thing that will tell us whether or not a given behavioris correct.
A spec as a set of behaviors
I depicted a spec in the diagram above as, literally, a black box. Let’s open that box. We can think of a specification simply as a set that contains all of the correct behaviors. Now, the “correct?” processor above is just a set membership check: all it does it check if behavior is an element of the set spec.
What could be simpler?
Note that this isn’t a simplification: this is what a formal specification is in a system like TLA+. It’s just a set of behaviors: nothing more, nothing less.
Describing a set of behaviors
You’re undoubtedly familiar with sets. For example, here’s a set of the first three positive natural numbers: . Here, we described the set by explicitly enumerating each of the elements.
While the idea of a spec being a set of behaviors is simple, actually describing that set is trickier. That’s because we can’t explicitly enumerate the elements of the set like we did above. For one thing, each behavior is, in general, of infinite length. Taking the example of our counter, one valid behavior is to just keep calling any operation over and over again, ad infinitum.
This is a correct behavior for our counter, but we can’t write it out explicitly, because it goes on forever.
The other problem is that the specs that we care about typically contain an infinite number of behaviors. If we take the case of a counter, for any finite correct behavior, we can always generate a new correct behavior by adding another inc, get, or reset call.
So, even if we restricted ourselves to behaviors of finite length, if we don’t restrict the total length of a behavior (i.e., if our behaviors are finite but unbounded, like natural numbers), then we cannot define a spec by explicitly enumerating all of the behaviors in the specification.
And this is where formal specification languages come in: they allow us to define infinite sets of behaviors without having to explicitly enumerate every correct behavior.
Describing infinite sets by generating them
Mathematicians deal with infinite sets all of the time. For example, we can use set-builder notation to describe the infinitely large set of all even natural numbers without explicitly enumerating each one:
The example above references another infinite set, the set of natural numbers (ℕ). How do we generate that infinite set without reference to another one?
One way is to define the set by describing how to generate the set of natural numbers. To do this, we specify:
an initial natural number (either 0 or 1, depending on who you ask)
a successor function for how to generate a new natural number from an existing one
This allows us to describe the set of natural numbers without having to enumerate each one explicitly. Instead, we describe how to generate them. If you remember your proofs by induction from back in math class, this is like defining a set by induction.
Specifications as generating a set of behaviors
A formal specification language is just a notation for describing a set of behaviors by generating them. In TLA+, this is extremely explicit. All TLA+ have two parts:
Init – which describes all valid initial states
Next – which describes how to extend an existing valid behavior to one or more new valid behavior(s)
Here’s a visual representation of generating correct behaviors for the counter.
Generating all correct behaviors for our counter
Note how in the case of the counter, there’s only one valid initial state in a behavior: all of the correct behaviors start the same way. After that, when generating a new behavior based on a previous one, whether one behavior or multiple behaviors can be generated depends on the history. If the last event was a method invocation, then there’s only one valid way to extend that behavior, which is the expected response of the request. If the last event was a return of a method, then you can extend the behavior in three different ways, based on the three different methods you can call on the counter.
The (Init, Next) pair describe all of the possible correct behaviors of the counters by generating them.
Nondeterminism
One area where formal methods can get confusing for newcomers is that the notation for writing the behavior generator can look like a programming language, particularly when it comes to nondeterminism.
When you’re writing a formal specification, you want to express “here are all of the different ways that you can validly extend this behavior”, hence you get that branching behavior in the diagram in the previous section: you’re generating all of the possible correct behaviors. In a formal specification, when we talk about “nondeterminism”, we mean “there are multiple ways a correct behavior can be extended”, and that includes all of the different potential inputs that we might receive from outside. In formal specifications, nondeterminism is about extending a correct behavior along multiple paths.
On the other hand, in a computer program, when we talk about code being nondeterministic, we mean “we don’t know which path the code is going to take”. In the programming world, we typically use nondeterminism to refer to things like random number generation or race conditions. One notable area where they’re different is that formal specifications treat inputs as a source of nondeterminism, whereas programmers don’t include inputs when they talk about nondeterminism. If you said “user input is one of the sources of nondeterminism”, a formal modeler would nod their head, and a programmer would look at you strangely.
Properties of a spec: sets of behaviors
I’ve been using the expressions correctbehavior and behavior satisfies the specification interchangeably. However, in practice, we build formal specifications to help us reason about the correctness of the system we’re trying to build. Just because we’ve written a formal specification doesn’t mean that the specification is actually correct! That means that we can’t treat the formal specification that we build as the correct description of the system in general.
The most frequent tactic people use to reason about their formal specifications is to define correctness properties and use a model-checking tool to check whether their specification conforms to the property or not.
Here’s an example of a property for our counter: the get operation always returns a non-negative value. Let’s give it a name: the no-negative-gets property. If our specification has this property, we don’t know for certain it’s correct. But if it doesn’t have this property, we know for sure something is wrong!
Like a formal specification, a property is nothing more than a set of behaviors! Here’s an example of a behavior that satisfies the no-negative-gets property:
Note that the second wrong probably looks wrong to you. We haven’t actually written out a specification for our counter in this post, but if we did, the behavior above would certainly violate it: that’s not how counters work. On the other hand, it still satisfies the no-negative-gets property. In practice, the set of behaviors defined by a property will include behaviors that aren’t in the specification, as depicted below.
A spec that satisfies a property.
When we check that that a spec satisfies a property, we’re checking that Spec is a subset of Property. We just don’t care about the behaviors that are in the Property set but not in the Spec set. What we care about are behaviors that are in Spec that are not in Property. That tells us that our specification can generate behaviors that do not possess the property that we care about.
A spec that does not satisfy a property
Consider the property: get always return a positive number. We can call it all-positive-gets. Note that zero is not considered a positive number. Assuming our counter specification starts at zero, here’s a behavior that violates the all-positive-gets property:
>>> c = Counter()
>>> c.get()
0
Thinking in sets
When writing formal specifications, I found that thinking in terms of sets of behaviors was a subtle but significant mind-shift from thinking in terms of writing traditional programs. Where it helped me most is in making sense of the errors I get when debugging my TLA+ specifications using the TLC model checker. After all, it’s when things break is when you really need to understand whats’s going on under the hood. And I promise you, when you write formal specs, things are going to break. That’s why we write them, to find where the breaks are.