I was recently a guest on the This is Fine! podcast, hosted by Colette Alexander and Clint Byrum. Here’s a video clip from the episode.
Caveat promptor
In the wake of a major incident, you’ll occasionally hear a leader admonish the engineering organization that we need to be more careful in the future in order to prevent such incidents from happening in the future. Ultimately, these sorts of admonishments don’t help improve reliability, because they miss an essential truth about the nature of work in organizations.
One of the big ideas from resilience engineering is the efficiency-thoroughness trade-off, also known as the ETTO Principle. The ETTO principle was first articulated by Erik Hollnagel, one of the founders of the field. The idea is that there’s a fundamental trade-off between how quickly we can complete tasks, and how thorough we can be when working on each individual task. Let’s consider the work of doing software development using AI agents through the lens of the ETTO principle.
Coding agents like Claude Code and OpenAI are capable of automatically generating significant amounts of code. Honestly, it’s astonishing what these tools are capable of today. But like all LLMs, while they will always generate plausible–looking output, they do not always generate correct output. This means that a human needs to check an AI agent’s work to ensure that it’s generating code that’s up to snuff: a human has to review the code generated by the agent.

As any human software engineer will tell you, reviewing code is hard. It takes effort to understand code that you didn’t write. And larger changes are harder to review, which means that the more work that the agent does, the more work the human in the loop has to do to verify it.
If the code compiles and runs and all tests pass, how much time should the human spend on reviewing it? The ETTO principle tells us there’s a trade-off here: the incentives push software engineers towards completing our development tasks more quickly, which is why we’re all adopting AI in the first place. After all, if it ends up taking just as long to review the AI-generated code as it would have for the human reviewer to write it from scratch, then that defeats the purpose of automating the development task to begin with.
Maybe at first we’re skeptical and we spend more time reviewing the agent code. But, as we get better at working with the agents, and as the AI models themselves get better over time, we’ll figure out where the trouble spots of AI-generated code tend to pop up, and we’ll focus our code review effort accordingly. In essence, we’re riding the ETTO trade-off curve by figuring out how much review effort we should be putting in to and where that effort should go.
Eventually, though, a problem with AI-generated code will slip through this human review process and will contribute to an incident. In the wake of this incident, the software engineers will be reminded that AI agents can make mistakes, and that they need to carefully review the generated code. But, as always, such reminders will do nothing to improve reliability. Because, while AI agents change way that software developers work, they don’t eliminate the efficiency-thoroughness trade-off.
The illegible nature of software development talent
Here’s another blog post on gathering some common threads from reading recent posts. Today’s topic is about the unassuming nature of talented software engineers.
The first thread was a tweet by Mitchell Hashimoto about how his best former colleagues are ones where you would have no signal about their skills based on their online activities or their working hours.
The second thread was a blog post written a week later by Nikunj Kothari titled The Quiet Ones: Working within the seams. In this post, Kothari wasn’t writing about a specific engineer per se, but rather a type of engineer, one whose contributions aren’t captured by the organization’s performance rubric (emphasis mine):
They don’t hit your L5 requirements because they’re doing L3 and L7 work simultaneously. Fixing the deploy pipeline while mentoring juniors. Answering customer emails while rebuilding core systems. They can’t be ranked because they do what nobody thought to measure.
The third thread was a LinkedIn post written yesterday by Gergly Orosz (emphasis mine).
One of the best staff-level engineers I worked with is on the market.
…
What you need to know about this person: every team he’s ever worked on, he did standout work, in every situation. He got stuff done with high quality, helped others, is not argumentative but is firm in holding up common sense and practicality, and is very curious and humble to top all of this off.
…
And still, from the outside, this engineer is near completely invisible.He has no social media footprint. His LinkedIn lists his companies he worked at, and nothing else: no technologies, no projects, nothing. His GitHub is empty for the last 5 years, and has perhaps a dozen commits throughout the last 10.
That reason that Mitchell Hashimoto, NIkunj Kothari, and Gergly Orosz were able to identify these talented colleagues as because they worked directly with them. People making hiring decisions don’t have that luxury. For promotions, there are organizational constraints that push organizations to define a formal process with explicit criteria.
For both hiring and promotion, decision-makers have a legibility problem. This problem will inevitability lead to a focus on details that are easier to observe directly precisely because they are easier to observe directly. This is how fields like graphology and phrenology come about. But just because we can directly observe someone’s handwriting or the shapes of the bumps on their head doesn’t mean that those are effective techniques for learning something about that person’s personality.
I think it’s unlikely the industry will get much better at identifying and evaluating candidates anytime soon. And so I’m sure we’ll continue to see posts about the importance of your LinkedIn profile, or your GitHub, or your passion project. But you neglect at your peril the engineers who are working nine-to-five days at boring companies.
Two thought experiments
Here’s a thought experiment that John Allspaw related to me, in paraphrased form (John tells me that he will eventually capture this in a blog post of his own, at which time I’ll put a proper link).
Consider a small-ish tech company that has four engineering teams (A,B,C,D), where an engineer from Team A was involved in an incident (In John’s telling, the incident involves the Norway problem). In the wake of this incident, a post-incident write-up is completed, and the write-up does a good job of describing what happened. Next, imagine that the write-up is made available to teams A,B, and C, but not to team D. Nobody on team D is allowed to read the write-up, and nobody from the other teams is permitted to speak to team D about the details of the incident. The question is: are the members of team D at a disadvantage compared to the other teams?
The point of this scenario is to convey the intuition that, even though team D wasn’t involved in the incident, its members can still learn something from its details that makes them better engineers.
Switching gears for a moment, let’s talk about the new tools that are emerging under the label AI SRE. We’re now starting to see more tools that leverage LLMs to try to automate incident diagnosis and remediation, such as incident.io’s AI SRE product, Datadog’s Bits AI SRE, Resolve.ai (tagline: Your always-on AI SRE), and Cleric (tagline: AI SRE teammate). These tools work by reading in signals from your organization such as alerts, metrics, Slack messages, and source code repositories.
To effectively diagnose what’s happening in your system, you don’t just want to know what’s happening right now, but you also want to have access to historical data, since maybe there was a similar problem that happened, say, a year ago. While LLMs will have been trained with a lot of general knowledge about software systems, it won’t have been trained on the specific details of your system, and your system will fail in system-specific ways, which means that (I assume!) these AI SRE systems will work better if they have access to historical data about your system.
Here’s second thought experiment, this one my own: Imagine that you’ve adopted one of these AI SRE tools, but the only historical data of the system that you can feed your tool is the collection of your company’s post-incident write-ups. What kinds of details would be useful to an AI SRE tool in helping to troubleshoot future incidents? Perhaps we should encourage people to write their incident reports as if they will be consumed by an AI SRE tool that will use it to learn as much as possible about the work involved in diagnosing and remediating incidents in your company. I bet the humans who read it would learn more that way too.
A statistic is as a statistic does
(With apologies to the screenwriters of Forrest Gump)
I’m going to use this post to pull together some related threads from different sources I’ve been reading lately.
Rationalization as discarding information
The first thread is from The Control Revolution by the late American historian and sociologist James Beniger, which was published back in the 1980s: I discovered this book because it was referenced in Neil Postman’s Technopoly.
Beniger references Max Weber’s concept of rationalization, which I had never heard of before. I’m used to the term “rationalization” as a pejorative term meaning something like “convincing yourself that your emotionally preferred option is the most rational option”, but that’s not how Weber meant it. Here’s Beniger, emphasis mine (from p15):
Although [rationalization] has a variety of meanings … most definitions are subsumed by one essential idea: control can be increased not only by increasing the capacity to process information but also by decreasing the amount of information to be processed.
…
In short, rationalization might be defined as the destruction or ignoring of information in order to facilitate its processing.
This idea of rationalization feels very close to James Scott’s idea of legibility, where organizations depend on simplified models of the system in order to manage it.
Decision making: humans versus statistical models
The second thread is from Benjamin Recht, a professor of computer science at UC Berkeley who does research in machine learning. Recht wrote a blog post recently called The Actuary’s Final Word about the performance of algorithms versus human experts on performing tasks such as medical diagnosis. The late American psychology professor Paul Meehl argued back in the 1950s that the research literature showed that statistical models outperformed human doctors when it came to diagnosing medical conditions. Meehl’s work even inspired the psychologist Daniel Kahneman, who famously studied heuristics and biases.
In his post, Recht asks, “what gives?” If we have known since the 1950s that statistical models do better than human experts, why do we still rely on human experts? Recht’s answer is that Meehl is cheating: he’s framing diagnostic problems as statistical ones.
Meehl’s argument is a trick. He builds a rigorous theory scaffolding to define a decision problem, but this deceptively makes the problem one where the actuarial tables will always be better. He first insists the decision problem be explicitly machine-legible. It must have a small number of precisely defined actions or outcomes. The actuarial method must be able to process the same data as the clinician. This narrows down the set of problems to those that are computable. We box people into working in the world of machines.
…
This trick fixes the game: if all that matters is statistical outcomes, then you’d better make decisions using statistical methods.
Once you frame a problem as being statistical in nature, than a statistical solution will be the optimal one, by definition. But, Recht argues, it’s not obvious that we should be using the average of the machine-legible outcomes in order to do our evaluation. As Recht puts it:
How we evaluate decisions determines which methods are best. That we should be trying to maximize the mean value of some clunky, quantized, performance indicator is not normatively determined. We don’t have to evaluate individual decisions by crude artificial averages. But if we do, the actuary will indeed, as Meehl dourly insists, have the final word.
Statistical averages and safe self-driving cars
I had Recht’s post in mind when Reading Philip Koopman’s new book Embodied AI Safety. Koopman is Professor Emeritus of Electrical Engineering at Carnegie-Mellon University, he’s a safety researcher that specializes in automotive safety. (I first learned about him from his work on the Toyota unintended acceleration cases from about ten years ago).
I’ve just started his book, but these lines from the preface jumped out at me (emphasis mine):
In this book, I consider what happens once you … come to realize there is a lot more to safety than low enough statistical rates of harm.
…
[W]e have seen numerous incidents and even some loss events take place that illustrate “safer than human” as a statistical average does not provide everything that stakeholders will expect from an acceptably safe system. From blocking firetrucks, to a robotaxi tragically “forgetting” that it had just run over a pedestrian, to rashes of problems at emergency response scenes, real-world incidents have illustrated that a claim of significantly fewer crashes than human drivers does not put the safety question to rest.
More numbers than you can count
I’m also reading The Annotated Turing by Charles Petzold. I had tried to read Alan Turing’s original paper where he introduced the Turing machine, but found it difficult to understand, and Petzold provides a guided tour through the paper, which is exactly what I was looking for.
I’m currently in Chapter 2, where Petzold discusses the German mathematician Georg Cantor’s famous result that the real numbers are not countable, that the size of the set of real numbers is larger than the size of the set of natural numbers. (In particular, it’s the transcendental numbers like π and e that aren’t countable: we can actually count what are called the algebraic real numbers, like √2).
To tie this back to the original thread: rationalization feels like to me like the process of focusing on only the algebraic numbers (which include the integers and rational numbers), even though most of the real numbers are transcendental.
Ignoring the messy stuff is tempting because it makes analyzing what’s left much easier. But we can’t forget that our end goal isn’t to simplify analysis, it’s to achieve insight. And that’s exactly why you don’t want to throw away the messy stuff.
Fixation: the ever-present risk during incident handling
Recent U.S. headlines have been dominated by school shootings. The bulk of the stories have been about the assassination of Charlie Kirk on the campus of Utah Valley University and the corresponding political fallout. On the same day, there was also a shooting at Evergreen High School in Colorado, where a student shot and injured two of his peers. This post isn’t about those school shootings, but rather, one that happened three years ago. On May 24, 2022, at Robb Elementary School in Uvalde, Texas, 19 students and 2 teachers were killed by a shooter who managed to make his way onto the campus.
Law enforcement were excoriated for how they responded to the Uvalde shooting incident: several were fired, and two were indicted on charges of child endangerment. On January 18, 2024, the Department of Justice released the report on their investigation of the shooting: Critical Incident Review: Active Shooter at Robb Elementary School. According to the report, there were multiple things that went wrong during the incident. Most significantly, the police originally believed that the shooter had barricaded himself in an empty classroom, where in fact shooter was in a classroom with students. There were also communication issues that resulted in a common ground breakdown during the response. But what I want to talk about in this post is the keys.
The search for the keys
During the response to the Uvalde shooting, there was significant effort by the police on the scene to locate master keys to unlock rooms 111/112 (numbered p14, PDF p48, emphasis mine).
Phase III of the timeline begins at 12:22 p.m., immediately following four shots fired inside classrooms 111 and 112, and continues through the entry and ensuing gunfight at 12:49 p.m. During this time frame, officers on the north side of the hallway approach the classroom doors and stop short, presuming the doors are locked and that master keys are necessary.
The search for keys started before this, because room 109 was locked, and had children in it, and the police wanted to evacuate those children (numbered p 13, PDF p48):
By approximately 12:09 p.m., all classrooms in the hallways have been evacuated and/or cleared except rooms 111/112, where the subject is, and room 109. Room 109 is found to be locked and believed to have children inside.
If you look at the Minute-by-Minute timeline section of the report (numbered p17, PDF p50) you’ll see the text “Events: Search for Keys” appear starting at 12:12 PM, all of the way until 12:45 PM.
The irony here is that the door to room 111/112 may have never been locked to begin with, as suggested by the following quote (numbered p15, PDF p48), emphasis mine:
At around 12:48 p.m., the entry team enters the room. Though the entry team puts the key in the door, turns the key, and opens it, pulling the door toward them, the [Critical Incident Review] Team concludes that the door is likely already unlocked, as the shooter gained entry through the door and it is unlikely that he locked it thereafter.
Ultimately, the report explicitly calls out how the search for the keys led to delays in response (numbered p xxviii, PDF p30):
Law enforcement arriving on scene searched for keys to open interior doors for more than 40 minutes. This was partly the cause of the significant delay in entering to eliminate the threat and stop the killing and dying inside classrooms 111 and 112. (Observation 10)
Fixation
In hindsight, we can see that the responders got something very important wrong in the moment: they were searching for keys for a door that probably wasn’t even locked. In this specific case, there appears to have been some communicated-related confusion about the status of the door, as shown by the following (numbered p53, PDF p86):
The BORTAC [U.S. Border Patrol Tactical Unit] commander is on the phone, while simultaneously asking officers in the hallway about the status of the door to classrooms 111/112. UPD Sgt. 2 responds that they do not know if the door is locked. The BORTAC commander seems to hear that the door is locked, as they say on the phone, “They’re saying the door is locked.” UPD Sgt. 2 repeats that they do not know the status of the door.
More generally, this sort of problem is always going to happen during incidents: we are forever going to come to conclusions during an incident about what’s happening that turn out to be wrong in hindsight. We simply can’t avoid that, no matter how hard we try.
The problem I want to focus on here is not the unavoidable getting it wrong in the moment, but the actually-preventable problem of fixation. We “fixate” when we focus solely on one specific aspect of the situation. The problem here is not searching for keys, but on searching for keys to the exclusion of other activities.
During complex incidents, the underlying problem is frequently not well understood, and so the success of a proposed mitigation strategy is almost never guaranteed. Maybe a rollback will fix things, but maybe it won’t! The way to overcome this problem is to pursue multiple strategies in parallel. One person or group focuses on rolling back a deployment that aligns in time, another looks for other types of changes that occurred around the same time, yet another investigates the logs, another looks into scaling up the amount of memory, someone else investigates traffic pattern changes, and so on. By pursuing multiple diagnostic and mitigation strategies in parallel, we reduce the risk of delaying the mitigation of the incident by blocking on the investigation of one avenue that may turn out to not be fruitful.
Doing this well requires diversity of perspectives and effective coordination. You’re more likely to come up with a broader set of options to pursue if your responders have a broader range of experiences. And the more avenues that you pursue, the more the coordination overhead increases, as you now need to keep the responders up to date about what’s going on in the different threads without overwhelming them with details.
Fixation is a pernicious risk because we’re more likely to fixate when we’re under stress. Since incidents are stressful by nature, they are effectively incubators of fixation. In the heat of the moment, it’s hard to take a breath, step back for a moment, understand what’s been tried already, and calmly ask about what the different possible options are. But the alternative is to tumble down the rabbit hole, searching for keys to a door that is already unlocked.
The hidden trade-offs of fine-grained progressive rollouts
A progressive rollout refers to the act of rolling out some new functionality gradually rather than all at once. This means that, when you initially deploy it, the change only impacts a fraction of your users. The idea behind a progressive rollout is to reduce the risk of a deployment by reducing the blast radius: if something goes wrong with the new thing during deployment, then the impact is much smaller than if you had deployed it all-at-once, to all of the traffic.

There are two general strategies for doing a progressive rollout. One strategy is coarse grained, where you stage your deploys across domains. For example, deploying the new functionality to one geographic region at a time. The second strategy is more fine-grained, where you define a ramp up schedule (e.g., 1% of traffic to the new thing, then 5%, then 10%, etc.).
Note that the two strategies aren’t mutually exclusive: you can stage your deploy across regions, and within each region, you can do a fine-grained ramp-up within each regions. And you can also think of it as a spectrum rather than two separate categories, since you can control the granularity. But I make the distinction here because I want to talk specifically about the fine-grained approach, where we use a ramp.
The ramp is clearly superior if you’re able to detect a problem during deployment, as shown in the diagram above. It’s a real win if you have automation that can automatically detect based on a metric like error rate. The problem with the ramp is the scenario when you don’t detect that there’s a problem with the deployment.
My claim here in this post is that if you don’t detect a problem with a fine-grained progressive rollout until after the rollout has completed, then it will tend to take you longer to diagnose what the problem is:

Here’s my argument: once you know something is wrong with your system, but you don’t know what it is that has gone wrong, one of the things you’ll do is to look at dashboard graphs to look for a signal that identifies when the problem started, such as an increase in error rate or request latency. When you do a fine-grained progressive rollout, if something has gone wrong, then the impact will get smeared out over time, and it will be harder to identify the rollout as the relevant change by looking at a dashboard. If you’re lucky, your observability tools will let you slice on the rollout dimension. This is why I like coarse-grained rollouts, because if you have explicit deployment domains like geographical regions, then your observability tools will almost certainly let you slice the data based on those. Heck, you should have existing dashboards that already slice on it. But for fine-grained rolled-out, you may not think to slice on a particular rollout dimension (especially if you’re rolling out a bunch of things at once, all of them doing fine-grained deployments), and you might not even be able to.
To determine whether fine-grained rollouts are a net win depends on a number of factors whose values are not obvious, including:
- the probability you detect a problem during the rollout vs after the rollout
- how much longer it takes to diagnose the problem if not caught during rollout
- your cost model for an incident
On the third bullet: the above diagram implicitly assumes that impact to the business is linear with respect to time. However, it might be non-linear: an hour-long incident may turn out to be more than twice as expensive as two half-hour-long incidents.
As someone who works in the reliability space, I’m acutely aware of the pain of incidents that take a long time to mitigate because they are difficult to diagnose. But I think that the trade-off of fine-grained progressive rollouts are generally not recognized as such: it’s easy to imagine the benefits when the problems are caught earlier, it’s harder to imagine the scenarios where the problem isn’t caught until later, and how harder things get because of it.
Nothing fails like a history of success
The Axiom of Experience: the future will be like the past, because, in the past, the future was like the past. – Gerald M. Weinberg, An Introduction to General Systems Thinking
Last Friday, the San Francisco Bay Area Rapid Transit system (known as BART) experienced a multiple hour outage. Later that day, the BART Deputy General Manager released a memo about the outage with some technical details. The memo is brief, but I was honestly surprised to see this amount of detail in a public document that was released so quickly after an incident, especially from a public agency. What I want to focus on in this post is this line (emphasis mine):
Specifically, network engineers were performing a cutover to a new network switch at
Montgomery St. Station… The team had already successfully performed eight similar cutovers earlier this year.
This reminded me of something I read in the Buildkite writeup from an incident that happened back in January of this year (emphasis mine):
Given the confidence gained by initial load testing and the migrations already performed over the past year, we wanted to allow customers to take advantage of their seasonal low periods to perform shard migrations, as a win-win. This caused us to discount the risk of performing migrations during a seasonal low period and what impacts might emerge when regular peak traffic returned.
It also reminded me about the 2022 Rogers Telecommunications outage in Canada (emphasis mine, [redacted] comments in the original):
Rogers had assessed the risk for the initial change of this seven-phased process as “High”. Subsequent changes in the series were listed as “Medium.” [redacted] was “Low” risk based on the Rogers algorithm that weighs prior success into the risk assessment value. Thus, the risk value for [redacted] was reduced to “Low” based on successful completion of prior changes.
Whenever we make any sort of operational change, we have a mental model of the risk associated with the change. We view novel changes (I’ve never done something like this before!) as riskier than changes we’ve performed successfully multiple times in the past (I’ve done this plenty of times). I don’t think this sort of thinking is a fallacy: rather, it’s a heuristic, and it’s generally a pretty effective one! But, like all heuristics, it isn’t perfect. As shown in the examples above, the application of this heuristic can result in a miscalibrated mental model of the risk associated with a change.
So, what’s the broader lesson? In practice, our risk models (implicit or otherwise) are always miscalibrated: a history of past successes is just one of multiple avenues that can lead us astray. Trying to achieve a perfect risk model is like trying to deploy software that is guaranteed to have zero bugs: it’s never going to happen. Instead, we need to accept the reality that, like our code, our models of risk will always have defects that are hidden from us until it’s too late. So we’d better get damned good at recovery.
My favorite developer productivity research method that nobody uses
You’ve undoubtedly heard of the psychological concept called flow state. This is the feeling you get when you’re in the zone, where you’re doing some sort of task, and you’re just really into it, and you’re focused, and it’s challenging but not frustratingly so. It’s a great feeling. You might experience this with a work task, or a recreational one, like when playing a sport or a video game. The pioneering researcher on the phenomenon of flow was the Hungarian-American psychologist Mihaly Csikszentmihalyi, and he wrote a popular book on the subject back in 1990 with the title Flow: The Psychology of Optimal Experience, which I read many years ago. But the one thing that stuck around most with me from Csikszentmihalyi’s book on Flow was the research method that he used to study flow.
One of the challenges of studying people’s experiences is that it’s difficult for researchers to observe them directly. This problem comes up when an organization tries to do research on the current state of developer productivity within the organizations. I harp on “make work visible” a lot because so much of the work we do in the software world is so hard for others to see. There are different data collection techniques that developer productivity researchers use, including surveys, interviews, focus groups, as well as automatic collection of metrics, like the DORA metrics. Of those, only the automatic collection of metrics focuses on in-the-moment data, and it’s a very thin type of data at that. Those metrics can’t give you any insights into the challenges that your developers are facing.
My preferred technique is the case study, which I try to apply to incidents. I like the incident case study technique because it gives us an opportunity to go deep into the nature of the work for a specific episode. But incident-as-case-study only works for, well, incidents, and while a well-done incident case study can shine a light on the nature of the development work, there’s also a lot that it will miss.
Csikszentmihalyi used a very clever approach which was developed by his PhD student Suzanne Prescott, called experience sampling. He gave the participants of his study pagers, and he would page them at random times. When paged, the participants would write down information about their experiences in a journal in-the-moment. In this way, he was able to collect information about subjective experience, without the problems you get when trying to elicit an account retrospectively.
I’ve never read about anybody trying to use this approach to study developer productivity, and I think that’s a shame. It’s something I’ve wanted to try myself, except that I have not worked in the developer productivity space for a long, long time.
These days, I’d probably use slack rather than a pager and journal to randomly reach out to the volunteers during the study and collect their responses, but the principle is the same. I’ve long wanted to capture an “are you currently banging your head against a wall” metric from developers, but with experience sampling, you could capture a “what are you currently banging your head against the wall about?”
Would this research technique actually work for studying developer productivity issues within an organization? I honestly don’t know. But I’d love to see someone try.
Note: I originally had the incorrect publication date for the Flow book. Thanks to Daniel Miller for the correction.
The problems that accountability can’t fix
Accountability is a mechanism that achieves better outcomes by aligning incentives, in particular, negative ones. Specifically: if you do a bad thing, or fail to do a good thing, under your sphere of control, then bad things will happen to you. I recently saw several LinkedIn posts that referenced the U.S. Coast Guard report on the OceanGate experimental submarine implosion. These posts described how this incident highlights the importance of accountability in leadership. And, indeed, the report itself references accountability five times.
However, I think this incident is an example of a type of problem where accountability doesn’t actually help. Here I want to talk about two classes of problems where accountability is a poor solution to addressing the problem, where the OceanGate accident falls into the second class.
Coordination challenges
Managing a large organization is challenging. Accountability is a popular tool in such organizations to ensure that work actually gets done, by identifying someone who is designated as the stuckee for ensuring that a particular task or project gets completed. I’ll call this top-down accountability. This kind of accountability is sometimes referred to, unpleasantly, as the “one throat to choke” model.

For this model to work, the problem you’re trying to solve needs to be addressable by the individual that is being held accountable for it. Where I’ve seen this model fall down is in post-incident work. As I’ve written about previously, I’m a believer in the resilience engineering model of complex systems failures, where incidents arise due to unexpected interactions between components. These are coordination problems, where the problems don’t live in one specific component, but, rather, how the components interact with each other.
But this model of accountability demands that we identify an individual to own the relevant follow-up incident work. And so it creates an incentive to always identify a root cause service, which is owned by the root cause team, who are then held accountable for addressing the issue.
Now, just because you have a coordination problem, that doesn’t mean that you don’t need an individual to own driving the reliability improvements around it. In fact, that’s why technical project managers (known as TPMs) exist. They act as the accountable individuals for efforts that require coordination across multiple teams, and every large tech organization that I know of employs TPMs. The problem I’m highlighting here, such as in the case of incidents, is that accountability is applied as a solution without recognizing that the problem revealed by the incident is a coordination problem.
You can’t solve a coordination problem by identifying one of the agents involved in the coordination and making them accountable. You need someone who is well-positioned in the organization, recognizes the nature of the problem, and has the necessary skills to be the one who is accountable.
Miscalibrated risk models
The other way people talk about accountability is about holding leaders such as politicians and corporate executives responsible for their actions, where there are explicit consequences for them acting irresponsibly, including actions such as corruption, or taking dangerous risks with the people and resources that have been entrusted to them. I’ll call this bottom-up accountability.

This brings us back to the OceanGate accident of June 18, 2023. In this accident, the TITAN submersible imploded, killing everyone aboard. One of the crewmembers who died was Stockton Rush, who was both pilot of the vessel and CEO of OceanGate.
The report is a scathing indictment of Rush. In particular, it criticizes how he sacrificed safety for his business goals, ran an organization that lacked that the expertise required to engineer experimental submersibles, promoted a toxic workplace culture that suppressed signs of trouble instead of addressing them, and centralized all authority in himself.
However, one thing we can say about Rush was that he was maximally accountable. After all, he was both CEO and pilot. He believed so much that TITAN was safe that he literally put his life on the line. As Nassim Taleb would put it, he had skin in the game. And yet, despite this accountability, he still took irresponsible risks, which led to disaster.
By being the pilot, Rush personally accepted the risks. But his actual understanding of the risk, his model of risk, was fundamentally incorrect. It was wrong, dangerously so.

Assigning accountability doesn’t help when there’s an expertise gap. Just as giving a software engineer a pager does not bestow up them the skills that they need to effectively do on-call operations work, having the CEO of OceanGate also be the pilot of the experimental vehicle did not lead to him being able to exercise better judgment about safety.
Rush’s sins weren’t merely lack of expertise, and the report goes into plenty of detail about his other management shortcomings that contributed to this incident. But, stepping back from the specifics of the OceanGate accident, there’s a greater point here that making executives accountable isn’t sufficient to avoid major incidents, if the risk models that executives use to make decisions are are out of whack with the actual risks. And by risk models here, I don’t just mean some sort of formal model like the risk assessment matrix above. Everyone carries with them an implicit risk model in their heads: this is a mental risk model.
Double binds
While the CEO also being a pilot sounds like it should be a good thing for safety (skin in the game!), it also creates a problem that the resilience engineering folks refer to as a double bind. Yes, Rush had strong incentives to ensure he wasn’t taking stupid risks, because otherwise he might die. But he also had strong incentives to keep the business going, and those incentives were in direct conflict with the safety incentives. But double-binds are not just an issue for CEO-pilots, because anyone in the organization will feel pressure from above to make decisions in support of the business, which may cut against safety. Accountability doesn’t solve the problem of double-binds, it exacerbates them, by putting someone on the hook for delivering.
Once again, from the resilience engineering literature, one way to deal with this problem is through cross-checks. For example, see the paper Collaborative Cross-Checking to Enhance Resilience by Patterson, Woods, Cook, and Render. Instead of depending on a single individual (accountability), you take advantage of the different perspectives of multiple people (diversity).
You also need someone who is not under a double-bind who has the authority to say “this is unsafe”. That wasn’t possible at OceanGate, where the CEO was all-powerful, and anybody who spoke up was silenced or pushed out.
On this note, I’ll leave you with a six-minute C-SPAN video clip from 2003. In this clip, the resilience engineering David Woods spoke at a U.S. Senate hearing in the wake of the Columbia accident. Here he was talking about the need for an independent safety organization at NASA as a mechanism for guarding against the risks that emerge from double binds.
(I could not get it to embed, here’s the link: https://www.c-span.org/clip/senate-committee/user-clip-david-woods-senate-hearing/4531343)
(As far as I know, the new independent safety organization that Woods proposed was not created)