Cache invalidation really is one of the hardest problems in computer science

My colleagues recently wrote a great post on the Netflix tech blog about a tough performance issue they wrestled with. They ultimately diagnosed the problem as false sharing, which is a performance problem that involves caching.

I’m going to take that post and write a simplified version of part of it here, as an exercise to help me understand what happened. After all, the best way to understand something is to try to explain it to someone else.

But note that the topic I’m writing about here is outside of my personal area of expertise, so caveat lector!

The problem: two bands of CPU performance

Here’s a graph from that post that illustrates the problem. It shows CPU utilization for different virtual machines instances (nodes) inside of a cluster. Note that all of the nodes are configured identically, including running the same application logic and taking the same traffic.

Note that there are two “bands”, a low band at around 15-20% CPU utilization, and a high band that varies a lot, from about 25%-90%.

Caching and multiple cores

Computer programs keep the data that they need in main memory. The problem with main memory is that accessing it is slow in computer time. According to this site, a CPU instruction cycle is about 400ps, and accessing main memory (DRAM access) is 50-100ns, which means it takes ~ 125 – 250 cycles. To improve performance, CPUs keep some of the memory in a faster, local cache.

There’s a tradeoff between the size of the cache and its speed, and so computer architects use a hierarchical cache design where they have multiple caches of different sizes and speeds. It was an interaction pattern with the fastest on-core cache (the L1 cache) that led to the problem described here, so that’s the cache we’ll focus on in this post.

If you’re a computer engineer designing a a multi-core system where each core has on-core cache, your system has to implement a solution for the problem known as cache coherency.

Cache coherency

Imagine a multi-threaded program where each thread is running on a different core. There’s a variable, which we’ll call x.

Let’s also assume that both threads have previously read x, so the memory associated with x is loaded in the caches of both. So the caches look like this:

Now imagine thread T1 modifies x, and then T2 reads x.

T1             T2
--             --
x = x + 1

              if(x==0) {
              // shouldn't execute this!

The problem is that T2’s local cache has become stale, and so it reads a value that is no longer valid.

The term cache coherency refers to the problem of ensuring that local caches in a multi-core (or, more generally, distributed) system stay in sync.

This problem is solved by a hardware device called a cache controller. The cache controller can detect when values in a cache have been modified on one core, and whether another core has cached the same data. In this case, the cache controller invalidates the stale cache. In the example above, the cache controller would invalidate the cache in T2. When T2 went to read the variable x, it would have to read the data from main memory into the core.

Cache coherency ensures that the behavior is correct, but every time a cache is invalidated and the same memory has to be retrieved from main memory again, it pays the performance penalty of reading from main memory.

The diagram above shows that the cache contains both the data as well as the addresses in main memory where the data comes from: we only need to invalidate caches that correspond to the same range of memory

Data gets brought into cache in chunks

Let’s say a program needs to read data from main memory. For example, let’s say it needs to read the variable named x. Let’s assume x is implemented as a 32-bit (4 byte) integer. When the CPU reads from main memory, the memory that holds the variable x will be brought into the cache.

But the CPU won’t just read the variable x into cache. It will read a contiguous chunk of memory that includes the variable x into cache. On x86 systems, the size of this chunk is 64 bytes. This means that accessing the 4 bytes that encodes the variable x actually ends up bringing 64 bytes along for the ride.

These chunks of memory stored in the cache are referred to as cache lines.

False sharing

We now almost have enough context to explain the failure mode. Here’s a C++ code snippet from the OpenJDK repository (from src/hotspot/share/oops/klass.hpp)

class Klass : public Metadata {

  // Cache of last observed secondary supertype
  Klass*      _secondary_super_cache;
  // Array of all secondary supertypes
  Array<Klass*>* _secondary_supers;

This declares two pointer variables inside of the Klass class: _secondary_super_cache, and _secondary_supers. Because these two variables are declared one after the other, they will get laid out next to each other in memory.

The two variables are adjacent in main memory.

The _secondary_super_cache is, itself, a cache. It’s a very small cache, one that holds a single value. It’s used in a code path for dynamically checking if a particular Java class is a subtype of another class. This code path isn’t commonly used, but it does happen for programs that dynamically create classes at runtime.

Now imagine the following scenario:

  1. There are two threads: T1 on CPU 1, T2 on CPU 2
  2. T1 wants to write the _secondary_super_cache variable and already has the memory associated with the _secondary_super_cache variable loaded in its L1 cache
  3. T2 wants to read from the _secondary_supers variable and already has the memory associated with the _secondary_supers variable loaded in its L1 cache.

When T1 (CPU 1) writes to _secondary_super_cache, if CPU 2 has the same block of memory loaded in its cache, then the cache controller will invalidate that cache line in CPU 2.

But if that cache line contained the _secondary_supers variable, then CPU 2 will have to reload that data from cache to do its read, which is slow.

ssc refers to _secondary_super_cache, ss refers to _secondary_supers

This phenomenon, where the cache controller invalidates cached non-stale data that a core needed to access, which just so happens to be on the same cache line as stale data, is called false sharing.

What’s the probability of false sharing in this scenario?

In this case, the two variables are both pointers. On this particular CPU architecture, pointers are 64-bits, or 8 bytes. The L1 cache line size is 64 bytes. That means a cache line can store 8 pointers. Or, put another away, a pointer can occupy one of 8 positions in the cache line.

There’s only one scenario where the two variables don’t end up on the same cache line: when _secondary_super_cache occupies position 8, and _secondary_supers occupies position 1. In all of the other scenarios, the two variables will occupy the same cache line, and hence will be vulnerable to false sharing.

1 / 8 = 12.5%, and that’s roughly the number of nodes that were observed in the low band in this scenario.

And now I recommend you take another look at the original blog post, which has a lot more details, including how they solved this problem, as well as a new problem that emerged once they fixed this one.

There is no “Three Mile Island” event coming for software

In Critical Digital Services: An Under-Studied Safety-Critical Domain, John Allspaw asks:

Critical digital services has yet to experience its “Three-Mile Island” event. Is
such an accident necessary for the domain to take human performance seriously? Or can it translate what other domains have learned and make productive use of
those lessons to inform how work is done and risk is anticipated for the future?

I don’t think the software world will ever experience such an event.

The effect of TMI

The Three Mile Island accident (TMI) is notable, not because of the immediate impact on human lives, but because of the profound effect it had on the field of safety science.

Before TMI, the prevailing theories of accidents was that they were because of issues like mechanical failures (e.g., bridge collapse, boiler explosion), unsafe operator practices, and mixing up physical controls (e.g., switch that lowers the landing gear looks similar to switch that lowers the flaps).

But TMI was different. It’s not that the operators were doing the wrong things, but rather that they did the right things based on their understanding of what was happening, but their understanding of what was happening, which was based on the information that they were getting from their instruments, didn’t match reality. As a result, the actions that they took contributed to the incident, even though they did what they were supposed to do. (For more on this, I recommend watching Richard Cook’s excellent lecture: It all started at TMI, 1979).

TMI led to a kind of Cambrian explosion of research into human error and its role in accidents. This is the beginning of the era where you see work from researchers such as Charles Perrow, Jens Rasmussen, James Reason, Don Norman, David Woods, and Erik Hollnagel.

Why there won’t be a software TMI

TMI was significant because it was an event that could not be explained using existing theories. I don’t think any such event will happen in a software system, because I think that every complex software system failure can be “explained”, even if the resulting explanation is lousy. No matter what the software failure looks like, someone will always be able to identify a “root cause”, and propose a solution (more automation, better procedures). I don’t think a complex software failure is capable of creating TMI style cognitive dissonance in our industry: we’re, unfortunately, too good at explaining away failures without making any changes to our priors.

We’ll continue to have Therac-25s, Knight Capitals, Air France 447s, 737 Maxs, 911 outages, Rogers outages, and Tesla autopilot deaths. Some of them will cause enormous loss of human life, and will result in legislative responses. But no such accident will compel the software industry to, as Allspaw puts it, take human performance seriously.

Our only hope is that the software industry eventually learns the lessons that the safety science learned from the original TMI.

Up and down the abstraction hierarchy

As operators, when the system we operate is working properly, we use a functional description of the system to reason about its behavior.

Here’s an example, taken from my work on a delivery system. if somebody asks me, “Hey, Lorin, how do I configure my deployment so that a canary runs before it deploys to production?”, then I would tell them, “In your deliver config, add a canary constraint to the list of constraints associated with your production environment, and the delivery system will launch a canary and ensure it passes before promoting new versions to production.”

This type of description is functional; It’s the sort of verbiage you’d see in a functional spec. On the other hand, if an alert fires because the environment check rate has dropped precipitously, the first question I’m going to ask is, “did something deploy a code change?” I’m not thinking about function anymore, but I’m thinking of the lowest level of abstraction.

In the mid nineteen-seventies, the safety researcher Jens Rasmussen studied how technicians debugged electronic devices, and in the mid-eighties he proposed a cognitive model about how operators reason about a system when troubleshooting, in a paper titled the role of hierarchical knowledge representation in decisionmaking and system management. He called this model the abstraction hierarchy.

Rasmussen calls this model a means-ends hierarchy, where the “ends” are at the top (the function: what you want the system to do), and the “means” are at the bottom (how the system is physically realized). We describe the proper function of the system top-down, and when we successfully diagnose a problem with the system, we explain the problem bottom-up.

The abstraction hierarchy, explained with an example

Depicts the five levels of the abstraction hierarchy:

1. functional purpose
2. abstract functions
3. general functions
4. physical functions
5. physical form
The five levels of the abstraction hierarchy

To explain these, I’ll use the example of a car.

The functional purpose of the car is to get you from one place to another. But to make things simpler, let’s zoom in on the accelerator. The functional purpose of the accelerator is to make the car go faster.

The abstract functions include transferring power from the car’s power source to the wheels, as well as transferring information from the accelerator to that system about how much power should be delivered. You can think of abstract functions as being functions required to achieve the functional purpose.

The generalized functions are the generic functional building blocks you use to implement the abstract functions. In the case of the car, you need a power source, you need a mechanism for transforming the stored energy to mechanical energy, a mechanism for transferring the mechanical energy to the wheels.

The physical functions capture how the generalized function is physically implemented. In an electric vehicle, your mechanism for transforming stored energy to mechanical energy is an electric motor; in a traditional car, it’s an internal combustion engine.

The physical form captures the construction detail of how the physical function. For example, if it’s an electric vehicle that uses an electric motor, the physical form includes details such as where the motor is located in the car, what its dimensions are, and what materials it is made out of.

Applying the abstraction hierarchy to software

Although Rasmussen had physical systems in mind when he designed the hierarchy (his focus was on process control, and he worked at a lab that focused on nuclear power plants), I think the model can map onto software systems as well.

I’ll use the deployment system that I work on, Managed Delivery, as an example.

The functional purpose is to promote software releases through deployment environments, as specified by the service owner (e.g., first deploy to test environment, then run smoke tests, then deploy to staging, wait for manual judgment, then run a canary, etc.)

Here are some examples of abstract functions in our system.

  • There is an “environment check” control loop that evaluates whether each pending version of code is eligible for promotion to the next environment by checking its constraints.
  • There is a subsystem that listens for “new build” events and stores them in our database.
  • There is a “resource check” control loop that evaluates whether the currently deployed version matches the most recent eligible version.

For generalized functions, here are some larger scale building blocks we use:

  • a queue to consume build events generated by the CI system
  • a relational database to track the state of known versions
  • a workflow management system for executing the control loops

For the physical functions that realize the generalized functions:

  • SQS as our queue
  • MySQL Aurora as our relational database
  • Temporal as our workflow management system

For physical form, I would map these to:

  • source code representation (files and directory structure)
  • binary representation (e.g., container image, Debian package)
  • deployment representation (e.g., organization into clusters, geographical regions)

Consider: you don’t care about how your database is implemented, until you’re getting some sort of operational problem that involves the database, and then you really have to care about how it’s implemented to diagnose the problem.

Why is this useful?

If Rasmussen’s model is correct, then we should build operator interfaces that take the abstraction hierarchy into account. Rasmussen called this approach ecological interface design (EID), where the abstraction hierarchy is explicitly represented in the user interface, to enable operators to more easily navigate the hierarchy as they do their troubleshooting work.

I have yet to see an operator interface that does this well in my domain. One of the challenges is that you can’t rely solely on off-the-shelf observability tooling, because you need to have a model of the functional purpose and the abstract functions to build those models explicitly into your interface. This means that what we really need are toolkits so that organizations can build custom interfaces that can capture those top levels well. In addition, we’re generally lousy at building interfaces that traverse different levels: at best we have links from one system to another. I think the “single pane of glass” marketing suggests that people have some basic understanding of the problem (moving between different systems is jarring), but they haven’t actually figured out how to effectively move between levels in the same system.

Production pressure

The individual contributor feels production pressure from their manager.

The manager feels production pressure from their director.

The director feels production pressure from their vice president.

The vice president feels production pressure from their C-level executive.

The C-level executive feels production pressure from the CEO.

The CEO feels production pressure from the board of directors.

The board of directors feels production pressure from investment funds, who are the major shareholders.

And what about the managers of these investment funds? They feel production pressure to provide good returns so that customers will continue to invest. Many of their customers happen to hold shares of the fund in their retirement accounts. Customers such as … the individual contributor.

And the circle of production pressure is complete.

Uvalde: a reasonable officer

In the ALERRT report on the Uvalde shooting, the term reasonable appears four times (emphasis mine):

A reasonable officer would conclude in this case, based upon the totality of the circumstances, that use of deadly force was warranted.

ALERRT report, p13

The suspect was actively firing his weapon when the officers entered the building, and a reasonable officer would assume that there were injured people in the classrooms.

ALERRT report, p16

A reasonable officer would have considered this an active situation and devised a plan to address the suspect.

ALERRT report, p17

During each of these instances, the situation had gone active, and the immediate action plan should have been triggered because it was reasonable to believe that people were being killed.

ALERRT report, p18

The implication here is that the responses of the officers who responded were unreasonable, because they did not conform to what the ALERRT staff considered to be reasonable.

Labeling the responders actions as unreasonable enables us to explain away the failures in the law enforcement response as deficiencies with the individual responders. I suspect a law enforcement officer in another city reading the ALERRT report would conclude “this type of thing would never happen to my department, because we know what we’re doing. It’s these bozos in Udvale that were the problem here.”

Once we identify the problem as being the individual responders, we don’t have to dig any deeper to understand what happened. There’s nothing to learn, because we’ve explained the failure away. It was due to incompetence!

The problem with this type of assessment on the behavior of the responders is that it makes it more difficult to learn from the incident, an effect that Cook and Woods call distancing through differencing. They describe a case study of a chemical processing company where there was a chemical fire that happened in a foreign processing plant (emphasis mine).

Interestingly, the relevant people at the plant knew all about the previous incident as soon as it had occurred through more informal communication channels. They had reviewed the incident, noted many features that were different from their plant (non-US location, slightly different model of the same machine, different safety systems to contain fires). The safety people consciously classified the incident as irrelevant to the local setting, and they did not initiate any broader review of hazards in the local plant. Overall they decided the incident “couldn’t happen here.”

But these local workers regarded the overseas fire not as evidence of a type of hazard that existed in the local workplace but rather as evidence that workers at the other plant were not as skilled, as motivated and as careful as they were, after all, they were not Americans (the other plant was in a first world country). The consequence of this view was that no broader implications of the fire overseas were extracted locally after that event.

Cook & Woods, Distancing Through Differencing

Later on, there was a chemical fire at an American facility. There were similar systemic failures in both fires, but the Americans had not learned the lessons of the systemic failures from the foreign fire. Ironically, the same pattern of distancing through differencing was observed after the second fire (emphasis mine):

Interestingly (and ominously) this distancing through differencing that occurred in response to the external, overseas fire, was repeated internally after the local fire. Workers in the same plant, working in the same area in which the fire occurred but on a different shift, attributed the fire to lower skills of the workers on the other shift. They regarded the workers to whom the accident happened as inattentive and unskilled. Not surprisingly, this meant that they saw the fire as largely irrelevant to their own work. After all, their reasoning went, the fire occurred because the workers to whom it happened were less careful than we are. Despite their beliefs, there was no evidence whatsoever that there were significant differences between workers on different shifts or in different countries (in fact, there was evidence that one of the workers involved was among the better skilled at this plant).

Cook & Woods, Distancing Through Differencing

If we want to learn as much as we can from an incident, we have to fight the urge to diagnose an incident as due to the incompetence of individuals involved. We need to assume that the incident happened even though everyone involved was acting reasonably. Only then will we be able to see the systemic problems with clarity.

Uvalde: would you have taken the shot?

Here’s an excerpt from the Uvalde shooting interim report about one of the first officers on the scene at the Uvalde shooting. Based on the timeline, I’d guess that the event described below occurred around 11:32 AM, just after the suspect fired shots outside of the school, but before he entered the school.

One of those officers testified to the Committee that, based on the sound of echoes, he believed the shooter had fired in their direction. That officer saw children dressed in bright colors in the playground, all running away. Then, at a distance exceeding 100 yards, he saw a person dressed in black, also running away. Thinking that the person dressed in black was the attacker, he raised his rifle and asked Sgt. Coronado for permission to shoot. Sgt. Coronado testified he heard the request, and he hesitated. He knew there were children present. He considered the risk of shooting a child, and he quickly recalled his training that officers are responsible for every round that goes downrange.

Interim report, p42

Should the officer have fired? Here’s the ALERRT report’s assessment (emphasis mine)

Third, a Uvalde PD officer reported that he was at the crash site and observed the suspect carrying a rifle prior to the suspect entering the west hall exterior door. The UPD officer was armed with a rifle and sighted in to shoot the attacker; however, he asked his supervisor for permission to shoot. The UPD officer did not hear a response and turned to get confirmation from his supervisor. When he turned back to address the suspect, the suspect had already entered the west hall exterior door at 11:33:00. The officer was justified in using deadly force to stop the attacker. Texas Penal Code § 9.32, DEADLY FORCE IN DEFENSE OF PERSON states, an individual is justified in using deadly force when the individual reasonably believes the deadly force is immediately necessary to prevent the commission of murder (amongst other crimes). In this instance, the UPD officer would have heard gunshots and/or reports of gunshots and observed an individual approaching the school building armed with a rifle. A reasonable officer would conclude in this case, based upon the totality of the circumstances, that use of deadly force was warranted. Furthermore, the UPD officer was approximately 148 yards from the west hall exterior door. One-hundred and forty-eight yards is well within the effective range of an AR-15 platform. The officer did comment that he was concerned that if he missed his shot, the rounds could have penetrated the school and injured students. We also note that current State of Texas standards for patrol rifle qualifications do not require officers to fire their rifles from more than 100 yards away from the target. It is, therefore, possible that the officer had never fired his rifle at a target that was that far away. Ultimately, the decision to use deadly force always lies with the officer who will use the force. If the officer was not confident that he could both hit his target and of his backdrop if he missed, he should not have fired.

had the UPD officer engaged the suspect with his rifle, he may have been able to neutralize, or at least distract, the suspect preventing him from entering the building.

ALERRT report, pp13–14

In hindsight, it sounds like that officer made the wrong call. If he had acted, perhaps he could have stopped the attacker from entering the school and slaughtering children.

If you’re thinking “if it was me in place of that officer, I would have taken the shot”, then, congratulations, you would have killed an innocent man (emphasis mine):

The officers testified to the Committee that it turned out that the person they had seen dressed in black was not the attacker, but instead it was Robb Elementary Coach Abraham Gonzales.

In a subsequent DPS interview, the officer in question described the person he saw not as “the shooter” but as “a person in black toward the back of the school, but kids were behind that individual.” DPS interview (June 13, 2022). These DPS interview reports do not include or support the detail suggested in the ALERRT report that a Uvalde police officer “observed the suspect carrying a rifle outside the west hall entry.” Based on its review of evidence to date, this Committee concludes that it is more likely that the officer saw Coach Gonzales dressed in black near a group of schoolchildren than that there was an actual opportunity to shoot the attacker from over 100 yards away, as assumed by ALERRT’s partial report.

Interim report, p43

This is yet another reminder that incident responders are faced with making time-pressured risk trade-offs under uncertainty. There are risks associated with both action and inaction, you don’t have enough information to make a fully informed decision, and you can’t take an arbitrary amount of time to make a decision, because the situation can change rapidly.

If you want to make sense of how responders behave in a situation like Uvalde, you need to understand what it an incident looks like from the inside.

Common ground breakdown in Uvalde

According to the interim report on the elementary school shooting in Uvalde, there were 376(!) law enforcement officers that responded. How do we make sense of the fact it took such a long to neutralize the shooter? One way is to see the problem as what the researchers Klein, Feltovich, and Woods refer to as a fundamental common ground breakdown.

The term “common ground” refers to a kind of shared understanding between people who are coordinating in some way, so that the participants can predict each other’s future actions. To take a simple example, if you and I are playing a board game, “whose turn is next” would be part of the common ground.

For a group of people to coordinate, they need to maintain common ground, shared understanding of the situation and what actions they expect others to take.

Because incidents are dynamic and the respondents have access to different information, a common risk during incidents is the erosion of common ground. One particular risk is what Klein et al refer to as confusion over who knows what.

During the shooting at Uvalde, many of the officers believed that Chief Arrodondo was the incident commander:

The general consensus of witnesses interviewed by the Committee was that officers on the scene either assumed that Chief Arredondo was in charge, or that they could not tell that anybody was in charge of a scene described by several witnesses as “chaos” or a “cluster.”

Interim report, p62

But Arredondo saw himself in the role of a responder trying to resolve the situation directly, not as an incident commander working on coordination.

[W]hile you’re in there, you don’t title yourself … .I know our policy states you’re the incident commander. My approach and thought was responding as a police officer. And so I didn’t title myself. But once I got in there and we took that fire, back then, I realized, we need some things. We’ve got to get in that door. We need an extraction tool. We need those keys. As far as … I’m talking about the command part … the people that went in, there was a big group of them outside that door. I have no idea who they were and how they walked in or anything. I kind of – I wasn’t given that direction.
you can always hope and pray that there’s an incident command post outside. I just didn’t have access to that. I didn’t know anything about that.

Interim report, p63

Here’s another example of how common ground got eroded: different understandings of the situation based on whether you were inside or outside.

Also, the misinformation reported to officers on the outside likely prevented some of them from taking a more assertive role. For example, many officers were told to stay out of the building because Chief Arredondo was inside a room with the attacker actively negotiating.

Interim report, p63

Another challenge that Klein et al discuss is communication problems, which we also see in Uvalde. In a previous post, I wrote about how Chief Arredondo believed that the shooter was barricaded alone in the room. There was evidence to the contrary, but that information didn’t reach him during the incident.

There was a series of phone calls with a student inside Room 112, initiated by the student calling 911 at 12:03 p.m. Radio traffic communicated to those officers who could hear it the fact that a student had called from within the classroom. Several witnesses indicated that they were aware of this, but not Chief Arredondo.

Interim report, p62

In particular, the police radios didn’t work properly inside of the school, a fact that is mentioned multiple times in the report (emphasis mine)

An effective incident commander located away from the drama unfolding inside the building would have realized that radios were mostly ineffective, and that responders needed other lines of communication to communicate important information like the victims’ phone calls from inside the classrooms.

Interim report, p8

Uvalde CISD police officers commonly carried two radios: one for the school district, and another “police radio” which transmitted communications from various local law enforcement agencies. While the school district radios tended to work reliably, the police radios worked more intermittently depending on where they were used.

Interim report, p14

Upon entering the building, the officers tried but were unable to communicate on their radios.

Interim report, p51

As mentioned in the narratives above, there were important events happening outside the north and south ends of the west building. In part due to the difficulty of maintaining radio communications within the building, not everybody inside the building received all of this information.

Interim report, p62

Radio communication was ineffective, so something else was needed for decisionmakers to receive critical information, such as the fact that victims had called from inside the rooms with the attacker.

Interim report, p64

Fundamental common ground breakdown is an ever-present danger whenever we coordinate, and it’s especially dangerous during incidents. The shooting in Uvalde is a painful reminder of this failure mode.

The fog of war in Uvalde

The interim report on the shooting in Uvalde, Texas faults the responders with treating the incident as “barricaded subject” scenario, where they should have treated it as an “active shooter” scenario.

Here are some excerpts from the report (emphasis mine)

Instead of continuing to act as if they were addressing a barricaded subject scenario in which responders had time on their side, they should have reassessed the scenario as one involving an active shooter. Correcting this error should have sparked greater urgency to immediately breach the classroom by any possible means, to subdue the attacker, and to deliver immediate aid to surviving victims. Recognition of an active shooter scenario also should have prompted responders to prioritize the rescue of innocent victims over the precious time wasted in a search for door keys and shields to enhance the safety of law enforcement responders.

Interim report, p8

An offsite overall incident commander who properly categorized the crisis as an active shooter scenario should have urged using other secondary means to breach the classroom, such as using a sledgehammer as suggested in active shooter training or entering through the exterior windows.

Interim report, p8

Although the encounter had begun as an “active shooter” scenario, Chief Arredondo testified that he immediately began to think of the attacker as being “cornered” and the situation as being one of a “barricaded subject” where his priority was to protect people in the other classrooms from being victimized by the attacker

Interim report, p52

Here’s how Chief Pete Arredondo described his mental model of the situation in the moment:

We have this guy cornered. We have a group of officers on … the north side, a group of officers on the south side, and we have children now that we know in these other rooms. My thought was: We’re a barrier; get these kids out — not the hallway, because the bullets are flying through the walls, but get them out the wall – out the windows, because I know, on the outside, it’s brick.

[T]o me … once he’s … in a room, you know, to me, he’s barricaded in a room. Our thought was: “If he comes out, you know, you eliminate the threat,” correct? And just the thought of other children being in other classrooms, my thought was: “We can’t let him come back out. If he comes back out, we take him out, or we eliminate the threat. Let’s get these children out.”

It goes back to the categorizing. … I couldn’t tell you when — if there was any different kind of categorizing. I just knew that he was cornered. And my thought was: “ … We’re a wall for these kids.” That’s the way I looked at it. “We’re a wall for these kids. We’re not going to let him get to these kids in these classrooms” where … we saw the children.

[W]hen there’s a threat … you have to visibly be able to see the threat. You have to have a target before you engage your firearm. That was just something that’s gone through my head a million times … .[G]etting fired at through the wall … coming from a blind wall, I had no idea what was on the other side of that wall. But … you eliminate the threat when you could see it. … I never saw a threat. I never got to … physically see the threat or the shooter.

Interim report, pp 52 –53

The report goes on to say:

Chief Arredondo’s testimony about his immediate perception of the circumstances is consistent with that of the other responders to the extent they uniformly testified that they were unaware of what was taking place behind the doors of Rooms 111 and 112. They obviously were in a school building, during school hours, and the attacker had fired a large number of rounds from inside those rooms. But the responders testified that they heard no screams or cries from within the rooms, and they did not know whether anyone was trapped inside needing rescue or medical attention. Not seeing any injured students during their initial foray into the hallway, Sgt. Coronado testified that he thought that it was probably a “bailout” situation.

Chief Arredondo and other officers contended they were justified in treating the attacker as a “barricaded subject” rather than an “active shooter” because of lack of visual confirmation of injuries or other information.

Interim report, p53

(Aside: A “bailout” situation refers to human traffickers who try to outrun the police. They commonly crash their vehicles and then flee. These bailout situations were so common in Uvalde that they led to alert fatigue(!). See p6 of the report for more details).

Of course, it’s impossible to know the true state of mind of the officers at the time. And, as the report notes, video camera evidence suggests that officers eventually believed there were people who had been injured by the shooter:

For example, later in the incident, Sgt. Coronado’s body-worn camera footage recorded that somebody asked, at 12:34 p.m., “we don’t know if he has anyone in the room with him, do we?” Chief Arredondo responded, “I think he does. There’s probably some casualties.” Sgt. Coronado agreed, saying “yeah, he does … casualties.” Then at 12:41 p.m.: “Just so you understand, we think there are some injuries in there.”

Interim report, p54, footnote 164

But even the report suggests that the issue was around fixation, as opposed to the officer lying about what he believed in the moment.

This “barricaded subject” approach never changed over the course of the incident despite evidence that Chief Arredondo’s perspective evolved to a later understanding that fatalities and injuries within the classrooms were a very strong probability.

Interim report, pp 53–54

My claim here is that we should assume the officer is telling the truth and was acting reasonably if we want to understand how these types of failure modes can happen.

Instead of assuming that Chief Arredondo made a mistake, if we assume he came to a reasonable conclusion in assuming the shooter was a “barricaded subject”, then we can better appreciate the ambiguous nature of incidents in the moment. In order to understand the challenges that people like Chief Arredondo faced, we need to put ourselves in his place, and imagine what our understanding would be like if we only saw the signals that he did.

This isn’t the last time a responder is going to reach the wrong conclusion based on partial information, and then get fixated on it. If we simply label Chief Arredondo as “acting unreasonably” or “being a coward”, then we might feel better when he gets fired, but we won’t get better at these sorts of failure modes. We must assume that a person can act reasonably and still come to the wrong conclusion in order to make progress.

What’s allowed to count as a cause: ALERRT edition

The Advanced Law Enforcement Rapid Response Training (ALERRT) Center, based at Texas State University, trains law enforcement officers on how to deal with active shooter incidents. After the shooting at Uvalde, ALERRT produced an after-action report titled Robb Elementary School Attack Response Assessment and Recommendations.

The “Tactical Assessment” section of the report criticizes the action of the responding officers. It’s too long to excerpt in this post, but here are some examples:

A reasonable officer would have considered this an active situation and devised a plan to address the suspect. Even if the suspect was no longer firing his weapon, his presence and prior actions were preventing officers from accessing victims in the classroom to render medical aid (ALERRT & FBI, 2020, p. 2-17).

ALERRT report, p17

In a hostage/barricade, officers are taught to utilize the 5 Cs (Contain, Control, Communicate, Call SWAT, Create a Plan; ALERRT & FBI, 2020, pp. 2-17 to 2-19). In this instance, the suspect was contained in rooms 111 and 112. The officers established control in that they slowed down the assault. However, the officers did not establish communication with the suspect. The UCISD PD Chief did request SWAT/tactical teams. SWAT was called, but it takes time for the operators to arrive on scene. In the meantime, it is imperative that an immediate action plan is created. This plan is used if active violence occurs. It appears that the officers did not create an immediate action plan.

ALERRT report, p17

(Note: per the interim report, the officers did try to establish communication with the suspect, but the ALERRT authors weren’t aware of this at the time).

At 11:40:58, the suspect fired one shot. At 11:44:00, the suspect fired another shot, and finally, at 12:21:08, the suspect fired 4 more shots. During each of these instances, the situation had gone active, and the immediate action plan should have been triggered because it was reasonable to believe that people were being killed.

ALERRT report, p18

Additionally, we have noted in this report that it does not appear that effective incident command was established during this event. The lack of effective command likely impaired both the Stop the Killing and Stop the Dying parts of the response.

ALERT report, p19

The interim report also covers some of this territory in the subsection titled “ALERRT Standard for Active Shooter Training”, which starts on p17.

What struck me after reading the ALERRT report is that there is no mention of the fact that several of the responding police officers had received ALERRT training, including the chief of the Uvalde school district police, Pete Arredondo. From the interim report:

Before joining the Uvalde CISD Police Department, Chief Arredondo received active shooter training from the ALERRT Center, which the FBI has recognized as “the National Standard in Active Shooter Response Training.” Every school district peace officer in Texas must be trained on how to respond in active shooter scenarios. Not all of them get ALERRT training, but Chief Arredondo and other responders at Robb Elementary did.

Interim report, pp 17–18

The ALERRT report discusses how the actions of the officers is contrary to ALERRT training, and that is one potential explanation for why things went badly. But another potential explanation is that the ALERRT training wasn’t good enough to prepare the officers to deal with this situation. For example, perhaps the training doesn’t go into enough detail about the danger of fixation, where Chief Arredondo focused on trying to get a key for the door, when it wasn’t even clear whether the door was locked or not. (Does ALERRT train peace officers to diagnose fixation in other responders?)

The interim report gestures in the direction of ALERRT training being inadequate when it comes to checking the locks, although not in about the more general problem of fixation.

ALERRT has noted the failure to check the lock in its criticisms. See ALERRT, Robb Elementary School Attack Response Assessment and Recommendations at 18-19 (July 6, 2022). A representative of ALERRT testified before the Committee that the “first rule of breaching” is to check the lock. See Testimony of John Curnutt, ALERRT (July 11, 2022). Unfortunately, ALERRT apparently has neglected to include that “first rule of breaching” in its active- shooter training materials, which includes modules entitled “Closed and Locked Interior Doors” and “Entering Locked Buildings Quickly, Discreetly, and Safely.” See Federal Bureau of Investigation & ALERRT, Active Shooter Response – Level 1, at STU 3-8 – 3-10, 4-20 – 4-25.

Interim report, p64, footnote 206

Now, these criticisms are hindsight-laden, and my goal here isn’t to criticize ALERRT’s training: this isn’t my domain, and I don’t pretend to know how to train officers to deal with active shooter scenarios. Rather, my point is that the folks writing the ALERRT report were never going to consider that their own training is inadequate. After all, they’re the experts!

ALERRT was recognized as the national standard in active shooter response training by the FBI in 2013. ALERRT’s excellence in training was recognized in 2016 with a Congressional Achievement Award.

More than 200,000 state, local, and tribal first responders (over 140,000 law enforcement) from all 50 states, the District of Columbia, and U.S. territories have received ALERRT training over the last 20 years.

ALERRT training is research based. The ALERRT research team not only evaluates the efficacy of specific response tactics (Blair & Martaindale, 2014; Blair & Martaindale, 2017; Blair, Martaindale, & Nichols, 2014; Blair, Martaindale, & Sandel, 2019; Blair, Nichols, Burns, & Curnutt, 2013;) but also has a long, established history of evaluating the outcomes of active shooter events to inform training (Martaindale, 2015; Martaindale & Blair, 2017; Martaindale, Sandel, & Blair, 2017). Specifically, ALERRT has utilized case studies of active shooter events to develop improved curriculum to better prepare first responders to respond to similar situations (Martaindale & Blair, 2019).

For these reasons, ALERRT staff will draw on 20 years of experience training first responders and researching best practices to fulfill the Texas DPS request and objectively evaluate the law enforcement response to the May 24, 2022, attack at Robb Elementary School.

ALERRT report, p1

I think it’s literally inconceivable for the ALERRT staff to consider the inadequacy of their own training curriculum as being a contributor to the incident. It’s a great example of something that isn’t allowed to count as a cause.

I’ll end this blog post with some shade that the interim report threw on the ALERRT report.

The recent ALERRT report states that “[o]nce the officers retreated, they should have quickly made a plan to stop the attacker and gain access to the wounded,” noting “[t]here were several possible plans that could have been implemented.” “Perhaps the simplest plan,” according to ALERRT, “would have been to push the team back down the hallway and attempt to control the classrooms from the windows in the doors.” The report explains the purported simplicity of the plan by noting: “Any officer wearing rifle-rated body armor (e.g., plates) would have assumed the lead as they had an additional level of protection.” ALERRT, Robb Elementary School Attack Response Assessment and Recommendations (July 6, 2022). A problem with ALERRT’s depiction of its “simplest plan” is that no officer present was wearing “rifle-rated body armor (e.g., plates).” The Committee agrees the officers should have attempted to breach the classrooms even without armor, but it is inflammatory and misleading to release to the public a report describing “plans that could have been implemented” that assume the presence of protective equipment that the officers did not have.

Interim Report, pp51–52, footnote 158


Last week, the Investigative Committee on the Robb Elementary shooting (in Uvalde, Texas) released an interim report with their findings. I recommend reading it if you’re interested in incidents, especially section 5, “May 24 Incident & Law Enforcement Response“, which goes into detail on how the police responded.

I was pleasantly surprised to see terms like contributing factors and systemic failures in the report, and not a single reference to root cause. On the other hand, there’s way too much counterfactual reasoning in the report in my taste: there’s an entire subsection with the title “What Didn’t Happen in Those 73 Minutes?” It doesn’t get more counterfactual-y than that. There’s also normative language like egregious poor decision making. It’s disappointing, but not surprising, to see this type of language, given the nature of the incident.

Reading the report, I found there was too much I wanted to comment on to fit into one post, and so I’m going to try and write a series of posts instead. I’ve also created a GitHub repo with pointers to various artifacts related to the shooting (reports, images, videos).