Code rewrites and joint cognitive systems

Way back in the year 2000, Joel Spolsky famously criticized the idea of doing a code rewrite.

The idea that new code is better than old is patently absurd. Old code has been used. It has been testedLots of bugs have been found, and they’ve been fixed. There’s nothing wrong with it.

Joel Spolsky, Things You Should Never Do, Part I

I think Spolsky is wrong here. His error comes from considering the software in isolation. The problem here isn’t the old code, it’s the interaction between the old code and the humans who are responsible for maintaining the software. If you draw the boundary around those people and the software together, you get what the cognitive systems engineering community calls a joint cognitive system.

One of the properties of joint cognitive systems is that the system has knowledge about itself. Being responsible for maintaining a legacy codebase is difficult because the joint cognitive system is missing important knowledge about itself.

Here’s Spolsky again:

When you throw away code and start from scratch, you are throwing away all that knowledge. 

But that knowledge is already gone! The people who wrote the code have left, and the current maintainers don’t know what their intent was. The joint cognitive system, the combination of code and the current maintainers, don’t know why it’s implemented the way it is.

Spolsky gestures at this, but doesn’t grasp its implications:

The reason that they think the old code is a mess is because of a cardinal, fundamental law of programming: It’s harder to read code than to write it.

Spolsky is missing the importance of a system’s ability to understand itself. Ironically, the computer scientist Peter Naur was writing about this phenomenon fifteen years earlier. In an essay titled Programming as Theory Building, he described the importance of having an accurate mental model or theory of the software, and the negative consequences of software being modified by maintainers with poor mental models.

It isn’t just about the software. It’s about the people and the software together.

Taking the hit

Here’s a scenario I frequently encounter: I’m working on writing up an incident or an OOPS. I’ve already interviewed key experts on the system, and based on those interviews, I understand the implementation details well enough to explain the failure mode in writing.

But, when I go to write down the explanation, I discover that I don’t actually have a good understanding of all of the relevant details. I could go back and ask clarifying questions, but I worry that I’ll have to do this multiple times, and I want to avoid taking up too much of other people’s time.

I’m now faced with a choice when describing the failure mode. I can either:

(a) Be intentionally vague about the parts that I don’t understand well.

(b) Make my best guess about the implementation details for the parts I’m not sure about.

Whenever I go with option (b), I always get some of the details incorrect. This becomes painfully clear to me when I show a draft to the key experts, and they tell me straight-out, “Lorin, this section is wrong.”

I call choosing option (b) taking the hit because, well, I hate the feeling of being wrong about something. However, I always try to go with this approach because this maximizes both my own learning and (hopefully) the learning of the readers. I take the hit. When you know that your work will be reviewed by an expert, it’s better to be clear and wrong than vague.

Bitrot

Engineering deals in lifetimes, both human and otherwise. If not fatigue or fracture, than corrosion or erosion; if not war or vandalism, then taste or fashion claim not only the body but the very souls of once-new machines…

The lifetime of a structure is no mere anthropomorphic metaphor, for how long a piece of engineering must last can be one of the most important considerations for its design.

Henry Petroski, To Engineer is Human: The Role of Failure in Successful Design

Unfathomed misunderstanding is further revealed by the term “software maintenance”, as a result of which many people continue to believe that programs —and even programming languages themselves— are subject to wear and tear. Your car needs maintenance too, doesn’t it? Famous is the story of the oil company that believed that its PASCAL programs did not last as long as its FORTRAN programs “because PASCAL was not maintained”.

Edsger W. Dijkstra, On the cruelty of really teaching computing science

Before Borland’s new spreadsheet for Windows shipped, Philippe Kahn, the colorful founder of Borland, was quoted a lot in the press bragging about how Quattro Pro would be much better than Microsoft Excel, because it was written from scratch. All new source code! As if source code rusted.

The idea that new code is better than old is patently absurd. Old code has been used. It has been testedLots of bugs have been found, and they’ve been fixed. There’s nothing wrong with it. It doesn’t acquire bugs just by sitting around on your hard drive. Au contraire, baby! Is software supposed to be like an old Dodge Dart, that rusts just sitting in the garage? Is software like a teddy bear that’s kind of gross if it’s not made out of all new material?

Joel Spolsky, Things You Should Never Do, Part I

In the two quotes above, Dijkstra and Spolsky ridicule the notion that software systems wear out. Unlike physical systems, software doesn’t suffer from fatigue due to prolonged usage.

And, yet, anyone who has uttered the phrase “legacy system” in the presence of a software engineer and watched the change of expression on their face knows that engineers find older code more difficult to deal with than newer code. The motivation of Dijkstra’s and Spolsky’s writings above is to express contempt for this point of view.

What Dijkstra and Spolsky are missing is that the world changes around software. Software doesn’t exist in a vacuum: it’s part of an ecosystem. Legacy systems have legacy dependencies, and run in legacy environments. Those dependencies and environments are not static, they change over time, and sometimes the old ones go away, or are too expensive or risky to keep using.

Software is indeed different from physical artifacts, in that software artifacts (source code, binaries) don’t change with use. But in the world of software, that’s exactly the problem. The world keeps changing, and the software doesn’t, unless you put the work into it. And, unlike civil engineers, we aren’t yet good at thinking about the intended lifetime of a software system when we’re designing it.

Aristotle’s revenge

Imagine you’re walking around a university campus. It’s a couple of weeks after the spring semester has ended, and so there aren’t many people about. You enter a building and walk into one of the rooms. It appears to be some kind of undergraduate lab, most likely either physics or engineering.

In the lab, you come across a table. On the table, someone has balanced a rectangular block on its smallest end.

You nudge the top of the block. As expected, it falls over with a muted plonk. You look around to see if you might have gotten in trouble, but nobody’s around.

You come across another table. This table has some sort of track on it. The table also has a block on it that’s almost identical to the one on the other table, except that the block has a pin in it that connects it to some sort of box that is mounted on the track.

Not being able to resist, you nudge the top of this block, and it starts to fall. Then, the little box on the track whirs to life, moving along the track in the same direction that you nudged the block in. Because of the motion of the box, the block stays upright.


The ancient philosopher Aristotle believed that there were four distinct types of causes that explained why things happened. One of these is what Aristotle called the efficient cause: “Y behaved the way it did because X acted on Y“. For example, the red billiard ball moved because it was struck by the white ball. This is the most common way we think about causality today, and it’s sometimes referred to as linear causality.

Efficient cause does a good job of explaining the behavior of the first rectangular block in the anecdote: it fell over because we nudged it with our finger. But it doesn’t do a good job of explaining the observed behavior of the second rectangular block: we nudged it, and it started to fall, but it righted itself, and ended up balanced again.

Another type of cause Aristotle talked about was what he called the final cause. This is a teleological view of cause, which explains the behavior of objects in terms of their purpose or goal.

Final cause sounds like an archaic, unscientific view of the world. And, yet, reasoning with final cause enables us to explain the behavior of the second block more effectively than efficient cause does. That’s because the second block is being controlled by a negative-feedback system that was designed to keep the block balanced. The system will act to compensate for disturbances that could lead to the block falling over (like somebody walking over and nudging it). Because the output, a sensor that reads the angle of the block, is fed back into the input of the control system, the relationship between external disturbance and system behavior isn’t linear. This is sometimes referred to as circular causality, because of the circular nature of feedback loops.

The systems that we deal with contain goals and feedback loops, just like the inverted pendulum control system that keeps the block balanced. If you try to use linear causality to understand a closed-loop system, you will be baffled by the resulting behavior. Only when you understand the goals that the system is trying to achieve, and the feedback loops that it uses to adjust its behavior to reach those goals, will the resulting behavior make sense.

Henry Yin on what the cyberneticists got wrong

I’ve been on a bit of a control systems kick lately, and, serendipitously, I happened to see this tweet, which referenced a paper by Henry Yin at Duke University titled The crisis in neuroscience.

In the paper, Yin argues that neuroscience has failed to make progress in modeling human behavior because it tries to model the brain as a linear system, where you can study it by generating inputs and observing outputs.

Input/output model of brain

Yin proposes an alternative model, that you need to view the brain as composed of a collection of hierarchical, closed-loop control systems in order to understand behavior from a neurological perspective.

Now, the cybernetics folks have long argued that you should model human brains as control systems. But Yin argues that the cyberneticists got an important thing wrong in their control models: their models were too close to engineering applications to be directly applicable to organisms.

Classical engineering model of a feedback control system

In an engineered control system, a human operator specifies the set point. For example, for a cruise control system, you’d set desired speed. In the block diagram above, this set point is provided as the input to the system.

The output of the “Plant” block diagram is the current state of the variable you’re trying to control (e.g., current speed). The controller takes as input the difference between the set point and current state, and uses that to determine how to drive the plant (e.g., input to the motor).

Here’s a block diagram of everyone’s favorite control systems example, the thermostat:

A thermostat that controls temperature

I’ve used a double-arrow to indicate signals that propagate through the environment, and single-arrows to indicate signals that propagate through wires. I’ve put a red box around where Yin claims the cyberneticists hold as their model for control in animals.

The variable under control is the temperature. A human sets the desired temperature, and a temperature sensor reads the current temperature. The controller takes as input the difference between the desired temperature and the current temperature, and uses that to determine whether or not to turn on the furnace.

The actual temperature in the house is determined both by the output of the furnace, and by other factors (e.g., temperature outside, how good the insulation is, whether someone has opened a door), which I’ve labeled disturbance.

The problem with this, Yin argues, is that the red box is not a good model for the control that happens in the brain. As an alternative, he proposes the following model:

You can think of the red box as being the stuff inside some aspect of the brain, the “plant” as being the things that this aspect controls (e.g., other parts of the brain, muscles).

The difference in Yin’s model is that the controller determines the set point. There’s no external agent specifying the desired value as an input. Instead, the controller generates its own set point, which Yin calls the reference value.

Also note that Yin’s model includes the input function inside the red box. This takes sensory input at calculates the variable that’s under control. The difference between this model and the thermostat is that, in the thermostat model, you know from the outside that temperature is the variable being controlled. In Yin’s model, you can’t see from the outside what the variable is that’s being controlled for: the variable is internal to the control system.

Despite knowing nothing about neuroscience, and only knowing a bit about control systems, I still found this paper surprisingly accessible. I recommend it. There’s a lot more here than what I’ve touched on in this post.

The ambiguity of real work

All ambiguity is resolved by actions of practitioners at the sharp end of the system.

Dr. Richard I. Cook, How Complex Systems Fail

There’s a wonderful book by the late urban planning professor Donald Schön called The Reflective Practitioner: How Professionals Think in Action. In the first chapter, he discusses the “rigor or relevance” dilemma that faces educators in professional degree programs. In the case of a university program aimed at preparing students for a career in software development, this is the “should we teach topological sort or React?” question.

Schön argues that the dilemma itself is a fundamental misunderstanding of the nature of professional work. What it misses is the ambiguity and uncertainty inherent in the work of professional life. The “rigor vs relevance” debate is an argument over the best way to get from the problem to the solution: do you teach the students first principles, or do you teach them how to use the current set of tools? Schön observes that a more significant challenge for professionals is defining the problems to solve in the first place, since an ill-defined problem admits no technical solution at all.

In the varied topography of professional practice, there is a high, hard ground where practitioners can make effective use of research-based theory and technique, and there is a swampy lowland where situations are confusing “messes” incapable of technical solution. The difficulty is that the problems of the high ground, however great their technical interest, are often relatively unimportant to clients or to the larger society, while in the swamp are the problems of greatest human concern.

His use of the term “messes” evokes Russell Ackoff’s use of the term in his paper The Future of Operational Research is Past:

Managers are not confronted with problems that are independent of each other, but with dynamic situations that consist of complex systems of changing problems that interact with each other. I call such situations messes. Problems are abstractions extracted from messes by analysis; they are to messes as atoms are to tables and chairs. We experience messes, tables, and chairs; not problems and atoms

To take another example from the software domain. Imagine that you’re doing quarterly planning, and there’s a collection of reliability work that you’d like to do, and you’re trying to figure out how to prioritize it. You could apply a rigorous approach, where you quantify some values in order to do the prioritization work, and so you try to estimate information like:

  • the probability of hitting a problem if the work isn’t done
  • the cost to the organization if the problem is encountered
  • the amount of effort involved in doing the reliability work

But you’re soon going to discover the enormous uncertainty involved in trying to put a number on any of those things. And, in fact, doing any reliability work can actually introduce new failure modes.

Over and over, I’ve seen the theme of ambiguity and uncertainty appear in ethnographic research that looks at professional work in action. In Designing Engineers, the aerospace engineering professor Louis Bucciarelli did an ethnographic study of engineers in a design firm, and discovered that the engineers all had partial understanding of the problem and solution space, and that their understandings also overlapped only partially. As a consequence, a lot of the engineering work that was done actually involved engineers resolving their incomplete understanding through various forms of communication, often informal. Remarkably, the engineers were not themselves aware of this process of negotiating understandings of the problems and solutions.

The famous Common Ground and Coordination in Joint Activity paper by Gary Klein, Paul Feltovich, and David Woods, makes explicit the role that ambiguity plays in human coordination and communication.

You’ll sometimes hear researchers who study work talk about the process of sensemaking. For example, there’s a paper by Sana Albolino, Richard Cook, and Micahel O’Connor called Sensemaking, safety, and cooperative work in the intensive care unit that describes this type of work in an intensive care unit. I think of sensemaking as an activity that professionals perform to try to resolve ambiguity and uncertainty.

(Ambiguity isn’t always bad. In the book On Line and On Paper, the sociologist Kathryn Henderson describes how engineers use engineering drawings as boundary objects. These are artifacts are that are understood differently by the different stakeholders: two engineers looking at the same drawing will have different mental models of the artifact based on their own domain expertise(!). However, there is also overlap in their mental models, and it is this combination of overlap and the fact that individuals can use the same artifact for different purposes that makes it useful. Here the ambiguity has actual value! In fact, her research shows that computer models, which eliminate the ambiguity, were less useful for this sort of work).

As practitioners, we have no choice: we always have to deal with ambiguity. As noted by Richard Cook in the quote that opens this blog post, we are the ones, at the sharp end, that are forced to resolve it.

The Howie Guide: How to get started with incident investigations

Until now, if you wanted to improve your organization’s ability to learn from incidents, there wasn’t a lot of how-to style material you could draw from. Sure, there were research papers you could read (oh, so many research papers!). But academic papers aren’t a great source of advice for someone who is starting on an effort to improve how they do incident analysis.

There simply weren’t any publications about how to get started with doing incident investigations which were targeted at the infotech industry. Your best bet was the Etsy Debrief Facilitation Guide. It was practical, but it focused on only a single aspect of the incident investigation process: the group incident retrospective meeting. And there’s so much more to incident investigation than that meeting.

The folks at Jeli have stepped up to the challenge. They just released Howie: The Post-Incident Guide.

Readers of this blog will know that this is a topic near and dear to my heart. The name “Howie” is short for “How we got here“, which is what we call our incident writeups at Netflix. (This isn’t a coincidence: we came up with this name at Netflix when Nora Jones of Jeli and I were on the CORE team).

Writing a guide like this is challenging, because so much of incident investigation is contextual: what you look at it, what questions you ask, will depend on what you’ve learned so far. But there are also commonalities across all investigations; the central activities (constructing timelines, doing one-on-one interviews, building narratives) happen each time. The Howie guide gently walks the newcomer through these. It’s accessible.

When somebody says, “OK, I believe there’s value in learning more from incidents, and we want to go beyond doing a traditional root-cause-analysis. But what should I actually do?”, we now have a canonical answer: go read Howie.

I have no idea what I’m doing

A few days ago, David Heinemeier Hansson (who generally goes by DHH) wrote a blog post titled Programmers should stop celebrating incompetence:

I disagreed with the post, but for different reasons than from most of the other responses I saw on twitter.

Here are a couple of lines from the post:

You can’t become the I HAVE NO IDEA WHAT I’M DOING dog as a professional identity. Don’t embrace being a copy-pasta programmer whose chief skill is looking up shit on the internet.

From the twitter reactions, it seems like people thought DHH was saying, “you shouldn’t be looking things up on the internet and copy-pasting code”. But I think that gets the thrust of his argument wrong. This wasn’t a diatribe against Stack Overflow, but it was about how programmers see themselves and their work.

DHH was criticizing a sort of anti-intellectualism mode of expression. The attitude he was criticizing reminds me of reading an essay (I can’t remember the source or author, it might have been Paul Lockhart) where a mathematics(?) professor was talking to some colleagues from the humanities department, and when the math professor mentioned their field, the humanities professor said, “Oh, I was never any good at math”, and it came off almost as a point of pride.

Where I disagree with DHH is that I don’t see this type of anti-intellectualism in our field at all. I don’t see “LOL, I don’t know what I’m doing” on people’s LinkedIn profiles or in their resumes, I don’t hear it in interviews, I don’t see it on pull request comments, I don’t hear it in technical meetings. I don’t think it exists in our field.

You can see our field’s professionalism in criticisms of technical interviews that involve live coding. You don’t hear programmers criticizing it by saying, “LOL, actually, nobody knows how to do this.” What you hear instead is, “these interviews don’t effectively evaluate my actual skills as a software developer”.

So, what’s going on here? What led DHH astray? Where does the dog meme come from?

To explain my theory, I’m going to use this recent blog post by Diomidis Spinnellis, called Rather than alchemy, methodical troubleshooting:

Spinellis is a software engineering professor who has written numerous books for practitioners and has contributed to numerous open source projects (including the FreeBSD kernel). He is as professional as they come.

His blog post is about his struggles getting a React Native project to build in Xcode, including trying (in vain) various bits of advice he found through Googling. Spinellis actually feels bad about his initial approach:

Although advice from the web can often help us solve tough problems in seconds, as the author of the book Effective Debugging, I felt ashamed of wasting time by following increasingly nonsensical advice. 

I bring this up not to pile onto Spinellis, but to point out that the surface area of the software world is vast, so vast that even the most professional software engineer will encounter struggles, will hit issues outside of their expertise.

(As an aside: note that Spinellis does not solve the problem by developing a deep understanding of the failure mode, but instead by systematically eliminating the differences between a succeeding build and a failed one.)

In the book Designing Engineers, Louis Bucciarelli notes that Murphy’s Law and horror stories told by engineers are symptoms of the dissonance between the certainty of engineering models and the uncertainty of reality. I think the dog meme is another such symptom. It uses humor to help us deal with the fact that, no matter how skilled we become in our profession as software engineers, we will always encounter problems that extend beyond our area of expertise to understand.

To put it another way: the dog meme is a coping mechanism for professionals in dealing with a domain that will always throw problems at them that push them beyond their local knowledge. It doesn’t indicate a lack of professionalism. Instead, it calls attention to the ironies of professionalism in software engineering. Even the best software engineers still get relegated to Googling incomprehensible error messages.

How much did that outage cost?

People like to put dollar values on outages. There’s no true answer to the question of how much an outage costs an organization. If your company is transaction-based, you can estimate how many transactions were missed, but there are all sorts of other factors that you could decide to model. (Are those transactions really lost, or will people come back later? Does it impact the public’s perception of the organization? What if your business isn’t transaction-based?). If you ask John Allspaw, he’ll tell you that incidents can provide benefits to an organization in addition to costs.

Putting all of that aside for now, one question around incident cost that I think is interesting is the perceived cost within the organization. How costly does leadership feel that this incident was?

Here’s a proposed approach to try and capture this. After an incident, go to different people in leadership, and ask them the following question:

Imagine I could wave a magic wand, and it would alter history to undo the incident: it will be as if the incident never happened. However, I’ll only do this if you pay me money out of your org’s budget. How much are you willing to pay me to wave the wand?

I think this would be an interesting way to convey to the organization how leadership perceived the incident.