Easy will always trump simple

One of the early criticisms of Darwin’s theory of evolution by natural selection was about how it could account for the development of complex biological structures. It’s often not obvious to us how the earlier forms of some biological organ would have increase fitness. “What use”, asked the 19th century English biologist St. George Jackson Mivart, “is half a wing?”

One possible answer is that while half a wing might not be useful for flying, it may have had a different function, and evolution eventually repurposed that half-wing for flight. This concept, that evolution can take some existing trait in an organism that serves a function and repurpose it to serve a different function, is called exaptation.

Biology seems to be quite good at using the resources that it has at hand in order to solve problems. Not too long ago, I wrote a review of the book How Life Works: A User’s Guide to the New Biology by the British science writer Philip Ball. One of the main themes of the book is how biologists’ view of genes has shifted over time from the idea DNA-as-blueprint to DNA-as-toolbox. Biological organisms are able to deal effectively with a wide range of challenges by having access to a broad set of tools, which they can deploy as needed based on their circumstances.

We’ll come back to the biology, but for a moment, let’s talk about software design. Back in 2011, Rich Hickey gave a talk at the (sadly defunct) Strange Loop conference with the title Simple Made Easy (transcript, video). In this talk, Hickey drew a distinction between the concepts of simple and easy. Simple is the opposite of complex, where easy is something that’s familiar to us: the term he used to describe the concept of easy that I really liked was at hand. Hickey argues that when we do things that are easy, we can initially move quickly, because we are doing things that we know how to do. However, because easy doesn’t necessarily imply simple, we can end up with unnecessarily complex solutions, which will slow us down in the long run. Hickey instead advocates for building simple systems. According to Hickey, simple and easy aren’t inherently in conflict, but are instead orthogonal. Simple is an absolute concept, and easy is relative to what the software designer already knows.

I enjoy all of Rich Hickey’s talks, and this one is no exception. He’s a fantastic speaker, and I encourage you to listen to it (there are some fun digs at agile and TDD in this one). And I agree with the theme of his talk. But I also think that, no matter how many people listen to this talk and agree with it, easy will always win out over simple. One reason is the ever-present monster that we call production pressure: we’re always under pressure to deliver our work within a certain timeframe, and easier solutions are, by definition, going to be ones that are faster to implement. That means the incentives on software developers tilts the scales heavily towards the easy side. Even more generally, though, easy is just too effective a strategy for solving problems. The late MIT mathematics professor Gian-Carlo Rota noted that every mathematician has only a few tricks, and that includes famous mathematicians like Paul Erdős and David Hilbert.

Let’s look at two specific examples of the application of easy from the software world, specifically, database systems. The first example is about knowledge that is at-hand. Richard Hipp implemented the SQLite v1 as a compiler that would translate SQL into byte code, because he had previous experience building compilers but not building database engines. The second example is about an exaptation, leveraging an implementation that was at-hand. Postgres’s support for multi-version concurrency control (MVCC) relies upon an implementation that was originally designed for other features, such as time-travel queries. (Multi-version support was there from the beginning, but MVCC was only added in version 6.5).

Now, the fact that we rely frequently on easy solutions doesn’t necessarily mean that they are good solutions. After all, the Postgres source I originally linked to has the title The Part of PostgreSQL We Hate the Most. Hickey is right that easy solutions may be fast now, but they will ultimately slow us down, as the complexity accretes in our system over time. Heck, one of the first journal papers that I published was a survey paper on this very topic of software getting more difficult to maintain over time. Any software developer that has worked at a company other than a startup has felt the pain of working with a codebase that is weighed down by what Hickey refers to in his talk as incidental complexity. It’s one of the reasons why startups can move faster than more mature organizations.

But, while companies are slowed down by this complexity, it doesn’t stop them entirely. What Hickey refers to in his talk as complected systems, the resilience engineering researcher David Woods refers to as tangled. In the resilience engineering view, Woods’s tangled, layered networks inevitably arise in complex systems.

Hickey points out that humans can only keep a small number of entities in their head at once, which puts a hard limit on our ability to reason about our systems. But the genuinely surprising thing about complex systems, including the ones that humans build, is that individuals don’t have to understand the system for them to work! It turns out that it’s enough for individuals to only understand parts of the system. Even without anyone having a complete understanding of the whole system, we humans can keep the system up and running, and even extend its functionality over time.

Now, there are scenarios when we do need to bring to bear an understanding of the system that is greater than any one person possesses. My own favorite example is when there’s an incident that involves an interaction between components, where no one person understands all of the components involved. But here’s another thing that human beings can do: we can work together to perform cognitive tasks that none of us could do on their own, and one such task is remediating an incident. This is an example of the power of diversity, as different people have different partial understandings of the system, and we need to bring those together.

To circle back to biology: evolution is terrible at designing simple systems: I think biological systems are the most complex systems that we humans have encountered. And yet, they work astonishingly well. Now, I don’t think that we should design software the way that evolution designs organisms. Like Hickey, I’m a fan of striving for simplicity in design. But I believe that complex systems, whether you call them complected or tangled, are inevitable, they’re just baked in to the fabric of the adaptive universe. I also believe that easy is such a powerful heuristic that it is also baked in to how we build and involved systems. That being said, we should be inspired, by both biology and Hickey, to have useful tools at-hand. We’re going to need them.

Why LFI is a tough sell

There are two approaches to doing post-incident analysis:

the (traditional) root cause analysis (RCA) perspective
the (more recent) learning from incidents (LFI) perspective

In the RCA perspective, the occurrence of an incident has demonstrated that there is a vulnerability that caused the incident to happen, and the goal of the analysis is to identify and eliminate the vulnerability.

In the LFI perspective, an incident presents the organization with an opportunity to learn about the system. The goal is to learn as much as possible with the time that the organization is willing to devote to post-incident work.

The RCA approach has the advantage of being intuitively appealing. The LFI approach, by contrast, has three strikes against it:

LFI requires more time and effort than RCA
LFI requires more skill than RCA
It’s not obvious what advantages LFI provides over RCA.

I think the value of LFI approach is based on assumptions that people don’t really think about because these assumptions are not articulated explicitly.

In this post, I’m going to highlight two of them.

Nobody knows how the system really works

The LFI approach makes the following assumption: No individual in the organization will ever have an accurate mental model about how the entire system works. To put it simply:

It’s the stuff we don’t know that bites us
There’s always stuff we don’t know

By “system” here, I mean the socio-technical system, which includes both the software and what it does, and the humans who do the work to develop and operate the system.

You’ll see the topic of incorrect mental models discussed in the safety literature in various ways. For example, David Woods uses the term miscalibration to describe incorrect mental models, and Diane Vaughan writes about structural secrecy, which is a mechanism that leads to incorrect mental models.

But incorrect mental models are not something we talk much about explicitly in the software world. The RCA approach implicitly assumes there’s only a single thing that we didn’t know: the root cause of the incident. Once we find that, we’re done.

To believe that the LFI approach is worth doing, you need to believe that there is a whole bunch of things about the system that people don’t know, not just a single vulnerability. And there are some things that, say, Alice knows that Bob doesn’t, and that Alice doesn’t know that Bob doesn’t know.

Better system understanding leads to better decision making in the future

The payoff for RCA is clear: the elimination of a known vulnerability. But the payoff for LFI is a lot fuzzier: if the people in the organization know more about the system, they are going to make better decisions in the future.

The problem with articulating the value is that we don’t know when these future decisions will be made. For example, the decision might happen when responding to the next incident (e.g., now I know how to use that observability tool because I learned from how someone else used it effectively in the last incident). Or the decision might happen during the design phase of a future software project (e.g., I know to shard my services by request type because I’ve seen what can go wrong when “light” and “heavy” requests are serviced by same cluster) or during the coding phase (e.g., I know to explicitly set a reasonable timeout because Java’s default timeout is way too high).

The LFI approach assumes that understanding the system better will advance the expertise of the engineers in the organization, and that better expertise means better decision making.

On the one hand, organizations recognize that expertise leads to better decision making: it’s why they are willing to hire senior engineers even though junior engineers are cheaper. On the other hand, hiring seems to be the only context where this is explicitly recognized. “This activity will advance the expertise of our staff, and hence will lead to better future outcomes, so it’s worth investing in” is the kind of mentality that is required to justify work like the LFI approach.

There is no “Three Mile Island” event coming for software

In Critical Digital Services: An Under-Studied Safety-Critical Domain, John Allspaw asks:

Critical digital services has yet to experience its “Three-Mile Island” event. Is
such an accident necessary for the domain to take human performance seriously? Or can it translate what other domains have learned and make productive use of
those lessons to inform how work is done and risk is anticipated for the future?

I don’t think the software world will ever experience such an event.

The effect of TMI

The Three Mile Island accident (TMI) is notable, not because of the immediate impact on human lives, but because of the profound effect it had on the field of safety science.

Before TMI, the prevailing theories of accidents was that they were because of issues like mechanical failures (e.g., bridge collapse, boiler explosion), unsafe operator practices, and mixing up physical controls (e.g., switch that lowers the landing gear looks similar to switch that lowers the flaps).

But TMI was different. It’s not that the operators were doing the wrong things, but rather that they did the right things based on their understanding of what was happening, but their understanding of what was happening, which was based on the information that they were getting from their instruments, didn’t match reality. As a result, the actions that they took contributed to the incident, even though they did what they were supposed to do. (For more on this, I recommend watching Richard Cook’s excellent lecture: It all started at TMI, 1979).

TMI led to a kind of Cambrian explosion of research into human error and its role in accidents. This is the beginning of the era where you see work from researchers such as Charles Perrow, Jens Rasmussen, James Reason, Don Norman, David Woods, and Erik Hollnagel.

Why there won’t be a software TMI

TMI was significant because it was an event that could not be explained using existing theories. I don’t think any such event will happen in a software system, because I think that every complex software system failure can be “explained”, even if the resulting explanation is lousy. No matter what the software failure looks like, someone will always be able to identify a “root cause”, and propose a solution (more automation, better procedures). I don’t think a complex software failure is capable of creating TMI style cognitive dissonance in our industry: we’re, unfortunately, too good at explaining away failures without making any changes to our priors.

We’ll continue to have Therac-25s, Knight Capitals, Air France 447s, 737 Maxs, 911 outages, Rogers outages, and Tesla autopilot deaths. Some of them will cause enormous loss of human life, and will result in legislative responses. But no such accident will compel the software industry to, as Allspaw puts it, take human performance seriously.

Our only hope is that the software industry eventually learns the lessons that the safety science learned from the original TMI.

Making operational work more visible

I wrote an article for the GitHub ReadME Project.

The ambiguity of real work

All ambiguity is resolved by actions of practitioners at the sharp end of the system.
Dr. Richard I. Cook, How Complex Systems Fail

There’s a wonderful book by the late urban planning professor Donald Schön called The Reflective Practitioner: How Professionals Think in Action. In the first chapter, he discusses the “rigor or relevance” dilemma that faces educators in professional degree programs. In the case of a university program aimed at preparing students for a career in software development, this is the “should we teach topological sort or React?” question.

Schön argues that the dilemma itself is a fundamental misunderstanding of the nature of professional work. What it misses is the ambiguity and uncertainty inherent in the work of professional life. The “rigor vs relevance” debate is an argument over the best way to get from the problem to the solution: do you teach the students first principles, or do you teach them how to use the current set of tools? Schön observes that a more significant challenge for professionals is defining the problems to solve in the first place, since an ill-defined problem admits no technical solution at all.

In the varied topography of professional practice, there is a high, hard ground where practitioners can make effective use of research-based theory and technique, and there is a swampy lowland where situations are confusing “messes” incapable of technical solution. The difficulty is that the problems of the high ground, however great their technical interest, are often relatively unimportant to clients or to the larger society, while in the swamp are the problems of greatest human concern.

His use of the term “messes” evokes Russell Ackoff’s use of the term in his paper The Future of Operational Research is Past:

Managers are not confronted with problems that are independent of each other, but with dynamic situations that consist of complex systems of changing problems that interact with each other. I call such situations messes. Problems are abstractions extracted from messes by analysis; they are to messes as atoms are to tables and chairs. We experience messes, tables, and chairs; not problems and atoms

To take another example from the software domain. Imagine that you’re doing quarterly planning, and there’s a collection of reliability work that you’d like to do, and you’re trying to figure out how to prioritize it. You could apply a rigorous approach, where you quantify some values in order to do the prioritization work, and so you try to estimate information like:

the probability of hitting a problem if the work isn’t done
the cost to the organization if the problem is encountered
the amount of effort involved in doing the reliability work

But you’re soon going to discover the enormous uncertainty involved in trying to put a number on any of those things. And, in fact, doing any reliability work can actually introduce new failure modes.

Over and over, I’ve seen the theme of ambiguity and uncertainty appear in ethnographic research that looks at professional work in action. In Designing Engineers, the aerospace engineering professor Louis Bucciarelli did an ethnographic study of engineers in a design firm, and discovered that the engineers all had partial understanding of the problem and solution space, and that their understandings also overlapped only partially. As a consequence, a lot of the engineering work that was done actually involved engineers resolving their incomplete understanding through various forms of communication, often informal. Remarkably, the engineers were not themselves aware of this process of negotiating understandings of the problems and solutions.

The famous Common Ground and Coordination in Joint Activity paper by Gary Klein, Paul Feltovich, and David Woods, makes explicit the role that ambiguity plays in human coordination and communication.

You’ll sometimes hear researchers who study work talk about the process of sensemaking. For example, there’s a paper by Sana Albolino, Richard Cook, and Micahel O’Connor called Sensemaking, safety, and cooperative work in the intensive care unit that describes this type of work in an intensive care unit. I think of sensemaking as an activity that professionals perform to try to resolve ambiguity and uncertainty.

(Ambiguity isn’t always bad. In the book On Line and On Paper, the sociologist Kathryn Henderson describes how engineers use engineering drawings as boundary objects. These are artifacts are that are understood differently by the different stakeholders: two engineers looking at the same drawing will have different mental models of the artifact based on their own domain expertise(!). However, there is also overlap in their mental models, and it is this combination of overlap and the fact that individuals can use the same artifact for different purposes that makes it useful. Here the ambiguity has actual value! In fact, her research shows that computer models, which eliminate the ambiguity, were less useful for this sort of work).

As practitioners, we have no choice: we always have to deal with ambiguity. As noted by Richard Cook in the quote that opens this blog post, we are the ones, at the sharp end, that are forced to resolve it.

Trouble during startup

I asked this question on twitter today:

Have you ever encountered an incident where one of the contributing factors was service behavior on startup?
— @norootcause@hachyderm.io on mastodon (@norootcause) July 11, 2021

I received a lot of great responses.

My question was motivated by this lecture by Dr. Richard Cook about an explosion at a BP petroleum processing plant in Texas in 2005 that killed fifteen plant workers. Here’s a brief excerpt, starting at 16:31 of the video:

Let me make one other observation about this that I think is important, which is that this occurred during startup. That is, once these processes get going, they work in a way that’s different than starting them up. So starting up the process requires a different set of activities than running it continuously. Once you have it running continuously, you can be pouring stuff in one end and getting it out the other, and everything runs smoothly in-between. But startup doesn’t, it doesn’t have things in it, so you have to prime all the pumps by doing a different set of operations.

This is yet another reminder of how similar we are to other fields that control processes.

Designing like a joint cognitive system

There’s a famous paper by Gary Klein, Paul Feltovich, and David Woods, called Common Ground and Coordination in Joint Activity. Written in 2004, this paper discusses the challenges a group of people face when trying to achieve a common goal. The authors introduce the concept of common ground, which must be established and maintained by all of the participants in order for them to reach the goal together.

I’ve blogged previously about the concept of common ground, and the associated idea of the basic compact. (You can also watch John Allspaw discuss the paper at Papers We Love). Common ground is typically discussed in the context of high-tempo activities. The most popular example in our field is an ad hoc team of engineers responding to an incident.

The book Designing Engineers was originally published in 1994, ten years before the Common Ground paper, and so Louis Bucciarelli never uses the phrase. And yet, the book calls forward to the ideas of common ground, and applies them to the lower-tempo work of engineering design. Engineering design, Bucciarelli claims, is a social process. While some design work is solitary, much of it takes place in social interactions, from formal meetings to informal hallway conversations.

But Bucciarelli does more than discover the ideas of common ground, he extends them. Klein et al. talk about the importance of an agreed upon set of rules, and the need to establish interpredictability: for participants to communicate to each other what they’re going to do next. Bucciarelli talks about how engineering design work involves actually developing the rules, making constraints concrete that were initially uncertain. Instead of interpredictability, Bucciarelli talks about how engineers argue for specific interpretations of requirements based on their own interests. Put simply, where Klein et al., talk about establishing, sustaining, and repairing common ground, Bucciarelli talks about constructing, interpreting, and negotiating the design.

Bucciarelli’s book is fascinating because he reveals how messy and uncertain engineering work is, and how concepts that we may think of as fixed and explicit are actually plastic and ambiguous.

For example, we think of building codes as being precise, but when applied to new situations, they are ambiguous, and the engineers must make a judgment about how to apply them. Bucciarelli tells an anecdote about the design of an an array of solar cells to mount on a roof. The building codes put limits on how much weight a roof can support, but the code only discusses distributed loads, and one of the proposed designs is based on four legs, which would be a concentrated load. An engineer and an architect negotiate on the type of design for the mounting: the engineer favors a solution that’s easier for the engineering company, but more work for the architect. The architect favors a solution that is more work and expense for the engineering company. The two must negotiate to reach an agreement on the design, and the relevant building code must be interpreted in this context.

Bucciarelli also observes that the performance requirements given to engineers are much less precise than you would expect, and so the engineers must construct more precise requirements as part of the design work. He gives the example of a company designing a cargo x-ray system for detecting contraband. The requirement is that it should be able to detect “ten pounds of explosive”. As the engineers prepare to test their prototype, a discussion ensues: what is an explosive? Is it a device with wires? A bag of plastic? The engineers must define what an explosive means, and that definition becomes a performance requirement.

Even technical terms that sound well-defined are ambiguous, and may be interpreted differently by different members of the engineering design team. The author witnesses a discussion of “module voltage” for a solar power generator. But the term can refer to open circuit voltage, maximum power voltage, operating voltage, or nominal voltage. It is only through social interactions that this ambiguity is resolved.

What Bucciarelli also notices in his study of engineers is that they do not themselves recognize the messy, social nature of design: they don’t see the work that they do establishing common ground as the design work. I mentioned this in a previous blog post. And that’s really a shame. Because if we don’t recognize these social interactions as design work, we won’t invest in making them better. To borrow a phrase from cognitive systems engineering, we should treat design work as work that’s done by a joint cognitive system.

You have to go there

A major claim of this study is that this process [of engineering design] transcends rational, instrumental process. But you have to go there and watch the process to capture this essential feature of designing. It is not recoverable from the artifact, nor is it there in the documentation and textual remnants of the design process. These, in their object-world voice, speak only of the empty shell of designing. Oral histories are a potentially sufficient source; these will also take the form of object-world representations of process. Multiple oral histories, if my thesis is correct, ought to be conflicting, reflecting the different interests and responsibilities of participants. The synthesis of a coherent story about design process from these becomes then a process of negotiation.
Louis Bucciarelli, Designing Engineers

Grappling with contingency

For want of a nail the shoe was lost.
For want of a shoe the horse was lost.
For want of a horse the rider was lost.
For want of a rider the message was lost.
For want of a message the battle was lost.
For want of a battle the kingdom was lost.
And all for the want of a horseshoe nail.

Contingency is the idea that history could easily have turned out completely different, if only certain minor events had happened differently. If there was a zig somewhere instead of a zag, maybe the election would have gone the other way, or the outcome of the revolution would have be different.

In terms of the work we do, contingency means that the success of projects, or the length of an incident, may vary dramatically based on happenstance. Maybe someone happened to be out sick one day and missed a critical meeting, and so didn’t have a certain important bit of information or wasn’t able to give feedback on a design. Maybe someone on the team happened to have prior experience with just the sort of problem that they are all grappling with.

When we look back on successes and failures, they feel inevitable somehow, like there were an inexorable set of forces pushing in the direction that led to the success or failure. You can see that in incident retrospectives in particular, as people search for the cause, the essential reason this happened.

We’re uncomfortable with contingency, preferring essentialism. That’s why so often commit the fundamental attribution error: I snapped at you because I missed lunch, which put me in a grouchy mood; you snapped at me because you’re a hot-tempered jerk.

So, while we do have influence over outcomes, much depends on, well, chance. The difference between success and failure might hinge on the occurrence of a random hallway conversation where we pick up an extra bit of context, or whether our kid has a fever on a particular day and we need to take them to see the doctor.

The disaster meeting

This post is mostly an excerpt from the book Designing Engineers by Louis Bucciarelli. This book describes Bucciarelli’s observational study of engineers doing design work at three different engineering companies.

At some point, I’ll write a proper review of the book, but I wanted to highlight a specific passage, a meeting among engineers working to solve a specific problem.

The engineers attending this meeting work at a company that sells photograph processing machines. The company is planning on releasing a new product (“Atlas”) in a few months, but there’s a problem with the design: a phenomenon that they call dropout. Dropout happens when parts of the image that are barely visible end up not getting printed on the paper. The problem can be hard to notice unless someone looks very closely at the photo, but it’s enough of an issue that they are putting in resources to solving it.

This meeting is being led by Sergio, the engineer leading the effort to solve the dropout problem. Before this meeting, he identified fourteen potential solutions to the dropout problems. He’s called this engineering meeting in order to apply a structured decision-making process (the Pugh Method) to help him narrow down this list to the most promising-sounding solutions.

The meeting does not go as the organizer hoped. The transcript is long-ish, but worth reading in full. You might even find it familiar.

Sergio: OK. Let’s start. You all got this. [He holds up a description of Pugh methodology]. I sent it around last Thursday. It pretty much says what we’re going to try to do, except I’m going to make a few changes. You’ll see as we go along. The basic idea today is that we want to first set up some criteria to judge. Then we compare how the fourteen go, compare them against these criteria. By the end of this morning I’d like to have narrowed things down, not to one option, but to three, say, something we can get going on. Yeah, Harold.

Harold: It says in this method that we ought to pick a baseline option to compare against. How are we going to do that? It seems to me any one of the fourteen would be as good or bad, for that matter, as any of the others.

Sergio: I thought about that, and here is what I propose. Let’s pick the option we know best, OK? Say the QWP. We know how that works, and other than that it probably won’t fit in the space we have to play with, it still can be our reference. But first we have to set up some criteria. So, let me get this chart around here.

Hans: Obviously we need a criterion, something like “Gets the job done” or “Eliminates dropout.”

Sergio: Yeah, that’s got to be one. The thing has got to work, to solve the problem. How did you state it?

Marco: What do we mean when we go and claim that, say, the QWEP eliminates the dropout? I mean, all of those up there have a chance of doing the job.

Sergio: I know. But we score, not with numbers but say three, four marks—better than the baseline, say the QWP. This is where the baseline comes in. Second would be neutral—no better, no worse than the QWP—and third would be negative; that is, we think it won’t be as good as what we know works now.

Marco: Yeah, but some of these options I think might work as good, even better on some papers but probably won’t work at all on others. How do you grade it then?

Sergio: What do you mean? Give me a more specific example.

Marco: I mean like with the air knife. It might work with Z-weight paper, but with the heavier M-weight I don’t think it will work.

Hans: Why not make that another criterion: “Works with all papers.“

Sergio: Or “Sensitivity to paper.” Sort of pull that out from under “Does the job.”

Marco: You mean that there are some options that will do the job, but some of those won’t be able to handle the heavy paper?

Sergio: Yeah, that’s one way to look at it. “Does the job” is our best guess that the thing will work, but we give paper type a separate category. We may want to say something else has to be done to handle the heavy paper; that becomes another problem.

Fritz: How do we know whether paper type is critical for the air knife? It seems to me we don’t really know what the problem is. How can we compare options when we don’t know what is causing the problem?

Marco: Fritz, that’s a good point. Do we really know enough to—

Sergio: We know we have dropout on Atlas. We know that the QWP gives good results. We have a pretty good idea of what consistency it takes to give good print—print that a trained eye can’t find a hole in. (With a magnifying glass, you still see some.)

Fritz: Yes, but we can know, and should know, a lot more before we go judging these proposals on whether or not they will solve the problem. If this place hadn’t cut back on its chemistry research, we might have a chance of knowing what the hell is going on, not just with Atlas but we had it on Mars as well.

Sergio: Look, some things are beyond our control. We have no power over the powers-that-be. We don’t have a chemistry group working on this problem to call up and say “Get over here and help us evaluate these options.” We’ve got to go with what we have. Atlas is due to go out onto the streets in seven months.

Fritz: That’s the way it always goes around her. Someone wants your solutions yesterday.

Sergio: OK. So we have “Eliminates dropout” and “Sensitivity to paper.” What are some others?

Hans: Cost.

Marco: Have you guys thought about some kind of chemical pretreatment… different papers?

Sergio: Cost. Let’s think about that. Is cost really that important? Leonard says he doesn’t see cost as really significant unless it really is some huge sum. But I don’t see how we will ever get to that point. And Atlas—

Harold: Yeah, I don’t see how unit cost can be that great. We’re not going to be able to fool around much inside Atlas at this late date.

Marco: We ought to think about what we can do without going inside.

Hans: On the other hand, if we do convince them that they have to move the paper feed, say, it is going to get costly.

Harold: In terms of engineering change but not in terms of unit costs. We still aren’t going to go in there with some exotic machinery. All those options, except maybe the E&M device, are just bending metal, cams, gears… mechanical stuff, nothing fancy.

George: We might have a problem holding tolerances. Machining can get expensive. We ask too much of my people, even with the mechanical parts.

Sergio: Maybe we make that another category, another criterion: “Engineering change,” “Extent of engineering change.”

Harold: What you really want to say is something like “Compatible with existing product.” Like the QWP we know will work fine. It does in Mars, but we know it will be extremely hard to fit in Atlas, so… Or the E&M that’s going to require a power supply, right?

Fritz: But the QWP is our reference. That’s not a good example. And that’s not a good example. And, for that matter, what good is the criterion if we know the QWP won’t fit? If that’s the case, won’t all the options be scored a plus, all the same?

Sergio: Good point, good point. But I see some that will be just as hard to retrofit—for example, the cam with a solenoid. Solenoids aren’t any miniature electronic device. They’ve got to have room, especially with the forces and reaction times we’re going to be demanding.

Hans: And the air knife requires a plenum, or the E&M—Marco, was it you who said they will need a power supply?

Sergio: Fritz, you have a good point, but let’s put it up there for now. There won’t be maybe any negatives there, but still… OK? How did you say it?

Harold: “Compatible with existing product” or maybe we ought to say “products,” with Leonard in mind.

Sergio: Yeah, got it.

Fritz: That brings up another thing. Who are we making this design for? Leonard out in Colorado and Atlas are not in sync. Atlas is well along, they’re getting into the panic mode now. But Leonard has more time, another year at least, right?

Sergio: I spoke to Leonard yesterday, and even though he has another year past Atlas, he wants to se a solution to what he thinks is his dropout problem well before that. He doesn’t want to go the panic route.

Fritz: But we still have more time with him. And shouldn’t we be thinking about the long term?

Sergio: We can’t afford to do too much of that. I’ve got the higher-ups breathing down my neck to get something going here. That makes me think fo another criterion: How well can we meet a schedule? Let’s say “Ease of schedule.”

George: How about “Pain and suffering”? [Laughter]

Sergio: No, we want to be positive about this.

Marco: Yeah, so we can mark them down. [Laughter]

Fritz: That’s why we chose the QWP as a baseline. He knows that can’t possibly fit here.

Sergio: Come on guys. That’s not true. Let’s get serious. We want to get out of here by lunchtime. Jeez, is it already 10:30?

Hans: I’ve got 10:40.

Sergio: OK. So far we’ve got—

Harold: I think we’re missing a big one. You all know how difficult it is to keep the QWP clean. Anything mechanical you add in there is going to collect sludge. Some of those, like the cam, are going to have. areal problem there with that—keeping clean.

Sergio: Good. That’s another good one. The guys in Service are not going to like it if they get called out every week.

Marco: Does that figure into the cost, the cost of servicing? Do we need a separate category?

Sergio: I think we ought to break that one out, just like we did with the paper. That’s something we are liable not to think of—what it takes to maintain the fix in the field. So let’s add—

Fritz: We don’t even know if it will work.

Sergio: We got some interesting results yesterday with a mock-up. I think it looks promising.

Fritz: But still, it’s got a long way to go. That’s what I mean. We don’t really know. if it will work, and I, at least, can’t make a good judgment even though you may be able to, because I don’t think we understand enough about the problem!

Marco: I’m with Fritz on that. I don’t think. we have enough information about these different options. I’m finding it hard to do. this method, and I think the reason is because we don’t really understand the problem.

Sergio: How much do we need to know? I admit that the E&M is a long shot, that we’ve got to get it going, that it will take a longer time to evaluate than, say, the cam concepts, and we’ve been promised a machine for next week. When we get the hardware, we can do both, evaluate the E&Ms and, in the process, get a firmer grip on what is the problem. But we don’t have all year. Jeez, it’s 11:00. We don’t have all morning either. And besides, this is just an exercise; we are not going to pick a definite option. and go with that. We only want to narrow the field some this morning. Then we give it a hard look again, after we’ve done some work on the three, come back at it and evaluate again. In fact, I can see us running pretty far with, say, two or three options in parallel, as long as they don’t interfere. Maybe that’s another thing to consider.

Hans: Seeing what time it is, maybe we better cut off our criteria here. Serge, I think we better get to ranking.

Sergio: OK, OK. So far we’ve got ‘Does the job,” “Sensitive to paper,” “Cost,” “Compatible with existing hardware,” “Ease of schedule,” “Ease of maintenance.” Anyone think of any more?

George: How about “Ease of production?”

Marco: That’s in cost. I see that as a main factor in cost.

Fritz: Look, I think we have a problem with these criteria. I’m having a hell of a time keeping them straight, trying to fix what they might mean. Are they all to be considered as having the same priority? I still think this exercise is not useful unless we know more about what we have to do, what the problem is.

Marco: I think even then these criteria would get all mixed up. When we say “Do the job” I see costs, sludge all in that, too.

Sergio: We are always going to have that problem. Where we are now, we’ve got to move. All I want is to get us narrowed down.

Fritz: But you yourself think PT’s additional option is worth keeping. I don’t think we’re ready.

Sergio: It’s getting late. We’re not going. to get there today. That’s clear. I’ll tell you what. Can we meet again? [Grunts, groans]

Sergio: No, I promise you. In the meantime, Hans and I will go back and sort out these criteria, try to explain what we see as what they are meant to measure. At least in that way we will start on the same wavelength. I will send you that before we get together. Then we will narrow.

Marco: When? I’ve got to go out to Colorado next week for two days. Can you take that into account?

Fritz: And I’m tied up in the lab the early part of the week.

George: We’ve got a production trial scheduled sometime.

Sergio: Look, I’ll have Cheryl survey, but it might have to go another week. I’ve got to get out and back to Colorado myself sometime next week. OK? Is that it? That’s enough!

(pp. 152–156)

The Pugh technique is an appealing model in principle, but we see problems crop up as Sergio tries to apply it: the engineers work to define criteria, but the categories are slippy. They have different opinions about how to cut up the space into categories, and whether they have enough information to even evaluate these criteria.

Note how well defined the problem seems to be on first glance. It’s a specific problem (dropout) on a system that otherwise has been fully designed. Not only that, but potential solutions have already been identified! Sergio’s goal is just to narrow down the solution space so that they can explore three options instead of fourteen.

Instead of a structured process, we see a much messier interaction, one that ultimately frustrates Sergio, who used the phrase “the disaster meeting” to describe what happened. What we observe, though, is a kind of progress: a group of engineers who have different understandings of the situations trying to establish common ground, building a shared understanding so that they can work together to accomplish this task. Real engineering work is messy.