Bad Religion: A review of Work Pray Code

When I worked as a professor at the University of Nebraska—Lincoln, after being there for a few months, during a conversation with the chair of the computer science department he asked me “have you found a church community yet?” I had not. I had, however, found a synagogue. The choice wasn’t difficult: there were only two. Nobody asked me a question like that after I moved to San Jose, which describes itself as the heart of Silicon Valley.

Why is Silicon Valley so non-religious is the question that sociologist Carolyn Chen seeks to answer here. As a tenured faculty member at UC Berkeley, Chen is a Bay Area resident herself. Like so many of us here, she’s a transplant: she grew up in Pennsylvania and Southern California, and first moved to the area in 2013 to do research on Asian religions in secular spaces.

Chen soon changed the focus of her research from Asian religions to the work culture of tech companies. She observes that people tend to become less religious when they move to the area, and are less engaged in their local communities. Tech work is totalizing, absorbing employees entire lives. Tech companies care for many of the physical needs of their employees in a way that companies in other sectors do not. Tech companies provide meditation/mindfulness (the companies use these terms interchangeably) to help their employees stay productive, but it is a neutered version of the meditation of its religious, Buddhist roots. Tech companies push up the cost of living, and provide private substitutes for public infrastructure, like shuttle busses.

Chen tries to weave these threads together into a narrative about how work substitutes for religion in the lives of tech workers in Silicon Valley. But the pieces just don’t fit together. Instead, they feel shoehorned in to support her thesis. And that’s a shame, because, as a Silicon Valley tech worker, many of the observations themselves ring true to my personal experience. Unlike Nebraska, Silicon Valley really is a very secular place, so much so that it was a plot point in an episode of HBO’s Silicon Valley. As someone who sends my children to religious school, I’m clearly in the minority at work. My employer provides amenities like free meals and shuttles. They even provide meditation rooms, access to guided meditations provided by the Mental Health Employee Resource Group, and subscriptions to the Headspace meditation app. The sky-high cost of living in Silicon Valley is a real problem for the area.

But Chen isn’t able to make the case that her thesis is the best explanation for this grab bag of observations. And her ultimate conclusion, that tech companies behave more and more like cults, just doesn’t match my own experiences working at a large tech company in Silicon Valley.

Most frustratingly, Chen doesn’t ever seem to ask the question, “are there other domains where some of these observations also hold?” Because so much of the description of the secular and insular nature of Silicon Valley tech workers applies to academics, the culture that Chen herself is immersed in!

Take this excerpt from Chen:

Workplaces are like big and powerful magnets that attract the energy of individuals away from weaker magnets such as families, religious congregations, neighborhoods, and civic associations—institutions that we typically associate with “life” in the “work-life” binary. The magnets don’t “rob” or “extract”—words that we use to describe labor exploitation. Instead they attract the filings, monopolizing human energy by exerting an attractive rather than extractive force. By creating workplaces that meet all of life’s needs, tech companies attract the energy and devotion people would otherwise devote to other social institutions, ones that, traditionally and historically, have been sources of life fulfillment.
Work Pray Code, p197

Compare this to an excerpt from a very different book: Robert Sommer’s sardonic 1963 book Expertland (sadly, now out of print), which describes itself as “an unrestricted inside view of the world of scientists, professors, consultants, journals, and foundations, with particular attention to the quaint customs, distinctive dilemmas, and perilous prospects”.

Experts know very few real people. Except for several childhood friends or close relatives, the expert does not know anybody who drives a truck, runs a grocery store, or is vice-president of the local Chamber of Commerce. His only connection with these people is in some kind of service relationship; they are not his friends, colleagues, or associates. The expert feel completely out of place at Lion’s or Fish and Game meeting. If he is compelled to attend such gatherings, he immediately gravitates to any other citizen of Expertland who is present… He has no roots, no firm allegiances, and nothing to gain or lose in local elections… Because he doesn’t vote in local elections, join service clubs, or own the house he lives in, outsiders often feel that the expert is not a good citizen.
Expertland pp 2-3

Chen acknowledges that work is taking over the lives of all high-skilled professionals, not just tech workers. But I found work-life balance to be much worse in academia than at a Silicon Valley tech company! To borrow a phrase from the New Testament, And why beholdest thou the mote that is in thy brother’s eye, but considerest not the beam that is in thine own eye?

We value possession of experience, but not its acquisition

Imagine you’re being interviewed for a software engineering position, and the interviewer asks you: “Can you provide me with a list of the work items that you would do if you were hired here?” This is how the action item approach to incident retrospectives feels to me.

We don’t hire people based on their ability to come up with a set of work items. We’re hiring them for their judgment, their ability to make good engineering decisions and tradeoffs based on the problems that they will encounter at the company. In the interview process, we try to assess their expertise, which we assume they have developed based on their previous work experience.

Incidents provide us with excellent learning opportunities because they confront us with surprises. If we examine an incident in detail, we can learn something about our system behavior that we didn’t know before.

Yet, while we recognize the value of experienced candidates when we do hiring, we don’t seem to recognize the value of increasing the experience of our current employees. Incidents are a visceral type of experience, and reflecting on these sorts of experiences is what increases our expertise. But you have to reflect on them to maximize the value, and you have to share this information out to the organization so that it isn’t just the incident responders that can benefit from the experience.

To me, learning from incidents is about increasing the expertise of an organization by reflecting on and sharing out the experiences of surprising operational events. Action items are a dime a dozen. What I care about is improving the organization’s ability to engineer software.

Software engineering in-the-large: the coordination challenge

Back when I was an engineering student, I wanted to know “How do the big companies develop software? How does it happen in the real world?”

Now that I work at a company that has to do large-scale software development, I understand better why it’s not something you can really teach effectively in a university setting. It’s not that companies doing large-scale software development are somehow better at writing software than companies that work on smaller-scale software projects. It’s that large-scale projects face challenges that small-scale projects don’t.

The biggest challenge at large-scale is coordination. My employer provides a single service, which means that, in theory, any project that anyone is working on inside of the company could potentially impact what anybody else is working on. In my specific case, I work on delivery tools, so we might be called upon to support some new delivery workflow.

You can take a top-down command-and-control style approach to the problem, by having the people at the top attempting to filter all of the information to just what they need, and them coordinating everyone hierarchically. However, this structure isn’t effective in dynamic environments: as the facts on the ground change, it takes too long for information to work its way up the hierarchy, adapt, and then change the orders downwards.

You can take a bottoms-up approach to the problem where you have a collection of teams that work autonomously. But the challenge there is getting them aligned. In theory, you hire people with good judgment, and provide them with the right context. But the problem is that there’s too much context! You can’t just firehose all of the available information to everyone, that doesn’t scale: everyone will spend all of their time reading docs. How do you get the information into the heads of the people that need it? becomes the grand challenge in this context.

It’s hard to convey the nature of this problem in a university classroom, if you’ve never worked in a setting like this before. The flurry of memos, planning documents, the misunderstandings, the sync meetings, the work towards alignment, the “One X” initiatives, these are all things that I had to experience viscerally, on a first-hand basis, to really get a sense of the nature of the problem.

OR, DevOps, and LFI

I’m currently reading Systems, Experts, and Computers: the Systems Approach in Management and Engineering, World War II and After. The book is really more of a collection of papers, each written by a different author. This post is about the second chapter, The Adoption of Operations Research in the United States During World War II, written by Erik Rau.

During World War II, the British and the Americans were actively investing in developing radar technology in support of the war effort, with applications such as radar-based air defense. It turned out that developing the technology itself wasn’t enough: the new tech had to be deployed and operated effectively in the field to actually serve its purpose. Operating these systems required coordination between machines, human operators, and the associated institutions.

The British sent out scientists and engineers into the field to study and improve how these new systems were used. To describe this type of work, the physicist Albert Rowe coined the term operational research (OR), to contrast it with developmental research.

After the surprise attack on Pearl Harbor, the U.S. Secretary of War tapped the British radar pioneer Robert Wattson-Watt to lead a study on the state of American air defense systems. In the report, Wattson-Watt describe U.S. air defense as “insufficient organization applied to technically inadequate equipment used in exceptionally difficult conditions“. The report suggested the adoption of the successful British technology and techniques, which included OR.

At this time, the organization responsible for developmental research into weapons systems was the Office of Scientific Research and Development (OSRD), headed by Vannevar Bush. While a research function like OR seemed like it should belong under OSRD, there was a problem: Bush didn’t want it there. He wanted to protect the scientific development work of his organization from political interference by the military, and so he sought to explicitly maintain a boundary between the scientists and engineers that were developing the technology, and the military that was operating it.

[The National Defense Research Committee] is concerned with the development of equipment for military use, whereas these [OR] groups are concerned with the analysis of its performance, and the two points of view do not, I believe, often mix to advantage.
Vannevar Bush, quoted on p69

In the end, the demand for OR was too great, and Bush relented, creating the Office of Field Service within the OSRD.

Two things struck me reading this chapter. The first was that operational research was a kind of proto-DevOps, a recognition of the need to create a cultural shift in how development and operations work related to each other, and the importance of feedback between the two groups. It was fascinating to see the resistance to it from Bush. He wasn’t opposed to OR itself, he was opposed to unwanted government influence, which drove his efforts to keep development and operations separate.

The second thing that struck me was this idea of OR being doing research on operations. I had always thought of OR as being about things like logistics, basically graph theory problems addressed by faculty who work in business schools instead of computer science departments. But, here, OR was sending researchers into the field to study how operations was done in order to help improve development and operations. This reminded me very much of the aims of the learning from incidents (LFI) movement.

Prussia meets Versailles: a review of Moral Mazes

Managers rarely speak of objective criteria for achieving success because once certain crucial points in one’s career are passed, success and failure seem to have little to do with one’s accomplishments.
p44

Almost all management books are prescriptive: they’re self-help books for managers. Moral Mazes is a very different kind of management book. Where most management books are written by management gurus, this one is written by a sociologist. This book is the result of a sociological study that the author conducted at three U.S. companies in the 1980s: a large textile firm, a chemical company, and a large public relations agency. He was interested in understanding the ethical decision-making process of American managers. And the picture he paints is a bleak one.

American corporations are organized into what Jackall calls patrimonial bureaucracies. Like the Prussian state, a U.S. company is organized as a hierarchy, with a set of bureaucratic rules that binds all of the employees. However, like a monarchy, people are loyal to individuals rather than offices. Effectively, it is a system of patronage, where leadership doles out privileges. Like in the court of King Louis XIV, factions within the organization jockey to gain favor.

With the exception of the CEO, all of the managers are involved in both establishing the rules of the game, and are bound by the rules. But, because the personalities of leadership play a strong role, and because leadership often changes over time, the norms are always contingent. When the winds change, the standards of behavior can change as well.

Managers are also in a tough spot because they largely don’t have control over the outcomes on which they are supposed to be judged. They are typically held responsible for hitting their numbers, but luck and timing play an enormous role over whether they are able to actually meet their objectives. As a result, managers are in a constant state of anxiety, since they are forever subject to the whims of fate. Failure here is socially defined, and the worst outcome is to be in the wrong place at the wrong time, and to have your boss say “you failed”.

Managers therefore focus on what they can control, which is the image they project. They put in a lot of hours, so they appear to be working hard. They strive to be seen as someone who is a team player, who behaves predictably and makes other managers feel comfortable. To stand out from their peers, they have to have the right style: the ability to relate to other people, to sell ideas, appear in command. To succeed in this environment, a manager needs social capital as well as the ability to adapt quickly as the environment changes.

Managers commonly struggle with decision making. Because the norms of behavior are socially defined, and because these norms change over time, they are forever looking to their peers to identify what the current norms are. Compounding the problem is the tempo of management work: because a manager’s daily schedule is typically filled with meetings and interrupts, with only fragmented views of problems being presented, there is little opportunity to gain a full view of problems and reflect deeply on them.

Making decisions is dangerous, and managers will avoid it when possible, even if this costs the organization in the long run. Jackall tells an anecdote about a large, old battery at a plant. The managers did not want to be on the hook for the decision to replace it, and so problems with it were patched up. Eventually, it failed completely, and the resulting cost to replace it and to deal with costs related to EPA violations and lawsuits was over $100M in 1979 dollars. And yet, this was still rational decision-making on behalf of the managers, because it was a risk for them in the short-term to make the decision to replace the battery.

Ethical decision making is particularly fraught here. Leadership wants success without wanting to be bothered with the messy details of how that success is achieved: a successful middle manager shields leadership from the details. Managers don’t have a professional ethic in the way that, say, doctors or lawyers do. Ethical guidelines are situational, they vary based on changing relationships. Expediency is a virtue, and a good manager is one who is pragmatic about decision making.

All moral issues are transmuted into practical concerns. Arguing based on morality rather than pragmatism is frowned upon, because moral arguments compel managers to act, and they need to be able to take stock of the social environment in order to judge whether a decision would be appropriate. Effective managers use social cues to help make decisions. They conform to what Jackall calls institutional logic: the ever-changing set of rules and incentives that the culture creates and re-creates to keep people’s perspectives and behaviors consistent and predictable.

There comes a time in every engineer’s career when you ask yourself, “do I want to go into management?” I’ve flirted with the idea in the past, but ultimately came down on the “no” side. After reading Moral Mazes, I’m more confident than ever that I made the right decision.

Making operational work more visible

I wrote an article for the GitHub ReadME Project.

Code rewrites and joint cognitive systems

Way back in the year 2000, Joel Spolsky famously criticized the idea of doing a code rewrite.

The idea that new code is better than old is patently absurd. Old code has been used. It has been tested. Lots of bugs have been found, and they’ve been fixed. There’s nothing wrong with it.
Joel Spolsky, Things You Should Never Do, Part I

I think Spolsky is wrong here. His error comes from considering the software in isolation. The problem here isn’t the old code, it’s the interaction between the old code and the humans who are responsible for maintaining the software. If you draw the boundary around those people and the software together, you get what the cognitive systems engineering community calls a joint cognitive system.

One of the properties of joint cognitive systems is that the system has knowledge about itself. Being responsible for maintaining a legacy codebase is difficult because the joint cognitive system is missing important knowledge about itself.

Here’s Spolsky again:

When you throw away code and start from scratch, you are throwing away all that knowledge.

But that knowledge is already gone! The people who wrote the code have left, and the current maintainers don’t know what their intent was. The joint cognitive system, the combination of code and the current maintainers, don’t know why it’s implemented the way it is.

Spolsky gestures at this, but doesn’t grasp its implications:

The reason that they think the old code is a mess is because of a cardinal, fundamental law of programming: It’s harder to read code than to write it.

Spolsky is missing the importance of a system’s ability to understand itself. Ironically, the computer scientist Peter Naur was writing about this phenomenon fifteen years earlier. In an essay titled Programming as Theory Building, he described the importance of having an accurate mental model or theory of the software, and the negative consequences of software being modified by maintainers with poor mental models.

It isn’t just about the software. It’s about the people and the software together.

Taking the hit

Here’s a scenario I frequently encounter: I’m working on writing up an incident or an OOPS. I’ve already interviewed key experts on the system, and based on those interviews, I understand the implementation details well enough to explain the failure mode in writing.

But, when I go to write down the explanation, I discover that I don’t actually have a good understanding of all of the relevant details. I could go back and ask clarifying questions, but I worry that I’ll have to do this multiple times, and I want to avoid taking up too much of other people’s time.

I’m now faced with a choice when describing the failure mode. I can either:

(a) Be intentionally vague about the parts that I don’t understand well.

(b) Make my best guess about the implementation details for the parts I’m not sure about.

Whenever I go with option (b), I always get some of the details incorrect. This becomes painfully clear to me when I show a draft to the key experts, and they tell me straight-out, “Lorin, this section is wrong.”

I call choosing option (b) taking the hit because, well, I hate the feeling of being wrong about something. However, I always try to go with this approach because this maximizes both my own learning and (hopefully) the learning of the readers. I take the hit. When you know that your work will be reviewed by an expert, it’s better to be clear and wrong than vague.

Bitrot

Engineering deals in lifetimes, both human and otherwise. If not fatigue or fracture, than corrosion or erosion; if not war or vandalism, then taste or fashion claim not only the body but the very souls of once-new machines…
The lifetime of a structure is no mere anthropomorphic metaphor, for how long a piece of engineering must last can be one of the most important considerations for its design.
Henry Petroski, To Engineer is Human: The Role of Failure in Successful Design

Unfathomed misunderstanding is further revealed by the term “software maintenance”, as a result of which many people continue to believe that programs —and even programming languages themselves— are subject to wear and tear. Your car needs maintenance too, doesn’t it? Famous is the story of the oil company that believed that its PASCAL programs did not last as long as its FORTRAN programs “because PASCAL was not maintained”.
Edsger W. Dijkstra, On the cruelty of really teaching computing science

Before Borland’s new spreadsheet for Windows shipped, Philippe Kahn, the colorful founder of Borland, was quoted a lot in the press bragging about how Quattro Pro would be much better than Microsoft Excel, because it was written from scratch. All new source code! As if source code rusted.
The idea that new code is better than old is patently absurd. Old code has been used. It has been tested. Lots of bugs have been found, and they’ve been fixed. There’s nothing wrong with it. It doesn’t acquire bugs just by sitting around on your hard drive. Au contraire, baby! Is software supposed to be like an old Dodge Dart, that rusts just sitting in the garage? Is software like a teddy bear that’s kind of gross if it’s not made out of all new material?
Joel Spolsky, Things You Should Never Do, Part I

In the two quotes above, Dijkstra and Spolsky ridicule the notion that software systems wear out. Unlike physical systems, software doesn’t suffer from fatigue due to prolonged usage.

And, yet, anyone who has uttered the phrase “legacy system” in the presence of a software engineer and watched the change of expression on their face knows that engineers find older code more difficult to deal with than newer code. The motivation of Dijkstra’s and Spolsky’s writings above is to express contempt for this point of view.

What Dijkstra and Spolsky are missing is that the world changes around software. Software doesn’t exist in a vacuum: it’s part of an ecosystem. Legacy systems have legacy dependencies, and run in legacy environments. Those dependencies and environments are not static, they change over time, and sometimes the old ones go away, or are too expensive or risky to keep using.

Software is indeed different from physical artifacts, in that software artifacts (source code, binaries) don’t change with use. But in the world of software, that’s exactly the problem. The world keeps changing, and the software doesn’t, unless you put the work into it. And, unlike civil engineers, we aren’t yet good at thinking about the intended lifetime of a software system when we’re designing it.

Aristotle’s revenge

Imagine you’re walking around a university campus. It’s a couple of weeks after the spring semester has ended, and so there aren’t many people about. You enter a building and walk into one of the rooms. It appears to be some kind of undergraduate lab, most likely either physics or engineering.

In the lab, you come across a table. On the table, someone has balanced a rectangular block on its smallest end.

You nudge the top of the block. As expected, it falls over with a muted plonk. You look around to see if you might have gotten in trouble, but nobody’s around.

You come across another table. This table has some sort of track on it. The table also has a block on it that’s almost identical to the one on the other table, except that the block has a pin in it that connects it to some sort of box that is mounted on the track.

Not being able to resist, you nudge the top of this block, and it starts to fall. Then, the little box on the track whirs to life, moving along the track in the same direction that you nudged the block in. Because of the motion of the box, the block stays upright.

The ancient philosopher Aristotle believed that there were four distinct types of causes that explained why things happened. One of these is what Aristotle called the efficient cause: “Y behaved the way it did because X acted on Y“. For example, the red billiard ball moved because it was struck by the white ball. This is the most common way we think about causality today, and it’s sometimes referred to as linear causality.

Efficient cause does a good job of explaining the behavior of the first rectangular block in the anecdote: it fell over because we nudged it with our finger. But it doesn’t do a good job of explaining the observed behavior of the second rectangular block: we nudged it, and it started to fall, but it righted itself, and ended up balanced again.

Another type of cause Aristotle talked about was what he called the final cause. This is a teleological view of cause, which explains the behavior of objects in terms of their purpose or goal.

Final cause sounds like an archaic, unscientific view of the world. And, yet, reasoning with final cause enables us to explain the behavior of the second block more effectively than efficient cause does. That’s because the second block is being controlled by a negative-feedback system that was designed to keep the block balanced. The system will act to compensate for disturbances that could lead to the block falling over (like somebody walking over and nudging it). Because the output, a sensor that reads the angle of the block, is fed back into the input of the control system, the relationship between external disturbance and system behavior isn’t linear. This is sometimes referred to as circular causality, because of the circular nature of feedback loops.

The systems that we deal with contain goals and feedback loops, just like the inverted pendulum control system that keeps the block balanced. If you try to use linear causality to understand a closed-loop system, you will be baffled by the resulting behavior. Only when you understand the goals that the system is trying to achieve, and the feedback loops that it uses to adjust its behavior to reach those goals, will the resulting behavior make sense.