Bad Religion: A review of Work Pray Code

When I worked as a professor at the University of Nebraska—Lincoln, after being there for a few months, during a conversation with the chair of the computer science department he asked me “have you found a church community yet?” I had not. I had, however, found a synagogue. The choice wasn’t difficult: there were only two. Nobody asked me a question like that after I moved to San Jose, which describes itself as the heart of Silicon Valley.

Why is Silicon Valley so non-religious is the question that sociologist Carolyn Chen seeks to answer here. As a tenured faculty member at UC Berkeley, Chen is a Bay Area resident herself. Like so many of us here, she’s a transplant: she grew up in Pennsylvania and Southern California, and first moved to the area in 2013 to do research on Asian religions in secular spaces.

Chen soon changed the focus of her research from Asian religions to the work culture of tech companies. She observes that people tend to become less religious when they move to the area, and are less engaged in their local communities. Tech work is totalizing, absorbing employees entire lives. Tech companies care for many of the physical needs of their employees in a way that companies in other sectors do not. Tech companies provide meditation/mindfulness (the companies use these terms interchangeably) to help their employees stay productive, but it is a neutered version of the meditation of its religious, Buddhist roots. Tech companies push up the cost of living, and provide private substitutes for public infrastructure, like shuttle busses.

Chen tries to weave these threads together into a narrative about how work substitutes for religion in the lives of tech workers in Silicon Valley. But the pieces just don’t fit together. Instead, they feel shoehorned in to support her thesis. And that’s a shame, because, as a Silicon Valley tech worker, many of the observations themselves ring true to my personal experience. Unlike Nebraska, Silicon Valley really is a very secular place, so much so that it was a plot point in an episode of HBO’s Silicon Valley. As someone who sends my children to religious school, I’m clearly in the minority at work. My employer provides amenities like free meals and shuttles. They even provide meditation rooms, access to guided meditations provided by the Mental Health Employee Resource Group, and subscriptions to the Headspace meditation app. The sky-high cost of living in Silicon Valley is a real problem for the area.

But Chen isn’t able to make the case that her thesis is the best explanation for this grab bag of observations. And her ultimate conclusion, that tech companies behave more and more like cults, just doesn’t match my own experiences working at a large tech company in Silicon Valley.

Most frustratingly, Chen doesn’t ever seem to ask the question, “are there other domains where some of these observations also hold?” Because so much of the description of the secular and insular nature of Silicon Valley tech workers applies to academics, the culture that Chen herself is immersed in!

Take this excerpt from Chen:

Workplaces are like big and powerful magnets that attract the energy of individuals away from weaker magnets such as families, religious congregations, neighborhoods, and civic associations—institutions that we typically associate with “life” in the “work-life” binary. The magnets don’t “rob” or “extract”—words that we use to describe labor exploitation. Instead they attract the filings, monopolizing human energy by exerting an attractive rather than extractive force. By creating workplaces that meet all of life’s needs, tech companies attract the energy and devotion people would otherwise devote to other social institutions, ones that, traditionally and historically, have been sources of life fulfillment.

Work Pray Code, p197

Compare this to an excerpt from a very different book: Robert Sommer’s sardonic 1963 book Expertland (sadly, now out of print), which describes itself as “an unrestricted inside view of the world of scientists, professors, consultants, journals, and foundations, with particular attention to the quaint customs, distinctive dilemmas, and perilous prospects”.

Experts know very few real people. Except for several childhood friends or close relatives, the expert does not know anybody who drives a truck, runs a grocery store, or is vice-president of the local Chamber of Commerce. His only connection with these people is in some kind of service relationship; they are not his friends, colleagues, or associates. The expert feel completely out of place at Lion’s or Fish and Game meeting. If he is compelled to attend such gatherings, he immediately gravitates to any other citizen of Expertland who is present… He has no roots, no firm allegiances, and nothing to gain or lose in local elections… Because he doesn’t vote in local elections, join service clubs, or own the house he lives in, outsiders often feel that the expert is not a good citizen.

Expertland pp 2-3

Chen acknowledges that work is taking over the lives of all high-skilled professionals, not just tech workers. But I found work-life balance to be much worse in academia than at a Silicon Valley tech company! To borrow a phrase from the New Testament, And why beholdest thou the mote that is in thy brother’s eye, but considerest not the beam that is in thine own eye?

We value possession of experience, but not its acquisition

Imagine you’re being interviewed for a software engineering position, and the interviewer asks you: “Can you provide me with a list of the work items that you would do if you were hired here?” This is how the action item approach to incident retrospectives feels to me.

We don’t hire people based on their ability to come up with a set of work items. We’re hiring them for their judgment, their ability to make good engineering decisions and tradeoffs based on the problems that they will encounter at the company. In the interview process, we try to assess their expertise, which we assume they have developed based on their previous work experience.

Incidents provide us with excellent learning opportunities because they confront us with surprises. If we examine an incident in detail, we can learn something about our system behavior that we didn’t know before.

Yet, while we recognize the value of experienced candidates when we do hiring, we don’t seem to recognize the value of increasing the experience of our current employees. Incidents are a visceral type of experience, and reflecting on these sorts of experiences is what increases our expertise. But you have to reflect on them to maximize the value, and you have to share this information out to the organization so that it isn’t just the incident responders that can benefit from the experience.

To me, learning from incidents is about increasing the expertise of an organization by reflecting on and sharing out the experiences of surprising operational events. Action items are a dime a dozen. What I care about is improving the organization’s ability to engineer software.

Software engineering in-the-large: the coordination challenge

Back when I was an engineering student, I wanted to know “How do the big companies develop software? How does it happen in the real world?”

Now that I work at a company that has to do large-scale software development, I understand better why it’s not something you can really teach effectively in a university setting. It’s not that companies doing large-scale software development are somehow better at writing software than companies that work on smaller-scale software projects. It’s that large-scale projects face challenges that small-scale projects don’t.

The biggest challenge at large-scale is coordination. My employer provides a single service, which means that, in theory, any project that anyone is working on inside of the company could potentially impact what anybody else is working on. In my specific case, I work on delivery tools, so we might be called upon to support some new delivery workflow.

You can take a top-down command-and-control style approach to the problem, by having the people at the top attempting to filter all of the information to just what they need, and them coordinating everyone hierarchically. However, this structure isn’t effective in dynamic environments: as the facts on the ground change, it takes too long for information to work its way up the hierarchy, adapt, and then change the orders downwards.

You can take a bottoms-up approach to the problem where you have a collection of teams that work autonomously. But the challenge there is getting them aligned. In theory, you hire people with good judgment, and provide them with the right context. But the problem is that there’s too much context! You can’t just firehose all of the available information to everyone, that doesn’t scale: everyone will spend all of their time reading docs. How do you get the information into the heads of the people that need it? becomes the grand challenge in this context.

It’s hard to convey the nature of this problem in a university classroom, if you’ve never worked in a setting like this before. The flurry of memos, planning documents, the misunderstandings, the sync meetings, the work towards alignment, the “One X” initiatives, these are all things that I had to experience viscerally, on a first-hand basis, to really get a sense of the nature of the problem.

Code rewrites and joint cognitive systems

Way back in the year 2000, Joel Spolsky famously criticized the idea of doing a code rewrite.

The idea that new code is better than old is patently absurd. Old code has been used. It has been testedLots of bugs have been found, and they’ve been fixed. There’s nothing wrong with it.

Joel Spolsky, Things You Should Never Do, Part I

I think Spolsky is wrong here. His error comes from considering the software in isolation. The problem here isn’t the old code, it’s the interaction between the old code and the humans who are responsible for maintaining the software. If you draw the boundary around those people and the software together, you get what the cognitive systems engineering community calls a joint cognitive system.

One of the properties of joint cognitive systems is that the system has knowledge about itself. Being responsible for maintaining a legacy codebase is difficult because the joint cognitive system is missing important knowledge about itself.

Here’s Spolsky again:

When you throw away code and start from scratch, you are throwing away all that knowledge. 

But that knowledge is already gone! The people who wrote the code have left, and the current maintainers don’t know what their intent was. The joint cognitive system, the combination of code and the current maintainers, don’t know why it’s implemented the way it is.

Spolsky gestures at this, but doesn’t grasp its implications:

The reason that they think the old code is a mess is because of a cardinal, fundamental law of programming: It’s harder to read code than to write it.

Spolsky is missing the importance of a system’s ability to understand itself. Ironically, the computer scientist Peter Naur was writing about this phenomenon fifteen years earlier. In an essay titled Programming as Theory Building, he described the importance of having an accurate mental model or theory of the software, and the negative consequences of software being modified by maintainers with poor mental models.

It isn’t just about the software. It’s about the people and the software together.

Bitrot

Engineering deals in lifetimes, both human and otherwise. If not fatigue or fracture, than corrosion or erosion; if not war or vandalism, then taste or fashion claim not only the body but the very souls of once-new machines…

The lifetime of a structure is no mere anthropomorphic metaphor, for how long a piece of engineering must last can be one of the most important considerations for its design.

Henry Petroski, To Engineer is Human: The Role of Failure in Successful Design

Unfathomed misunderstanding is further revealed by the term “software maintenance”, as a result of which many people continue to believe that programs —and even programming languages themselves— are subject to wear and tear. Your car needs maintenance too, doesn’t it? Famous is the story of the oil company that believed that its PASCAL programs did not last as long as its FORTRAN programs “because PASCAL was not maintained”.

Edsger W. Dijkstra, On the cruelty of really teaching computing science

Before Borland’s new spreadsheet for Windows shipped, Philippe Kahn, the colorful founder of Borland, was quoted a lot in the press bragging about how Quattro Pro would be much better than Microsoft Excel, because it was written from scratch. All new source code! As if source code rusted.

The idea that new code is better than old is patently absurd. Old code has been used. It has been testedLots of bugs have been found, and they’ve been fixed. There’s nothing wrong with it. It doesn’t acquire bugs just by sitting around on your hard drive. Au contraire, baby! Is software supposed to be like an old Dodge Dart, that rusts just sitting in the garage? Is software like a teddy bear that’s kind of gross if it’s not made out of all new material?

Joel Spolsky, Things You Should Never Do, Part I

In the two quotes above, Dijkstra and Spolsky ridicule the notion that software systems wear out. Unlike physical systems, software doesn’t suffer from fatigue due to prolonged usage.

And, yet, anyone who has uttered the phrase “legacy system” in the presence of a software engineer and watched the change of expression on their face knows that engineers find older code more difficult to deal with than newer code. The motivation of Dijkstra’s and Spolsky’s writings above is to express contempt for this point of view.

What Dijkstra and Spolsky are missing is that the world changes around software. Software doesn’t exist in a vacuum: it’s part of an ecosystem. Legacy systems have legacy dependencies, and run in legacy environments. Those dependencies and environments are not static, they change over time, and sometimes the old ones go away, or are too expensive or risky to keep using.

Software is indeed different from physical artifacts, in that software artifacts (source code, binaries) don’t change with use. But in the world of software, that’s exactly the problem. The world keeps changing, and the software doesn’t, unless you put the work into it. And, unlike civil engineers, we aren’t yet good at thinking about the intended lifetime of a software system when we’re designing it.

I have no idea what I’m doing

A few days ago, David Heinemeier Hansson (who generally goes by DHH) wrote a blog post titled Programmers should stop celebrating incompetence:

I disagreed with the post, but for different reasons than from most of the other responses I saw on twitter.

Here are a couple of lines from the post:

You can’t become the I HAVE NO IDEA WHAT I’M DOING dog as a professional identity. Don’t embrace being a copy-pasta programmer whose chief skill is looking up shit on the internet.

From the twitter reactions, it seems like people thought DHH was saying, “you shouldn’t be looking things up on the internet and copy-pasting code”. But I think that gets the thrust of his argument wrong. This wasn’t a diatribe against Stack Overflow, but it was about how programmers see themselves and their work.

DHH was criticizing a sort of anti-intellectualism mode of expression. The attitude he was criticizing reminds me of reading an essay (I can’t remember the source or author, it might have been Paul Lockhart) where a mathematics(?) professor was talking to some colleagues from the humanities department, and when the math professor mentioned their field, the humanities professor said, “Oh, I was never any good at math”, and it came off almost as a point of pride.

Where I disagree with DHH is that I don’t see this type of anti-intellectualism in our field at all. I don’t see “LOL, I don’t know what I’m doing” on people’s LinkedIn profiles or in their resumes, I don’t hear it in interviews, I don’t see it on pull request comments, I don’t hear it in technical meetings. I don’t think it exists in our field.

You can see our field’s professionalism in criticisms of technical interviews that involve live coding. You don’t hear programmers criticizing it by saying, “LOL, actually, nobody knows how to do this.” What you hear instead is, “these interviews don’t effectively evaluate my actual skills as a software developer”.

So, what’s going on here? What led DHH astray? Where does the dog meme come from?

To explain my theory, I’m going to use this recent blog post by Diomidis Spinnellis, called Rather than alchemy, methodical troubleshooting:

Spinellis is a software engineering professor who has written numerous books for practitioners and has contributed to numerous open source projects (including the FreeBSD kernel). He is as professional as they come.

His blog post is about his struggles getting a React Native project to build in Xcode, including trying (in vain) various bits of advice he found through Googling. Spinellis actually feels bad about his initial approach:

Although advice from the web can often help us solve tough problems in seconds, as the author of the book Effective Debugging, I felt ashamed of wasting time by following increasingly nonsensical advice. 

I bring this up not to pile onto Spinellis, but to point out that the surface area of the software world is vast, so vast that even the most professional software engineer will encounter struggles, will hit issues outside of their expertise.

(As an aside: note that Spinellis does not solve the problem by developing a deep understanding of the failure mode, but instead by systematically eliminating the differences between a succeeding build and a failed one.)

In the book Designing Engineers, Louis Bucciarelli notes that Murphy’s Law and horror stories told by engineers are symptoms of the dissonance between the certainty of engineering models and the uncertainty of reality. I think the dog meme is another such symptom. It uses humor to help us deal with the fact that, no matter how skilled we become in our profession as software engineers, we will always encounter problems that extend beyond our area of expertise to understand.

To put it another way: the dog meme is a coping mechanism for professionals in dealing with a domain that will always throw problems at them that push them beyond their local knowledge. It doesn’t indicate a lack of professionalism. Instead, it calls attention to the ironies of professionalism in software engineering. Even the best software engineers still get relegated to Googling incomprehensible error messages.

The strange beauty of strange loop failure modes

As I’ve posted about previously, at my day job, I work on a project called Managed Delivery. When I first joined the team, I was a little horrified to learn that the service that powers Managed Delivery deploy itself using Managed Delivery.

“How dangerous!”, I thought. What if we push out a change that breaks Managed Delivery? How will we recover? However, after having been on the team for over a year now, I have a newfound appreciation for this approach.

Yes, sometimes there’s something that breaks, and that makes it harder to roll back, because Managed Delivery provides the main functionality for easy rollback. However, it also means that the team gets quite a bit of practice at bypassing Managed Delivery when something goes wrong. They know how to disable Managed Delivery and use the traditional Spinnaker UI to deploy an older version. They know how to poke and prod at the database if the Managed Delivery UI doesn’t respond properly.

These strange loop failure modes are real: if Managed Delivery breaks, we may lose out on the functionality of Managed Delivery to help us recover. But it also means that we’re more ready for handling the situation if something with Managed Delivery goes awry. Yes, Managed Delivery depends on itself, and that’s odd. But we have experience with how to handle things when this strange loop dependency creates a problem. And that is a valuable thing.

The local nature of culture

I’m really enjoying Turn the Ship Around!, a book by David Marquet about his experiences as commander of a nuclear submarine, the USS Santa Fe, and how he worked to improve its operational performance.

One of the changes that Marquet introduced is something he calls “thinking out loud”, where he encourages crew members to speak aloud their thoughts about things like intentions, expectations, and concerns. He notes that this approach contradicted naval best practices:

As naval officers, we stress formal communications and even have a book, the Interior Communications Manual, that specifies exactly how equipment, watch stations, and evolutions are spoken, written, and abbreviated …

This adherence to formal communications unfortunately crowds out the less formal but highly important contextual information needed for peak team performance. Words like “I think…” or “I am assuming…” or “It is likely…” that are not specific and concise orders get written up by inspection teams as examples of informal communications, a big no-no. But that is just the communication we need to make leader-leader work.

Turn the Ship Around! p103

This change did improve the ship operations, and this improvement was recognized by the Navy. Despite that, Marquet still got pushback for violating norms.

[E]ven though Santa Fe was performing at the top of the fleet, officers steeped in the leader-follower mind-set would criticize what they viewed as the informal communications on Santa Fe. If you limit all discussion to crisp orders and eliminate all contextual discussion, you get a pretty quiet control room. That was viewed as good. We cultivated the opposite approach and encouraged a constant buzz of discussions among the watch officers and crew. By monitoring that level of buzz, more than the actual content, I got a good gauge of how well the ship was running and whether everyone was sharing information.

Turn the Ship Around! p103

Reading this reminded me how local culture can be. I shouldn’t be surprised, though. At Netflix, I’ve worked on three teams (and six managers!) and each team had very different local cultures, despite all of them being in the same organization, Platform Engineering.

I used to wonder, “how does a large company like Google write software?” But I no longer think that’s a meaningful question. It’s not Google as an organization that writes software, it’s individual teams that do. The company provides the context that the teams work in, and the teams are constrained by various aspects of the organization, including the history of the technology they work on. But, there’s enormous cultural variation from one team to the next. And, as Marquet illustrates, you can change your local culture, even cutting against organizational “best practices”.

So, instead of asking, “what is it like to work at company X”, the question you really want answered is, “what is it like to work on team Y at company X?”

What do you work on, anyway?

I often struggle to describe the project that I work on at my day job, even though it’s an open-source project that even has its own domain name: managed.delivery. I’ll often mumble something like, “it’s a declarative deployment system”. But that explanation does not yield much insight.

I’m going to use Kubernetes as an analogy to explain my understanding of Managed Delivery. This is dangerous, because I’m not a Kubernetes user(!). But if I didn’t want to live dangerously, I wouldn’t blog.

With Kubernetes, you describe the desired state of your resources declaratively, and then the system takes action to bring the current state of the system to the desired state. In particular, when you use Kubernetes to launch a pod of containers, you need to specify the container image name and version to be deployed as part of the desired state.

When a developer pushes new code out, they need to change the desired state of a resource, specifically, the container image version. This means that a deployment system needs some mechanism for changing the desired state.

A common pattern we see is that service owners have a notion of an environment (e.g., test, staging, prod). For example, maybe they’ll deploy the code to test, and maybe run some automated tests against it, and if it looks good, they’ll promote to staging, and maybe they’ll do some manual tests, and if they’re happy, they’ll promote out to prod.

Example of deployment environments

Imagine test, staging, and prod all have version v23 of the code running in it. After version v24 is cut, it will first be deployed in test, then staging, then prod. That’s how each version will propagate through these environments, assuming it meets the promotion constraints for each environment (e.g., tests pass, human makes a judgment).

You can think of this kind of promoting-code-versions-through-environments as a pattern for describing how the desired states of the environments changes over time. And you can describe this pattern declaratively, rather than imperatively like you would with traditional pipelines.

And that’s what Managed Delivery is. It’s a way of declaratively describing how the desired state of the resources should evolve over time. To use a calculus analogy, you can think of Managed Delivery as representing the time-derivative of the desired state function.

If you think of Kubernetes as a system for specifying desired state, Managed Delivery is a system for specifying how desired state evolves over time

With Managed Delivery, you can say express concepts like:

  • for a code version to be promoted to the staging environment, it must
    • be successfully deployed to the test environment
    • pass a suite of end-to-end automated tests specified by the app owner

and then Managed Delivery uses these environment promotion specifications to shepherd the code through the environments.

And that’s it. Managed Delivery is a system that lets users describe how the desired state changes over time, by letting them specify environments and the rules for promoting change from one from environment to the next.