Below is a screenshot of Vizceral, a tool that was built by a former teammate of mine at Netflix. It provides a visualization of the interactions between the various microservices.
Vizceral uses moving dots to depict how requests are currently flowing through the Netflix microservice architecture. Vizceral is able to do its thing because of the platform tooling, which provides support for generating a visualization like this by exporting a standard set of inter-process communication (IPC) metrics.
What you don’t see depicted here are the interactions between those microservices and the telemetry platform that ingest these metrics. There’s also logging and tracing data, and those get shipped off-box via different channels, but none of those channels show up in this diagram.
In fact, this visualization doesn’t represent interactions with any of the platform services. You won’t see bubbles that represent the compute platform or the CI/CD platform represented in a diagram like this, even though those platform services all interact with these application services in important ways.
I call the first category of interactions, the ones between the application services, as first-class, and the second category, the ones where the interactions involve platform services, as second-class. It’s those second-class interactions that I want to say more about.
These second-class interactions tend to have a large blast radius, because successful platforms by their nature have a large blast radius. There’s a reason why there’s so much havoc out in the world when AWS’s us-east-1 region has a problem: because so many services out there are using us-east-1 as a platform. Similarly, if you have a successful platform within your organization, then by definition it’s going to see a lot of use, which means that if it experiences a problem, it can do a lot of damage.
These platforms are generally more reliable than the applications that run atop them, because they have to be: platforms naturally have higher reliability requirements than the applications that run atop them. They have these requirements because they have a large blast radius. A flaky platform is a platform that contributes to multiple high-severity outages, and systems that contribute to multiple high-severity outages are the systems were reliability work gets prioritized.
And a reliable system is a system whose details you aren’t aware of, because you don’t need to be. If my car is very reliable, then I’m not going to build an accurate mental model of how my car works, because I don’t need to: it just works. In her book Human-Machine Reconfigurations: Plans and Situated Actions, the anthropologist Lucy Suchman used the term representation to describe the activity of explicitly constructing a mental model of how a piece of technology works, and she noted that this type of cognitive work only happens when we run into trouble. As Suchman puts it:
[R]epresentation occurs when otherwise transparent activity becomes in some way problematic
Hence the irony: these second-class interactions tend not to be represented in our system models when we talk about reliability, because they are generally not problematic.
And so we are lulled into a false sense of security. We don’t think about how the plumbing works, because the plumbing just works. Until the plumbing breaks. And then we’re in big trouble.
There are two general approaches to decision-making. One way is to make a judgment call. Informally, you could call this “trusting your gut”. Formally, you could describe this as a subjective, implicit process. The other way is to use an explicit approach that relies on objective, quantitative data, for example, doing a return-on-investment (ROI) calculation on a proposed project to decide whether to undertake the project. We use the term rigorous to describe these type of approaches, and we generally regard them as superior.
Here, Porter argues that quantitative, rigorous decision-making in a field is not a sign of its maturity, but rather its political weakness. In fields where technical professionals enjoy a significant amount of trust, these professionals do decision-making using personal judgment. While professionals will use quantitative data as input, their decisions are ultimately based on their own subjective impressions. (For example, see Julie Gainsburg’s notion of skeptical reverence in The Mathematical Disposition of Structural Engineers). In Porter’s account, we witnessed an increase of rigorous decision-making approaches in the twentieth century because of a lack of trust in certain professional fields, not because the quantitative approaches yielded better results.
It’s only in fields where the public does not grant deference to professionals that they are compelled to use explicit, objective processes to make the decisions. They are forced to show their work in a public way because they aren’t trusted. In some cases, a weak field adopts rigor to strengthen itself in the eyes of the public, such as experimental psychology’s adoption of experimental rigor (in particular, ESP research). Most of the case studies in the book come from areas where a field was compelled to adopt objective approaches because there was explicit political pressure and the field did not have sufficient power to resist.
In some cases, professionals did have the political clout to push back. An early chapter of the book discusses a problem that the British parliament wrestled with in the late nineteenth century: unreliable insurance companies that would happily collect premiums but then would eventually fail and would hence be unable to pay out when their customers submitted claims. A parliamentary committee formed and heard testimony from actuaries about how the government could determine whether an insurance company was sound. The experienced actuaries from reputable companies argued that it was not possible to define an objective procedure for assessing a company. They insisted that “precision is not attainable through actuarial methods. A sound company depends on judgment and discretion.” They were concerned that a mechanical, rule-based approach wouldn’t work:
Uniform rules of calculation, imposed by the state, might yield “uniform errors.” Charles Ansell, testifying before another select committee a decade earlier, argued similarly, then expressed his fear that the office of government actuary would fall to “some gentlemen of high mathematical talents, recently removed from one of our Universities, but without any experience whatever, though of great mathematical reputation.” This “would not qualify him in any way whatever for expressing a sound opinion on a practical point like that of the premiums in a life assurance.”
Trust in Numbers, pp108-109
Porter tells a similar story about American accountants. To stave off having standardized rules imposed on them, the American Institute of Accountants defined standards for its members, but these were controversial. One accountant, Walter Wilcox, argued in 1941 that “Cost is not a simple fact, but is a very elusive concept… Like other aspects of accounting, costs give a false impression of accuracy.” Similarly, when it came to government-funded projects, the political pressure was simply too strong to defer to government civil engineers, such as the French civil engineers who had to help decide which rail projects should be funded, or the U.S. Army Corps of Engineers who had to help make similar decisions about waterway projects such as dams and reservoirs. In the U.S., they settled on a cost-benefit analysis process, where the return on investment had to exceed 1.0 in order to justify a project. But, unsurprisingly, there were conflicts over how benefits were quantified, as well as over how to classify costs. While the output may have been a number, and the process was ostensibly objective, because it needed to be, ultimately these numbers were negotiable and assessments changed as a function of political factors.
In education, teachers were opposed to standardized testing, but did not have the power to overcome it. On the other hands, doctors were able to retain the use of their personal judgment for diagnosing patients. However, the regulators had sufficient power that they were able to enforce the use of objective measures for evaluating drugs, and hence were able to oversee some aspect of medical practice.
This tug of war between rigorous, mechanical objectivity and élite professional autonomy continues to this day. Professionals say “This requires private knowledge; trust us”. Sometimes, the public says “We don’t trust you anymore. Make the knowledge public!”, and the professionals have no choice but to relent. On the subject of whether we are actually better off when we trade away judgment for rigor, Porter is skeptical. I agree.
“Thanks! I’m excited to be here. This is my first tech job, even if it is just an internship.”
“We’re going to start you off with some automated testing. You’re familiar with queues, right?”
“The data structure? Sure thing. First in, first out.”
“Great! We need some help validating that our queueing module is always working properly. We have a bunch of test scenarios written, and we want need to someone to check that the observed behavior of the queue is correct.”
“So, for input, do I get something like a history of interactions with the queue? Like this?”
q.add("A") -> OK q.add("B") -> OK q.pop() -> "A" q.add("C") -> OK q.pop() -> "B" q.pop() -> "C"
“Exactly! That’s a nice example of a correct history for a queue. Can you write a program that takes a history like that as input and returns true if it’s a valid history?”
“Sure thing.”
“Excellent. We’ll also need your help generating new test scenarios.”
A few days later
“I think I found a scenario where the queue is behaving incorrectly when it’s called by a multithreaded application. I got a behavior that looks like this:”
q.add("A") -> OK q.add("B") -> OK q.add("C") -> OK q.pop() -> "A" q.pop() -> "C" q.pop() -> "B"
“Hmmm. That’s definitely incorrect behavior. Can you show me the code you used to generate the behavior?”
“Sure thing. I add the elements to the queue in one thread, and then I spawn a bunch of new threads and dequeue in the new threads. I’m using the Python bindings to call the queue. My program looks like this.”
from bigco import Queue from threading import Thread
“Well, that’s certainly not the order I expect the output to be printed in, but how do you know the problem is that the queue is actually behaving correctly? It might be that the values were dequeued in the correct order, but because of the way the threads are scheduled, the print statements were simply executed in a different order than you expect.”
“Hmmm. I guess you’re right: just looking at the order of the printed output doesn’t give me enough information to tell if the queue is behaving correctly or not. Let me try printing out the thread ids and the timestamps.”
[id0] [t=1] before pop [id0] [t=2] after pop [id0] [t=3] output: A [id1] [t=4] before pop [id2] [t=5] before pop [id2] [t=6] after pop [id2] [t=7] output: C [id1] [t=8] after pop [id1] [t=9] output: B
“Oh, I see what happened! The operations of thread 1 and thread 2 were interleaved! I didn’t think about what might happen in that case. It must have been something like this:”
“Well, it looks like the behavior is still correct, the items got dequeued in the expected order, it’s just that they got printed out in a different order.”
The next day
“After thinking through some more multithreaded scenarios, I ran into a weird situation that I didn’t expect. It’s possible that the “pop” operations overlap in time across the two different threads. For example, “pop” might start on thread 1, and then in the middle of the pop operation, the operating system schedules thread 2, and it starts in the middle.”
[id0] [id1] [id2] q.pop(): start q.pop(): end print("A") q.pop(): start | q.pop(): start q.pop(): end | q.pop(): end print("C") print("B")
“Let’s think about this. If id1 and id2 overlap in time like this, what do you think the correct output should be? ‘ABC’ or ‘ACB’?”
“I have no idea. I guess we can’t say anything!”
“So, if the output was ‘ABB’, you’d consider that valid?”
“Wait, no… It can’t be anything. It seems like either ‘ABC’ or ‘ACB’ should be valid, but not “ABB”.
“How about ‘BCA’? Would that be valid here?”
“No, I don’t think so. There’s no overlap between the first pop operation and the others, so it feels like the pop in id0 should return “A”.
“Right, that makes sense. So, in a concurrent world, we have potentially overlapping operations, and that program you wrote that checks queue behaviors doesn’t have any notion of overlap in it. So we need to be able to translate these potentially overlapping histories into the kind of sequential history your program can handle. Based on this conversation, we can use two rules:
1. If two operations don’t overlap (like the pop in id0 and the pop in id1) in time, then we use the time ordering (id0 happened before id1).
2. If two operations do overlap in time, then either ordering is valid.
“So, that means that when I check whether a multithreaded behavior is valid, I need to actually know the time overlap of the operations, and then generate multiple possible sequential behaviors, and check to see if the behavior that I witnesses corresponds to one of those?”
“Yes, exactly. This is a consistency model called linearizability. If our queue has linearizable consistency, that means that for any behavior you witness, you can define a linearization, an equivalent sequential behavior. Here’s an example.”
“The question is: can we generate a linearization based on the two rules above? We can! Because the “id1” and “id2” overlap, we can generate a linearization where the “id1″ operation happens first. One way to think about it is to identify a point in time between the start and end of the operation and pretend that’s when the operation really happens. I’ll mark these points in time with an ‘x’ in the diagram.
“We’re expanding our market. We’re building on our queue technology to build a distributed queue. We’re also providing a new operation: “get”. When you call “get” on a distributed queue, you get the entire contents of the queue, in queue order.”
“Oh, so a valid history would be something like this?”
“Exactly! One use case we’re targeting is using our queue for implementing online chat, so the contents of a queue might look like this:”
["Alice: How are you doing?", "Bob: I'm fine, Alice. How are you?", "Alice: I'm doing well, thank you."]
CAPd
“OK, I did some testing with the distributed queue. ran into a problem with the distributed queue. Look at this history, it’s definitely wrong. Note that the ids here are process ids, not thread ids, because we’re running on different machines.
“When process 1 called ‘get’, it didn’t see the “Alice: Hello” entry, and that operation completed before the ‘get’ started! This history isn’t linearizable!”
“You’re right, our distributed queue isn’t linearizable. Note that we could modify this history to make it linearizable if process 0’s add operation did not complete until after the get:
[id0] [id1] q.add("Alice: Hello"): start
q.add("Bob: "Hi"): start q.add(...) -> OK q.get(): start q.get()-> ["Bob: Hi"] q.add(...) -> OK
“Now we can produce a valid linearization from the history”
“But look what we had to do: we had to delay the completion of that add operation. This is the lesson of the CAP theorem: if you want your distributed object to have linearizable consistency, then some operations might take an arbitrarily long time to complete. With our queue, we decided to prefer availability, so that all operations are guaranteed to complete within a certain period of time. Unfortunately, once we give up on linearizability, things can get pretty weird. Let’s see how many different types of weird things you can find.”
Monotonic reads
“Here’s a weird one. The ‘Hi’ message disappeared in the second read!”
“Yep, this violates a property called monotonic reads. Once process 0 has seen the effect of the add(“B: Hi”) operation, we expect that it will always see it in the future. This is an example of a session property. If the two gets happened on two different processes, this would not violate the monotonic reads property. For example, the following history doesn’t violate monotonic reads, even though the operations and ordering are the same. That’s because one of the gets is in process 0, and the other is in process 1, and the monotonic reads property only applies to reads within the same process.
“All right, let’s say we can guarantee monotonic reads. What other kinds of weirdness happen?”
Read your writes
[id0] q.add("A: Hello") q.get() -> []
“Read your writes is one of the more intuitive consistency properties. If a process writes data, and then does a read, it should be able to see the effective of the write. Here we did a write, but we didn’t see it.”
“Here’s a case where read-your-writes isn’t violated (in fact, we don’t do any reads after the write), but something very strange has happened. We saw the effect of our write before we actually did the write! This violates the writes follow reads property. This also called session causality, and you can see why: when it was violated, we saw the effect before the cause!”
Monotonic writes
[id0] [id1] q.add("A: Hi there!") q.add("A: How are you?") q.get() -> ["A: How are you?"]
“Hey, process 1 saw the ‘How are you?’ but not the ‘Hi there!’, even though they both came from process 0.”
“Yep. It’s weird that process 1 saw the second write from process 0, but it didn’t see the first write. This violates the monotonic writes property. Note that if the two writes were from different processes, this would not violate the property. For example, this would be fine:
[id0] [id1] q.add("A: Hi there!") q.add("A: How are you?") q.get() -> ["A: How are you?"]
“From process 1’s perspective, it looks like the history of the chat log changed! Somehow, ‘A: Hello’ snuck in before ‘B: Hi’, even though process 1 had already seen ‘B: Hi’.”
“Yes, this violates a property called consistent prefix. Note that this is different from monotonic reads, which is not violated in this case. (Sadly, the Jepsen consistency page doesn’t have an entry for consistent prefix).
Reasoning about correctness in a distributed world
One way to think about what it means for a data structure implementation to be correct is to:
Define what it means for a particular execution history to be correct
Check that every possible execution history for the implementation satisfies this correctness criteria.
Step 2 requires doing a proof, because in general there are too many possible execution histories for us to check exhaustively. But, even if we don’t actually go ahead and do the formal proof, it’s still useful to think through step 1: what it means for a particular execution history to be correct.
As we move from sequential data structures to concurrent (multithreaded) ones and then distributed ones, things get a bit more complicated.
Recall that for the concurrent case, in order to check that a particular execution history was correct, we had to see if we could come up with a linearization. We had to try and identify specific points in time when operations took effect to come up with a sequential version of the history that met our sequential correctness criteria.
In Principles of Eventual Consistency, Sebastian Burckhardt proposed a similar type of approach for validating the execution history of a distributed data structure. (This is the approach that Viotti & Vukolic extended. Kyle Kingsbury references Viotti and Vukolic on the Jepsen consistency models page that I’ve linked to several times here).
Execution histories as a set of events
To understand Burckhardt’s approach, we first have to understand how he models a distributed data structure execution history. He models an execution history as a set of events, where each event has associated with it:
The operation (including arguments), e.g.:
get()
add(“Hi”)
A return value, e.g.
[“Hi”, “Hello”]
OK
He also defines two relations on these events, returns-before and same-session.
Returns-before
The returns-before (rb) relation models time. If there are two events, e1, e2, and (e1,e2) is in rb, that means that the operation associated with e1 returned before the operation associated with e2 started.
Let’s take this example, where the two add operations overlap in time:
Note that neither (e1,e2) nor (e2,e1) is in rb, because the two operations overlap in time. Neither one happens before the other.
Same-session
The same-session (ss) relation models the grouping of operations into processes. In the example above, there are three sessions (id0, id1, id2), and the same-session relation looks like this: ss={(e1,e1),(e1,e4),(e4,e1),(e4,e4),(e2,e2),(e3,e3)}. (Note: in this case, there are only two operations that are in the same session, e1 and e4
This is what the graph looks like with the returns-before (rb) and same-session (ss) relationship shown.
Explaining executions with visibility and arbitration
Here’s the idea behind Burckhardt’s approach. He defines consistency properties in terms of the returns-before (rb) relation, the same-session (ss) relation, and two other binary relations called visibility (vis) and arbitration (ar).
For example, an execution history satisfies read my writes if: (rb ∩ ss) ⊆ vis
In this scheme, an execution history is correct if we can come up with visibility and arbitration relations for the execution such that:
All of the consistency properties we care about are satisfied by our visibility and arbitration relations.
Our visibility and arbitration relations don’t violate any of our intuitions about causality.
You can think of coming up with visibility and arbitration relations for a history as coming up with an explanation for how the history makes sense. It’s a generalization of the process we used for linearizability where we picked a specific point in time where the operation took effect.
(1) tells us that we have to pick the right vis and ar (i.e., we have to pick a good explanation). (2) tells us that we don’t have complete freedom in picking vis and ar (i.e., our explanations have to make intuitive sense to human beings).
You can think of the visibility relation as capturing which write operations were visible to a read, and the arbitration relation as capturing how the data structure should reconcile conflicting writes.
Specifying behavior based on visibility and arbitration
Unfortunately, in a distributed world, we can no longer use the sequential specification for determining correct behavior. In the sequential world, writes are always totally ordered, but in the distributed world, we might have to deal with two different writes that aren’t ordered in a meaningful way.
What’s a valid value for ???. Let’s assume we’ve been told that: vis={(e1,e3),(e2,e3)}. This means that both writes are visible to process 3.
Based on our idea of how this data structure should work, e3 should either be: [“A”,”B”] or [“B”,”A”]. But the visibility relationship doesn’t provide enough information to tell us which one of these it was. We need some additional information to determine what the behavior should be.
This is where the arbitration relation comes in. This relation is always a total ordering. (For example, if ar specifies an ordering of e1->e2->e3, then the relation would be {(e1,e2),(e1,e3),(e2,e3)}. ).
If we define the behavior of our distributed queue such that the writes should happen in arbitration order, and we set ar=e1->e2->e3, then e3 would have to be get()->[“A”,”B”].
The above history is valid, because we can do: vis={(e1,e3),(e2,e4),(e3,e4))}, ar=e1->e2->e3->e4. Note that even though (e2,e3) is in ar, e2 is not visible to e3, and an operation only has to reflect the visible writes.
Note that we can come up with valid vis and ar relations for this history:
vis = {(e3,e2)}
ar = e1->e3->e2
But, despite the fact that we can come up with an explanation for this history, it doesn’t make sense to us, because e3 happened after e2. You can see why this is also referred to as session causality, because it violates our sense of causality: we read a write that happened in the future!
This is a great example of one of the differences between programming and formal modeling. It’s impossible to write a non-causal program (i.e., a program whose current output depends on future inputs). On the other hand, in formal modeling, we have no such restrictions, so we can always propose “impossible to actually happen in practice” behaviors to look at. So we often have to place additional constraints on the behaviors we generate with formal models to ensure that they’re actually realizable.
Sometimes we do encounter systems that record history in the wrong order, which makes the history look non-causal.
History is sometimes re-ordered in such a way that it looks like causality has been violated
Consistency as constraints on relations
The elegant thing about this relation-based model of execution histories is that the consistency models can be expressed in terms of them. Burckhardt conveniently defines two more relationships.
Session-order (so) is the ordering of events within each session, expressed as: so = rb ∩ ss
Happens-before (hb) is a causal ordering, in the sense of Lamport’s Time, Clocks, and the Ordering of Events in a Distributed System paper. (e1,e2) is in hb if (e1,e2) is in so (i.e., e1 comes before e2 in the same session), or if (e1,e2) is in vis (i.e., e1 is visible to e2), or if there’s some transitive relationship (e.g., there’s some e3 such that (e1,e3) and (e3,e2) are in so or vis.
Therefore, happens-before is the transitive closure of so ∪ vis, which we write as: hb = (so ∪ vis)⁺ . We can define no circular causality as no cycles in the hb relation or, as Burckhardt writes it: NoCircularCausality = acyclic(hb)
If you made it all of the way here, I’d encourage you to check out Burckhardt’s Principles of Eventual Consistency book. You can get the PDF for free by clicking the “Publication” button the web page.
What struck me in the NY Times article was this anecdote about Chang’s search for a job after he failed out of a Ph.D. program at MIT in 1955 (emphasis mine):
Two of the best offers arrived from Ford Motor Company and Sylvania, a lesser-known electronics firm. Ford offered Mr. Chang $479 a month for a job at its research and development center in Detroit. Though charmed by the company’s recruiters, Mr. Chang was surprised to find the offer was $1 less than the $480 a month that Sylvania offered.
When he called Ford to ask for a matching offer, the recruiter, who had previously been kind, turned hostile and told him he would not get a cent more. Mr. Chang took the engineering job with Sylvania. There, he learned about transistors, the microchip’s most basic component.
“That was the start of my semiconductor career,” he said. “In retrospect, it was a damn good thing.”
The course of history changed because an internal recruiter Ford refused to offer him an additional dollar a month ($11.46 in 2023 dollars) to match a competing offer!
This is the sort of thing that historians call contingency.
Here’s a brief excerpt from a talk by David Woods on what he calls the component substitution fallacy (emphasis mine):
claim of root cause is ex. of component substitution fallacy. All incidents that threaten failure reveal component weaknesses due to finite resources & tradeoffs -> easy to miss the critical systemic/emergent factors see min 25 https://t.co/OsYy2U8fsA
Everybody is continuing to commit the component substitution fallacy.
Now, remember, everything has finite resources, and you have to make trade-offs. You’re under resource pressure, you’re under profitability pressure, you’re under schedule pressure. Those are real, they never go to zero.
So, as you develop things, you make trade offs, you prioritize some things over other things. What that means is that when a problem happens, it will reveal component or subsystem weaknesses. The trade offs and assumptions and resource decisions you made guarantee there are component weaknesses. We can’t afford to perfect all components.
Yes, improving them is great and that can be a lesson afterwards, but if you substitute component weaknesses for the systems-level understanding of what was driving the event … at a more fundamental level of understanding, you’re missing the real lessons.
Seeing component weaknesses is a nice way to block seeing the system properties, especially because this justifies a minimal response and avoids any struggle that systemic changes require.
Whenever an incident happens, we’re always able to point to different components in our system and say “there was the problem!” There was a microservice that didn’t handle a certain type of error gracefully, or there was bad data that had somehow gotten past our validation checks, or a particular cluster was under-resourced because it hadn’t been configured properly, and so on.
These are real issues that manifested as an outage, and they are worth spending the time to identify and follow up on. But these problems in isolation never tell the whole story of how the incident actually happened. As Woods explains in the excerpt of his talk above, because of the constraints we work under, we simply don’t have the time to harden the software we work on to the point where these problems don’t happen anymore. It’s just too expensive. And so, we make tradeoffs, we make judgments about where to best spend our time as we build, test, and roll out our stuff. The riskier we perceive a change, the more effort we’ll spend on validation and rollout of the change.
And so, if we focus only on issues with individual components, there’s so much we miss about the nature of failure in our systems. We miss looking at the unexpected interactions between the components that enabled the failure to happen. We miss how the organization’s prioritization decisions enabled the incident in the first place. We also don’t ask questions like “if we are going to do follow-up work to fix the component problems revealed by this incident, what are the things that we won’t be doing because we’re prioritizing this instead?” or “what new types of unexpected interactions might we be creating by making these changes?” Not to mention incident-handling questions like “how did we figure out something was wrong here?”
In the wake of an incident, if we focus only on the weaknesses of individual components then we won’t see the systemic issues. And it’s the systemic will continue to bite us long after we’ve implemented all of those follow-up action items. We’ll never see the forest for the trees.
Unfortunately, the first few minutes were lost due to technical issues. You’ll just have to take my word for it that the missing part of my talk was truly astounding, a veritable tour de force.
Starting after World War II, the idea was culture is accelerating. Like the idea of an accelerated culture was just central to everything. I feel like I wrote about this in the nineties as a journalist constantly. And the internet seemed like, this is gonna be the ultimate accelerant of this. Like, nothing is going to accelerate the acceleration of culture like this mode of communication. Then when it became ubiquitous, it sort of stopped everything, or made it so difficult to get beyond the present moment in a creative way.
We software developers are infamous for our documentation deficiencies: the eternal lament is that we never write enough stuff down. If you join a new team, you will inevitably discover that, even if some important information is written down, there’s also a lot of important information that is tacit knowledge of the team, passed down as what’s sometimes referred to as tribal lore.
But writing things down has a cost beyond the time and effort required to do the writing: written documents are durable, which means that they’re harder to change. This durability is a strength of documentation, but it’s also a weakness. Writing things down has a tendency to ossify the content, because it’s much more expensive to update than tacit knowledge. Tacit knowledge is much more fluid: it adapts to changing circumstances much more quickly and easily than updating documentation, as anybody who has dealt with out-of-date written procedures can attest to.