TLA+ is hard to learn

I’m a fan of the formal specification language TLA+. With TLA+, you can build models of programs or systems, which helps to reason about their behavior.

TLA+ is particularly useful for reasoning about the behavior of multithread programs and distributed systems. By requiring you to specify behavior explicitly, it forces you to think about interleaving of events that you might not otherwise have considered.

The user base of TLA+ is quite small. I think one of the reasons that TLA+ isn’t very popular is that it’s difficult to learn. I think there are at least three concepts you need for TLA+ that give new users trouble:

  • The universe as a state machine
  • Modeling programs and systems with math
  • Mathematical syntax

The universe as state machine

TLA+ uses a state machine model. It treats the universe as a collection of variables whose values vary over time.

A state machine in sense that TLA+ uses it is similar, but not exactly the same, as the finite state machines that software engineers are used to. In particular:

  • A state machine in the TLA+ sense can have an infinite number of states.
  • When software engineers think about state machines, they think about a specific object or component being implemented as a finite state machine. In TLA+, everything is modeled as a state machine.

The state machine view of systems will feel familiar if you have a background in physics, because physicists use the same approach for system modeling: they define a state variable that evolves over time. If you squint, a TLA+ specification looks identical to a system of first-order differential equations, and associated boundary conditions. But, for the average software engineer, the notion of an entire system as an evolving state variable is a new way of thinking.

The state machine approach requires a set of concepts that you need to understand. In particular, you need to understand behaviors, which requires that you understand statessteps, and actions. Steps can stutter, and actions may or may not be enabled. For example, here’s the definition of “enabled” (I’m writing this from memory):

An action a is enabled for a state s if there exists a state t such that a is true for the step s→t.

It took me a long time to internalize these concepts to the point where I could just write that out without consulting a source. For a newcomer, who wants to get up and running as quickly as possible, each new concept that requires effort to understand decreases the likelihood of adoption.

Modeling programs and systems with math

One of the commonalities across engineering disciplines is that they all work with mathematical models. These models are abstractions, objects that are simplified versions of the artifacts that we intend to build. That’s one of the thing that attracts me about TLA+, it’s modeling for software engineering.

A mechanical engineer is never going to confuse the finite element model they have constructed on a computer with the physical artifact that they are building. Unfortunately, we software engineers aren’t so lucky. Our models superficially resemble the artifacts we build (a TLA+ model and a computer program both look like source code!). But models aren’t programs: a model is a completely different beast, and that trips people up.

Here’s a metaphor: You can think of writing a program as akin to painting, in that both are additive work: You start with nothing and you do work by adding content (statements to your program, paint to a canvas).

The simplest program, equivalent to an empty canvas, is one that doesn’t do anything at all. On Unix systems, there’s a program called true which does nothing but terminate successfully. You can implement this in shell as an empty file. (Incidentally, AT&T has copyrighted this implementation).

By contrast, when you implement a model, you do the work by adding constraints on the behavior of the state variables. It’s more like sculpting, where you start with everything, and then you chip away at it until you end up with what you want.

The simplest model, the one with no constraints at all, allows all possible behaviors. Where the simplest computer program does nothing, the simplest model does (really allows) everything. The work of modeling is adding constraints to the possible behaviors such that the model only describes the behaviors we are interested in.

When we write ordinary programs, the only kind of mistake we can really make is a bug, writing a program that doesn’t do what it’s supposed to. When we write a model of a program, we can also make that kind of mistake. But, we can make another kind of mistake, where our model allows some behavior that would never actually happen in the real world, or isn’t even physically possible in the real world.

Engineers and physicists understand this kind of mistake, where a mathematical model permits a behavior that isn’t possible in the real world. For example, electrical engineers talk about causal filters, which are filters whose outputs only depend on the past and present, not the future. You might ask why you even need a word for this, since it’s not possible to build a non-causal physical device. But it’s possible, and even useful, to describe non-causal filters mathematically. And, indeed, it turns out that filters that perfectly block out a range of frequencies, are not causal.

For a new TLA+ user who doesn’t understand the distinction between models and programs, this kind of mistake is inconceivable, since it can’t happen when writing a regular program. Creating non-causal specifications (the software folks use the term “machine-closed” instead of “causal”) is not a typical error for new users, but underspecifying the behavior some variable of interest is very common.

Mathematical syntax

Many elements of TLA+ are taken directly from mathematics and logic. For software engineers used to programming language syntax, these can be confusing at first. If you haven’t studied predicate logic before, the universal (∀) and extensional (∃) quantifiers will be new.

I don’t think TLA+’s syntax, by itself, is a significant obstacle to adoption: software engineers pick up new languages with unfamiliar syntax all of the time. The real difficulty is in understanding TLA+’s notion of a state machine, and that modeling is describing a computer program as permitted behaviors of a state machine. The new syntax is just one more hurdle.

Why we will forever suffer from missing timeouts, TTLs, and queue size bounds

If you’ve operated a software service, you will have inevitably hit one of the following problems:

A network call with a missing timeout.  Some kind of remote procedure call or other networking call is blocked waiting … forever, because there’s no timeout configured on the call.

Missing time-to-live (TTL). Some data that was intended to be ephemeral did not explicitly have a TTL set on it, and it didn’t get removed by normal means, and so its unexpected presence bit you.

A queue with no explicit size limit. A queue somewhere doesn’t have an explicitly configured upper bound on its size, and somehow the producers are consistently outnumbering the consumers, and then your queue eventually grows to some size that you never expected to happen.

Unfortunately, the only good solution to these problems is diligence. We have to remember to explicitly set timeouts, TTLs, and queue sizes. there are two reasons:

It’s impossible for a library author to define a reasonable default for these values. Appropriate timeouts, TTLs, and queue sizes will vary enormously from one use case to another, there simply isn’t a “reasonable” value to pick without picking one so large that it’s basically the same as being unbounded.

Forcing users to always specify values is a lousy user experience for new users. Library authors could make these values required rather than optional. However, this makes it more annoying for new users of these libraries, it’s an extra step that forces them to make a decision they don’t really want to think about. They probably don’t even know what a reasonable value is when they’re first setting out.

I think forcing users to specify these limits would lead to more robust software, but I can see many users complaining about being forced to set these limits rather than defaulting to infinity. At least, that’s my guess about why library authors don’t do it. 

The Equifax breach report

The House Oversight and Government Reform Committee released a report on the big Equifax data breach that happened last year. In a nutshell, a legacy application called ACIS contained a known vulnerability that attackers used to gain access to internal Equifax databases.

A very brief timeline is:

  • Day 0: (3/7/17) Apache Struts vulnerability CVE-2017-5638 is publicly announced
  • Day 1: (3/8/17) US-CERT sends an alert to Equifax about the vulnerability
  • Day 2: (3/9/17) Equifax’s Global Threat and Vulnerability Management (GTVM) team posts to an internal mailing list about the vulnerability and requests that app owners should patch within 48 hours
  • Day 37: (4/13/17) Attackers exploit the vulnerability in the ACIS app

The report itself is… frustrating. There is some good content here. The report lays out multiple factors that enabled the breach, including:

  • A scanner that was run but missed the vulnerable app because of the directory that the scan ran in
  • An expired SSL certificate that prevented Equifax from detecting malicious activity
  • The legacy nature of the vulnerable application (originally implemented in the 1970s)
  • A complex IT environment that was the product of multiple acquisitions.
  • An organizational structure where the chief security officer and the chief information officer were in separate reporting structures.

The last bullet, about the unconventional reporting structure for the chief security officer, along with the history of that structure, was particularly insightful. It would have been easy to leave out this sort of detail in a report like this.

On the other hand, the report exhibits some weapons-grade hindsight bias. To wit:

 Equifax, however, failed to implement an adequate security program to protect this sensitive data. As a result, Equifax allowed one of the largest data breaches in U.S. history. Such a breach was entirely preventable.

Equifax failed to fully appreciate and mitigate its cybersecurity risks. Had the company taken action to address its observable security issues prior to this cyberattack, the data breach could have been prevented.

Page 4

Equifax knew its patch management process was ineffective.501 The 2015 Patch Management Audit concluded “vulnerabilities were not remediated in a timely manner,” and “systems were not patched in a timely manner.” In short, Equifax recognized the patching process was not being properly implemented, but failed to take timely corrective action.

Page 80

The report highlights a number of issues that, if they had been addressed, would have prevented or mitigated the breach, including:

Lack of a clear owner of the vulnerable application. An email went out announcing the vulnerability, but nobody took action to patch the vulnerable app.

Lack of a comprehensive asset inventory. The company did not have a database where that they could query to check if any published vulnerabilities applied to any applications in use.

Lack of network segmentation in the environment where the vulnerable app ran. The vulnerable app ran a network that was not segmenting from unrelated databases. Once the app was compromised, it was used as a vector to reach these other databases.

Lack of integrity file monitoring (FIM). FIM could have detected malicious activity, but it wasn’t in place. 

Not prioritizing retiring the legacy system. This one is my favorite. From the report: “Equifax knew about the security risks inherent in its legacy IT systems, but failed to prioritize security and modernization for the ACIS environment”.

Use of NFS. The vulnerable system had an NFS mount, that allowed the attackers to access a number of files.

Frustratingly, the report does not go into any detail about how the system got into this state.  It simply lays them out like an indictment for criminal negligence. Look at all of these deficiencies! They should have known better! Even worse, they did know better and didn’t act!

In particular, the report doesn’t dig enough into the communication breakdown that resulted in ACIS not being patched. Here’s an exchange that gets close:

To determine who was responsible for applying the Apache Struts patch to the ACIS system, the Committee asked [former Senior Vice President and Chief Information Officer for Global Corporate Platforms Graeme] Payne to identify employees by the roles listed within the Patch Management Policy. Specifically, the Committee asked him to identify the business owner, system owner, and application owner responsible for the ACIS system. Payne testified:

Q. So the application owner for ACIS would have been who or what organization?
A. So I don’t believe there was any explicit designation of application owners. If you ask me who I think the application owner would be, I can probably answer that.

Q. That would be good.
A. So I believe – in my view, the application owner for ACIS – for the online dispute portal component because that was a component – was [Equifax IT Employee 1] and probably also [Equifax IT Employee 2]. So again, I don’t believe there were any specific designations, so these would be – if someone asked me, “Who do you think they would be?” that would probably be the two people I would look at.

**

Q. So would they have been the people that should have received the GTVM email saying you need to patch?
A. Yes, as well as the system owner.

Q. Okay. Who’s the system owner?
A. So again, those people weren’t designated. So I can –

Q. Tell me who you think?
A. My guess would be that the system owner would be someone in the infrastructure group probably under [Equifax IT Employee 3], since…as part of the global platform services group, his team ran the sort of the server operations

***

Q. If you look at the definition . . . it says: System owner is responsible for applying patch to electronic assets.

So would it be the case that [Equifax IT Employee 3] would have been the one responsible for actually applying the patch to ACIS?
A. Possibly. Again, we are talking at a level that I wasn’t involved in, so I can’t talk specifically about…who actually

Pages 65-66

Alas, the committee doesn’t seem to have interviewed any of the ICs referenced as Equifax IT Employees 1-3. How did they understand how ownership works? In addition, there’s also no context here about how ownership generally works inside of Equifax. Was ACIS a special case, or was it typical? 

There was also a theme that anyone who was worked in a software project would recognize:

[Former Chief Security Officer Susan] Mauldin stated Equifax was in the process of making the ACIS application Payment Card Industry (PCI) Data Security Standard (DSS) compliant when the data breach occurred.

Mauldin testified the PCI DSS implementation “plan fell behind and these items did not get addressed.” She stated:

A. The PCI preparation started about a year before, but it’s very complex. It was a very complex – very complex environment.

Q. year before, you mean August 2016?

A. Yes, in that timeframe.

Q. And it was scheduled to be complete by August 2017?

A. Right.

Q. But it fell behind?

A. It fell behind.

Q. Do you know why?

A. Well, what I recall from the application team is that it was very complicated, and they were having – it just took a lot longer to make the changes than they thought. And so they just were not able to get everything ready in time.

Pages 80-81

And, along the same lines:

So there were definitely risks associated with the ACIS environment that we were trying to remediate and that’s why we were doing the CCMS upgrade.

It was just – it was time consuming, it was risky . . . and also we were lucky that we still had the original developers of the system on staff.


So all of those were risks that I was concerned about when I came into this role. And security was probably also a risk, but it wasn’t the primary driver. The primary driver was to get off the old system because it was just hard to manage and maintain.

Graeme Payne, former Senior Vice President and Chief Information Officer for Global Corporate Platforms, page  82

Good luck finding a successful company that doesn’t face similar issues.

Finally, in a beautiful example of scapegoating, there’s the Senior VP that Equifax fired, ostensibly for failing to forward an email that had already been sent to an internal mailing list. In the scapegoat’s own words:

To assert that a senior vice president in the organization should be forwarding vulnerability alert information to people . . . sort of three or four layers down in the organization on every alert just doesn’t hold water, doesn’t make any sense. If that’s the process that the company has to rely on, then that’s a problem.

Graeme Payne, former Senior Vice President and Chief Information Officer for Global Corporate Platforms, page 51

Death of a pledge as systems failure

Caitlin Flanagan has written a fantastic and disturbing piece for the Atlantic entitled Death at a Penn State Fraternity.

This line really jumped out at me:

Fraternities do have a zero-tolerance policy regarding hazing. And that’s probably one of the reasons Tim Piazza is dead.

The official policy of the fraternities is that hazing is forbidden. Because this is the official policy, it is the individuals in a particular frat house that are held responsible if hazing happens, not the national fraternity organization.

This policy has had the effect of insulating the organizations from being liable, but it hasn’t stopped hazing from being widespread: according to Flanagan, 80% of fraternity members report being hazed.

Because individual fraternity members are the ones that are on the hook if something goes wrong during hazing, reporting an injury carries risk, which means the member must make a decision involving a tradeoff. In the case documented above, that tradeoff led to a nineteen year old dying of his injuries.

This example really reinforces ideas around systems thinking: the introduction of the zero-tolerance policy did not have the intended effect. Because the culture of hazing remains, the policy ended up making things worse.

Assumption of rationality

Matthew Reed wrote a post about Lisa Servon’s book “The Unbanking of America”. This comment stood out for me (emphasis mine):

By treating her various sources as intelligent people responding rationally to their circumstances, rather than as helpless victims of evil predators, [Servon] was able to stitch together a pretty good argument for why people make the choices they make.

In its approach, it reminded me a little of Tressie McMillan Cottom’s “Lower Ed” or Matthew Desmond’s “Evicted.”  In their different ways, each book addresses a policy question that is usually framed in terms of smart, crafty, evil people taking advantage of clueless, ignorant, poor people, and blows up the assumption.  In no case are predators let off the hook, but the “prey” are actually (mostly) capable and intelligent people doing the best they can.  Understanding why this is the best they can do, and what would give them better options, leads to a very different set of prescriptions.

 

Sidney Dekker calls this perspective the local rationality principle. It assumes that people make decisions that are reasonable given the constraints that they are working within, even though from the outside those decisions appear misguided.

I find this assumption of rationality to be a useful frame for explaining individual behavior. It’s worth putting in the effort to identify why a particular decision would have seemed rational within the context in which it was made.

A conjecture on why reliable systems fail

(Some of my co-workers call this Lorin’s Law)

Even highly reliable systems go down occasionally. After having read over the details of several incidents, I’ve started to notice a pattern, which has led me to the following conjecture:

Once a system reaches a certain level of reliability, most major incidents will involve:

  • A manual intervention that was intended to mitigate a minor incident, or
  • Unexpected behavior of a subsystem whose primary purpose was to improve reliability

Here are three examples from Amazon’s post-mortem write-ups of major AWS outages:

The S3 outage on February 28, 2017 involved a manual intervention to debug an issue that was causing the S3 billing system to progress more slowly than expected.

The DynamoDB outage on September 20, 2015 (which also affected SQS, auto scaling, and CloudWatch) involved healthy storage servers taking themselves out of service by executing a distributed protocol that was (presumably) designed that way for fault tolerance.

The EBS outage on October 22, 2012 (which also affected EC2, RDS, and ELBs) involved a memory leak bug in an agent that monitors the health of EBS servers.

Changing a complex system is hard

I’ve been reading Drift into Failure by Sidney Dekker, and it’s a fantastic book about applying systems thinking to understand how complex systems fail. One of the things that a systems thinking perspective teaches you is that we don’t know how a complex system will respond to a change in inputs. In particular, when you try an intervention that is intended to improve the system, it might make things worse.

There’s a great example of this phenomenon in a recent paper by Alpert et al. entitled Supply-Side Drug Policy in the Presence of Substitutes: Evidence from the Introduction of Abuse-Deterrent Opioids, that was mentioned on a recent episode of Vox’s The Weeds podcast.

The paper examined the impact of introducing an abuse-deterrent version of OxyContin on drug abuse. The FDA approved OxyContin pills that were more difficult to crush or dissolve. These pills were expected to act as a deterrent to abuse since addicts tend to consume the drug by chewing, snorting, or injecting it to increase its impact. The authors examined the rates of OxyContin abuse before and after these new abuse-deterrent drugs were introduced in different states.

What the authors found was that this intervention did have the effect of decreasing OxyContin abuse. Unfortunately, it also increased heroin-related deaths. The unexpected effect was that addicts  substituted one form of opiate for a more dangerous one.

The only way for us to know that our interventions have the desired effect on a complex system is to try and measure that effect.