Why we will forever suffer from missing timeouts, TTLs, and queue size bounds

If you’ve operated a software service, you will have inevitably hit one of the following problems:

A network call with a missing timeout. Some kind of remote procedure call or other networking call is blocked waiting … forever, because there’s no timeout configured on the call.

Missing time-to-live (TTL). Some data that was intended to be ephemeral did not explicitly have a TTL set on it, and it didn’t get removed by normal means, and so its unexpected presence bit you.

A queue with no explicit size limit. A queue somewhere doesn’t have an explicitly configured upper bound on its size, and somehow the producers are consistently outnumbering the consumers, and then your queue eventually grows to some size that you never expected to happen.

Unfortunately, the only good solution to these problems is diligence. We have to remember to explicitly set timeouts, TTLs, and queue sizes. there are two reasons:

It’s impossible for a library author to define a reasonable default for these values. Appropriate timeouts, TTLs, and queue sizes will vary enormously from one use case to another, there simply isn’t a “reasonable” value to pick without picking one so large that it’s basically the same as being unbounded.

Forcing users to always specify values is a lousy user experience for new users. Library authors could make these values required rather than optional. However, this makes it more annoying for new users of these libraries, it’s an extra step that forces them to make a decision they don’t really want to think about. They probably don’t even know what a reasonable value is when they’re first setting out.

I think forcing users to specify these limits would lead to more robust software, but I can see many users complaining about being forced to set these limits rather than defaulting to infinity. At least, that’s my guess about why library authors don’t do it.

The Equifax breach report

The House Oversight and Government Reform Committee released a report on the big Equifax data breach that happened last year. In a nutshell, a legacy application called ACIS contained a known vulnerability that attackers used to gain access to internal Equifax databases.

A very brief timeline is:

Day 0: (3/7/17) Apache Struts vulnerability CVE-2017-5638 is publicly announced
Day 1: (3/8/17) US-CERT sends an alert to Equifax about the vulnerability
Day 2: (3/9/17) Equifax’s Global Threat and Vulnerability Management (GTVM) team posts to an internal mailing list about the vulnerability and requests that app owners should patch within 48 hours
Day 37: (4/13/17) Attackers exploit the vulnerability in the ACIS app

The report itself is… frustrating. There is some good content here. The report lays out multiple factors that enabled the breach, including:

A scanner that was run but missed the vulnerable app because of the directory that the scan ran in
An expired SSL certificate that prevented Equifax from detecting malicious activity
The legacy nature of the vulnerable application (originally implemented in the 1970s)
A complex IT environment that was the product of multiple acquisitions.
An organizational structure where the chief security officer and the chief information officer were in separate reporting structures.

The last bullet, about the unconventional reporting structure for the chief security officer, along with the history of that structure, was particularly insightful. It would have been easy to leave out this sort of detail in a report like this.

On the other hand, the report exhibits some weapons-grade hindsight bias. To wit:

Equifax, however, failed to implement an adequate security program to protect this sensitive data. As a result, Equifax allowed one of the largest data breaches in U.S. history. Such a breach was entirely preventable.
…
Equifax failed to fully appreciate and mitigate its cybersecurity risks. Had the company taken action to address its observable security issues prior to this cyberattack, the data breach could have been prevented.
Page 4

Equifax knew its patch management process was ineffective.501 The 2015 Patch Management Audit concluded “vulnerabilities were not remediated in a timely manner,” and “systems were not patched in a timely manner.” In short, Equifax recognized the patching process was not being properly implemented, but failed to take timely corrective action.
Page 80

The report highlights a number of issues that, if they had been addressed, would have prevented or mitigated the breach, including:

Lack of a clear owner of the vulnerable application. An email went out announcing the vulnerability, but nobody took action to patch the vulnerable app.

Lack of a comprehensive asset inventory. The company did not have a database where that they could query to check if any published vulnerabilities applied to any applications in use.

Lack of network segmentation in the environment where the vulnerable app ran. The vulnerable app ran a network that was not segmenting from unrelated databases. Once the app was compromised, it was used as a vector to reach these other databases.

Lack of integrity file monitoring (FIM). FIM could have detected malicious activity, but it wasn’t in place.

Not prioritizing retiring the legacy system. This one is my favorite. From the report: “Equifax knew about the security risks inherent in its legacy IT systems, but failed to prioritize security and modernization for the ACIS environment”.

Use of NFS. The vulnerable system had an NFS mount, that allowed the attackers to access a number of files.

Frustratingly, the report does not go into any detail about how the system got into this state. It simply lays them out like an indictment for criminal negligence. Look at all of these deficiencies! They should have known better! Even worse, they did know better and didn’t act!

In particular, the report doesn’t dig enough into the communication breakdown that resulted in ACIS not being patched. Here’s an exchange that gets close:

To determine who was responsible for applying the Apache Struts patch to the ACIS system, the Committee asked [former Senior Vice President and Chief Information Officer for Global Corporate Platforms Graeme] Payne to identify employees by the roles listed within the Patch Management Policy. Specifically, the Committee asked him to identify the business owner, system owner, and application owner responsible for the ACIS system. Payne testified:
Q. So the application owner for ACIS would have been who or what organization?
A. So I don’t believe there was any explicit designation of application owners. If you ask me who I think the application owner would be, I can probably answer that.
Q. That would be good.
A. So I believe – in my view, the application owner for ACIS – for the online dispute portal component because that was a component – was [Equifax IT Employee 1] and probably also [Equifax IT Employee 2]. So again, I don’t believe there were any specific designations, so these would be – if someone asked me, “Who do you think they would be?” that would probably be the two people I would look at.
**

Q. So would they have been the people that should have received the GTVM email saying you need to patch?
A. Yes, as well as the system owner.
Q. Okay. Who’s the system owner?
A. So again, those people weren’t designated. So I can –
Q. Tell me who you think?
A. My guess would be that the system owner would be someone in the infrastructure group probably under [Equifax IT Employee 3], since…as part of the global platform services group, his team ran the sort of the server operations
***
Q. If you look at the definition . . . it says: System owner is responsible for applying patch to electronic assets.
So would it be the case that [Equifax IT Employee 3] would have been the one responsible for actually applying the patch to ACIS?
A. Possibly. Again, we are talking at a level that I wasn’t involved in, so I can’t talk specifically about…who actually
Pages 65-66

Alas, the committee doesn’t seem to have interviewed any of the ICs referenced as Equifax IT Employees 1-3. How did they understand how ownership works? In addition, there’s also no context here about how ownership generally works inside of Equifax. Was ACIS a special case, or was it typical?

There was also a theme that anyone who was worked in a software project would recognize:

[Former Chief Security Officer Susan] Mauldin stated Equifax was in the process of making the ACIS application Payment Card Industry (PCI) Data Security Standard (DSS) compliant when the data breach occurred.
…
Mauldin testified the PCI DSS implementation “plan fell behind and these items did not get addressed.” She stated:

A. The PCI preparation started about a year before, but it’s very complex. It was a very complex – very complex environment.

Q. year before, you mean August 2016?
A. Yes, in that timeframe.
Q. And it was scheduled to be complete by August 2017?
A. Right.
Q. But it fell behind?
A. It fell behind.
Q. Do you know why?
A. Well, what I recall from the application team is that it was very complicated, and they were having – it just took a lot longer to make the changes than they thought. And so they just were not able to get everything ready in time.
Pages 80-81

And, along the same lines:

So there were definitely risks associated with the ACIS environment that we were trying to remediate and that’s why we were doing the CCMS upgrade.
…
It was just – it was time consuming, it was risky . . . and also we were lucky that we still had the original developers of the system on staff.

So all of those were risks that I was concerned about when I came into this role. And security was probably also a risk, but it wasn’t the primary driver. The primary driver was to get off the old system because it was just hard to manage and maintain.
Graeme Payne, former Senior Vice President and Chief Information Officer for Global Corporate Platforms, page 82

Good luck finding a successful company that doesn’t face similar issues.

Finally, in a beautiful example of scapegoating, there’s the Senior VP that Equifax fired, ostensibly for failing to forward an email that had already been sent to an internal mailing list. In the scapegoat’s own words:

To assert that a senior vice president in the organization should be forwarding vulnerability alert information to people . . . sort of three or four layers down in the organization on every alert just doesn’t hold water, doesn’t make any sense. If that’s the process that the company has to rely on, then that’s a problem.
Graeme Payne, former Senior Vice President and Chief Information Officer for Global Corporate Platforms, page 51

Cloud software, fragility and Air France 447

The Human Factor by William Langewiesche is an account of how Air France Flight 447 crashed back in 2009. It’s a great piece of long-form journalism.

[Recently deceased engineer Earl] Wiener pointed out that the effect of automation is to reduce the cockpit workload when the workload is low and to increase it when the workload is high.

That’s basically Nassim Nicholas Taleb’s definition of fragility: where the potential downside is much greater than the potential upside.

But this is what really struck me:

[U. Michigan Industrial & Systems Engineering Professor Nadine] Sarter said, “… Complexity means you have a large number of subcomponents and they interact in sometimes unexpected ways. Pilots don’t know, because they haven’t experienced the fringe conditions that are built into the system. I was once in a room with five engineers who had been involved in building a particular airplane, and I started asking, ‘Well, how does this or that work?’ And they could not agree on the answers. So I was thinking, If these five engineers cannot agree, the poor pilot, if he ever encounters that particular situation … well, good luck.””

We also build software by connecting together different subcomponents. We know that the hardware underlying these subcomponents can fail, and in order to provide reliability for our customers we must be able to deal with these failures. If we’re sophisticated, we turn to automation: we try to build fault-tolerant systems that can automatically detect failures and compensate for them.

But the behavior of these types of systems is difficult to reason about, precisely because they take action automatically based on self-monitoring. These systems can handle hard failures of individual components: when something has obviously failed. But if it’s a soft failure, where the component is still partially working, or if two independent components fail simultaneously, then the system may not be able to handle it. Adding automation to handle failures introduces new failure modes even as it eliminates old ones, and the new ones may be much harder for an operator to understand.

Consider the AWS outage of October 2012, which took down multiple websites that deploy on Amazon’s cloud. The outage was a result of:

1. A hardware failure on a data collection server
2. A DNS update not propagating to all internal DNS servers
3. A memory leak bug in an agent that monitors the health of EBS (storage) servers

The agents collect operational data from the storage servers and transfer it to data collection servers. Some agents couldn’t reach the collection servers because of a stale DNS entry. Because the agents also had a memory leak bug, they consumed too much memory on the storage servers.

Lack of sufficient memory is a good way to create a soft failure: the component still works, but in a degraded fashion. These memory-poor servers couldn’t keep up with requests, and eventually became stuck. The system detected the stuck servers (a hard failure), and failed over to healthy servers, but there were so many stuck servers that the healthy servers couldn’t keep up, and also became stuck.

The irony is that this failure is entirely due to the monitoring subsystem, which is intended to increase system reliability! It happened because of the co-incidence of a hardware failure in one component of the monitoring subsystem (a data collection server) and a software bug in another component of the monitoring subsystem (data collection agent).

We can’t avoid automation. In the case of the airlines, the automation has significantly increased safety, even though it increases the chances of incidents like Air France 447. For cloud software, this type of fault-tolerance automation does result in more reliable systems.

But, while these systems mean failure is less likely, when it does happen, it’s much more difficult to understand what’s happening. The future of cloud software is systems that fail much less often, but much harder. Buckle up.