Our brittle serverless future

I’m really enjoying David Woods’s Resilience Engineering short course videos. In Lecture 9, Woods mentions an important ingredient in a resilient system: the ability to monitor how hard you are working to stay in control of the system.

I was thinking of this observation in the context of serverless computing. In serverless, software engineers offload the responsibility of resource management to a third-party organization, who handles this transparently for them. No more thinking in terms of servers, instance types, CPU utilization and memory usage!

The challenge is this: from the perspective of a customer of a serverless provider, you don’t have visibility into how hard the provider is working to stay in control. If the underlying infrastructure is nearing some limit (e.g., amount of incoming traffic it can handle), or if it’s operating in degraded mode because of an internal failure, these challenges are invisible to you as a customer.

Woods calls this phenomenon the veil of fluency. From the customer’s perspective, everything is fine. Your SLOs are all still being met! However, from the provider’s perspective, the system may be very close to the boundary, the point where it falls over.

Woods also talks about the importance of reciprocity in resilient organizations: how different units of adaptive behavior synchronize effectively when a crunch happens and one of them comes under pressure. In a serverless environment, you lose reciprocity because there’s a hard boundary between the serverless provider and a customer. If your system is deployed in a serverless environment, and a major incident happens where the serverless system is a contributing factor, nobody from your serverless provider is going to be in the Slack channel or on the conference bridge.

I think Simon Wardley is correct in his prediction that serverless is the future of software deployment. The tools are still immature today, but they’ll get there. And systems built on serverless will likely be more robust, because the providers will have more expertise in resource management and fault tolerance than their customers do.

But every system eventually reaches its limit. One day a large-scale serverless-based software system is going to go past the limit of what it can handle. And when it breaks, I think it’s going to break quickly, without warning, from the customer’s perspective. And you won’t be able to coordinate with the engineers at your serverless provider to bring the system back into a good state, because all you’ll have are a set of APIs.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s