failure | Cutting Edge Computing

SLAs Are A Crock December 29, 2012

Posted by Peter Varhol in Software platforms, Strategy.
Tags: Black Swan, failure, SLA
2 comments

We would like to think that automated systems make us predictable. That assumption leads to a number of desirable characteristics, the foremost of which is that we can predict the behavior of the system to stay fully functional, and how long any failures may last.

The problem is that automated systems are complex. I realized this when Matt Heusser recently tweeted, “If Amazon can’t fulfill its SLAs, how can we?” The answer is that we can’t. We couldn’t do it with relatively simple pre-digital manual systems, and we certainly can’t do it with complex automated ones. We can look at each of the components of the system – multiple servers and maybe large server farms, storage and other peripherals, network, OS, applications, database, power, environmental, and so on, and assign probabilities of failure to each.

But the problem is that we treat them as independent at best, and unrelated at worst. We understand that individual parts of the system have a high reliability, but we fail to understand their interrelationships. We think that the failure rate of each component is somehow analogous to the failure rate of the system as a whole. Or we think that the individual components are large and independent of each other. The latter is a more subtle fallacy, but it’s still a fallacy.

SLAs (Service Level Agreements, to the uninitiated), offer contractual guarantees to IT services users on uptime and availability. They usually promise over 99 percent uptime, and the resolution of any problem within a certain limited amount of time. But they are based on the reliability of the portion of the system within the control of the provider, rather than the full system. Or they are based on a misunderstanding of the risk of failure.

The worst thing is the Black Swan event, as Heusser notes. The Black Swan event is the purported once-in-lifetime event that surprises everyone when it occurs, either because it is considered very rare, or no one thought of it. The nuclear meltdown of the Fukushima Nuclear Power Plant in Japan as a result of the undersea earthquake and tsunami is a well-known example of a Black Swan event. It required a complex series of circumstances that seemed obvious only in retrospect.

We tend to think incorrectly when applying Black Swan events to complex systems. We believe that the sequence of events that caused the unexpected failure is very rare, where in fact the system itself is complex and has multiple unknown and untested failure points.

In IT, we tend to counter rare events through redundancy. We connect our systems to separate network segments, for example, and are then surprised when snow collapses the roof of the data center. The physical plant is part of the system, but we simply treat it as a given, when it’s not. I’ll discuss how we might test such systems in a later post.

So the next time someone asks you to take their SLA bet, go ahead. You will win sooner or later.

UPDATE: Matt Heusser has his own post on this topic here.