jump to navigation

Testing and the Black Swan January 6, 2013

Posted by Peter Varhol in Software platforms.
Tags: , ,

I noted in a post a few days ago that complex systems are subject to what are known as Black Swan events. We think of Black Swan events as very rare, requiring a complex set of circumstances to occur in order for a disaster to happen.

That’s fallacious reasoning. A Black Swan event happens not because of an unusual sequence of events, but because the corresponding system is complex, and has multiple unknown points of failure. It’s not out of our control, just out of our conception. And the events are “fat tail” ones; not a traditional Gaussian normal curve, but a flattened one, with a lot of probability in tail events.

I was speaking to my friend Jim Farley on this topic last night. He asked (no, demanded) that I distinguish between a complex system, where you can with sufficient foresight conceive of and control outcomes, and a chaotic one, which is highly dependent upon initial conditions and largely unpredictable if those conditions aren’t known.

His intent was to separate natural disasters from the definition. I’m not sure that the distinction is a worthwhile one to make, as the results seem more a matter of degree than a hard difference.

Can complex interrelated systems be tested? Not completely, of course; we don’t even completely test software that is very well defined. But what we need to do is to get away from the idea that catastrophic failures occur due to a complex sequence of highly unlikely events, and instead acknowledge that a complex system simply has a lot of points of failure.

This type of testing is similar to testing safety-critical software, where your goal is to map out the failure points and determine how best to make it fail. That’s a very different way that how testers tend to work with software, which is usually quite methodical and planned. The problem is that most failures are catastrophic and unplanned (how can you plan a failure?).

James Bach talks about the Buccaneer Tester in his blog (actually, buccaneer scholar, but I’m being selective). While his point is much broader, I’d like to focus on the part of the buccaneer that takes measured risks for a high reward. We would like someone who thoroughly abuses our software, and risks ridicule and even censure as a result. But that person is more likely to understand the boundaries under which our software operates.

And in general it helps to think out of the box. You want someone to do what any user may try, without fear that it isn’t covered in the spec or even conceived of as an error. Most testers are very much in the box. When looking at what can go wrong with a complex system, it’s important to both understand all of the individual components of that system, as well as what might happen outside of the system but within its ecosystem.

SLAs Are A Crock December 29, 2012

Posted by Peter Varhol in Software platforms, Strategy.
Tags: , ,

We would like to think that automated systems make us predictable.  That assumption leads to a number of desirable characteristics, the foremost of which is that we can predict the behavior of the system to stay fully functional, and how long any failures may last.

The problem is that automated systems are complex.  I realized this when Matt Heusser recently tweeted, “If Amazon can’t fulfill its SLAs, how can we?”  The answer is that we can’t.  We couldn’t do it with relatively simple pre-digital manual systems, and we certainly can’t do it with complex automated ones.  We can look at each of the components of the system – multiple servers and maybe large server farms, storage and other peripherals, network, OS, applications, database, power, environmental, and so on, and assign probabilities of failure to each.

But the problem is that we treat them as independent at best, and unrelated at worst.  We understand that individual parts of the system have a high reliability, but we fail to understand their interrelationships.  We think that the failure rate of each component is somehow analogous to the failure rate of the system as a whole.  Or we think that the individual components are large and independent of each other.  The latter is a more subtle fallacy, but it’s still a fallacy.

SLAs (Service Level Agreements, to the uninitiated), offer contractual guarantees to IT services users on uptime and availability.  They usually promise over 99 percent uptime, and the resolution of any problem within a certain limited amount of time.  But they are based on the reliability of the portion of the system within the control of the provider, rather than the full system.  Or they are based on a misunderstanding of the risk of failure.

The worst thing is the Black Swan event, as Heusser notes.  The Black Swan event is the purported once-in-lifetime event that surprises everyone when it occurs, either because it is considered very rare, or no one thought of it.  The nuclear meltdown of the Fukushima Nuclear Power Plant in Japan as a result of the undersea earthquake and tsunami is a well-known example of a Black Swan event.  It required a complex series of circumstances that seemed obvious only in retrospect.

We tend to think incorrectly when applying Black Swan events to complex systems.  We believe that the sequence of events that caused the unexpected failure is very rare, where in fact the system itself is complex and has multiple unknown and untested failure points.

In IT, we tend to counter rare events through redundancy.  We connect our systems to separate network segments, for example, and are then surprised when snow collapses the roof of the data center.  The physical plant is part of the system, but we simply treat it as a given, when it’s not.  I’ll discuss how we might test such systems in a later post.

So the next time someone asks you to take their SLA bet, go ahead.  You will win sooner or later.

UPDATE:  Matt Heusser has his own post on this topic here.