Pensieri di un lunatico minore

21 July 2008 Technology

Understanding availability

With the recent Amazon S3 outage of approximately 8 hours, there’s a lot of people blowing a lot of energy on lambasting Amazon for the downtime. While I think we’d all love to have systems that never go down, the probability of such occurring in the “real world” is relatively small, unless you’re running some esoteric hardware that most people aren’t. Before we go any further, let’s quickly break down what we mean by availability.

When most people talk about availability, they often using the marketing-speak method of speaking in “nines”. Five nines, or 99.999% is the “gold standard” of what most people talk about, but few people actually achieve. To clarify, here’s what it means when you convert percentages to actual time spans.

This means that the vaunted “five nines” allows for only a bit more than five minutes of downtime in a year, and only 6 seconds per week, which translates to less than 1 second per day. Quite honestly, you can’t even bounce an HTTP server in that period of time reliably. For example, if you’re running on a single server, and you have to reboot it more than once a year, you’ll likely never hit 99.99%, even if nothing else breaks.

So what does this mean when we talk about systemic availability? It means that putting all your eggs in one basket—regardless of the quality of the basket—is silly. While many people think about drives failing, and implement RAID or some other technique, and some think about CPU and memory, very few think or plan for electrical system failure or cooling failure. These kind of problems, which strike entire data centers, are not uncommon, and can not be waved away by saying that you have redundant infrastructure.

A vast majority of availability problems, however, are not hardware driven—even though that’s all people think about. They come from a few areas:

All of these cost serious money to solve. They are solved through processes and planning and not just traditional technical operations. In examining failure modes of systems I’ve worked on, a vast majority are preventable. They are due to someone making an unplanned change that isn’t properly vetted. They’re based on software configuration errors, and they’re based on upgrades that simply weren’t tested first.

The silver lining here is this: a vast majority of sites, companies, etc., do not need this kind of availability. The pursuit of high availability tends to be a mental masturbation exercise by people who want to spend money, but aren’t willing to do the cost-benefit analysis. Before undertaking anything above 99.9%, you really need to understand your business to a level that will allow you to make a rational decision about risks. Often, it is cheaper to rebate money to people than it is to fix the problem.

So what do I say to those who puff up and say “I can do better”? I say “no you can’t”. At least, not likely. The cost of running exceptionally high availability systems is not just hardware. It is operational costs. It is staffing, monitoring infrastructure, planning and operational processes. It doesn’t happen when you’ve only got one machine. It doesn’t even happen when you have 50 machines.

Don’t delude yourself any more about your own ability to run systems at that level than you delude yourself into assuming someone else can as well.

This entry was posted at 10:43 pm on 21 July 2008 and is filed under Technology. You can follow any responses to this entry through the post-specific RSS 2.0 feed.

Some people can. I work for , although in business systems not in actual telecommunications software. I asked one of the guys who works on the Real Telco side of the business when they last had a total outage in his area of responsibility. He knew the answer instantly: “four minutes in February 1999”. A different world entirely – although not, these days, using particularly exotic hardware; just lots of it, in a very redundant configuration.

I worked with some of the early softswitch implementations (SS7), and I can say that the telecom world, at least around the voice items, still operate with a level of conservatism that isn’t seen in many places. It’s not just the hardware that’s largely redundant—though a lot of it is—it’s also the approach to rolling out any change to the environment.

Having said that, when September 11th happened, my company’s VoIP environment survived, where as Verizon’s didn’t. Resilience comes in a lot of forms.

You can leave a response, or trackback from your own site.