Pensieri di un lunatico minore

21 July 2008 Technology

Understanding availability

With the recent Amazon S3 outage of approximately 8 hours, there’s a lot of people blowing a lot of energy on lambasting Amazon for the downtime. While I think we’d all love to have systems that never go down, the probability of such occurring in the “real world” is relatively small, unless you’re running some esoteric hardware that most people aren’t. Before we go any further, let’s quickly break down what we mean by availability.

When most people talk about availability, they often using the marketing-speak method of speaking in “nines”. Five nines, or 99.999% is the “gold standard” of what most people talk about, but few people actually achieve. To clarify, here’s what it means when you convert percentages to actual time spans.

This means that the vaunted “five nines” allows for only a bit more than five minutes of downtime in a year, and only 6 seconds per week, which translates to less than 1 second per day. Quite honestly, you can’t even bounce an HTTP server in that period of time reliably. For example, if you’re running on a single server, and you have to reboot it more than once a year, you’ll likely never hit 99.99%, even if nothing else breaks.

So what does this mean when we talk about systemic availability? It means that putting all your eggs in one basket—regardless of the quality of the basket—is silly. While many people think about drives failing, and implement RAID or some other technique, and some think about CPU and memory, very few think or plan for electrical system failure or cooling failure. These kind of problems, which strike entire data centers, are not uncommon, and can not be waved away by saying that you have redundant infrastructure.

A vast majority of availability problems, however, are not hardware driven—even though that’s all people think about. They come from a few areas: