The availability game
I don’t know how often I’ve heard people talk about how important “availability” or “scalability” is in their systems, without the first ability to quantify or understand the implications of their statements. So, before we talk about availability, let’s put the nines on display, since so many people talk about them without knowing what the implications might be.
| Avail | per Month | per Year |
|---|---|---|
| 98% | 14.4h | 7.3d |
| 99% | 7.2h | 3.65d |
| 99.5% | 3.6h | 1.83d |
| 99.9% | 43.2m | 8.76h |
| 99.99% | 4.32m | 52.6m |
| 99.999% | 25.9s | 5.26m |
| 99.9999% | 2.59s | 31.5s |
So, the mythical five-nines reliability that is so often bandied about by clueless people allows for only 25.9s of downtime in a month. Is this achievable? Sure, but if you want to achieve it based on a measurement that matters, it’s a lot harder.
How do you measure availability? At the end of the day, the only availability that truly matters is the one seen by your end user. I’ve often called this holistic uptime, as it requires a holistic approach to problem solving. It includes the hardware, but also the operating system, application components, network and the client platform. Often that last component is your restriction. That and networks that are less reliable than you think they really are. 99.5% is a typical SLA in the ISP world, and 99.9% if you have redundant links. Surpassing that starts to cost serious money in many cases.
It’s important to understand that not all downtime is the same. Scheduled downtime is generally more acceptable than unscheduled downtime, and sometimes is excluded from the calculation (although both should be calculated). Downtime during primary user hours1 is more painful than in off hours. Whether you have any off hours, however, may be largely a result of who your user base is. If you are a global organization and a specific system is supporting all your users, then you might not have any quiet periods1 during the regular time-period. If you work with retail customers, then your work-day window is even larger, and includes weekends.
To cut to the chase, since this is an entire book to be written about managing availability, it is important to set goals realistically and with true business impact understanding. Very few organizations need five-nines availability, and many would be just fine with 99.5% or even lower for many systems. Heck, I’d be happy if our email was down more, as it would keep the distraction level down.
Money is also a major issue. I usually tell people that every additional nine costs 2x what the previous one cost, and sometimes that’s an understatement. Going from 3 to 4 is one thing, 5 to 6 is a totally different thing and might even be 5-6x increase in cost. You best be able to write a business case for that.
1 It’s important to understand that not every timezone has an equal distribution of people, and the number of people impacted is also a critical factor.
This entry was posted at 10:33 am on 11 December 2006 and is filed under Random. You can follow any responses to this entry through the post-specific RSS 2.0 feed.
[...] “At the end of the day, the only availability that truly matters is the one seen by your end user. I’ve often called this holistic uptime, as it requires a holistic approach to problem solving. It includes the hardware, but also the operating system, application components, network and the client platform.” (Pensieri di un lunatico minore) [...]
You can leave a response, or trackback from your own site.
[...] The availability game : Pensieri di un lunatico minore “I usually tell people that every additional nine costs 2x what the previous one cost…You best be able to write a business case for that.” – precisely; great to see others making the case for sanity (tags: Petrilli availability scalability economics cost) [...]