Pensieri di un lunatico minore

15 February 2008 Technology

AWS, uptime, downtime and information

This “morning” Amazon experienced a massive failure of their web services infrastructure; specifically Simple Storage Service and Elastic Compute Cloud. Lots of people jumped in to quickly yell and scream about the issue, wringing their hands about trust and such.

What seems to be completely missed here is something very, very common: technology fails. It fails regularly and with often unpredictable regularity. It’s the way the world works. You can build a system that doesn’t fail, but that doesn’t mean your software won’t break. I’m not trying to excuse Amazon for their issues. Without knowing what happened, it’s nearly impossible to know how foreseeable it would have been. What I am trying to say is that people need to get a grip.

Here’s how this downtime calculation works, assuming Amazon’s outage was 61 hours, and that there are approximately 8,765.81 hours in a year. That gets you a downtime of .0006, or inverted to the normal view and coverted to percentages, an uptime2 of 99.94%. Is that perfect? Nope. Is that “good enough”? Maybe. What I do know is that actually achieving this kind of uptime isn’t as easy as the armchair geeks would like to think.

So what lessons should the users of Amazon’s web services take out of this?