AWS, uptime, downtime and information
This “morning” Amazon experienced a massive failure of their web services infrastructure; specifically Simple Storage Service and Elastic Compute Cloud. Lots of people jumped in to quickly yell and scream about the issue, wringing their hands about trust and such.
What seems to be completely missed here is something very, very common: technology fails. It fails regularly and with often unpredictable regularity. It’s the way the world works. You can build a system that doesn’t fail, but that doesn’t mean your software won’t break. I’m not trying to excuse Amazon for their issues. Without knowing what happened, it’s nearly impossible to know how foreseeable it would have been. What I am trying to say is that people need to get a grip.
Here’s how this downtime calculation works, assuming Amazon’s outage was 61 hours, and that there are approximately 8,765.81 hours in a year. That gets you a downtime of .0006, or inverted to the normal view and coverted to percentages, an uptime2 of 99.94%. Is that perfect? Nope. Is that “good enough”? Maybe. What I do know is that actually achieving this kind of uptime isn’t as easy as the armchair geeks would like to think.
So what lessons should the users of Amazon’s web services take out of this?
- Always have a contingency plan, even if it consists of a dark blanket over your head and a six pack of your favorite beverage.
- Make sure you understand the true cost of downtime, and what the impact is to your organization/business/whatever.
- SLAs mean nothing.
- Nobody’s perfect.
- Don’t assume.
And what lessons should Amazon take away?
- Communication is key. A vast majority of the “complaints” were centered around lack of information.
- Information is confidence. Confidence is trust. Amazon should institute something similar to what Salesforce.com did after their big issues a few years ago.
- Never underestimate the anger of a mob.
All told, I still think that “cloud computing” for a lot of applications is the future.
1 My understanding is it was somewhere around 2.5 to 3 hours, but I’m including a few extra to cover stabilization times, and restart times.
2 If we just allow 2.5 hours, it’s an uptime of 99.97%.
This entry was posted at 4:25 pm on 15 February 2008 and is filed under Technology. You can follow any responses to this entry through the post-specific RSS 2.0 feed.
Both comments and pings are currently closed.
[...] AWS, uptime, downtime and information at Pensieri di un lunatico minore wherein Chris essentially says exactly what i would have re: the Amazon outage, only more succinctly and eloquently (tags: chrispetrilli amazon outage sla uptime contingencyplanning aws) [...]