Post Mortem of EBS Outage

Now that Amazon has fully recovered from the EBS outage last week, they've reased an official explanation of what happened, and why.

Within this explanation is burred a reference that this really did not drastically impact more then one Availability Zone. Due to this they're going to be offering a series of webinars to help developers better understand how to properly architect their applications for cloud computing

But what really happened?

Most users of Amazon's Cloud Computing really want to know what happened, or more specifically we want assurances that it wont happen again. In reality, what happened can be broke down into the following failures which compounded into a complete and total collapse of EBS's services across the entire us-east-1 region:
  • A bad network upgrade caused several EBS volumes to think they were alone, and panic
  • These panic'd EBS nodes caused lots of trouble in the regional EBS control system
  • Due to this failure, Amazon had to disable the entire EBS API
  • Users trying to launch new EBS volumes to replace the ones that had failed could not, because the EBS API had been disabled for the duration of the fix

How do I know it wont happen again?

Amazon is attacking the problem on several fronts, but the biggest bonus that I saw in their entire post is a true admission that mistakes will happen. Although they're going to try to not make mistakes and are putting in lots of automation and controls to hopefully prevent the original network problem that caused the outage, they're also attacking each step along the way which caused the outage.

  • Controls will be put in place to prevent bad network upgrades in the future
  • EBS clusters will be modified to back-off gracefully instead of wildly trying to recover from a failure causing the entire system to break
  • The EBS Control system will be modified to be more zone-based, so that failures in one zone wont cross over to others
  • Amazon is making more services (such as VPC) available in all availability zones
  • Amazon is offering up a series of webinars to help developers architect their systems more effectively

But why should I still trust Amazon?

The reality of this event is that yes, Amazon did make a mistake, but all those effected by this mistake, even some that didn't get effected, will be getting a credit equal to 10 days worth of EC2 and EBS usage on their account. Although the monetary compensation isn't the big part of this, and is very little comfort, what is comforting is that Amazon is really announcing that they screwed up, and they quickly updated everyone to acknowledge there was a problem.

When my Cable goes out (which happens almost every month), I get no notification other then from my own systems. Whenever I call my cable company, all I get told is if they know there's a problem, never any compensation in return, and never even an ETA or apology. The same is true with most hosting providers. What's even more comforting is that really only those who limited themselves to a single availability zone were effected. Although you couldn't quickly launch in another AZ to solve your problems, you could launch within another Region, and more importantly if you had already had instances in other zones, you were most likely still operational.

What can we learn from this?

The biggest thing to learn from this is that nobody, not even Amazon, is infallible. That being said, in Amazon's case they always give you the option to recover from a failure quicker then any other solution. For example, if I had a single server that was broken in a traditional hosting environment, I couldn't quickly launch a clone in another datacenter to serve out of while that server was repaired. With Amazon, it's just a matter of a few API calls.

Cloud Computing is not dead, in fact this is a shining moment for it. It shows that you'll get exactly the stability that you plan for.

Comments