The Great EC2 Outage of 2011

There's been a lof of posts recently about the Massive Outage on EC2 that's now spanning two days of Amazon acknowledging that they have an all-out failure. Guess what folks, it's not all of EC2 that's down, it's a small part of EC2 called Elastic Block Storage (EBS). Unfortunately, this system was developed because most people can't live by the disposable instances theory, and wanted something more like traditional Virtual Machines (VMs).


I've always been a huge fan of using Cloud Computing, specifically Amazon Web Services mostly because it's pretty much the only game in town. But, as I've always said, you have to prepare for something to go wrong. In most cases, this outage is more related to Network Providers then AWS itself, but in this case it was actually an outage caused by some pretty massive failures at the Amazon side.


Amazon splits up their systems into multiple Regions, each Region is then split up into multiple Zones. The major outage effected only one region, and furthermore it's now limited down to just one Zone within that region that's still a problem. Amazon has always had the policy that any single zone within a region may be down without effecting their SLA (they always tell you to use multi-zone deployments), but in this case they've actively marked themselves as down. What this means is that this outage will be counted against their SLA. Unlike traditional providers who try to cover things like this up, Amazon has been very transparent and open about all the problems they've been having.


What I find even more Amazing is that everyone yells that all of Amazon Web Services is down when they have this kind of outage, yet nobody even noticed that Amazon never went down in Japan even with all the recent Tsunami and Earthquakes, Nuclear Disasters, and even rolling blackouts that have been occurring. They survive acts of God which anyone would forgive for them being down, but there's no forgiveness for one piece of AWS being down for any length of time.


It's not hard to launch systems in multi-region deployments for redundancy situations. In fact, Amazon has even offered up some suggestions for how to do this, and there's even more help available on the Forums. I find it Amazing that many people seem to simply call it a day when something like this happens, rather then try to figure out how to improve your DR situation. And if you really "dont have the resources to make a DR solution", then you probably don't have enough resources to matter in an outage like this.

Comments

Brian Williams said…
Great article, I am going to post this on reddit. Oh wait...