Last night at about Midnight Eastern time, Newstex began receiving errors from PagerDuty, immediately warning about service issues that it detected. We always begin our investigation with a few simple steps.
The number one rule during any potential crisis is always, "Don't Panic". The worst thing that can happen is a panic'd engineer goes out during a perceived crisis, and ends up causing more issues then originally were, if there even were anything.
The first phase in any crisis situation is the Discovery Phase. This is where you monitor your systems and attempt to discover the cause of the alerts being sent out.
Verify the alert
Next, it's important to verify that the error alert wasn't erroneous, or already fixed by the time you get the alert. Temporary issues are quite common, and although the alerts attempt to only send themselves when things have been confirmed down, there's always the chance that the alert was misconfigured, or the alerting system is having an issue. It's important to verify that what the alert is telling you is true. In our example, it was telling us that several of our FTP servers were having issues. This is easy to verify by simply attempting to log into those FTP servers with known good accounts (we have several testing accounts just for this purpose).
Check the AWS Status Page
After you've verified that it is an actual issue, it's important to check the AWS Status Pages. In our case, this turned out to be the end of the discovery phase. We noticed immediately that SimpleDB was having major issues, and that was the cause of our issues. If this is the problem, you immediately move on to mitigation. If this isn't the problem for you, it's important to also check your log files to identify what the potential issue might be.
Once you have discovered the issue, it's time to go about mitigating the impact and fixing the situation.
Mitigate the impact on customers
Customers generally don't care why a system is down, they only care that it is down. Blame may work in politics, but it doesn't comfort the user to know that you're down "because of amazon". That just ends up putting more work on your shoulders as now they will simply question why you are running on AWS. Instead, it's more important to restore services as quickly as possible, or at the very least mitigate the impact to your customer. In our case, although we couldn't fully restore all services, we were able to keep our FTP servers accepting new connections and files, focusing on the real-time needs of our customers. We knew it wasn't important to make sure that our internal administration system was operating, nor was it important to make sure that our delayed feeds were reading. The most important aspects of our system, the ones requiring real-time delivery, were our top and only priority.
Monitor and Recover
During any crisis phase, it's important to continually monitor the situation, and be prepared to escalate to another step. In our example, we had already begun to prepare databases in us-west-1 in the event the issues extended beyond 4am. Fortunately, at around 3am Eastern time, the services were fully restored.
After the services were fully restored, we went through each of our services one-by-one and verified manually that everything was working again. We continued to monitor through the morning hours to make sure everything stayed stable.
The last phase of any crisis situation is the Post-Mortem. This is the time when you can go back in detail through your logs and identify exactly why the issue was escalated to an outage. Its at this point you want to take your time and find out exactly what triggered the domino effect, and work on potential solutions to those issues. You don't need to necessarily make any changes during this phase, but you do need to at least present the root cause, and why it became an escalation to an actual outage.
After you have discovered the issue and potential solutions, it's up to you to determine the cost benefit of implementing those solutions. Remember that every issue comes with an associated risk, which can be calculated as a combination of How likely is this issue to occur, and How big of an impact will this be. This means that something that is lower impact, but much more likely to occur, will be a higher priority then something that is extremely unlikely to occur, but a much higher impact.
Great support Staff is key
In our case, Newstex was fully prepared to roll to a different region, but we determined the impact would be greater then simply waiting for Amazon to fix the root cause. Our monitoring and support staff were entirely on the situation and handled it incredibly well, minimizing the impact in most cases so that customers don't even know there was an issue. Having an incredibly great support staff is absolutely key. You don't want to have to wake up your customers because they're being told the system is down, you have to know and resolve these issues before your customers notice.