Design for Failure

Failures are inevitable. No system is entirely without them, it's how you deal with them that makes the difference. While many people will tell you that you should build your system so that it does not fail, in practice that's really just not possible. Instead of always trying to prevent any sort of failure, it's often better to be prepared for failure, and have a plan for what to do when your system fails. It's also important to minimize the impact of a failure. You'll also want to determine the risk of failures, and mitigate those risks using appropriate means.


Split your system into Modules

The first thing to do with any system is to make sure you split appart different features into different self-reliant pieces. In this way you can minimize impact of an outage for a particular system. Making an online game? Make sure that if your website goes down (say, due to a DDOS attack), it doesn't take down the system that serves up your game to your clients. Writing an online video conversion tool? Make sure that even if your conversion servers are overloaded (or go down entirely), your users can still submit videos.

Monitor the health of your systems

Another very common practice to minimize the impact of failures is to make sure that you know when your service is down. You may have several different methods of determining the "health" of your system, but in the end you need to make sure that you're monitoring the end-to-end result of your system.

But who monitors the monitors? If you're writing your own checks to monitor your systems, make sure there's a third party also involved. If your monitors are running on the same hardware or platform as your service, what happens if you have a hardware/platform failure? How are you being notified? If you're notification system goes down, do you find out about it?

At Newstex, we use several different methods to monitor the health of our systems.

Pingdom

Pingdom offers simple port-based checks. They have monitor servers located all across the world, and allow you to check things like web servers, or any generic TCP port. We use this to monitor the health of all of our web-accessible services, as well as check some services that simply respond to HTTP just to verify the health of the system. Pingdom provides a nice tie-in with other services like Pagerduty to notify your staff when there's a problem.

Papertrail

Papertrail is a nice solution which offers log management. Although they don't currently have much in the way of alerts, they can notify you via Email or Pagerduty if certain events appear within your log events. They have future plans to also include checking for the lack of events, as well as thresholds, (for example,  at least 15 events within the last 5 minutes).  Papertrail also has a nice API that you can tie into to make your own custom alert monitoring, just make sure this isn't the ONLY way you get notified of a problem.

Custom Scripts

For those things that are custom issues we like to track, we also use our own custom scripts. We verify these scripts are working through other services (such as pingdom), but even if the system itself is working, that doesn't mean there might not be other minute problems. For example, you may want to track and log if it's taking a particularly long amount of time to render your webpage, or if a user receives a 5xx level error for any reason. There's nothing wrong with using your own custom monitoring scripts, just as long as that's not the only method you use to check your system.


Determining Risk

Risk Mitigation is one of the key aspects of a disaster recovery plan. Risk can be calculated as a function of how likely a system is to go down, as well as the impact that outage would cause you. If a system is very fragile and likely to go down, but that outage wouldn't cause any negative results for a few days, then it's probably not worth investing a lot of time and money into making it more stable. What if you do get hacked? What information will the hacker get for such and such account? If all they get to see is your name, it's probably not very important to spend a lot of time securing that system. If that system also contains phone numbers, social security numbers, or any sort of financial data, then it's more important to make it less likely that someone can hack in. Make sure you put the proper effort where it's needed. Don't spend 6 months making a secure system for people to see your company logo, or read what the lunch specials for the day are.

Have a Plan

Ok, so your system failed and you've been notified. Do you know what to do about it? Who do you contact? What actually failed? What's the procedure for fixing it and who do you have to notify about it?

Monitoring your system is only half of a recovery plan. You also need to have detailed documentation on how to fix problems, and make sure everyone knows or has a way to contact the person who does know. It's also extremely important that if your system is down, you have access to this documentation. If your system is a wiki, don't keep the only copy of your disaster recovery documentation in your wiki. Plans can vary from system to system, but when an on-call support person is paged in the middle of the night what they really want to know is these few questions:

1. Is there really a problem?
2. Does this need to be fixed now?
3. How can I fix this?
4. If I can't fix it, who do I contact to fix it? (escalation)

The third question How can I fix this, often will include a series of steps of how to identify the problem, and what to do to fix it. It's important to have this decently documented so that you can avoid having to wake up everyone in the office for something that can be solved by a simple reboot of the servers. You also need to have monitoring in place to make sure that once it's resolved, the on-call can verify it has been fixed.

The most important thing to remember is that when there is an outage, it doesn't matter who's fault it is, what matters most is minimizing the impact and downtime of the outage.



Questions you should ask yourself (or your IT staff)

So how do you really know if you're prepared for an outage? Can you (or your IT staff) answer these questions?

1. What would you do if your database suddenly died?
2. What if your entire datacenter went down (or fell into the ocean)?
3. What if your user base suddenly increased by an order of magnitude?
4. What would happen if your head of operations or IT were to suddenly quit, or died?
5. If your system went down, how would you find out? (if your customers notice before you do, there's a problem).
6. Who would respond right now if your entire system was down?
7. What if an employee went rogue, how much damage could they do and how quickly could you recover?


Remember, the most important thing isn't to prevent failures, but to handle them, and be ready for when they happen.
0