I've been spending a lot of time lately looking for a good replacement for Loggly, ever since they've been having so many problems with uptime and availability. The most important feature of any log management platform is obviously to make sure it's available when I need it, and always collecting my logs. If the service drops 50% of my logs, then it's not very useful in tracking down those little bugs, I still have to log into all of my servers to see everything.
Thanks very much to Jordan Sissel for his post on Shipping Some Logs, we've decided to switch away from Loggly, to something else
Features of Log Management Solutions
What we really need out of a log management solution is something that easily integrates with our services transparently. What I mean by that is not something that will take extra code-level development work in order to use. We use Python with the standard logging module, which we then push to syslog. Syslog (or rsyslog as we use now) allows us to ship off logs to another remote server, which is really what we like to do. That means that no matter what language we use, all of our logs are able to be stored both locally and shipped off to the log management solution, without any integration with our actual programs. We also like to see logs from native linux apps like SSH, so the only real solution for us is something that integrates with syslog
Scale and Search
It also Needs to scale. We follow the rule of Log everything, even if you don't think you'll need it. Its not abnormal to have over 1 million log events, or 1GB of log data, in a single day. If you log everything, you can sort through things later to more easily find out what happened.
Since we do log everything, what we really need is the ability to search for something across all of our systems. We need to be able to trace down something that may have gone wrong and figure out exactly where it went wrong
We also want to be able to use our logging solution to alert us if something is going wrong. Specifically we have two types of alerts:
- If more then X events appear in Y minutes (error threshold)
- If LESS then X events appear in Y minutes (heartbeat)
Specifically, we deliver to clients using FTP, HTTP, or other methods, and we need to be alerted if we haven't delivered in over 10 minutes, or if we've received more then a few errors per client in an hour. It's not abnormal to have a client system go down for a few minutes, or receive just a periodic error (the internet is not perfect, after all), but if there say 500 errors delivering to client Z in an hour, then someone should be alerted. It's also nice to be able to schedule timeframes for when these alerts run, but that's just icing on the cake
Another feature we would like to have is the ability to graph events. Take for example graphing how many errors you've had over the past day. Even better, how many deliveries have you had over the past month? This could show us what days we have the heaviest amount of load on our services, and we could then use that information to determine when we need to have more servers available (after all, we're in the cloud so that can be really automated).
Uptime is key
The most important piece of any log management solution though, is consistent uptime and availability. If your logs are lost, or if you can't get to them in the time of an emergency, or if alerts suddenly fail because the service isn't available, then the service is completely useless
What services have we looked at?
We've looked at quite a few different services to solve our logging problems. Obviously we don't want to roll our own service, since that's really not our core business, and we don't want to maintain yet another system that doesn't make us any money. Here's a list of what we've looked at
We previously were using Loggly. It's cheap, only about $200/month for up to 1GB/day and 90 days of search history. It gives you graphs, search, and it's got the easy integration. With the addition of AlertBirds, it also has very robust alert management. AlertBirds is actually not integrated directly with Loggly, but it's a free service that they do also provide.
So why did we recently switch away from them? They violated the first rule of Log management, they had a horrible uptime. Whenever we went to their service, they always had a message about "Sorry, we're working on backfilling the servers with your logs". This type of transparency is nice, but it doesn't make up for constant issues. The final straw was A huge outage which lost all logs for several days and a weak apology where they basically blamed AWS for reboots, which Amazon announces up front that are not uncommon. Amazon has always had the policy that they may have to reboot your instance at any time. While they try to let you know about it, sometimes it's just not possible (what if a server suddenly dies?). They weakly tried to say it was their fault while simultaneously blaming AWS. It's one of the major points that anyone seriously interested in cloud computing should handle from the beginning. It's one of the few rules of doing business in the cloud
While digging around, I also took a look at Loggr.net. While at first glance it does appear very nice, they offer tons of analytics tools, it again violates one of my primary rules. It requires you to use their API to push logs, it doesn't integrate seamlessly with Syslog or any other common standard. Additionally, they charge per log event, and the highest plan they appear to have is for 20,000 log events per day. We blow through that in a few minutes, so obviously they're not designed to scale to what we need.
Papertrailapp is the current log management system of choice. They have one feature that I really wanted from the beginning with Loggly, being able to see a live tail of your logs. They do lack alerting and graphing, but they also offer up a very nice API which means you could build it yourself, or simply wait for them to build it as they seem to be very concerned about their customers. I've had several email interactions with the folks there, which is quite frankly the entire reason we're still staying with them. What's even nicer is that unlike Loggly, I don't have to keep logging in every 5 minutes to view my logs. Automatic login that actually works is very nice.
Although you wouldn't think much of it, being able to look at a live tail of a saved search is very important to debugging issues, or simply watching how well an upgrade went. Although Papertrailapp doesn't support more then 4 weeks of search, having this live tail of events is very nice to be able to see to get a "live look" at how well things are going in the system. They do also offer a nice archival to your S3 bucket which means you could do your own work with your log events after the 30 days of search results in Papertrailapp are gone
What other solutions are people looking at?
Have a better solution or use something better? Please let me know in the comments!
I wanted to clarify that Papertrail does support alerts.
Specifically, when new events occur that match a key search, Papertrail can notify a different service (like Campfire chat or Librato Metrics), send an email, or hit a URL that you provide (webhooks).
That can happen immediately, hourly, or daily as a summary.
Here's the webhook JSON format: http://help.papertrailapp.com/kb/how-it-works/web-hooks
We also maintain an open-source services app that anyone can fork (and optionally, contribute new services that we'll run): https://github.com/papertrail/papertrail-services
The part that's not supported is a min/max velocity threshold. As a workaround, thresholds are possible by using the Librato Metrics alert in Papertrail and tying an alert to that.
Give up on Heroku, EC2, GitHub, and basically anything shy of dedicated servers in your own colo. In all cases, you rely on someone else to fix stuff quickly (and some are more mission-critical than logging). I'm not saying that's right or wrong, only that it's not unique to logging.
Personally, I'm okay paying people to do other stuff well as long as it performs as advertised.
Do you have any more info about "Librato Metrics" in papertrail?
From there, attach Librato alerts using their thresh_above_value and thresh_below_value constraints (docs).
Librato's alerts were inspired by Papertrail's (which was heavily inspired by GitHub's), so the functionality and structure is pretty similar. We're happy to help set these up, too.
As Troy pointed out, our KB is at support.metrics.librato.com and the API docs are at dev.librato.com.
There's a live demo running at http://demo.logzilla.pro