Chris Moyer
Author of "Building Applications in the Cloud: Concepts, Patterns, and Projects"
Monday, January 23, 2012
Amazon DynamoDB
One very interesting and confusing part that I discovered was how Amazon actually measures this provisioned throughput. When creating a table (or at any time in the future), you set up a provisioned amount of "Read" and "Write" units individually. At a minimum, you must have at least 5 Read and 5 Write units partitioned. What isn't as clear, however, is that read and write units are measured in terms of 1KB operations. That is, if you're reading a single value that's 5KB, that counts as 5 Read units (same with Write). If you choose to operate in eventually consistent mode, you're charged for half of a read or write operation, so you can essentially get double your provisioned throughput if you're willing to put up with only eventually consistent operations.
Ok, so read operations are essentially just look-up operations. This is a database after all, so we're probably not just going to be looking at looking up items we know, right?
Wrong.
Amazon does offer a "Scan" operation, but they state that it is very "expensive". This isn't just in terms of speed, but also in terms of partitioned throughput. A scan operation iterates over every item in the table, It then filters out the returned results, based on some very crude filtering options which are not full SQL-like, (nothing close to what SDB or any relational database offers). What's worse, a single Scan operation can operate on up to 1MB of data at a time. Since Scan operates only in eventually consistent mode, that means it will use up to 500 Read units in a single operation (1,000KB items/2 (eventually consistent) = 500). If you have 5 provisioned Read units per second, that means you're going to have to wait 100 seconds (almost 2 minutes) before you can perform another Read operation of any sort again.
So, if you have 1 Million 1KB records in your Table, that's approximately 1,000 Scan operations to perform. Assuming you provisioned 1,000 Read operations per second, that's roughly 17 minutes to iterate through the entire database. Now yes, you could easily increase your read operations to cut that time down significantly, but lets assume that at a minimum it takes at least 10ms for a single scan operation. That still means the fastest you could get through your meager 1 Million records is 10 seconds. Now extend that out to a billion records. Scan just isn't effective.
So what's the alternative? Well there's this other very obscure ability that DynamoDB has, you may set your Primary Key to a Hash and Range key. You always need to provide your Hash Key, but you may also provide the Range Key as either Greater then, Less then, Equal To, Greater then or equal to, Less then or equal to, Between, or Starts With using the Query operation.
Unlike Scan, Query only operates on matching records, not all records. This means that you only pay for the throughput of the items that match, not for everything scanned.
So how do you effectively use this operation? Simply put, you have to build your own special indexes. This lends itself to the concept of "Ghost Records", which simply point back to the original record, letting you keep a separate index of the original for specific attributes. Lets assume we're dealing with a record representing a Person. This Person may have several things that identify it, but lets use a unique identifier as their Hash key, with no Rage key. Then we'll create several separate Ghost records, in a different table. Lets call this table "PersonIndex".
Now if we want to search for someone by their First Name, we simply issue a query with a Hash Key of property = "First Name", and a range Key of the first name we're looking for, or even "Starts With" to match things like "Sam" to match "Samuel". We can also insert "alias" records, for things like "Dick" to match "Richard". Once we retrieve the Index Record, we can use the "Stories" property to go back and retrieve the Person records.
So now to search for a record it takes us Read operation to search, and 1 Read operation for each matching record, which is a heck of a lot cheaper then one million! The only negative is that you also have to maintain this secondary table of Indexes. Keeping these indexes up to date is the hardest part of maintaining your own separate indexes. however, if you can do this, you can search and return records within milliseconds instead of seconds, or even minutes.
How are you using or planning to use Amazon DynamoDB?
Friday, January 13, 2012
What's coming up for Amazon Web Services Cloud?
They've also been launching new services, such as ElastiCache, a memcached service offered up directly from Amazon. And of course it wouldn't be amazon without dozens of improvements to existing services, such as finally enabling console logins for IAM users, and vast improvements to Amazon SNS.
So what then really can we expect at this new announcement? A few things have been on the radar for quite a while now:
- Location-Aware Route53 (DNS) - Make sure that users get the closest server to where they are
- Two-Factor Auth for IAM Users
- APN (Apple Push Notifications) support for SNS
- "Pass Through" option for CloudFront (i.e. don't cache certain URLs)
- Lots of improvements for SimpleDB, such as SQL functions and cross-domain joins.
Thursday, January 5, 2012
Monitor your SDB Domains
As you should be aware, SimpleDB has a limit of 10GB per domain. This limit is calculated as a sum of the bytes used by Item Names, Attribute Names (unique), and Attribute Values. Fortunately, Attribute Names are only stored once per name, so you don't pay for each name being used multiple times, they just charge you per unique name.
You can get all of the Usage information about your SDB Domain using the get_metadata function of a boto domain.
>>> import boto
>>> sdb = boto.connect_sdb()
>>> db = sdb.lookup("my-domain")
>>> md = db.get_metadata()
This "md" object then contains the following elements:
md.item_names_size
md.attr_names_size
md.attr_values_size
md.item_count
I wrote a simple script to check my domains, which takes an optional list of arguments for domain names to check. If you dont' pass in a domain name, it will iterate over all of them and show you any domain that uses more then 3GB:
#!/usr/bin/env python
"""
Check script to make sure none of our domains are close to the size limit
"""
import boto
if __name__ == "__main__":
import sys
sdb = boto.connect_sdb()
if len(sys.argv) > 1:
query = [sdb.lookup(n) for n in sys.argv[1:]]
limit = 0
else:
query = sdb.get_all_domains()
limit = 3000000000
for db in query:
md = db.get_metadata()
total = int(md.item_names_size) + int(md.attr_names_size) + int(md.attr_values_size)
if total > limit:
print db.name
print "\tItems:", md.item_count
print "\tItem Name Size:", md.item_names_size
print "\tAttribute Name Size:", md.attr_names_size
print "\tAttribute Values Size:", md.attr_values_size
print "\t---------------------------------------------"
print "\tTOTAL:", total
If your domain is using more then 10GB of space, you can use this to track down what's using a lot of space. In my case, I was adding a lot of unnecessary items that were almost completely blank, so my item_count was huge, and my item_names_size was over 7GB.
Of course, if you do have need for all these items, you should consider Sharding your domain into multiple sub-domains. This process is usually handled by taking one attribute that nicely splits your items into different segments, and using that value as the domain name. Unfortunately, you can not query across multiple domains, so you have to be very careful what you choose as your Shard Key.
Friday, December 30, 2011
The Great search for Syslog services
I've been spending a lot of time lately looking for a good replacement for Loggly, ever since they've been having so many problems with uptime and availability. The most important feature of any log management platform is obviously to make sure it's available when I need it, and always collecting my logs. If the service drops 50% of my logs, then it's not very useful in tracking down those little bugs, I still have to log into all of my servers to see everything.
Thanks very much to Jordan Sissel for his post on Shipping Some Logs, we've decided to switch away from Loggly, to something else
Features of Log Management Solutions
What we really need out of a log management solution is something that easily integrates with our services transparently. What I mean by that is not something that will take extra code-level development work in order to use. We use Python with the standard logging module, which we then push to syslog. Syslog (or rsyslog as we use now) allows us to ship off logs to another remote server, which is really what we like to do. That means that no matter what language we use, all of our logs are able to be stored both locally and shipped off to the log management solution, without any integration with our actual programs. We also like to see logs from native linux apps like SSH, so the only real solution for us is something that integrates with syslog
Scale and Search
It also Needs to scale. We follow the rule of Log everything, even if you don't think you'll need it. Its not abnormal to have over 1 million log events, or 1GB of log data, in a single day. If you log everything, you can sort through things later to more easily find out what happened.
Since we do log everything, what we really need is the ability to search for something across all of our systems. We need to be able to trace down something that may have gone wrong and figure out exactly where it went wrong
Alerts
We also want to be able to use our logging solution to alert us if something is going wrong. Specifically we have two types of alerts:
- If more then X events appear in Y minutes (error threshold)
- If LESS then X events appear in Y minutes (heartbeat)
Specifically, we deliver to clients using FTP, HTTP, or other methods, and we need to be alerted if we haven't delivered in over 10 minutes, or if we've received more then a few errors per client in an hour. It's not abnormal to have a client system go down for a few minutes, or receive just a periodic error (the internet is not perfect, after all), but if there say 500 errors delivering to client Z in an hour, then someone should be alerted. It's also nice to be able to schedule timeframes for when these alerts run, but that's just icing on the cake
Graphing
Another feature we would like to have is the ability to graph events. Take for example graphing how many errors you've had over the past day. Even better, how many deliveries have you had over the past month? This could show us what days we have the heaviest amount of load on our services, and we could then use that information to determine when we need to have more servers available (after all, we're in the cloud so that can be really automated).
Uptime is key
The most important piece of any log management solution though, is consistent uptime and availability. If your logs are lost, or if you can't get to them in the time of an emergency, or if alerts suddenly fail because the service isn't available, then the service is completely useless
What services have we looked at?
We've looked at quite a few different services to solve our logging problems. Obviously we don't want to roll our own service, since that's really not our core business, and we don't want to maintain yet another system that doesn't make us any money. Here's a list of what we've looked at
Loggly
We previously were using Loggly. It's cheap, only about $200/month for up to 1GB/day and 90 days of search history. It gives you graphs, search, and it's got the easy integration. With the addition of AlertBirds, it also has very robust alert management. AlertBirds is actually not integrated directly with Loggly, but it's a free service that they do also provide.
So why did we recently switch away from them? They violated the first rule of Log management, they had a horrible uptime. Whenever we went to their service, they always had a message about "Sorry, we're working on backfilling the servers with your logs". This type of transparency is nice, but it doesn't make up for constant issues. The final straw was A huge outage which lost all logs for several days and a weak apology where they basically blamed AWS for reboots, which Amazon announces up front that are not uncommon. Amazon has always had the policy that they may have to reboot your instance at any time. While they try to let you know about it, sometimes it's just not possible (what if a server suddenly dies?). They weakly tried to say it was their fault while simultaneously blaming AWS. It's one of the major points that anyone seriously interested in cloud computing should handle from the beginning. It's one of the few rules of doing business in the cloud
Loggr.net
While digging around, I also took a look at Loggr.net. While at first glance it does appear very nice, they offer tons of analytics tools, it again violates one of my primary rules. It requires you to use their API to push logs, it doesn't integrate seamlessly with Syslog or any other common standard. Additionally, they charge per log event, and the highest plan they appear to have is for 20,000 log events per day. We blow through that in a few minutes, so obviously they're not designed to scale to what we need.
Papertrailapp
Papertrailapp is the current log management system of choice. They have one feature that I really wanted from the beginning with Loggly, being able to see a live tail of your logs. They do lack alerting and graphing, but they also offer up a very nice API which means you could build it yourself, or simply wait for them to build it as they seem to be very concerned about their customers. I've had several email interactions with the folks there, which is quite frankly the entire reason we're still staying with them. What's even nicer is that unlike Loggly, I don't have to keep logging in every 5 minutes to view my logs. Automatic login that actually works is very nice.
Although you wouldn't think much of it, being able to look at a live tail of a saved search is very important to debugging issues, or simply watching how well an upgrade went. Although Papertrailapp doesn't support more then 4 weeks of search, having this live tail of events is very nice to be able to see to get a "live look" at how well things are going in the system. They do also offer a nice archival to your S3 bucket which means you could do your own work with your log events after the 30 days of search results in Papertrailapp are gone
What other solutions are people looking at?
Have a better solution or use something better? Please let me know in the comments!
ROI: Why aren't we building this ourselves?
The answer can be summed up in 3 words, Return on Investment. What exactly is the ROI for me building it myself, and how much would it actually cost to build and maintain the service, vs purchasing an external service that does the same thing?
Your time may be better spent elsewhere
Most likely your company, like mine, isn't based around Log management. If the task you're thinking about doing yourself is solved by other solutions out there already, and it's not part of your core business, then you always have to remember that your time could be better spent doing things that improve your product.
Lets take a look at Log Management. Yes, I could probably build a service that would be better then Papertrailapp, providing me the features that I want and need without having other features that maybe I don't (although Papertrailapp doesn't have a lot of extra features that I wouldn't need). Maybe I could add in some nifty graphs, and other sorts of alert management, but how much time would that take me? At a minimum, it would take me several months to build a system to do all that paper trailapp does that I need, and it would take me several more months to add in everything I really want that they don't have. But what would that give me? Would our core product be any better? What value could I have added to our core products that I couldn't because I was working on this project instead?
Quite simply, you always have to ask yourself before you start on a project, would my time be better spent elsewhere? If the answer is yes, you probably shouldn't be building it yourself.
It's probably not cheaper to build it yourself
Lets face it, no project is ever completed even after you're "finished" with it and it's being used. There's always regular maintenance, unexpected events, and updates that need to be done. That log management service may be costing you $200/month, but what's that in hours of your time? Lets say you get paid $20/hour, that's 10 hours of your time per month. That's half an hour a day on average of your time. Would it cost you more then that to maintain and update the system? What about adding new features? Even if you ignore the fact that you had to initially build the system, you're still getting quite a deal. How can they do this cheaper then you can? Quite simply, they have more use of the same product. You don't. If you think it would take you less then half an hour a day on average just to maintain the system even after it's built, you've probably never actually worked with a system this big. Do you think they spend less then that per day maintaining the system?
Leave it to the experts
You don't do this for a living, those that offer these services are focused completely on providing a service like this just for clients like you. If they're any good, they're also asking you for input on how you use the system and what new features you might be looking for. What happens if the system goes down? Are you monitoring this daily to make sure you respond quickly to an outage? If it's running for months and then suddenly breaks, then you have to stop working on your core business just to fix a problem with your log management system. Unless you're constantly working with it, it'll probably take you longer to "switch gears" and get back into the mode to fix and update the system. Leave it to the experts that work with their technology on a regular basis
Anytime you think "I could write this better myself", don't forget to ask the question "but is it worth it?"
A quick note on Papertrailapp, they are missing a few features I would love to see, but they offer a very substantial API which can easily be used to add the features you may want, or integrate with your other systems.
Wednesday, December 21, 2011
Design for Failure
Split your system into Modules
Monitor the health of your systems
Another very common practice to minimize the impact of failures is to make sure that you know when your service is down. You may have several different methods of determining the "health" of your system, but in the end you need to make sure that you're monitoring the end-to-end result of your system.But who monitors the monitors? If you're writing your own checks to monitor your systems, make sure there's a third party also involved. If your monitors are running on the same hardware or platform as your service, what happens if you have a hardware/platform failure? How are you being notified? If you're notification system goes down, do you find out about it?
At Newstex, we use several different methods to monitor the health of our systems.
Pingdom
Pingdom offers simple port-based checks. They have monitor servers located all across the world, and allow you to check things like web servers, or any generic TCP port. We use this to monitor the health of all of our web-accessible services, as well as check some services that simply respond to HTTP just to verify the health of the system. Pingdom provides a nice tie-in with other services like Pagerduty to notify your staff when there's a problem.Papertrail
Papertrail is a nice solution which offers log management. Although they don't currently have much in the way of alerts, they can notify you via Email or Pagerduty if certain events appear within your log events. They have future plans to also include checking for the lack of events, as well as thresholds, (for example, at least 15 events within the last 5 minutes). Papertrail also has a nice API that you can tie into to make your own custom alert monitoring, just make sure this isn't the ONLY way you get notified of a problem.Custom Scripts
For those things that are custom issues we like to track, we also use our own custom scripts. We verify these scripts are working through other services (such as pingdom), but even if the system itself is working, that doesn't mean there might not be other minute problems. For example, you may want to track and log if it's taking a particularly long amount of time to render your webpage, or if a user receives a 5xx level error for any reason. There's nothing wrong with using your own custom monitoring scripts, just as long as that's not the only method you use to check your system.Determining Risk
Have a Plan
The most important thing to remember is that when there is an outage, it doesn't matter who's fault it is, what matters most is minimizing the impact and downtime of the outage.
Questions you should ask yourself (or your IT staff)
6. Who would respond right now if your entire system was down?
7. What if an employee went rogue, how much damage could they do and how quickly could you recover?
Remember, the most important thing isn't to prevent failures, but to handle them, and be ready for when they happen.
Saturday, December 17, 2011
Bridging the communication barrier: describing what you do to your boss.
I don't blog a lot, but when I do blog it's always about something that I have real-world knowledge of, and I'm very passionate about. This is one thing that I'm very passionate about. It's not very useful to know something if you can't describe it to others. If you really want to get ahead in life, you're going to have to explain things to people that have absolutely no clue what you're talking about, and don't have any knowledge of the terminology you use with your colleagues. Bridging this communication barrier can be quite complex, and is something that few people really understand. It's the reasons most companies have several layers of management, simply because an executive top-level person wouldn't ever be able to understand anything a bottom-level engineer is describing. If you're one of those middle management folks, it's probably because you're very good at bridging the communication barrier.
I recently read a post by Jordan Sissel, a former classmate of mine at college, where he details how to Speak the same language with others. While I do and always will have a lot of respect for Jordan, there's one thing that I strongly disagree with on this point. Effective communication isn't always about speaking the same language, it's usually about how to speak to someone who doesn't understand your language. Describing to a business person all the details they need to know to speak your language is usually impossible to do within any reasonable period of time (but don't worry, them teaching you to speak their language would be equally as challenging). So how really do you communicate effectively to people who don't speak your same language?
Find a common language
When two or more people meet and try to communicate to each other, the first step is always to find the common language that both people can understand. This is true not only of difference in language of two people that both speak english, but also of people that speak literally different languages. If you came across someone that spoke German, for example, and you really needed to use the bathroom, how would you go about describing that to them? Obviously you trying to teach them English wouldn't work, and them teaching you German before you pee yourself probably wouldn't work either. What you want to do is find some way to explain to them that you really need to go to the bathroom. You start off saying words like "I need to pee!" and they don't understand, ok, so that's not the common language you both speak. Then you start making hand gestures, maybe even that doesn't work. Finally you start dancing up and down holding your crotch and crossing your legs. They finally catch on and direct you to the bathroom. They don't tell you in German how to get there, they actually show you. Why? Because you've established a common language that you can both speak.
The same is true with any communication experience. You can speak in multiple different forms of communication, so what you do in any meeting is try to find a common language that you can speak and the intended audience can understand. I'm doing it right now; I'm typing out my thoughts into a written form which, hopefully, you as the reader will understand and be able to use in the future.
So how, really, do you go about finding a common language? You have to study your target audience. The first part of being able to communicate to your audience is to listen. If you're trying to explain something to your boss, find out what he likes. If he's a big baseball fan, and you know about the rules of baseball, you can speak the same language. If he likes to go fishing, learn more about fishing, and you can speak that language. If you're trying to say something to anyone else, the first thing you have to do is listen to what they say and then you can find a common language. Speaking louder, slower, or faster, doesn't help.
Use analogies and examples
The very title of this post is an example. No this isn't just how you talk to your boss, but it's something many people have a hard time doing. So what did I do here? I found a common language we can all understand (talking with our boss), and I've given that out as an example
Analogies, no matter how silly they may seem, are generally very effective if done well. Take, for example, the long criticized analogy The internet is a series of tubes. Yes, for you the reader this may seem like a horrible analogy, and a great reason why not to use analogies. Ok, so that's how you feel about this analogy, but you weren't the target audience. Take this analogy to your grandparents, or your redneck cousins who don't understand anything about the internet. After they read it, I bet they have a much better understanding of what the internet really is.
You have to target your analogies at your audience, once you've found the common ground, speak it. Study it, and make sure you have it right.
The best example of an analogy in recent times that I've heard was actually (and I'm hesitant to say this) while watching the coverage of SOPA. I apologize in advance for not having the exact quote, or even the name of the senator who made it (and if you do please let me know so I can quote/add it here). For those of you who don't know, SOPA is the government act that is their attempt to prevent online piracy of copywritten works that are being distributed through the internet, by using DNS Blocking and filtering.
The basic gist of it is this:
When we identify a crack house, we send in a raid and shut it down, taking those responsible into custody and making sure that they are punished for selling crack. What this bill is essentially suggesting is that instead of arresting those responsible, we change the street signs and take their address out of the GPS and map systems. The house will still be there, and anyone who's smart enough to figure out how to get there will still have their crack. The problem isn't actually solved. Worse yet, those normal law-abiding citizens now have to deal with the fact that their street signs were all changed, and the GPS systems were all completely messed with. This is essentially what we're doing with DNS in SOPA.
It really is a great analogy if you think about it. By using DNS filters and blocking, we're not removing the site, it still has a public IP address, and it now has impact on the law-abiding normal citizen more so then those who would be seeking out to reach that site. They went on to talk about how even a ten year old could follow instructions on how to change their DNS servers to use something housed outside of the US that has the right information. Back to the analogy, anyone could still get a copy of a map that has that crack house's address on it, and go off of that instead of anything official.
Listen to and make sure your audience understood
The final piece of any communication session is to listen again to your audience and make sure they understood what you said. If you gave an analogy or example that they didn't understand, you may need to explain things a little more clearly, give more details, or try a different analogy all together. While switching analogies can be quite confusing, if you start out talking about baseball and they don't understand anything about baseball, it's time to switch and try something else. It's important at this stage to wrap up your point and make sure they come away with a better understanding of what you were trying to explain. Don't think of this as a quiz time, but it's good to allow them to follow up with questions. Be open minded and answer their questions with the same common language you've been using. Don't assume that if they don't understand it's their fault, in fact if they don't understand it's because you didn't communicate effectively.
There are always going to be some times where your target and you simply can't find any common language to speak, and even after you think they understood, they just look at you completely puzzled. If this happens, it's time to find someone who can bridge the gap. If you have trouble explaining something to your grandparents, try first explaining it to your parents, then perhaps they can explain it to your grandparents. You don't always have the ability to speak to everyone, in fact those that do usually end up in public jobs.
