Amazon CloudSearch

So you may remember that recently I posted about how to Index using DynamoDB. At Newstex, we just rolled out a custom solution to use this Indexing scheme to help us search through our SimpleDB domains quicker, and pull up content within a matter of milliseconds. We were scheduled to go live with this change on Friday, April 13th. I even gave a presentation in front of Mitch Garnaat, with Jeff Barr also present in the building (although he did not attend my presentation).

On Thursday, April 12th, Amazon announced CloudSearch.

After a few hours of intense anger at myself for not realizing that this would be coming, and wasting the last month of my life on a DynamoDB based solution, I immediately dropped everything else and worked for the next two days CloudSearch into our system, and also re-write my entire presentation for BarCampRochester.

Don't get me wrong, in the end I'm very happy that CloudSearch is available, but what upset me most is that i didn't see this coming. When you think of Amazon, you don't usually think of search, yet they are one of the top retail web-based companies, and how do you find products on amazon? Through searching.

Yes, the exact same search engine, powered by A9, which runs Amazon.com's massive website and index of products, can now be integrated into any web application you want, for a price.

At Newstex, we currently have one search domain with over one million documents in it. Each document has dozens of indexed metadata fields, and still we retrieve search results in under 100ms. This includes facets, which is search-engine-lingo for those little filters you see which let you narrow down your search result by things like department, brand, and features. Amazon makes all aspects of Search just incredibly simple with this new JSON-capable API.

You can get started with CloudSearch by simply setting up a new domain via the Amazon Console. Adding documents is relatively painless and very well documented, and Mitch is working hard on integrating a solution for boto. Until then, however, you can use cloudsearch.py provided by ex.fm

There were a few problems with this file, which I've updated in my own fork here: https://gist.github.com/2414564 which provides a few bug fixes and updates. Note that the support here is still very much a work-in-progress, and hopefully we'll eventually get this integrated into boto directly.


As for Newstex, we're now just over million documents indexed, but we're already on an Extra Large instance (search.m2.xlarge). We believe at this point that this is due to an extra amount of fields that we're indexing, that were not literals. If you're encountering similar issues, please think about what resultant fields you really need, and consider turning anything you can into a literal field (everything except a full-text index should be a literal or uint). We've also removed quite a bit of our usage of Result fields (although we still have probably too many facets).

Comments