Indexing with DynamoDB

One of Amazon's coolest services recently announced was Amazon DynamoDB. With this new service, you can utilize the massive power of Amazon's Cluster of Solid State Disks, and computational power to store and search your data. What's interesting about DynamoDB is that it doesn't index any of the fields you provide (outside of the ID), so if you want to be able to retrieve your data, most likely it's going to have to be via ID and lookups, not by scanning or querying. If you want to really utilize DynamoDB, you have to re-think how you store your data.

The Concept

Whenever Amazon releases a new product this interesting, I always try to figure out how it would work into my current workflow. For DynamoDB, I realized this could very easily solve my indexing problems, by simply taking a few notes from other search systems, I created an algorithm which indexes the most common versions of any given string you provide to it. For example, if you want to the following string:

Learning by Doing:

This gets first split into it's tuple pairs, each combination of words that it appears in this string. This starts with the full string:

LEARNING BY DOING

Then we go to two word tuples:

LEARNING BY
BY DOING

Then we go into single word tuples:

LEARNING
BY
DOING

Then we also index each "stemmed" version of each tuple string:

LEARN BY DO
LEARN BY
BY DO
LEARN
BY
DO

This leaves us with a list, which we then make sure to remove any duplicates from, and then insert the corresponding records into DynamoDB. The "id" field is whatever we pass into the add function when indexing. Lets see how this works in the new botoweb.db.index.Index class.

Indexing with Botoweb

I decided that this isn't just something that might be useful for Newstex, but instead could be generally useful. So I made a generic Index class which can be used to generate full and complex Indexes. To get started, simply create a new Index object

>>> from botoweb.db.index import Index
>>> index = Index("test-search")

Then lets add something to the index:

>>> index.add("Learning by Doing", "learnbydoing.com")

Note that if this is the first time you've created this index, it may take a minute or two for the first item to be added. This is because the Index class is actually creating your DynamoDB table before adding the item to Dynamo. After the record is added, you'll be able to search for it by almost any of the terms you would think of:

>>> for item in index.search("learn"):
... print item['id']
...
learnbydoing.com

What's even more fun is when you start indexing longer "collapsed" words:

>>> index.add("coredumped.org", "cdump")
>>> for item in index.search("core dumped"):
... print item['id']
...
cdump

What you don't see is that behind the scenes here multiple different searches take effect (with fallbacks so your primary search is always given precedence). In this search since there is no match for "core dumped' as two separate words, it also checks for "coredumped" as a single collapsed version. Additionally, the indexer takes the "." as a word separator, so it's not required to match the search result. However, what will not match is just using the word "core".

I'm still trying to find a good way to match partial words typed in (short of indexing each letter), so if anyone has any ideas there please let me know!

Chris Moyer

Search This Blog