Tuesday, May 29, 2012

Take your Cloud With you: Amazon announces VM Export

While most people were getting ready for the long weekend, Amazon was busy releasing it's first phase of being able to take your cloud with you: VM Extraction for Windows-based EC2 instances. While this might seem like a small bit of functionality for most, what it does is show that Amazon is thinking about it's customers concerns.

The biggest concern many businesses both big and small have about moving to any cloud platform is the risk of Vendor Lock-in. Amazon is the biggest instance of this kind of potential problem, as they have more cloud services then any other cloud provider. This means that it's not very easy to migrate away from Amazon's Cloud services (even though they do have "compatible" solutions like Eucalyptus which offers a compatible API). Recently Amazon Announced that it was partnering with Eucalyptus in an effort to reduce the risk of vendor lock-in that many clients feared. Of course it's still in Amazon's best interest to keep you, but they want to make sure you don't feel like you're going to "get stuck" with them.

While currently all you can export is Windows-based AMIs, obviously the long-term plan is to make sure that you can export anything you want out to a VMware image. Since VMWare images can be uploaded to a few cloud providers, and since you can run your own VMWare solution, this really increases the flexibility you have with working in a Cloud Computing environment like AWS.

This also adds one additional feature which most people probably haven't thought of; you can now export your EC2 instance, work on it locally, then upload it again to EC2 when you're ready to run it in the cloud.

That concept is huge. Uploading a VM that was downloaded from an original EC2 AMI means that you can ensure compatibility with EC2, and you're suddenly able to really have full control over your AMIs. This may not be the original reasoning for launching this feature, but it certainly is a nice side-effect.

Thursday, May 17, 2012

Using MixPanel with Python

In my Recent post about my Great search for a decent analytics solution, I introduced MixPanel, the new Analytics system that we're test-driving at Newstex. I went over a little bit of how we've been testing it out by importing all the data we had in our old system into MixPanel so we could review it with our actual live data. While doing so, I decided to write something a little more generic and robust for tracking events on the server side. After all, it might be useful in other places within our system, so we can actually track events from the backend.

Tracking events server-side has a pretty big advantage in that it doesn't depend on a specific client application. Since we do everything as APIs with clients, this means that all of our clients log similar events no matter what platform they're running on.

We do everything in Python, and the documentation does give a relatively rudimentary API to pushing events to the server, but interacting directly to the REST API just seemed to be a lot easier. I wrote a very simple class that handles pushing events to the server, both asynchronously and synchronously. Instead of pushing off events through a system like is done mixpanel-celery this just spawns off a new thread for each event tracking if you call it with track_async. It also allows you to pass in a callback function to be executed once the track event is fired, which helps if you need to be absolutely sure your event was saved properly.

But enough talking, get to the code!:


"""
Event tracking, currently uses Mixpanel: https://mixpanel.com
"""
TRACK_BASE_URL = "http://api.mixpanel.com/track/?data=%s"
ARCHIVE_BASE_URL = "http://api.mixpanel.com/import/?data=%s&api_key=%s"
import urllib2
import json
import base64
import time

class EventTracker(object):
  """Simple Event Tracker
  Designed to be generic, but currently uses Mixpanel
  to actually handle the tracking of the events
  """
  def __init__(self, token, api_key=None):
    """Create a new event tracker
    :param token: The auth token to use to validate each request
    :type token: str
    """
    self.token = token
    self.api_key = api_key

  def track(self, event, properties=None, callback=None):
    """Track a single event
    :param event: The name of the event to track
    :type event: str
    :param properties: An optional dict of properties to describe the event
    :type properties: dict
    :param callback: An optional callback to execute when
      the event has been tracked.
      The callback function should accept two arguments, the event
      and properties, just as they are provided to this function 
      This is mostly used for handling Async operations
    :type callback: function
    """
    if properties is None:
      properties = {}
    if not properties.has_key("token"):
      properties['token'] = self.token
    if not properties.has_key("time"):
      properties['time'] = int(time.time())

    assert(properties.has_key("distinct_id")), "Must specify a distinct ID"

    params = {"event": event, "properties": properties}
    data = base64.b64encode(json.dumps(params))
    if self.api_key:
      resp = urllib2.urlopen(ARCHIVE_BASE_URL % (data, self.api_key))
    else:
      resp = urllib2.urlopen(TRACK_BASE_URL % data)
    resp.read()

    if callback is not None:
      callback(event, properties)

  def track_async(self, event, properties=None, callback=None):
    """Track an event asyncrhonously, essentially this runs the track
    event in a new thread
    :param event: The name of the event to track
    :type event: str
    :param properties: An optional dict of properties to describe the event
    :type properties: dict
    :param callback: An optional callback to execute when the event has been
      tracked. The callback function should accept two arguments, the event
      and properties, just as they are provided to this function
    :type callback: function

    :return: Thread object that will process this request
    :rtype: :class:`threading.Thread`
    """
    from threading import Thread
    t = Thread(target=self.track, kwargs={
      'event': event, 
      'properties': properties, 
      'callback': callback
    })
    t.start()
    return t

Usage is incredibly simple:

tracker = EventTracker(TOKEN)
tracker.track_async("My Event", {
   "distinct_id": "some_unique_id", 
   "mp_tag_name": "My User Name",
   "my_property": "some value"
   "some_int_value": 0,
})

It's important to note that you should convert datetimes into integers as unix timestamps as MixPanel handles those very well.

See https://mixpanel.com/docs/properties-or-segments/special-or-reserved-properties for special properties that can be used.

The Great search for Analytics services

In the spirit of The Great search for Syslog services that I posted last year, I've decided to talk about another very important area at Newstex that we've been struggling to find a good solution to: Analytics. First, it's important to note that I'm not an analytics person; I don't like building reports or going through logs to find out retention rates, conversion rates, or analyzing A/B testing results. It's not my core business, and it's not something that I want to devote a lot of time to. Every time I have to build a custom report for something because our analytics engine can't answer a question someone at Newstex has, I'm wasting valuable time that could be better spent implementing new features that users care about.

I don't like writing reports. I don't want to be an analytics engineer. I have better things to do.

When we started out building Mobile Applications, we knew we would need to have good reporting behind our application, not only to see what users were doing, but to report to our clients and pay our publishers. Reporting was always important, and we needed to find a solid solution to our problem.

At first we turned to Google Analytics. We use them for newstex.com, and for our internal web-based applications, so we decided to turn to them for mobile as well. The problem with Google Analytics is that it's very tied to Web applications, and it didn't really handle the different types of things that happen on a Mobile device. On a web application, you navigate through pages, and occasionally do things. On a mobile application, almost everything is an action; events are the core of analytical tracking. While Google Analytics for web did support even tracking (although very limited at the time), it only supported this for Web applications, the event library wasn't available for Mobile. We were building a native application, we knew the limitations of HTML5 and we wanted to make something without those limitations, an application capable of handling the thousands of stories per day that some of our applications require. We couldn't simply use the JavaScript version of Google Analytics, and without really good event tracking, it wouldn't work for us.

We began searching for replacements; we saw a lot about Flurry, but our biggest concern there was that there was very little actual support, and the product didn't seem to have any paid or premium services. We then found Localytics, which seemed to be very similar to Flurry in many respects, but they also offered premium services and support. Being new to the analytics game and not really wanting to figure everything out all on our own, we decided to give Localytics a try. Although incredibly expensive, they seemed to be like the best, and most supported, option we had.

Unfortunately, Localytics made us sign a rather long-term contract, and decided that they wanted to make their product more complicated, then sell "consulting services" on top. Not exactly how we wanted to go, and it took us quite a while to find something better. The problems with Localytics spread out even deeper, answering the simplest of questions seems to require you to download the export of the event log and build a custom report... exactly what we were trying to avoid by rolling out our own solution!

Then we discovered MixPanel. Immediately when looking at the UI I knew something was different, but I couldn't quite put my finger on what. Digging through the documentation we found the underlying differences in the core system: they expose the API on every aspect. It's not a service that then built APIs on top, it's a true cloud-service from the bottom up. This had the added advantage of not being locked into a specific platform. It didn't matter if it was web, mobile, desktop, or even backend applications, we could dump all of our data into MixPanel and see it all in one easy-to-use UI.

Obviously as a true proponent of cloud services I quickly dove in deeper to see if this really was a solution for us. What I found was that this was only the start. Not only did they allow us to push all these events into a unified location, but we could also do advanced segmentation and reporting, all based on whatever we saved to the system. They have ways to track unique users across events, and even tag users with human-readable names.

I quickly found the very nice import API which allowed me to copy all of the events I had in Localytics out to MixPanel. This was huge, it meant that while we're evaluating MixPanel, we could import our live actual data from our existing system. I quickly wrote up a script to import the last months worth of data, and then added a script to run nightly to copy over our event data from Localytics into MixPanel. While this did mean we wouldn't get the "live" events from MixPanel, it at least meant we could really start to evaluate the system.

Next I started looking for support. It wasn't hard to find, in fact they found me, and I attended one of their weekly 101 Webinars that helps you quickly understand the power that you get out of MixPanel. What's more, their support is free, but that's because you probably wont even need it. The system is just so intuitive. After getting a bunch of data into the system, I invited my boss to take a look. He was creating funnels, viewing segmentation, and finding out answers to questions he'd been asking for a long time. He figured this out all on his own, without having to get support, or asking me any questions. That's a win in any scenario.

Lets take a simple example of the Funnels. These funnels are the coolest feature of MixPanel that I've discovered so far. It's a quick way of telling what percentage of your user base does a certain sequence of events. What's more, you can drill down into almost any report, including the funnels, to find out more details. In this simple funnel, I wanted to see how many users went from installing the app to viewing a story.


Not only could we quickly see that only 57% of our user base continues on to view a story, but we could see the actual breakdown by platform (an attribute on every event which we track, which MixPanel refers to as a "Super Property"). In our case, 68% of iPad users actually went on to view a story after installing the app, but only 52% of iPhone users did the same. We could also break this down by Application version, or even filter so that it only shows a specific application version. This was huge; all of these reports previously had to be run by hand.

So what we've found so far is that MixPanel is way more then just an Analytics system, it's a question resolver. It does far more then just tracking your events, it allows anyone to view them in a smart and intuitive way. What's more, the amazing support is there if you need it, and out of the way when you don't.

The home-run of course was when we showed this to our financial and reporting guy and his exact response was "This is how I would have built it if I had designed it myself".

You know the service is good when you can see there's lots of usage but you don't get any questions about it.

Saturday, April 21, 2012

OpenSearch, how to get Google Chrome to search your site.

Ever wonder how sites like battle.net support things like this in Google Chrome?

Well I did, so I did a little bit of digging. It turns out Google Chrome supports an open standard called Open Search. This format is relatively simple, and very easy to add to your own site. I just added it to some of our systems in under 5 minutes.


Adding OpenSearch to your site is incredibly simple, you just have to add a simple tag to your index HTML page, and add a simple XML file that it points to. The link tag looks like this:

<link rel="search" type="application/opensearchdescription+xml" href="http://my-site.com/opensearch.xml" title="MySite Search" />


The opensearch.xml file looks something like this:


<OpenSearchDescription xmlns="http://a9.com/-/spec/opensearch/1.1/" xmlns:moz="http://www.mozilla.org/2006/browser/search/">
   <ShortName>My Site</ShortName>
   <Description>Search My Site</Description>
   <Url type="text/html" method="get" template="http://my-site.com/search?{searchTerms}"/>
   <Image height="16" width="16" type="image/vnd.microsoft.icon">
http://my-site.com/favicon.ico
   </Image>
   <moz:SearchForm>http://my-site/search</moz:SearchForm>
   <Url type="application/opensearchdescription+xml" rel="self" template="http://my-site.com/opensearch.xml"/>
</OpenSearchDescription>




And that's all, the first time someone visits your site, the link tag will be registered, and following attempts to type in that URL will provide them with the option to search (chrome), and it will nicely integrate with other search services as well.

Wednesday, April 18, 2012

Amazon CloudSearch

So you may remember that recently I posted about how to Index using DynamoDB. At Newstex, we just rolled out a custom solution to use this Indexing scheme to help us search through our SimpleDB domains quicker, and pull up content within a matter of milliseconds. We were scheduled to go live with this change on Friday, April 13th. I even gave a presentation in front of Mitch Garnaat, with Jeff Barr also present in the building (although he did not attend my presentation).

On Thursday, April 12th, Amazon announced CloudSearch.

After a few hours of intense anger at myself for not realizing that this would be coming, and wasting the last month of my life on a DynamoDB based solution, I immediately dropped everything else and worked for the next two days CloudSearch into our system, and also re-write my entire presentation for BarCampRochester.

Don't get me wrong, in the end I'm very happy that CloudSearch is available, but what upset me most is that i didn't see this coming. When you think of Amazon, you don't usually think of search, yet they are one of the top retail web-based companies, and how do you find products on amazon? Through searching.

Yes, the exact same search engine, powered by A9, which runs Amazon.com's massive website and index of products, can now be integrated into any web application you want, for a price.

At Newstex, we currently have one search domain with over one million documents in it. Each document has dozens of indexed metadata fields, and still we retrieve search results in under 100ms. This includes facets, which is search-engine-lingo for those little filters you see which let you narrow down your search result by things like department, brand, and features. Amazon makes all aspects of Search just incredibly simple with this new JSON-capable API.

You can get started with CloudSearch by simply setting up a new domain via the Amazon Console. Adding documents is relatively painless and very well documented, and Mitch is working hard on integrating a solution for boto. Until then, however, you can use cloudsearch.py provided by ex.fm

There were a few problems with this file, which I've updated in my own fork here: https://gist.github.com/2414564 which provides a few bug fixes and updates. Note that the support here is still very much a work-in-progress, and hopefully we'll eventually get this integrated into boto directly.


As for Newstex, we're now just over million documents indexed, but we're already on an Extra Large instance (search.m2.xlarge). We believe at this point that this is due to an extra amount of fields that we're indexing, that were not literals. If you're encountering similar issues, please think about what resultant fields you really need, and consider turning anything you can into a literal field (everything except a full-text index should be a literal or uint). We've also removed quite a bit of our usage of Result fields (although we still have probably too many facets).

Thursday, March 22, 2012

Indexing with DynamoDB

One of Amazon's coolest services recently announced was Amazon DynamoDB. With this new service, you can utilize the massive power of Amazon's Cluster of Solid State Disks, and computational power to store and search your data. What's interesting about DynamoDB is that it doesn't index any of the fields you provide (outside of the ID), so if you want to be able to retrieve your data, most likely it's going to have to be via ID and lookups, not by scanning or querying. If you want to really utilize DynamoDB, you have to re-think how you store your data.

The Concept

Whenever Amazon releases a new product this interesting, I always try to figure out how it would work into my current workflow. For DynamoDB, I realized this could very easily solve my indexing problems, by simply taking a few notes from other search systems, I created an algorithm which indexes the most common versions of any given string you provide to it. For example, if you want to the following string:

Learning by Doing:

This gets first split into it's tuple pairs, each combination of words that it appears in this string. This starts with the full string:


  • LEARNING BY DOING


Then we go to two word tuples:


  • LEARNING BY
  • BY DOING


Then we go into single word tuples:


  • LEARNING
  • BY
  • DOING


Then we also index each "stemmed" version of each tuple string:

  • LEARN BY DO
  • LEARN BY
  • BY DO
  • LEARN
  • BY
  • DO


This leaves us with a list, which we then make sure to remove any duplicates from, and then insert the corresponding records into DynamoDB. The "id" field is whatever we pass into the add function when indexing. Lets see how this works in the new botoweb.db.index.Index class.

Indexing with Botoweb

I decided that this isn't just something that might be useful for Newstex, but instead could be generally useful. So I made a generic Index class which can be used to generate full and complex Indexes. To get started, simply create a new Index object


>>> from botoweb.db.index import Index
>>> index = Index("test-search")

Then lets add something to the index:

>>> index.add("Learning by Doing", "learnbydoing.com")


Note that if this is the first time you've created this index, it may take a minute or two for the first item to be added. This is because the Index class is actually creating your DynamoDB table before adding the item to Dynamo. After the record is added, you'll be able to search for it by almost any of the terms you would think of:


>>> for item in index.search("learn"):
...     print item['id']
... 
learnbydoing.com

What's even more fun is when you start indexing longer "collapsed" words:


>>> index.add("coredumped.org", "cdump")
>>> for item in index.search("core dumped"):
...     print item['id']
... 
cdump

What you don't see is that behind the scenes here multiple different searches take effect (with fallbacks so your primary search is always given precedence). In this search since there is no match for "core dumped' as two separate words, it also checks for "coredumped" as a single collapsed version. Additionally, the indexer takes the "." as a word separator, so it's not required to match the search result. However, what will not match is just using the word "core".

I'm still trying to find a good way to match partial words typed in (short of indexing each letter), so if anyone has any ideas there please let me know!

Thursday, March 15, 2012

Why Kindle will always live on

There's a lot that Amazon has done wrong recently with Kindle, most notably releasing the Kindle Fire. Although the Fire has been a much larger seller then most of us would have anticipated, it's constantly receiving negative reviews. There's a lot wrong with it, but the number one thing that's wrong is that it isn't a Kindle.

I recently purchased a new Kindle Touch, just to see what all the Hype is really about.

Simple No-Frills Interface

The simpler, the more specific, the better. Imagine giving an iPad to your non-technical Grandmother who's just use to reading books, you know, those things on paper. Paper, you remember, the stuff made out of dead trees? Well guess what, it's not nearly as intuitive as a book. A book you just open and start reading. An iPad, well you have to click on all these things, first of all hopefully you already pre-installed iBooks for her, and then you have to guide her through how to browse books and purchase them... and you can only purchase them on that device and read them on that device....

Then there's Kindle. Kindle not only looks and acts like a book when reading it, you also can purchase books on your computer, iPad, iPhone, Kindle, whatever, and just read them. Kindle isn't designed to be fancy, it's plain and simple. When you order one from Amazon, it's already activated to your account that you purchased it from. It starts up and guides you through everything. From the moment you turn it on, it's goal is to help you use it, not to confuse you with all the different features you can use on it. And guess what, if you forget to turn it off? Oh well, it's battery life is measured in months, not hours.

E-Ink

The E-ink display literally exactly like you would expect normal paper to react. It's not backlit, there's no strain on your eyes, it's like reading a regular text book. The only negatives so far are that the current version doesn't do color (although that is rumored to be released in the next refresh of kindle in the middle of 2012), and it has a horrible refresh rate. While backlit displays measure in terms of single digit milliseconds, the E-Ink display measures refresh rates in the hundreds of milliseconds.

Even the Color E-Ink display that is coming out doesn't have very good stats. It only displays just over 4,000 different colors (compared to the millions/billions that a traditional display can do). So what makes the E-Ink displays so much better?

Battery Life, These things last for months longer because they soak up just a tiny fraction of the battery a traditional backlit display does. There's nothing else on the market that even comes close.

You can read it in direct sunlight, that's right, on the beach where you'd normally need a traditional book, you can use an e-ink display. What's even weirder, you actually can not view them without a light source. They don't provide their own light, so you need something like this if you want to read your kindle in the dark.

No Backlit display means less eye strain. How many times do you get headaches because of constantly staring at an LCD? Right now I'm typing this up on a traditional display, and already my eyes are hurting. It's not anything that any styling can do to fix it, it's the fundamental flaw of a backlit display.

Come out from the dungeon! Perhaps the most important aspect these e-ink displays will do is bring us out of our dungeons. Right now I'm sitting in a room that's mostly dark, because that's the only way I can see my screens. With the e-ink displays, they actually encourage you to get outside and see actual sunlight. Imagine the difference between "Work at home" to "Work outside wherever the hell you want".

It will get better. Remember when regular CRT monitors first came out? How many colors did they have? How bad was the refresh rate? For my money, within 10 years, I expect that we'll have e-ink displays that can compete with the best LCDs out there. Yes, there will still be some that hold out on the backlit displays, and I suspect the backlit displays will always be around in some manor, but E-Ink will start to creep into our lives more and more over the next decade, and it'll be a change for the better.


Free 3G

No it doesn't work with the Experimental browser or any third-party stuff, but for syncing and most book-related things, the free 3G is an excellent selling point. You don't have to be on WiFi to buy a new book, sync your old books (remember there is a limited amount of space on the Kindle, but everything is archived to your "book cloud"). And no contract, no meetering, no billing. You don't ever even have to talk to AT&T (unless you want to use it overseas). Yes, it's slow, but it's not designed for streaming videos, it's designed to synchronize your text documents, bookmarks, and notes.