My internship at Sun Microsystems Labs, which has been going on for about 15 months - 9 of those full time at their campus in the Boston area - is coming to an end. During the course of those months, I’ve met a lot of very smart and fun people, I’ve worked on very challenging and stimulating problems and I’ve discovered a bunch of really good New England beers.
All my work has been centered around the Aura datastore, an open-source, scalable and distributed recommendation platform. The datastore is designed to handle millions of users and items and can generate content-based recommendations based on each item’s aura (aka tag cloud).
Last summer, under the supervision of Paul Lamere, I worked a lot more on our music recommendation web application, called the Music Explaura and designed a steerable recommendation interface. (We also have a Facebook companion app to the Explaura that was created by Jeff Alexander.)
This summer, I worked with Steve Green on many different things, including what I’d like to talk about in this post, pyaura, a Python interface to the datastore.
The idea behind pyaura is to get the best of both world. While the datastore is very good at what it does - storing millions of items and being able to compute similarity between all of them very quickly - the Java framework surrounding it is a bit too rigid to quickly hack random research code on top of it. While my actual goal was to experiment with ways of doing automatic cleanup and clustering of social tags, I felt I was missing the flexibility I wanted and was used to getting when working on projects using Python’s interactive environment.
Without going into details, since the datastore is distributed and has many different components, it uses a technology called Jini to automatically hook them all up together. Jini takes care of automatic service discovery so you don’t have to manually specify IP adresses and so on. It also allows you to publicly export functions that remote components can call. A concrete example would be the datastore head component allowing the web server component to call it’s getSimilarity() function on two items. The computation goes on in the datastore head and then the results get shipped across the wire to the web server so it can serve its request. However, Jini only supports Java leaving us no direct way to connect to the datastore using Python.
After looking around for a bit, I stumbled upon a project called JPype, which essentially allows you to launch a JVM inside Python. This allows you to instantiate and use Java objects in a completely transparent way from within Python. Using JPype, I built two modules which together, allow very simple access to the datastore though Python.
To demonstrate how things become easy when using pyaura, imagine you are running an Aura datastore and have collected a lot of artist and tag information from the web. You might be interested in quickly seeing the number of artists that have generally been tagged by the each individual tag you know about. With these few lines of code, you can get a nice histogram that answers just that questions:
import pyaura.bridge as B
import pylab as P
aB = B.AuraBridge()
counts = [len(tag.getTaggedArtist()) for tag
The above code produces the following plot:
This is the result we expect, as this was generated with a datastore containing 100,000 artists. As less and less popular artists are added to the datastore, the effects of sparsity in social data kick in. Less popular artists are indeed tagged with less tags than popular artists, leading to the situation where very few tags were applied to more than 5000 artists.
This is a small example but it shows the simplicity of using pyaura. With very few lines of code, you can do pretty much anything with the data stored in Aura. This hopefully will make the Aura datastore more accessible and attractive to projects looking to take advantage of both its scalability and raw power as well as have the flexibility to quickly hack on top of it.