Something exciting happened related to the original Epic Goal Celebration Hack project! ESPN did a mini-documentary on it with our friends at Hodge Films. It was quite a fun experience at the end result is really nice!
Check it out below!
Something exciting happened related to the original Epic Goal Celebration Hack project! ESPN did a mini-documentary on it with our friends at Hodge Films. It was quite a fun experience at the end result is really nice!
Check it out below!
I love plants. I’m not quite sure why, but over the recent years, it’s been a growing love affair; pun intended.
I’m lucky to have room on my rooftop to grow lots of veggies in the summer. I’ve been building up capacity and I’m now up to 7 irrigated containers. I even managed to grow corn that was 9+ feet tall during my first summer! I’m now engaged in a ruthless battle with our beloved squirrels over domination of the garden’s bounty.
However, each year as our Montréal winter drew closer, I watched, powerless, as my once strong crops wrinkled away. This got me interested in finding out what my options were when it came to growing veggies indoors, and that lead me to hydroponics.
There is a sustainable development aspect to all of this which I find quite interesting. Growing most of the food we need locally seems like something we will have to do as a society pretty soon. I’ve been a big fan of Lufa Farms, who have been pioneers in that area, ever since they started. There are similar initiatives in most major cities. This technology is even being put in containers that are then shipped to the arctic to grow fresh food in the most arid climate on the planet!
This post isn’t an exact step-by-step guide. There is lots to know and learn, and great resources already exist online. Here, I’m going to take you through my journey and point you to relevant resources along the way.
I recently gave a talk at PyCon Canada 2015 in Toronto about the Epic NHL goal celebration hack with a hue light show and real-time machine learning blog post.
In a hurry? Go straight to the graphs.
The dataset and notebook detailing how this was done are available in the companion repository.
Two weeks ago was Back To The Future Day. October 21st, 2015 is the day Marty and Doc Brown travel to at the beginning of the second movie. The future is now the past. There were worldwide celebrations and jokes, from the Queensland police deploying a hoverboard unit, Universal Pictures releasing a Jaws 19 trailer and even Health Canada issuing an official recall notice of DeLorean DMC-12 because of a flux capacitor defect that could prevent the car from traveling through time.
I love the trilogy and as many people probably did that week, I rewatched the movies. I also wondered if there was any fun BTTF data science project I could do. While watching the climactic sequence at the end of the third movie, I realized that as the steam locomotive pushes the DeLorean down the tracks, we get many data points as to the speed of the DeLorean. Marty is essentially reciting a dataset, all the way from 1885.
That made me ask the 1.21 Giga Watts question: Do they really make it to 88 miles per hour before they run out of tracks?
The 2015 Canadian federal election is in its final stretch and college and I thought it would be a great opportunity to collect some data and do some machine learning. Citizen data science in action!
See media coverage of this blog post.
In Montréal this time of year, the city literally stops and everyone starts talking, thinking and dreaming about a single thing: the Stanley Cup Playoffs. Even most of those who don’t normally care the least bit about hockey transform into die hard fans of the Montréal Canadiens, or the Habs like we also call them.
Below is a Youtube clip of the epic goal celebration hack in action. In a single sentence, I trained a machine learning model to detect in real-time that a goal was just scored by the Habs based on the live audio feed of a game and to trigger a light show using Philips hues in my living room.
The rest of this post explains each step that was involved in putting this together. A full architecture diagram is available if you want to follow along.
Je viens de publier un article, MapReduce avec parallel, cat et une redirection, sur le blogue de Datacratic.
I’ve recently learned a couple of neat tricks to process large amounts of text files more efficiently from my new co-worker @nicolaskruchten. Our use-case is efficiently going through tens of gigabytes of logs to extract specific lines and do some operation on them. Here are a couple of things we’ve done to speed things up.
Often, the bottleneck will be IO. This is especially true on modern servers that have a lot of cores and ram. By keeping the files we want to process gziped, we can use zcat to directly read the compressed files and pipe the output to whichever script we need. This reduces the amount of data that needs to be read from the disk. For example:
zcat log.gz | grep pattern
If you’re piping into a Python script, you can easily loop over lines coming from the standard input by using the fileinput module, like this:
import fileinput for line in fileinput.input(): process(line) |
GNU parallel is the coolest utility I’ve discovered recently. It allows you to execute a script that needs to act on a list of files in parallel. For example, suppose we have a list of 4 log files (exim_reject_1.gz, exim_reject_2.gz, etc) and that we need to extract the lines that contain gmail.com. We could run a grep on each of those files sequentially but if our machine has 4 cores, why not run all the greps at once? It can be done like this using parallel:
parallel -j4 "zcat {} | grep gmail.com" ::: exim_reject*.gz
Breaking down the previous command, we tell parallel to run, using 4 cores, the command zcat {} | grep gmail.com, where {} will be substituted with each of the files matching the selector exim_reject*.gz. Each resulting command from the substitutions of {} will be run in parallel.
What’s great about it is that you can also collect all the results from the parallel executions and pipe them into another command. We could for example decide to keep the resulting lines in a new file like this:
parallel -j4 "zcat {} | grep gmail.com" ::: exim_reject*.gz | gzip > results.txt.gz
If you’ll be doing a lot of reading and writing to the disk on the same files and have lots of ram, you should consider using a ramdisk. Doing so will undoubtedly save you lots of IO time. On Linux, it is very easy to do. The following command would create an 8GB ramdisk:
sudo mount -t tmpfs -o size=8G,nr_inodes=1k,mode=777 tmpfs /media/ramdisk
By using all the tricks above, we were able to considerably improve the overall runtime or our scripts. Well worth the time it took to refactor our initial naive pipeline.
Getting a 64-bit installation of Python with scientific packages on our dear Windows isn’t as simple as running an apt-get or port command. There is an official 64-bit Python build available but extensions like numpy, scipy or matplotlib only have official 32-bit builds. There are commercial distributions such as Enthought that offer all the packages built in 64-bit but at around 200$ per license, this was not an option for me.
Stumbled upon the Python Extension Packages for Windows page that contains dozens of extensions compiled for Python 2.5, 2.6 and 2.7 in 32 and 64 bits. With these packages, I was able to get a working installation in no time.
I was thrilled to attend the Boston Music Hackday this week-end. A lot of people hacked up some pretty cool projects, many of us coding until the very early morning Sunday (aka 4am), only to get back up a few hours later (aka 8am) to keep at it until the dreaded 15h45 deadline, when we all had to submit our demos. The organisers did a wonderful job and the event was a success at every level.
The hack I did was called the PartyLister. The goal was mainly trying to come up with a way to generate steerable playlists that would also be personalized for a group of people (ie.: taking into account each of their musical taste and making sure everyone gets a song he likes once in a while). Given the very limited amount of time available to hack this up, I had to keep things simple and so I decided to use only social tags to do all the similarity computation. I excepted the quality of the playlists would suffer but the goal was really to develop a way to include multiple listeners in the track selection process. The algorithm should then be used in conjunction with something like the playlist generation model I presented at this year’s ISMIR.
Imagine you’re hosting a party and using the PartyLister as DJ for the night. Each of your guests will need to supply the software with his last.fm username and we’ll be good to go.
We go out and fetch from the last.fm API the social tags associated with the artists (and their top tracks) that our listeners know about. We also use the EchoNest API to get similar artists so we can present new artists to our listeners. From a user’s top artists, we can create a tag cloud that represents the user’s general musical taste (UMT). We’re also allowing each user to specify a set of tags that represent their current musical taste using a steerable tag cloud.
Suppose you have 3 guests at your party, where two like pop and the other likes metal. By doing a naive combination of the users’ musical taste, we’ll probably end up playing pop music, leaving our metalhead friend bored. To solve this, I added a user weight term which is determined by looking at the last 5 songs that played and computing the average similarity between the user’s musical taste and those songs. If we’re only playing pop songs, the metalhead will have a very low similarity between his taste and what played and so we’ll increase his weight and lower the pop lovers’ weights. When we pick the next song, this weighting scheme will allow the metalhead’s taste to count more than the pop lovers’, even if there are more of them. This will make us play a more metal-like track. After a while the weights will equal out and we’ll start playing pop music again.
For sparseness reasons, I operated on artists instead of tracks. A simplified version how I weighted each candidate artist is below. Lambda is simply a knob to determine how much the users’ musical taste will count, cd() represents the cosine distance and UMT represents a combination of the user’s general musical taste and his steerable cloud.
The following plot represents a running average of the cosine distance (dissimilarity) between users’ musical taste and the last 5 songs that played. It represents a 160 songs playlist with 3 listeners in the system.
As you can see, as a user’s running average increases, his weight is also increased so that we start playing more songs that fit his taste. His average then decreases as the other users’ weights go up forcing a return to music that fits their taste a little more. The plot shows that the system seems to be doing what we want, that is taking into account the musical taste of multiple users and playing music that each person will like once in a while. Integrated in a real playlist generation model, I believe this could produce interesting results.
I also played with a discovery setting, where users could specify if they wanted to discover new songs or stick to what they know. This was achieved by adding a bonus or penalizing each candidate’s score, based on the discovery setting (float between 0 and 1) and the proportion of users who knew (had already listened to) the artist in question.
PartyLister was not a very visually or sonically attractive hack like some of the others but I still managed to win a price based on popular vote. Thanks to all the great sponsors, there were a lot of prizes and so lots of winners.
Below is the Université de Montréal delegation, Mike Mandel (who also won a price for his Bowie S-S-S-Similarities) and myself, with our bounty.
I really hope to attend another hackday soon as it was all a lot of fun. Time to go get some sleep now.