See media coverage of this blog post.
In Montréal this time of year, the city literally stops and everyone starts talking, thinking and dreaming about a single thing: the Stanley Cup Playoffs. Even most of those who don’t normally care the least bit about hockey transform into die hard fans of the Montréal Canadiens, or the Habs like we also call them.
Below is a Youtube clip of the epic goal celebration hack in action. In a single sentence, I trained a machine learning model to detect in real-time that a goal was just scored by the Habs based on the live audio feed of a game and to trigger a light show using Philips hues in my living room.
The rest of this post explains each step that was involved in putting this together. A full architecture diagram is available if you want to follow along.
The original goal (no pun intended) of this hack was to program a celebratory light show using Philips hue lights and play the Habs’ goal song when they scored a goal. Everything would be triggered using a big Griffin PowerMate USB button that would need to be pushed by whoever was the closest to it when the goal occurred.
That is already pretty cool, but can we take it one step further? Wouldn’t it be better if the celebratory sequence could be triggered automatically?
As far as I could find, there is no API or website available online that can give me reliable notifications within a second or two that a goal was scored. So how can we do it very quickly?
Imagine you watch a hockey game blindfolded, I bet you would have no problem knowing when goals are scored because a goal sounds a lot different that anything else in a game. There is of course the goal horn, if the home team scores, but also the commentator who usually yells a very intense and passionate “GOOOAAAALLLLL!!!!!”. By hooking up into the audio feed of the game and processing it in real-time using a machine learning model trained to detect when a goal occurs, we could trigger the lights and music automatically, allowing all the spectators to dance and do celebratory chest-bumps without having to worry about pushing a button.
Some signal processing
The first step is to take a look at what a goal sound looks like. The Habs’ website has a listing of all previous games with ~4 minutes video highlights of each game. I extracted the audio from a particular highlight and used librosa, a library for audio and music analysis, to do some simple signal processing. If you’ve never played with sounds before, you can head over to Wikipedia to read about what a spectrogram is. You can also simply think of it as taking the waveform of an audio file and creating a simple heat map over time and audio frequencies (Hz). Low-pitched sounds are at the lower end of the y-axis and high-pitched sounds are on the upper end, while the color represents the intensity of the sound.
We’re going to be using the mel power spectrogram (MPS), which is like a spectrogram with additional transformations applied on top of it.
You can use the code below to display the MPS of a sound file.
This is what the MPS of a 4 minutes highlight of a game looks like:
mel power spectogram of a 4 minutes highlight
Now let’s take a look at an 8 seconds clip from that highlight, specifically when a goal occurred.
mel power spectrogram of a goal by the Canadiens
As you can see, there are very distinctive patterns when the commentator yells (the 4 big wavy lines), and when the goal horn goes off in the amphitheater (many straight lines). Being able to see the patterns with the naked eye is very encouraging in terms of being able to train a model to detect it.
There are tons of different audio features we could derive from the waveform to use as features for our classifier. However, I always try to start simple to create a working baseline and improve from there. So I decided to simply vectorize the MPS, which was created by using 2 second clips with frequencies up to 8KHz with 128 Mel bands at a sampling rate of 22.5KHz. The MPS have a shape of 128×87, which results in a feature vector of 11,136 elements when vectorized.
The machine learning problem
If you’re not familiar with machine learning, think of it as building algorithms that can learn from data. The type of ML task we need to do for this project is binary classification, which means making the difference between two classes of things:
- positive class: the Canadiens scored a goal
- negative class: the Canadiens did not score a goal
Put another way, we need to train a model that can give us the probability that the Canadiens scored a goal given the last 2 seconds of audio.
A model learns to perform a task through training, which is looking at past examples of those two classes and figuring out what are the statistical regularities in the data that allow it to separate the classes. However, it is easy for a computer to learn things by heart. The goal of machine learning is producing models that are able to generalize what they learn to data they have never seen, to new examples. What this means for us is that we’ll be using past games to train the model but what we obviously want to do are predictions for future games in real-time as they are aired on TV.
Building the dataset
As with any machine learning project, there is a time when you will feel like a monkey, and that is usually when you’re either building, importing or cleaning a dataset. For this project, this took the form of recording the audio from multiple 4 minute highlights of games and noting the time in the clip when a goal was scored by the Habs or the opposing team.
Obviously, we’ll be using the Canadiens’ goals as positive examples for our classifier, since that is what we are trying to detect.
Now what about negative examples? If you think about it, the very worst thing that could happen to this system is for it to get false positives (falsely thinking there is a goal). Imagine we are playing against the Toronto Maple Leafs and they score a goal and the light show starts. Not only did we just get scored and are bummed out, but on top of that the algorithm is trolling us about it by playing our own goal song! (This is naturally a fictitious example because the Leafs are obviously not making the playoffs once again this year) To make sure that doesn’t happen, we’ll be using all the opposing team’s goals as explicit negatives. The hope is that the model will be able to distinguish between goals for and against because the commentator is much more enthusiastic for Canadiens’ goals.
To illustrate this, compare the MSP of the Habs’ goal above with the example below of a goal against the Habs. The commentator’s scream is much shorter and the goal horn of the opponent’s team amphitheater is at very different frequencies than the one at the Bell Center. The goal horn only goes off when the home team scores so the MSP below is taken from a game not played in Montréal.
mel power spectrogram of a goal against the Canadiens
In addition to the opposing team’s goals, we’ll use 50 randomly selected segments from each highlight that are far enough from an actual goal as negatives, so that the model is exposed to what the uneventful portions of a game sound like.
False negatives (missing an actual goal) are still bad, but we prefer them over false positives. We’ll talk about how we can deal with them later on.
Note that I did not do any alignment of the sound files, meaning the commentator yelling does not start at exactly the same time in every clip. The dataset ended up consisting of 10 games, with 34 goals by the Habs and 17 goals against them. The randomly selected negative clips added another 500 examples.
Training and picking a classifier
As I mentioned earlier, the goal was to start simple. To that effect, the first models I tried were a simple logistic regression and an SVM with an rbf kernel over the raw vectorized MPS.
I was a bit surprised that this trivial approach yielded usable results. The logistic regression got an AUC of 0.97 and an F1 score of 0.63, while the SVM got an AUC of 0.98 and an F1 score of 0.71. Those results were obtained by holding out 20% of the training data to test on.
At this point I ran a few complete game broadcasts through the system and each time the model detected a goal, I wrote out the 2 seconds corresponding sound file to disk. A bunch were false positives that corresponded to commercials. The model had never seen commercials before because they are not included in game highlights. I added those false positives to the negative examples, retrained and the problem went away.
However the AUC/F1 score were not an accurate estimation of the performance I could expect because I was not necessarily planning to use a single prediction as the trigger for the light show. Since I’m scoring many times per second, I could try decision rules that would look at the last n predictions to make a decision.
I ran a 10-fold cross-validation, holding out an entire game from the training set, and actually stepping through the held out game’s highlight as if it was the real-time audio stream of a live game. That way I could test out multi-prediction decision rules.
I tried two decision rules:
- average of last n predictions over the threshold t
- m positive votes in the last n predictions, where a YES vote requires a prediction over the threshold t
For each combination of decision rule, hyper-parameters and classifier, there were 4 metrics I was looking at:
- Real Canadiens goal that the model detected (true positive)
- Opposing team goal that the model detected (really bad false positive)
- No goal but the model thought there was one (false positive)
- Canadiens goal the model did not detect (false negative)
SVMs ended up being able to get more true positives but did a worst job on false positives. What I ended up using was a logistic regression with the second decision rule. To trigger a goal, there needs to be 5 positives votes out of the last 20 and votes are cast if the probability of a goal is over 90%. The cross-validation results for that rule were 23 Habs goals detected, 11 not detected, 2 opposing team goals falsely detected and no other false positives.
Looking at the Habs’ 2014-15 season statistics, they scored an average of 2.61 goals per game and got scored 2.24 times. This means I can loosely expect the algorithm to not detect 1 Habs goal per game (0.84 to be more precise) and to go off for a goal by the opposing team once every 4 games.
Note that the trained model only works for the specific TV station and commentator I trained on. I trained on regular season games aired on TVA Sports because they are airing the playoffs. I tried testing on a few games aired on another station and basically detected no goals at all. This means performance is likely to go down if the commentator catches a cold.
Philips hue light show
Now that we’re able to do a reasonable job at identifying goals, it was time to create a light show that rivals those crazy Christmas ones we’ve all seen. This has 2 components: playing the Habs’ goal song and flashing the lights to the music.
The goal song I play is not the current one in use at the Bell Center, but the one that they used in the 2000s. It is called “Le Goal Song” by the Montréal band L’Oreille Cassée. To the best of my knowledge, the song is not available for sale and is only available on Youtube.
Philips hues are smart LED multicolor lights that can be controlled using an iPhone app. The app talks to the hue bridge that is connected to your wifi network and the bridge talks to the lights over the ZigBee Light Link protocol. In my living room, I have the 3 starter-kit hue lights, a light-strip under my kitchen island and a Bloom pointing at the wall behind my TV. Hues are not specifically meant for light shows; I usually use them to create an interesting atmosphere in my living room.
I realized the lights can be controller using a REST API that runs on the bridge. Using the very effective phue library, we can interface with the hue bridge API from python. At that point, it was simply a question of programming a sequence of color and intensity calls that would roughly go along with the goal song I wanted to play.
Below is an example of using phue to make each light cycle through the colors blue, white and red 10 times.
I deployed this up as a simple REST API using bottle. This way, the celebratory light show is decoupled from the trigger. The lights can be triggered easily by calling the /goal endpoint.
Hooking up to the live audio stream
My classifier was trained on audio clips offline. To make this whole thing come together, the missing piece was the real-time scoring of a live audio feed.
I’m running all of this on OSX and to get the live audio into my python program, I needed two components: Soundflower and pyaudio. Soundflower acts as a virtual audio device and allows audio to be passed between applications, while pyaudio is a library that can be used to play an record audio in python.
The way things need to be configured is the system audio is first set to the Soundflower virtual audio device. At that point, no sound will be heard because nothing is being sent to the output device. In python, you can then configure pyaudio to capture audio coming into the virtual audio device, process it, and then resend it out to the normal output device. In my case, that is the HDMI output going to the TV.
As you can see from the code snippet below, you start listening to the stream by giving pyaudio a callback function that will be called each time the captured frames buffer is full. In the callback, I add the frames to a ring buffer that keeps 2 seconds worth of audio, because that is the size of the training examples I used to train the model. The callback gets called many times per second. Each time, I take the contents of the ring buffer and score it using the classifier. When a goal is detected by the model, this triggers a REST call to the /goal endpoint of the light show API.
My TV subscription allows me to stream the hockey games on a computer in HD. I hooked up a Mac Mini to my TV and that Mac will be responsible for running all the components of the system:
- displaying the game on the TV
- sending the game’s audio feed to the Soundflower virtual audio device
- running the python goal detector that capture the sound from Soundflower, analyses it, calls the goal endpoint if necessary and resends the audio out to the HDMI output
- running the light show API that listens for calls to the goal endpoint
Since the algorithm is not perfect, I also hooked up the Griffin USB button that I mentioned at the very beginning of the post. It can be used to either start or stop the light show in case we get a false negative or false positive respectively. It was very easy to do this because a push of the button simply calls the /goal endpoint of the API that can decide what it should do with the trigger.
Production results and beyond
After two playoff games against the Ottawa Senators, the model successfully detected 75% of the goals (missing 1 per game) and got no false positives. This is in line with the expected performance, and the USB button was there to save the day when the detection did not work.
This was done in a relatively short amount of time and represents the simplest approach at each step. To make this work better, there are a number of things that could be done. For instance, aligning the audio files of the positive examples, trying different example length, trying more powerful classifiers like a convolutional neural net, doing simple image analysis of the video feed to try to determine on which side of the ice we are, etc.
In the mean-time, enjoy the playoffs and Go Habs Go!
In the media