# Building an indoor NFT hydroponics system with Raspberry Pi monitoring

I love plants. I’m not quite sure why, but over the recent years, it’s been a growing love affair; pun intended.

I’m lucky to have room on my rooftop to grow lots of veggies in the summer. I’ve been building up capacity and I’m now up to 7 irrigated containers. I even managed to grow corn that was 9+ feet tall during my first summer! I’m now engaged in a ruthless battle with our beloved squirrels over domination of the garden’s bounty.

However, each year as our Montréal winter drew closer, I watched, powerless, as my once strong crops wrinkled away. This got me interested in finding out what my options were when it came to growing veggies indoors, and that lead me to hydroponics.

There is a sustainable development aspect to all of this which I find quite interesting. Growing most of the food we need locally seems like something we will have to do as a society pretty soon. I’ve been a big fan of Lufa Farms, who have been pioneers in that area, ever since they started. There are similar initiatives in most major cities. This technology is even being put in containers that are then shipped to the arctic to grow fresh food in the most arid climate on the planet!

This post isn’t an exact step-by-step guide. There is lots to know and learn, and great resources already exist online. Here, I’m going to take you through my journey and point you to relevant resources along the way.

Post contents

### Hydroponics basics

Let’s start with the basics: what’s hydroponics? Very simply, it’s a technique for growing plants without soil, by using water enriched with a nutrient solution. This technique has been around for a very long time. No need to do a hydro garden to grow veggies inside; I could have simply decided to grow veggies in pots filled with soil. But isn’t doing it the way NASA is growing food in space way cooler?? There are also advantages in terms of efficiency, like increased growth rate and less water usage.

There are different types of hydroponics systems, but the basic idea is always the same. The type of system I built is a nutrient film technique (NFT) system. The diagram below shows the major elements involved.

The water reservoir is filled with water, enriched with nutrients and pH corrected. The water is then pumped into the system, where plants are placed in net cups. The water flows through the system, bringing nutrients to the roots of the plants, until it reaches the end and gets recycled back in the reservoir. An air stone is placed in the reservoir to pump air in the water to make sure it stays oxygenated (not shown on the diagram but required).

### The plan

I first considered buying a pre-built small system like the ones from Aerogarden, but ended up not going for those because I was afraid the yields would be low and I would be locked in their proprietary seed system.

Then I took steps towards building a deep water culture system like the one described in this video. What made me reconsider that choice was thinking about the logistics of actually running the system. This is a pretty crucial point to consider depending on where you’d place your garden. Since the water needs to be changed every 1 to 2 weeks, the reservoir has to be either close or transported to a drain, emptied and filled up again. Depending on the size of the system, the reservoir can be quite heavy when filled with water. Living in a condo with mostly wooden floors, I considered the required water operations too complicated and risky. I imagined having the container on a shelf with wheels and rolling it to my bathroom to change the water, but that seemed like it would be too complicated.

I then found a guide to building an NFT system with PVC pipes that seemed better adapted for my situation. Having the water reservoir separate from the system where the plants grow would make changing the water much easier as I would be able to simply carry it, leaving the plants where they are.

Inspiration for my system. Image taken from this post.

### Acquiring the parts

I got all the required parts from my local hardware store and online stores. I also dropped by a specialized hydroponics store and got some things from them, but I could have gotten them online.

Here are most of the parts that I ended up using:

System

Water

Hydroponics grow medium and nutrients

Seed starting

Other

### Building the system

The first step is to cut the PVC pipe into sections with a saw and then drill holes at the right spots. You can use the net cups as guides and draw around them with a marker so you know where to drill.

I had a small piece of unused pipe that I used to make sure the size of the whole saw matched the net cups correctly. The size was good but since I was using a round pipe, the sides of the net cups were exposed. I used aluminum foil to block off the light. I’ve seen some people use square fence posts instead of this type of pipe I used; more on that later.

Then I put a drain in the PVC cap by drilling a hole using a smaller hole saw then for the net cups:

The next step was to drill all the net cup holes. Drilling with the hole saw makes a pretty huge mess. You’ll get little pieces of plastic everywhere. I highly recommend using a file to finish the job cleanly and then properly cleaning the inside of the pipes so that there isn’t any plastic piece left inside. That will prevent any piece from making its way into your water pump later on.

At this point I could assemble the whole thing by dry fitting the different pieces to make sure it was all good:

Then came the time to glue the different pieces together so that no water leaked out. Watched some more Youtube and got going with the glue.

Moved the system in my bathroom and put it in the bathtub so I could test it out with some water flowing through it. It seemed to be doing OK, which was somewhat of a surprise. Next step was a live-fire exercise: running the system with water pumped from the reservoir and having it drain back out.

Unfortunately, two problems became very apparent. First, I had leaks in my PVC joints. Second, the pump was too powerful for the capacity of the drain that I had put. This meant the pump was emptying the entire reservoir in the system and would eventually cause the water to overflow as well as burn out the pump once the reservoir emptied out. Even setting the pump to its lowest capacity was too much.

PVC cap with self-fusing silicon tape

For the leaks, it was a bit tricky. My normal job is working in computer science, where mistakes can often by fixed by undoing with a simple ctrl+z. Things aren’t as simple when gluing plastic pipes together. One option was to saw off the joints and redo them, which would reduce the size of the system and wasn’t ideal at all. The second option, which I ended up doing, was to apply self-fusing silicon tape around the joints. But putting a generous quantity over the problematic joints, it took care of all the leaks!

For the overflow problem, the ideal solution would have been to put in a bigger drain, but again, since the PVC cap was glued on, I would have had to saw it off, put a new one, install a bigger drain, which I didn’t have… So the fix was to put a second drain right next to the first one as I did have that part.

At this point the system was good to go. But I was a bit scared that something unforeseen could go wrong and water would leak out. Since I’m on the top floor of a condo building, you can understand that it is not an acceptable scenario. I needed a contingency plan.

I thought of how I could put some kind of containment system in place. I was inspired by small kid’s swimming pools, and decided to build a custom basin into which I could place the whole hydro system. If something were to break and all the water leaked out, it would be completely contained within the basin. The reservoir only contains 14L of water, which is not a huge quantity.

Went back to the hardware store to get some wood, metal shelves supports, tie wraps and a tarp.

Et voilà. Water shields up!

The final step was to hang the light fixture. Because I was placing the garden at the top of a staircase where I didn’t care too much about the look of it, I simply screwed some hooks in the beams in the ceiling and hung the lamp using adjustable ropes with ratchets.

### Preparing the water

A little preparation needs to go in the water. I’ve used tap water for the garden, even though there are recommendations online on using distilled water, or water that went through reverse osmosis. The reason is that tap water contains more chemicals and minerals (ie: chlorine) that you might not want in your garden. However, as I understood it, the important thing you need to do when using tap water is to let is standing, exposed to the sun’s UV rays, for about 24 hours. This allows the chlorine put in the water by the municipality to break down. For that purpose, I bought 2 identical buckets so that I could have the system running with the first bucket as the reservoir while the second one is filled with the replacement water. The buckets take turns being the reservoir.

Then you need to add nutrients. Some people optimize the type of nutrient based on what growth stage their plants are in. So one type when they’re simply growing and creating roots, and another when they’re producing fruit. I’m not that sophisticated yet and I’m using a simple 2-3-2 solution from my local hydro store. I also started putting in Calimagic solution as well after about 6 weeks in an attempt to help out my tomato plants that didn’t look too happy. Simply follow the instructions on the bottle in terms of quantity.

The final step is to control the pH, with a pH up/down solution and your pH meter. I try to aim for a pH of about 6, based on the many guides online. The pH up/down solution is pretty concentrated, so I use pipette for greater control.

I haven’t paid too much attention to the PPM readings up to now so I can’t report much on that front, except that I’m getting OK-ish results without paying attention to it :)

### The first residents move in

Because the plants will be growing without soil, we need to start them from seeds into a growing medium. The most popular one is rock wool cubes (yellow cubes below). The cubes are created from rock and sand and are similar to mineral insulation used in houses. Because of their popularity, I started with them. However as I read more on the subject, some argued that they were not very eco-friendly, so I decided to try alternatives. I’m now using rapid rooter starter plugs (brown cylinders), that are made from compost.

Whichever medium you use, the idea is the same: hold water, air and our seeds. You need to start them in a little green house. This video explains how it’s done. I’ve found that using a heating mat helped a lot so I recommend getting one.

My very first batch was 2 tomatoes, 2 peppers and 3 lettuce. Once our little veggies are big enough, we can put the rock wool cube in a net cup and fill it with clay pebbles so they are held in place. You also want to prevent as much light as possible from getting into the system, which the pebbles also help to do. If light gets in, you increase the chances of having algae grow in the system.

With all our baby plants in net cups placed in the system with aluminum foil covering any remaining openings, the system was complete. I drilled holes in the reservoir’s cover to allow the different pipes though and we were good to go.

This is a video of the system in action:

### Monitoring environmental conditions with a Raspberry Pi

My day job is doing machine learning and data science, so it was natural to think about how I could collect as much data as possible on the garden to track and automate things. One of the longer-term goal is to try correlating environmental conditions and whatever actions I’m taking with plant growth and yield. In terms of actions, everyone suggests keeping a journal, which I’m doing. But for metrics on the water, air and light, having something automated is much better.

To monitor the environment, I put together a small monitoring rig with a Raspberry Pi. I’m not going to go into too many details on this, but since this was my first experience with a Pi, it’s not too complicated. I ordered a Pi, breadboard and a few sensors. The first ones I hooked up were the following:

• Ambient light sensor: KY-018
• Ambiant temperature sensor: DHT11
• Water temperature sensor: DS18B20

A small Python server runs on the Pi and exposes an API with the readings from the sensors. The script is available on Github. The server also pulls weather information from the Open Weather Map API so I can track what’s going on outside. I have a MacMini that runs Prometheus, a time series database, that collects and saves the time series data, and I setup Grafana to have a nice dashboard.

This is the result:

Some fun patterns:

• The oscillation in the room temperature (yellow line on the top graph) is the electric baseboard heating the room going on and off. I’ve got some wall insulation problems in that area of my condo, and the sensor is on the floor, so you can see the effect of the outside temperature on that temperature reading. The temperature at the level of the floor is very correlated with the outside temperature.
• There is a daily cycle in the reservoir water temperature (green line on top graph). By looking at both that graph and the bottom one that shows when the lamp is on, you can see that the temperature follows the cycle of the lamp. That was unexpected but makes total sense. The lamp heats up the water that is going through the PVC pipe, which in turn heats up the reservoir when it gets recycled. Because of the period this graph is based on, where the outside temperature was going down quickly, that pattern is harder to see, except for on the 25th (left-hand side of the top graph).
• On the bottom graph, looking at the green line which shows the light intensity, you can see the lamp opening before the sun rise, and staying on after the sun has set. During the day, you see a nice arc where the ambient light increases and then decreases, following the sun.

I also got a TP-Link HS110 smart plug with energy monitoring in order to track how much power I’m using. I also wanted to be able to programmatically turn the power off and on. I didn’t want to get into setting up a Pi-controlled relay, and there is an open source 3rd party library that can be used to control the HS110 so I went with that. As you can see from the screenshot, the light used a daily average of 1,39kWh, and the electricity cost in Québec is 0,0877$/kWh, meaning the cost of running the lamp was 3,66$ for the last month. Acceptable price for hopefully yummy tomatoes!

I configured the HS110 to turn on the light from 6:00AM to 11:30PM. Plants need to sleep too.

### Growth history

Below is a slideshow showing the progression of the growth.

Up to now, I’ve eaten two lettuces, but there are 3 that are about ready for consumption.

The tomato and pepper plants took longer to establish themselves and for a while, their leaves were in what seemed a pretty bad shape. I realized I had mis calibrated my pH meter and was about 30% off from the desired acidity level. After fixing that, the leaves improved and flowers appeared in all 4 plants. There are now many baby tomatoes growing.

Since the garden is indoors, bees can’t help when it comes to pollination. Tomato plants can simply be shaken since their flowers have both the male and female parts. Another way to do this is to have a fan blowing on the garden to constantly move the plants, simulating the wind. Pepper plants, however, need to be pollinated by hand. After doing some research,  I’ve tried doing it with a queue tip. I think I have just succeeded as I’m seeing tiny tiny green balls in some pepper flowers.

### Lessons learned

One big lesson is to make sure your pH meter is properly calibrated. I hadn’t done it properly and was about 30% off in terms of what the water’s acidity needed to be at. That caused the leaves of fruit plants to wrinkle on themselves and not produce any flowers.

I’ve gotten away with changing the water about every 10-12 days. As the tomato and pepper plants have gotten bigger, they’ve started drinking much more water, which causes the reservoir to empty out. With less water in, the acidity goes up because the ratio of water to nutrient changes. So I need to do a quick check up every two days and fix the acidity with the pH+/- solutions. With bigger tomato plants, I also now have to top up the water once between every change. One big reason to change the water completely is to prevent algae from forming in the system. I’m being as careful as I can to prevent light from hitting the water (algae needs light to grow), so I haven’t had any problems up to now.

Another realization is that the system might be too small for tomato plants. As they are growing bigger, their root system also grows. After my last water change, the system started to leak because roots got into the drain and partially blocked it, causing the water level to rise too much in the system and leak through the net cup holes. This was the first real test for the emergency basin and everything worked great. I’ll have to monitor that closely and potentially remove the tomato plants, which would mean I can’t grow fruit plants in this 4′ PVC system.

Root system from a tomato plant too close to the drain

Finally, the next garden I build will probably not use round pipes, but instead square fence posts, like in this project. The big reason is you can’t install drains on a round surface, and the net cups are harder to put in place. That’s why I had to put the drains in the end caps of the pipes, and I have to use aluminum foil around the cups. When changing the water, it’s hard to remove all water from the system because the drain is about 1 cm from the bottom, so I have to tilt the whole thing. Having a drain at the bottom would make it much easier.

### Next steps

I’ve bought an ultrasonic distance sensor for the Pi and have been planning to install on the inside of the water reservoir cover. This would allow monitoring of the water level in the reservoir, allowing me to programmatically stop the pump if required. This could happen if the plants drink all the water or if there is a leak. It would even be possible to differentiate between the two based on the rate of change. Taking an automated action in case of a leak would be a great second contingency system. Adding extra sensors to track how the water is doing would also be great, like a pH sensor, so I don’t have to do it myself every other day. Professional sensor systems exist but they’re pretty expensive.

I also have a webcam ready and I’m planning to start taking pictures every few minutes to create a time-lapse video showing how things are changing over time. I’ve done that for seeds and it’s pretty fun to watch. The second thing I’d like to do with the images is to automatically track the growth of plants by doing some image analysis.

Having both extra sensors and the video feed that I can analyze automatically is really what unlocks optimizing the garden conditions to maximize yield.

### Closing remarks

I took a pretty complicated route to get to where I am. If you’re interested in trying out a hydro garden, a 100-200hydroponic kit will certainly do the job. You could even go the soil route by only getting a lamp. Happy hydro gardening! # Actually, Marty didn’t go Back To The Future: Graphing the train sequence of BTTF3 In a hurry? Go straight to the graphs. The dataset and notebook detailing how this was done are available in the companion repository. Two weeks ago was Back To The Future Day. October 21st, 2015 is the day Marty and Doc Brown travel to at the beginning of the second movie. The future is now the past. There were worldwide celebrations and jokes, from the Queensland police deploying a hoverboard unit, Universal Pictures releasing a Jaws 19 trailer and even Health Canada issuing an official recall notice of DeLorean DMC-12 because of a flux capacitor defect that could prevent the car from traveling through time. I love the trilogy and as many people probably did that week, I rewatched the movies. I also wondered if there was any fun BTTF data science project I could do. While watching the climactic sequence at the end of the third movie, I realized that as the steam locomotive pushes the DeLorean down the tracks, we get many data points as to the speed of the DeLorean. Marty is essentially reciting a dataset, all the way from 1885. That made me ask the 1.21 Giga Watts question: Do they really make it to 88 miles per hour before they run out of tracks? #### Doc’s Plan For those not familiar with the movies, Marty and Doc are trapped in the old West without any gas to power the gasoline engine of the DeLorean, their time-machine. That means they can’t drive it to 88 miles per hour, the speed required to activate the flux capacitor, and travel back to 1985. The plan they come up with is to commandeer a steam locomotive and use it to push the DeLorean to the required speed. Doc spells out the plan to make it back to the future: Tomorrow night, Sunday, we’ll load the DeLorean on to the tracks here on the spur right by the old abandoned silver mine. The switch track is where the spur runs off the main line 3 miles into Clayton… Shonash Ravine. The train leaves the station at 8:00 Monday morning. We’ll stop it here, uncouple the cars from the tender, throw the switchtrack, and hijack – borrow the locomotive and use it to push the time machine. According to my calculations we’ll hit 88 miles per hour just before we hit the edge of the ravine, at which point we’ll instantaneously arrive in 1985 and coast safely across the completed bridge. If you think of it, it’s a shame Doc didn’t equip the DeLorean with Tesla electric motors when he visited 2015. That would have made things easier considering the DeLorean was equipped with a working Mr. Fusion generator in 1885. #### The dataset To assemble the dataset, I simply watched the train sequence and took down each time Marty said the speed, or that we saw it on the speedometer, along with the time in the movie. The tiny dataset is available in the companion repository. Also, Doc tells us twice in the movie that they have 3 miles of tracks before they hit the ravine. Finally, like Jack Bauer would say, we assume events occur in real-time. This is a critical assumption because I’ll base the distance calculations off this. So if they were to go at a steady 25 miles per hour for one hour in the movie, that would mean that they traveled 25 miles during that period. #### Graphing the sequence of events For simplicity, I assumed a linear progression between each actual data point, meaning we’re assuming a uniform acceleration between data points. The following graph shows the sequence of events as they occur in the movie. The x-axis represents the number of minutes since the beginning of the train sequence and the y-axis is their speed. The landmarks along the tracks have been labeled in red at the bottom. Finally, the period for which each of Doc’s 3 presto logs burnt have been marked as horizontal lines. We can see that the whole sequence lasts about 7 minutes and they successfully reach 88 miles per hour just before reaching the ravine, exactly as Doc predicted. But did they? #### How far did they go? The question we have is about the actual distance it takes them to get to that critical velocity. Since we know at what speed they were going and for how long, we can essentially integrate the time vs speed graph above to get the distance they really traveled. Doing this gives us the distance vs speed graph that we can use to determine if they really reached 88 miles per hour before having traveled 3 miles. In other words, does the blue speed line get to the green future line before reaching the red ravine line ? Great Scott! They actually run out of tracks just shy of 70 miles per hour. This means the ravine is actually rightly renamed Eastwood Ravine because Marty does end up at the bottom of it! Here is another way to look at it: #### Can we fix this? Looking at their acceleration over time sheds some light as to why they got in trouble. Remember we’re assuming a uniform acceleration, meaning that between each actual data point, we assume a linear speed progression between each pair of points. The following graph shows the acceleration: The very narrow spikes are when the speedometer is shown on screen and you see the speed go up by 1 mph within 2 seconds. The wider and lower periods are the result of the speedometer not being shown for a while and the speed not having gone up that much in the meantime. My expectation would have been that as the yellow and especially red logs catch fire, we get a higher and higher acceleration. In reality, acceleration correlates with dramatic moments in the story and with when the speedometer is shown on screen. It’s a movie; I know. Let’s give Doc a hand and figure out the acceleration his presto logs would have needed to provide in order to make this work. By assuming that we can only influence the acceleration from the point where the green presto log catches fire, we can determine the acceleration we need to get to the right speed in time and derive the following modified distance vs speed graph: This allows us to plot the new speed curve on the initial graph that showed the sequence of events. In this scenario, Marty safely goes back to the future after 2 minutes and 8 seconds, a mere 39 seconds after the green log caught fire. Unfortunately, since Clara came on board exactly when the green log caught fire, she most probably would have made the jump with the locomotive. In the movie, it took her 1 minute and 51 seconds to get to the locomotive’s whistle, so she would not have had time to call for help. Doc, who had to put the presto logs in the firebox while the train was moving, would have had to rush to the DeLorean but it’s possible he would have made it. #### In the end… We’re forced to conclude that Doc’s calculations were off and Marty couldn’t have made it back to the future. The fact that he did may mean that we are currently in a “time paradox, the results of which could cause a chain reaction that would unravel the very fabric of the space-time continuum and destroy the entire universe.” But in any case, time travel can be a risky business. As a word of advice: maybe where you’re going, you don’t need roads… but where you came from, always make sure you had enough tracks. More comments are available on: # Mapping Press Releases in the 2015 Canadian Federal Election The 2015 Canadian federal election is in its final stretch and college and I thought it would be a great opportunity to collect some data and do some machine learning. Citizen data science in action! We looked at the press releases of non-regional Canadian federal political parties using Datacratic’s Machine Learning Database (MLDB). The image below is a map with 620 dots each representing one English-language press release, colored by each party’s official color. The closer two dots are, the more similar the text of the press releases they represent. The white text labels were placed by hand to give a sense of what the various groupings mean. A lot of interesting insights about each party’s communication strategy can be derived from the visualization. Check out the complete blog post for more details as well as an interactive version of the graph. # Hacking an epic NHL goal celebration with a hue light show and real-time machine learning See media coverage of this blog post. In Montréal this time of year, the city literally stops and everyone starts talking, thinking and dreaming about a single thing: the Stanley Cup Playoffs. Even most of those who don’t normally care the least bit about hockey transform into die hard fans of the Montréal Canadiens, or the Habs like we also call them. Below is a Youtube clip of the epic goal celebration hack in action. In a single sentence, I trained a machine learning model to detect in real-time that a goal was just scored by the Habs based on the live audio feed of a game and to trigger a light show using Philips hues in my living room. The rest of this post explains each step that was involved in putting this together. A full architecture diagram is available if you want to follow along. ### The hack The original goal (no pun intended) of this hack was to program a celebratory light show using Philips hue lights and play the Habs’ goal song when they scored a goal. Everything would be triggered using a big Griffin PowerMate USB button that would need to be pushed by whoever was the closest to it when the goal occurred. That is already pretty cool, but can we take it one step further? Wouldn’t it be better if the celebratory sequence could be triggered automatically? As far as I could find, there is no API or website available online that can give me reliable notifications within a second or two that a goal was scored. So how can we do it very quickly? Imagine you watch a hockey game blindfolded, I bet you would have no problem knowing when goals are scored because a goal sounds a lot different that anything else in a game. There is of course the goal horn, if the home team scores, but also the commentator who usually yells a very intense and passionate “GOOOAAAALLLLL!!!!!”. By hooking up into the audio feed of the game and processing it in real-time using a machine learning model trained to detect when a goal occurs, we could trigger the lights and music automatically, allowing all the spectators to dance and do celebratory chest-bumps without having to worry about pushing a button. ### Some signal processing The first step is to take a look at what a goal sound looks like. The Habs’ website has a listing of all previous games with ~4 minutes video highlights of each game. I extracted the audio from a particular highlight and used librosa, a library for audio and music analysis, to do some simple signal processing. If you’ve never played with sounds before, you can head over to Wikipedia to read about what a spectrogram is. You can also simply think of it as taking the waveform of an audio file and creating a simple heat map over time and audio frequencies (Hz). Low-pitched sounds are at the lower end of the y-axis and high-pitched sounds are on the upper end, while the color represents the intensity of the sound. We’re going to be using the mel power spectrogram (MPS), which is like a spectrogram with additional transformations applied on top of it. You can use the code below to display the MPS of a sound file. This is what the MPS of a 4 minutes highlight of a game looks like: mel power spectogram of a 4 minutes highlight Now let’s take a look at an 8 seconds clip from that highlight, specifically when a goal occurred. mel power spectrogram of a goal by the Canadiens As you can see, there are very distinctive patterns when the commentator yells (the 4 big wavy lines), and when the goal horn goes off in the amphitheater (many straight lines). Being able to see the patterns with the naked eye is very encouraging in terms of being able to train a model to detect it. There are tons of different audio features we could derive from the waveform to use as features for our classifier. However, I always try to start simple to create a working baseline and improve from there. So I decided to simply vectorize the MPS, which was created by using 2 second clips with frequencies up to 8KHz with 128 Mel bands at a sampling rate of 22.5KHz. The MPS have a shape of 128×87, which results in a feature vector of 11,136 elements when vectorized. ### The machine learning problem If you’re not familiar with machine learning, think of it as building algorithms that can learn from data. The type of ML task we need to do for this project is binary classification, which means making the difference between two classes of things: • positive class: the Canadiens scored a goal • negative class: the Canadiens did not score a goal Put another way, we need to train a model that can give us the probability that the Canadiens scored a goal given the last 2 seconds of audio. A model learns to perform a task through training, which is looking at past examples of those two classes and figuring out what are the statistical regularities in the data that allow it to separate the classes. However, it is easy for a computer to learn things by heart. The goal of machine learning is producing models that are able to generalize what they learn to data they have never seen, to new examples. What this means for us is that we’ll be using past games to train the model but what we obviously want to do are predictions for future games in real-time as they are aired on TV. ### Building the dataset As with any machine learning project, there is a time when you will feel like a monkey, and that is usually when you’re either building, importing or cleaning a dataset. For this project, this took the form of recording the audio from multiple 4 minute highlights of games and noting the time in the clip when a goal was scored by the Habs or the opposing team. Obviously, we’ll be using the Canadiens’ goals as positive examples for our classifier, since that is what we are trying to detect. Now what about negative examples? If you think about it, the very worst thing that could happen to this system is for it to get false positives (falsely thinking there is a goal). Imagine we are playing against the Toronto Maple Leafs and they score a goal and the light show starts. Not only did we just get scored and are bummed out, but on top of that the algorithm is trolling us about it by playing our own goal song! (This is naturally a fictitious example because the Leafs are obviously not making the playoffs once again this year) To make sure that doesn’t happen, we’ll be using all the opposing team’s goals as explicit negatives. The hope is that the model will be able to distinguish between goals for and against because the commentator is much more enthusiastic for Canadiens’ goals. To illustrate this, compare the MSP of the Habs’ goal above with the example below of a goal against the Habs. The commentator’s scream is much shorter and the goal horn of the opponent’s team amphitheater is at very different frequencies than the one at the Bell Center. The goal horn only goes off when the home team scores so the MSP below is taken from a game not played in Montréal. mel power spectrogram of a goal against the Canadiens In addition to the opposing team’s goals, we’ll use 50 randomly selected segments from each highlight that are far enough from an actual goal as negatives, so that the model is exposed to what the uneventful portions of a game sound like. False negatives (missing an actual goal) are still bad, but we prefer them over false positives. We’ll talk about how we can deal with them later on. Note that I did not do any alignment of the sound files, meaning the commentator yelling does not start at exactly the same time in every clip. The dataset ended up consisting of 10 games, with 34 goals by the Habs and 17 goals against them. The randomly selected negative clips added another 500 examples. ### Training and picking a classifier As I mentioned earlier, the goal was to start simple. To that effect, the first models I tried were a simple logistic regression and an SVM with an rbf kernel over the raw vectorized MPS. I was a bit surprised that this trivial approach yielded usable results. The logistic regression got an AUC of 0.97 and an F1 score of 0.63, while the SVM got an AUC of 0.98 and an F1 score of 0.71. Those results were obtained by holding out 20% of the training data to test on. At this point I ran a few complete game broadcasts through the system and each time the model detected a goal, I wrote out the 2 seconds corresponding sound file to disk. A bunch were false positives that corresponded to commercials. The model had never seen commercials before because they are not included in game highlights. I added those false positives to the negative examples, retrained and the problem went away. However the AUC/F1 score were not an accurate estimation of the performance I could expect because I was not necessarily planning to use a single prediction as the trigger for the light show. Since I’m scoring many times per second, I could try decision rules that would look at the last n predictions to make a decision. I ran a 10-fold cross-validation, holding out an entire game from the training set, and actually stepping through the held out game’s highlight as if it was the real-time audio stream of a live game. That way I could test out multi-prediction decision rules. I tried two decision rules: 1. average of last n predictions over the threshold t 2. m positive votes in the last n predictions, where a YES vote requires a prediction over the threshold t For each combination of decision rule, hyper-parameters and classifier, there were 4 metrics I was looking at: 1. Real Canadiens goal that the model detected (true positive) 2. Opposing team goal that the model detected (really bad false positive) 3. No goal but the model thought there was one (false positive) 4. Canadiens goal the model did not detect (false negative) SVMs ended up being able to get more true positives but did a worst job on false positives. What I ended up using was a logistic regression with the second decision rule. To trigger a goal, there needs to be 5 positives votes out of the last 20 and votes are cast if the probability of a goal is over 90%. The cross-validation results for that rule were 23 Habs goals detected, 11 not detected, 2 opposing team goals falsely detected and no other false positives. Looking at the Habs’ 2014-15 season statistics, they scored an average of 2.61 goals per game and got scored 2.24 times. This means I can loosely expect the algorithm to not detect 1 Habs goal per game (0.84 to be more precise) and to go off for a goal by the opposing team once every 4 games. Note that the trained model only works for the specific TV station and commentator I trained on. I trained on regular season games aired on TVA Sports because they are airing the playoffs. I tried testing on a few games aired on another station and basically detected no goals at all. This means performance is likely to go down if the commentator catches a cold. ### Philips hue light show Now that we’re able to do a reasonable job at identifying goals, it was time to create a light show that rivals those crazy Christmas ones we’ve all seen. This has 2 components: playing the Habs’ goal song and flashing the lights to the music. The goal song I play is not the current one in use at the Bell Center, but the one that they used in the 2000s. It is called “Le Goal Song” by the Montréal band L’Oreille Cassée. To the best of my knowledge, the song is not available for sale and is only available on Youtube. Philips hues are smart LED multicolor lights that can be controlled using an iPhone app. The app talks to the hue bridge that is connected to your wifi network and the bridge talks to the lights over the ZigBee Light Link protocol. In my living room, I have the 3 starter-kit hue lights, a light-strip under my kitchen island and a Bloom pointing at the wall behind my TV. Hues are not specifically meant for light shows; I usually use them to create an interesting atmosphere in my living room. I realized the lights can be controller using a REST API that runs on the bridge. Using the very effective phue library, we can interface with the hue bridge API from python. At that point, it was simply a question of programming a sequence of color and intensity calls that would roughly go along with the goal song I wanted to play. Below is an example of using phue to make each light cycle through the colors blue, white and red 10 times. I deployed this up as a simple REST API using bottle. This way, the celebratory light show is decoupled from the trigger. The lights can be triggered easily by calling the /goal endpoint. ### Hooking up to the live audio stream My classifier was trained on audio clips offline. To make this whole thing come together, the missing piece was the real-time scoring of a live audio feed. I’m running all of this on OSX and to get the live audio into my python program, I needed two components: Soundflower and pyaudio. Soundflower acts as a virtual audio device and allows audio to be passed between applications, while pyaudio is a library that can be used to play an record audio in python. The way things need to be configured is the system audio is first set to the Soundflower virtual audio device. At that point, no sound will be heard because nothing is being sent to the output device. In python, you can then configure pyaudio to capture audio coming into the virtual audio device, process it, and then resend it out to the normal output device. In my case, that is the HDMI output going to the TV. As you can see from the code snippet below, you start listening to the stream by giving pyaudio a callback function that will be called each time the captured frames buffer is full. In the callback, I add the frames to a ring buffer that keeps 2 seconds worth of audio, because that is the size of the training examples I used to train the model. The callback gets called many times per second. Each time, I take the contents of the ring buffer and score it using the classifier. When a goal is detected by the model, this triggers a REST call to the /goal endpoint of the light show API. ### Full architecture My TV subscription allows me to stream the hockey games on a computer in HD. I hooked up a Mac Mini to my TV and that Mac will be responsible for running all the components of the system: 1. displaying the game on the TV 2. sending the game’s audio feed to the Soundflower virtual audio device 3. running the python goal detector that capture the sound from Soundflower, analyses it, calls the goal endpoint if necessary and resends the audio out to the HDMI output 4. running the light show API that listens for calls to the goal endpoint Since the algorithm is not perfect, I also hooked up the Griffin USB button that I mentioned at the very beginning of the post. It can be used to either start or stop the light show in case we get a false negative or false positive respectively. It was very easy to do this because a push of the button simply calls the /goal endpoint of the API that can decide what it should do with the trigger. ### Production results and beyond After two playoff games against the Ottawa Senators, the model successfully detected 75% of the goals (missing 1 per game) and got no false positives. This is in line with the expected performance, and the USB button was there to save the day when the detection did not work. This was done in a relatively short amount of time and represents the simplest approach at each step. To make this work better, there are a number of things that could be done. For instance, aligning the audio files of the positive examples, trying different example length, trying more powerful classifiers like a convolutional neural net, doing simple image analysis of the video feed to try to determine on which side of the ice we are, etc. In the mean-time, enjoy the playoffs and Go Habs Go! Talks In the media # Efficient log processing I’ve recently learned a couple of neat tricks to process large amounts of text files more efficiently from my new co-worker @nicolaskruchten. Our use-case is efficiently going through tens of gigabytes of logs to extract specific lines and do some operation on them. Here are a couple of things we’ve done to speed things up. #### Keep everything gziped Often, the bottleneck will be IO. This is especially true on modern servers that have a lot of cores and ram. By keeping the files we want to process gziped, we can use zcat to directly read the compressed files and pipe the output to whichever script we need. This reduces the amount of data that needs to be read from the disk. For example: If you’re piping into a Python script, you can easily loop over lines coming from the standard input by using the fileinput module, like this: #### Use parallel to use all available cores GNU parallel is the coolest utility I’ve discovered recently. It allows you to execute a script that needs to act on a list of files in parallel. For example, suppose we have a list of 4 log files (exim_reject_1.gz, exim_reject_2.gz, etc) and that we need to extract the lines that contain gmail.com. We could run a grep on each of those files sequentially but if our machine has 4 cores, why not run all the greps at once? It can be done like this using parallel: Breaking down the previous command, we tell parallel to run, using 4 cores, the command zcat {} | grep gmail.com, where {} will be substituted with each of the files matching the selector exim_reject*.gz. Each resulting command from the substitutions of {} will be run in parallel. What’s great about it is that you can also collect all the results from the parallel executions and pipe them into another command. We could for example decide to keep the resulting lines in a new file like this: #### Use a ramdisk If you’ll be doing a lot of reading and writing to the disk on the same files and have lots of ram, you should consider using a ramdisk. Doing so will undoubtedly save you lots of IO time. On Linux, it is very easy to do. The following command would create an 8GB ramdisk: #### In the end… By using all the tricks above, we were able to considerably improve the overall runtime or our scripts. Well worth the time it took to refactor our initial naive pipeline. # 64-bit Scientific Python on Windows Getting a 64-bit installation of Python with scientific packages on our dear Windows isn’t as simple as running an apt-get or port command. There is an official 64-bit Python build available but extensions like numpy, scipy or matplotlib only have official 32-bit builds. There are commercial distributions such as Enthought that offer all the packages built in 64-bit but at around 200 per license, this was not an option for me.

Stumbled upon the Python Extension Packages for Windows page that contains dozens of extensions compiled for Python 2.5, 2.6 and 2.7 in 32 and 64 bits. With these packages, I was able to get a working installation in no time.

# Boston Music Hackday

I was thrilled to attend the Boston Music Hackday this week-end. A lot of people hacked up some pretty cool projects, many of us coding until the very early morning Sunday (aka 4am), only to get back up a few hours later (aka 8am) to keep at it until the dreaded 15h45 deadline, when we all had to submit our demos. The organisers did a wonderful job and the event was a success at every level.

The hack I did was called the PartyLister. The goal was mainly trying to come up with a way to generate steerable playlists that would also be personalized for a group of people (ie.: taking into account each of their musical taste and making sure everyone gets a song he likes once in a while). Given the very limited amount of time available to hack this up, I had to keep things simple and so I decided to use only social tags to do all the similarity computation. I excepted the quality of the playlists would suffer but the goal was really to develop a way to include multiple listeners in the track selection process. The algorithm should then be used in conjunction with something like the playlist generation model I presented at this year’s ISMIR.

### My hack: PartyLister

Imagine you’re hosting a party and using the PartyLister as DJ for the night. Each of your guests will need to supply the software with his last.fm username and we’ll be good to go.

We go out and fetch from the last.fm API the social tags associated with the artists (and their top tracks) that our listeners know about. We also use the EchoNest API to get similar artists so we can present new artists to our listeners. From a user’s top artists, we can create a tag cloud that represents the user’s general musical taste (UMT). We’re also allowing each user to specify a set of tags that represent their current musical taste using a steerable tag cloud.

Suppose you have 3 guests at your party, where two like pop and the other likes metal. By doing a naive combination of the users’ musical taste, we’ll probably end up playing pop music, leaving our metalhead friend bored. To solve this, I added a user weight term which is determined by looking at the last 5 songs that played and computing the average similarity between the user’s musical taste and those songs. If we’re only playing pop songs, the metalhead will have a very low similarity between his taste and what played and so we’ll increase his weight and lower the pop lovers’ weights. When we pick the next song, this weighting scheme will allow the metalhead’s taste to count more than the pop lovers’, even if there are more of them. This will make us play a more metal-like track. After a while the weights will equal out and we’ll start playing pop music again.

For sparseness reasons, I operated on artists instead of tracks. A simplified version how I weighted each candidate artist is below. Lambda is simply a knob to determine how much the users’ musical taste will count, cd() represents the cosine distance and UMT represents a combination of the user’s general musical taste and his steerable cloud.

$\text{cand}_i = cd(\text{seed},\text{cand}_i) + \frac{\lambda}{\sum_{\text{users}} [\text{user weight}]}\sum_\text{users} [\text{user weight}] * cd(\text{cand}_i,\text{UMT})$

The following plot represents a running average of the cosine distance (dissimilarity) between users’ musical taste and the last 5 songs that played. It represents a 160 songs playlist with 3 listeners in the system.

As you can see, as a user’s running average increases, his weight is also increased so that we start playing more songs that fit his taste. His average then decreases as the other users’ weights go up forcing a return to music that fits their taste a little more. The plot shows that the system seems to be doing what we want, that is taking into account the musical taste of multiple users and playing music that each person will like once in a while. Integrated in a real playlist generation model, I believe this could produce interesting results.

I also played with a discovery setting, where users could specify if they wanted to discover new songs or stick to what they know. This was achieved by adding a bonus or penalizing each candidate’s score, based on the discovery setting (float between 0 and 1) and the proportion of users who knew (had already listened to) the artist in question.

PartyLister was not a very visually or sonically attractive hack like some of the others but I still managed to win a price based on popular vote. Thanks to all the great sponsors, there were a lot of prizes and so lots of winners.

Below is the Université de Montréal delegation, Mike Mandel (who also won a price for his Bowie S-S-S-Similarities) and myself, with our bounty.

I really hope to attend another hackday soon as it was all a lot of fun. Time to go get some sleep now.

# My time at Sun Labs and pyaura

My internship at Sun Microsystems Labs, which has been going on for about 15 months – 9 of those full time at their campus in the Boston area – is coming to an end. During the course of those months, I’ve met a lot of very smart and fun people, I’ve worked on very challenging and stimulating problems and I’ve discovered a bunch of really good New England beers.

All my work has been centered around the Aura datastore, an open-source, scalable and distributed recommendation platform. The datastore is designed to handle millions of users and items and can generate content-based recommendations based on each item’s aura (aka tag cloud).

Last summer, under the supervision of Paul Lamere, I worked a lot more on our music recommendation web application, called the Music Explaura and designed a steerable recommendation interface. (We also have a Facebook companion app to the Explaura that was created by Jeff Alexander.)

This summer, I worked with Steve Green on many different things, including what I’d like to talk about in this post, pyaura, a Python interface to the datastore.

## pyaura

The idea behind pyaura is to get the best of both world. While the datastore is very good at what it does – storing millions of items and being able to compute similarity between all of them very quickly – the Java framework surrounding it is a bit too rigid to quickly hack random research code on top of it. While my actual goal was to experiment with ways of doing automatic cleanup and clustering of social tags, I felt I was missing the flexibility I wanted and was used to getting when working on projects using Python’s interactive environment.

Without going into details, since the datastore is distributed and has many different components, it uses a technology called Jini to automatically hook them all up together. Jini takes care of automatic service discovery so you don’t have to manually specify IP adresses and so on. It also allows you to publicly export functions that remote components can call. A concrete example would be the datastore head component allowing the web server component to call it’s getSimilarity() function on two items. The computation goes on in the datastore head and then the results get shipped across the wire to the web server so it can serve its request. However, Jini only supports Java leaving us no direct way to connect to the datastore using Python.

After looking around for a bit, I stumbled upon a project called JPype, which essentially allows you to launch a JVM inside Python. This allows you to instantiate and use Java objects in a completely transparent way from within Python. Using JPype, I built two modules which together, allow very simple access to the datastore though Python.

• AuraBridge: A Java implementation of the Aura datastore interface. The bridge knows about the actual datastore because it can locate it and talk to it using Jini.
• pyaura: A set of Python helper functions (mostly automatic type conversion). pyaura instantiates an AuraBridge instance using JPype and uses it as a proxy to get data to and from the datastore.

### Example

To demonstrate how things become easy when using pyaura, imagine you are running an Aura datastore and have collected a lot of artist and tag information from the web. You might be interested in quickly seeing the number of artists that have generally been tagged by the each individual tag you know about. With these few lines of code, you can get a nice histogram that answers just that questions:

The above code produces the following plot:

This is the result we expect, as this was generated with a datastore containing 100,000 artists. As less and less popular artists are added to the datastore, the effects of sparsity in social data kick in. Less popular artists are indeed tagged with less tags than popular artists, leading to the situation where very few tags were applied to more than 5000 artists.

This is a small example but it shows the simplicity of using pyaura. With very few lines of code, you can do pretty much anything with the data stored in Aura. This hopefully will make the Aura datastore more accessible and attractive to projects looking to take advantage of both its scalability and raw power as well as have the flexibility to quickly hack on top of it.