Deer Detection with Machine Learning (Part 4)

It’s time again for another episode of “Murderous Deer Machine Learning Mayhem”. Last time, I used a neural network to detect raccoons, since they frequented my backyard more often in the early days. In that post, my preliminary results revealed an accuracy (F1 measure) of approximately 72%. In this post, I’ll talk about some of the experiments I performed to try and increase the overall effectiveness of the neural network.

Seeing in Color

My first observation I thought was fairly obvious – color information should lead to better performance. This is an example where my gut feeling and a quick and dirty exploration of the data set seemed to point in a promising direction. Notice the word seemed – that’s because I ultimately went down a bad path. This is a perfect example of why you should dig into the data much more – and have a better grounding in the domain – before investing too much time into feature engineering. Past-self was warning current-self about this in previous blog posts here and here. Bad Craig for not listening, bad! But before I reveal why this didn’t work, let me describe how I adapted the architecture of the neural net to see in color.

Changing the Network

Background – one type of computer image ultimately boils down to a combination of three different color components – Red, Green and Blue (RGB). By mixing these components together with varying intensities of each, you can produce a wide variety of colors. For example, if each component can take on an intensity value between 0 and 255, you can produce 16,777,216 different colors. Cool!

My idea for the neural  network was simple – break down each pixel into its component values, and feed all of that to the neural network. A picture will probably describe this much better. My original network looked like this, where I took the average of the RGB intensities and stored it in a single value (creating a grayscale image):

network_diagram

I modified the network (by modify, I mean built in an additional option to the program) to subdivide each pixel into its R, G and B component, and fed that to the neural network like so:

network_diagram_color

This means that I now had 3,600 x 3 = 10,800 inputs to the neural network. A bit of a combinatorial explosion in the number of network connections, but heck, memory and computing power is cheap these days.

Some Experiments

I ran quite a number of experiments on the raccoon data with more control on how testing and training data sets were built, both to re-affirm my original results (which hold, ~70% accuracy is right on the button), and to try out a few different options, which included color information. I’ll explain the different options in a moment, but first, here are some results.  Note, these results are after training the network over 1,500 iterations to convergence, and then performing a 10-fold cross validation.

Precision

Precision (also known as the positive predictive value) is a measurement of how well the machine learning algorithm is doing when it labels something as a positive example. You basically look at all of the things that the classifier labels as a raccoon, and see how many of them are actually raccoons. High precision is good – it means the algorithm is accurately identifying positive examples.

color-raccoons-precision

In the case of the Baseline, you can see that it does fairly well – roughly 70% of the things it labelled a raccoon were actually raccoons. This is good. Notice however, that the Hyperbolic did better (more on that in a bit). Notice too that the precision with Color information was significantly lower.

Recall

Recall (also known as sensitivity) measures how well the algorithm is remembering the actual examples it was given (I always think of the movie Total Recall – the original 90’s version with the Governator). Put another way, recall asks: of all of things that were actually raccoons, how many did it properly identify? As an example, if there are 10 raccoons in the data set, and the algorithm correctly identifies 6 raccoons, then the recall is 60% (total recall would be identifying all 10 raccoons).

color-raccoons-recall

As we can see, the Baseline recall is roughly 70%, with the Hyperbolic performing equally. Again, it took a hit with Color, but this time, the Hyperbolic Color does better. Notice the variance between the different options (the black lines that indicate a range). Both Color and Hyperbolic Color have quite significant variances. This means that those models are fairly fragile.

F1 Measure

The F1 measure is a way of combining both the Precision and Recall of a model into a single measurement (also known as a harmonic mean). Why? Well, sometimes it’s nice to be able to judge performance based on a single feature, rather than comparing the features separately. In a previous post, I used the F1 measure and accuracy to mean the same thing.

color-raccoons-f1

As you can see here, the Hyperbolic wins the race for overall performance. Let’s talk about what is happening here.

The Hyperbolic Tangent

In a previous post, I talked about how neural networks work. One of the key features is the activation function for the network. When a signal is passed from one end of the network to the other, it must pass through several layers in the network. The signal must be a certain “strength” before it is passed on. The thing that measures the strength and “decides” whether to pass on a new signal is the activation function.

The Hyperbolic Tangent is an activation function with a slightly different profile than the one I was using in the Baseline model. The Baseline model uses the Sigmoid as an activation function. Here is a plot of how the two functions stack up with some values between -10 and 10:

sigmoid_tanh

As you can see, the Hyperbolic Tangent maps values between -1 and 1, and is triggered slightly differently than the Sigmoid.

Long story short, recall for both the Hyperbolic Tangent and Sigmoid models were basically identical. The difference was with precision. Basically, the Hyperbolic Tangent resulted in a model that was more precise than the Sigmoid model, which is why the F1 measure is slightly better (because it is a combination of both precision and recall). Note that both the Baseline and Hyperbolic models were produced using grayscale images. So, what is going on with color?

Color Made Things Worse

Yup. That’s right. Performance tanked with color information (okay, it’s not abysmal, but it is considerably worse than my original run with grayscale images). At first I thought I had an error somewhere in my code. But looking at the data more, as well as false positives and false negatives, I started to hypothesize that the way I was using color information was probably hurting me very badly. The reason – I think – is due to the fact that I’m losing shape information when I separate out each color band.

To explain a bit more, in both the Baseline and Hyperbolic (without Color) models, for each pixel in each image, I take an average of all the different color bands. This does two things:

  1. It converts the color image into a grayscale image.
  2. It normalizes and accentuates edges.

Point 2 I think is the important part. To demonstrate this a bit more clearly, I looked at the following false negative (what was a raccoon, but what the algorithm thought was not):

fn37

I separated it into its R, G, and B components and looked at the intensity of each component.

Here is the R component:

red_fn37

 

Here is the G component:

green_fn37

 

Here is the B component:

blue_fn37

Notice how difficult it is to make out shapes when you treat each color band separately (especially the blue component). The raccoon tends to blend in with the background. Aside from the white on its face and front, it is very difficult to make it out. For a comparison, here is what the picture looks like if I convert it to a proper grayscale:

i-fn37

Notice how it becomes much easier to separate the actual shape of the raccoon out of the background.

The long and the short of it is while I thought I was giving the neural network more data to work with, in actual fact, I was making it harder for the network to distinguish shapes. I need to do some reading to find a new strategy for dealing with color information. Feel free to reply in the comments section if you have an idea for an approach!

In the Meantime – Deer!

With some preliminary analysis out of the way, and with a functioning neural network, I think it is finally time to perform deer detection. I now have over 5,500 images of deer to train with. It will take some time to process that many images in order to generate a training set. Half of those images are taken during the night, and half during the day. Many of them are of the deer laying down, like so:

deer_laying_downNotice that pesky post in the way of the deer! Plus, eventually those deer are going to lose their antlers (ha-ha – one less weapon with which to murder me). All in all, there are going to be some unique challenges to identifying deer.

Wrapping Up

In this post, I talked about the performance of the neural network with respect to identifying raccoons, and talked about why color information didn’t work out as I originally planned. Tune in next time when I’ll look at identifying deer and some of the challenges that brings.

Deer Detection Diversion 4 – Neural Network Setup

For those of you following along at home, I’ve been attempting to make a deer detector using a Machine Learning algorithm. In my last post on the topic, I discussed why I wanted to use a neural network to perform the actual machine learning in the project. In that post, I glossed over a lot of the details. In this post, I want to provide a little bit more information about neural networks and how I came to write my own version.

Which Features??

I had been doing a lot of reading about Neural Networks, and how they are useful for various machine learning tasks. More importantly, I knew that they had been used in tasks involving computer vision. One aspect of neural networks really stood out to me as being useful. The hidden layer of the network has the ability to act as a sort of automated feature selector. Put another way, the neural net builds its own set of features when it is learning from examples. This is incredibly useful, since in an earlier post, I spoke about how feature engineering and discovery is really key to get a lot of machine learning algorithms working well.

Having never undertaken a computer vision task before, I really felt that trying to come up with my own sets of features would be taking me down a proverbial rabbit hole, putting me at risk of getting lost on the web while trying to educate myself to an entire field (something came out of the mirror, and it wasn’t Alice!). As I have learned in the past, it pays more to spend time and dig into the data set and run a few simple tests than it does to embark on feature engineering right off the bat.

Validating the Approach First

To start out, I wrote a small Python program to strip out the pixel data from my pictures, and shove it into a CSV file (oof… very clumsy!). The result was a somewhat portable data set that I could read in just about any language or tool very quickly. My thinking was that I could use the data set to run a few tests to see if I could tune an off-the-shelf neural net implementation to perform somewhat reasonably. The downside was that I couldn’t easily turn the samples back into a set of images without some manual work, making it harder to see what images were causing problems with the machine learning approach.

My first attempt was using R to write a quick and dirty neural network. You can actually do this in about 5 lines of code! I wanted to experiment with various neural network parameters (learning rate, number of iterations, layer sizes, activation functions, etc). However, something was going wrong with my approach – probably in the way I was specifying the features I wanted to use. It would ramp up my CPU to full blast furnace for about 10 minutes, and then error out with a decidedly non-helpful message. After spending a day trying to figure it out (too much time to invest for a simple fact finding mission), I decided to step away to a different tool.

I then moved to Weka. 10 minutes of setting up a neural net, and cross validating my data set gave me error rates that actually looked quite good. At this point, I decided to go for broke and write my own Java implementation of a neural network, since it appeared that the approach was going to give me results that were better than just random guessing. My own approach would give me the freedom to experiment with many different parameters. I decided to write one from scratch – I’m crazy like that. You can check it out in my GitHub repository – and yes, there are unit tests that verify that it is actually running correctly!

How do Neural Networks Work?

A neural network is based off of a human wetware equivalent – the neuron. Without getting into a biology lesson, it is well known that a neuron receives an electrical signal, reacts to it, and forwards on its own electrical signal to other neurons. Depending on the structural configuration of these networks, different things may happen when the network receives various input signals.

A computer neural network is set up in much the same way. Usually, a neural network has several layers, each consisting of a bunch of neurons (nodes). Each node in any given layer is connected to nodes in another layer in various structural configurations. Here’s an example from my machine learning post:

network_diagram

The connections are regulated by mathematical weights (called theta values). The first layer is usually called the input layer, and the last layer is the output layer. Layers in between are called hidden layersEvery time an input is given to the network, it multiplies the weights together, triggers an activation function, and continues to propagate the resulting signal forward until it hits the output node. At this point, the neural network has actually made a prediction.

This process is known as forward propagation. The first time the network performs this task, the output values won’t be very good. This is because the weights connecting each of the nodes together are randomly generated – in other words, they are quite meaningless!

Learning

To “learn”, the network needs examples of what the correct output value should be given the input values it had. It then traverses the network in reverse, calculating the size of the error that each theta value caused (called a delta value). This process is known as back propagation.

Using some calculus (you must calculate the derivative of the activation function) and a trick called gradient descent, it becomes easy to see whether the delta is increasing or decreasing. The whole purpose of the network is to minimize the size of the error between a set of known inputs and known outputs (see my other post for my thoughts on machine learning). If the delta is increasing, then the theta value is decreased by a small amount (this is because the weight is too big). On the other hand, if the delta is decreasing then the theta value is increased by a small amount. The amount used for the increase or decrease is called the learning rate.

A single pass of both forward and back propagation is called an iteration. Multiple learning iterations are performed until the delta value reaches a fairly stable point known as convergence. For large networks, thousands of iterations may be required until the network stabilizes.

Activation Functions

The activation function acts to map values into a particular range. They are “triggered” based upon their input value. One common activation function is the Sigmoid function:

sigmoid_activationThe Sigmoid is interesting since it maps values between values of 0 and 1 between a fairly narrow domain. As you can see, it falls off drastically both for positive and negative values of X. Another useful activation function is the Hyperbolic Tangent, which maps values between -1 and 1. Here it is in contrast to the Sigmoid:

sigmoid_tanh

 

Usually, input values should be scaled properly so that they fall into the ranges generated by the activation function. So if you plan on using a Sigmoid function, you should scale the inputs to be between 0 and 1.

Prediction Thresholds

When using a set of pre-computed theta values (weights) for a neural network, you can make predictions using forward propagation. Another optimization problem is then deciding what output values should be associated with a positive and negative result. Erm, what?

At the end of forward propagation, using the Sigmoid activation function, the value at the output layer will be between 0 and 1. Usually 1 is used to represent a positive result, while a 0 is used to indicate a negative result. However, most predictions will be greater than 0 and less than one. So, the optimization problem is deciding what range of values should be interpreted as a positive result (say >= 0.7), and what values should be interpreted as a negative result (say < 0.7). Depending on how sensitive the network is, the choice of threshold can significantly impact the performance of the network. Looking at performance metrics such as precision and recall can help determine what a good value should be.

That’s Mostly It – Except for the Math

That’s all there really is to a neural network (except for bias units). I have, however, glossed over the math involved in building one. There are many online courses that will provide the basics of how a neural network is built. Feel free to check out my implementation for learning purposes.

Next time, I’ll talk about precision and recall, and how to figure out what steps to take next when tackling a learning problem.