For those of you following along at home, I’ve been attempting to make a deer detector using a Machine Learning algorithm. In my last post on the topic, I discussed why I wanted to use a neural network to perform the actual machine learning in the project. In that post, I glossed over a lot of the details. In this post, I want to provide a little bit more information about neural networks and how I came to write my own version.
I had been doing a lot of reading about Neural Networks, and how they are useful for various machine learning tasks. More importantly, I knew that they had been used in tasks involving computer vision. One aspect of neural networks really stood out to me as being useful. The hidden layer of the network has the ability to act as a sort of automated feature selector. Put another way, the neural net builds its own set of features when it is learning from examples. This is incredibly useful, since in an earlier post, I spoke about how feature engineering and discovery is really key to get a lot of machine learning algorithms working well.
Having never undertaken a computer vision task before, I really felt that trying to come up with my own sets of features would be taking me down a proverbial rabbit hole, putting me at risk of getting lost on the web while trying to educate myself to an entire field (something came out of the mirror, and it wasn’t Alice!). As I have learned in the past, it pays more to spend time and dig into the data set and run a few simple tests than it does to embark on feature engineering right off the bat.
Validating the Approach First
To start out, I wrote a small Python program to strip out the pixel data from my pictures, and shove it into a CSV file (oof... very clumsy!). The result was a somewhat portable data set that I could read in just about any language or tool very quickly. My thinking was that I could use the data set to run a few tests to see if I could tune an off-the-shelf neural net implementation to perform somewhat reasonably. The downside was that I couldn’t easily turn the samples back into a set of images without some manual work, making it harder to see what images were causing problems with the machine learning approach.
My first attempt was using R to write a quick and dirty neural network. You can actually do this in about 5 lines of code! I wanted to experiment with various neural network parameters (learning rate, number of iterations, layer sizes, activation functions, etc). However, something was going wrong with my approach – probably in the way I was specifying the features I wanted to use. It would ramp up my CPU to full blast furnace for about 10 minutes, and then error out with a decidedly non-helpful message. After spending a day trying to figure it out (too much time to invest for a simple fact finding mission), I decided to step away to a different tool.
I then moved to Weka. 10 minutes of setting up a neural net, and cross validation, my data set gave me error rates that actually looked quite good. At this point, I decided to go for broke and write my own Java implementation of a neural network, since it appeared that the approach was going to give me results that were better than just random guessing. My own approach would give me the freedom to experiment with many different parameters. I decided to write one from scratch – I’m crazy like that. You can check it out in my GitHub repository – and yes, there are unit tests that verify that it is actually running correctly!
How do Neural Networks Work?
A neural network is based off of a human wetware equivalent – the neuron. Without getting into a biology lesson, it is well known that a neuron receives an electrical signal, reacts to it, and forwards on its own electrical signal to other neurons. Depending on the structural configuration of these networks, different things may happen when the network receives various input signals.
A computer neural network is set up in much the same way. Usually, a neural network has several layers, each consisting of a bunch of neurons (nodes). Each node in any given layer is connected to nodes in another layer in various structural configurations. Here’s an example from my machine learning post:
The connections are regulated by mathematical weights (called theta values). The first layer is usually called the input layer, and the last layer is the output layer. Layers in between are called hidden layers. Every time an input is given to the network, it multiplies the weights together, triggers an activation function, and continues to propagate the resulting signal forward until it hits the output node. At this point, the neural network has actually made a prediction.
This process is known as forward propagation. The first time the network performs this task, the output values won’t be very good. This is because the weights connecting each of the nodes together are randomly generated – in other words, they are quite meaningless!
To “learn”, the network needs examples of what the correct output value should be given the input values it had. It then traverses the network in reverse, calculating the size of the error that each theta value caused (called a delta value). This process is known as back propagation.
Using some calculus (you must calculate the derivative of the activation function) and a trick called gradient descent, it becomes easy to see whether the delta is increasing or decreasing. The whole purpose of the network is to minimize the size of the error between a set of known inputs and known outputs (see my other post for my thoughts on machine learning). If the delta is increasing, then the theta value is decreased by a small amount (this is because the weight is too big). On the other hand, if the delta is decreasing then the theta value is increased by a small amount. The amount used for the increase or decrease is called the learning rate.
A single pass of both forward and back propagation is called an iteration. Multiple learning iterations are performed until the delta value reaches a fairly stable point known as convergence. For large networks, thousands of iterations may be required until the network stabilizes.
The activation function acts to map values into a particular range. They are “triggered” based upon their input value. One common activation function is the Sigmoid function:
Sigmoid is interesting since it maps values between values of 0 and 1 between a fairly narrow domain. As you can see, it falls off drastically both for positive and negative values of X. Another useful activation function is the Hyperbolic Tangent, which maps values between -1 and 1. Here it is in contrast to the Sigmoid:
Usually, input values should be scaled properly so that they fall into the ranges generated by the activation function. So if you plan on using a Sigmoid function, you should scale the inputs to be between 0 and 1.
When using a set of pre-computed theta values (weights) for a neural network, you can make predictions using forward propagation. Another optimization problem is then deciding what output values should be associated with a positive and negative result. Erm, what?
At the end of forward propagation, using the Sigmoid activation function, the value at the output layer will be between 0 and 1. Usually 1 is used to represent a positive result, while a 0 is used to indicate a negative result. However, most predictions will be greater than 0 and less than one. So, the optimization problem is deciding what range of values should be interpreted as a positive result (say >= 0.7), and what values should be interpreted as a negative result (say < 0.7). Depending on how sensitive the network is, the choice of threshold can significantly impact the performance of the network. Looking at performance metrics such as precision and recall can help determine what a good value should be.
That’s Mostly It – Except for the Math
That’s all there really is to a neural network (except for bias units). I have, however, glossed over the math involved in building one. There are many online courses that will provide the basics of how a neural network is built. Feel free to check out my implementation for learning purposes.
Next time, I’ll talk about precision and recall, and how to figure out what steps to take next when tackling a learning problem.