I want to dispel a little bit of the magic behind Machine Learning (awww). In my mind, the name *Machine Learning* makes it sound like these algorithms are capable of actually *understanding* the information that you give them, on the same level which a human understands that *kittens are cute, winter is coming*, and *bacon is delicious*.

Unfortunately, the computer really isn’t aware of the information on that kind of a sentient level. At their core, most Machine Learning algorithms are really only attempting to solve an optimization problem. What does that really mean? Well, let’s cook up an example to demonstrate!

### Linear Regression – Simplified

Let’s say I want to buy a house, but I’m worried that murderous deer are going to eat all of the vegetables in my garden (sound familiar?).

The city borders a forest where deer stew all day and plot to eat people’s vegetables. Luckily, deer have one weakness – they are lazy (yup, I’m calling them out!). The deer don’t want to walk very far if they don’t have to. So, there might be a sweet spot in the city where I can buy a house that doesn’t get visited by deer because it’s too far for them to walk.

But how do I find this deer-free spot? I can go around various neighborhoods and ask people what percentage of their garden has been eaten by deer. Then, I can plot out the distance each of those houses to the forest. My fictitious data might look something like this:

Tools like linear regression attempt to find a line or curve that fit these data. Here’s an example of a line that I fit to the data:

This particular curve is produced by a *quadratic function*. For those of you who don’t remember, the quadratic formula is:

Y = *a*X^{2} + *b*X + *c*

To generate this line, the algorithm steps through different values for each parameter (*a, b*, and *c*). For each set of parameters, it calculates the sum of the distances between each data point and the generated line, and through repeated experimentation, picks the parameters *a, b*, and *c* that make the sum as small as possible. Visually, here is what the algorithm measures:

The algorithm draws those imaginary blue dotted lines for every data point. Overall, the algorithm is trying to minimize the distance between each data point and the line.

The tricky part is knowing what values of* a, b* and *c* to try next. This is where techniques such as gradient descent come into play. With each new set of parameters, the algorithm can see whether or not the change caused the overall set of distances to increase or decrease (this is usually called a *cost* or *objective*). More usefully, gradient descent uses some calculus to determine what a better guess at the parameters would be in order to produce a better fit (i.e. what would cost less). The algorithm adjusts the parameters again, re-calculates the cost of the line, etc, etc, until the parameters with the smallest cost is found. Lots of machine learning algorithms operate on a premise similar to this.

### Prediction

The really cool thing of course is that you can then use the best fit line to make predictions. For example, let’s say I found a house that was 1.4 kilometers away from the forest. Using the line of best fit, I can gauge how much of my garden is likely to be eaten by a deer:

The curve basically says that I’m likely to have about 32% of my garden eaten by deer if I live 1.4 kilometers away. Cool! Note however, there is a certain amount of *error* associated with this prediction.

### Other Features

The model above also doesn’t take into consideration things like deer fences or proximity to households with dogs. These are examples of two different *features* that might influence how much of the garden gets eaten. The real power of some machine learning algorithms comes from being able to look at data that have large numbers of features, and being able to find the best way of fitting parameters.

### K-means

Another example of a Machine Learning algorithm that uses this concept of finding minimal distances is the K-means algorithm. Let’s say I wanted to classify how dangerous a deer might be based on its general mass, and number of antlers (note that I really mean antlers, not horns!). Bigger deer = more dangerous. More antlers = pokier (also more dangerous). I want to use this to make a sort of deer danger scale.

Let’s say I observed a bunch of deer entering my yard over the past year. For each deer, I figured out what their general mass was, as well as the pokiness factor of their antlers. The data might look like this:

Now what I want to do is identify groups within that data set, and give them names so that I know what I’m up against if one of them happens to come into the yard. This is where K-means comes in. The algorithm places a number of additional dots on the plot called centroids which will represent the center point of any given *cluster* (centroids <-> center, get it?).

For this example, I asked K-means to place 3 centroids. After K-means finished running, I had the following clusters, represented by the colors black, green, and red. The circles with the cross-hairs represent the center of each cluster:

This is great! I can label the black cluster *mostly harmless*, the green cluster *somewhat pokey*, and the red cluster *avoid at all costs*.

Like linear regression, the algorithm is trying to figure out which cluster a data point belongs to, and does so by measuring the distance between the data point and each centroid. Every data point is assigned to a centroid that it is closest to (i.e. has the shortest distance). Here’s a single data point example:

Notice that the distance to the black centroid is the closest for that data point, which is why it ended up in the black cluster.

### Figuring Out Centroid Locations

At first, the centroid locations are chosen at random – usually a data point in the actual data is chosen to act as a centroid.

Next, K-means loops through the data, and figures out which centroid is closest to which data point, assigning every data point to its closest centroid. Once all data points are assigned to centroids, K-means then updates the centroid positions by calculating a new position that would minimize the distance that each point in the centroid cluster would be to the centroid. K-means does this over and over again, until the centroid positions no longer change, or the maximum number of iterations is reached.

### K-Means Predictions

Predictions using K-means is easy – for any new data point, calculate the distance between the point and each of the centroids. The data point belongs to the centroid that has the smallest distance. For example, I saw a deer that had 4 points on its antlers, and was about 260 pounds:

Here, the data point is closest to the green centroid – meaning that deer is *somewhat pokey*.

### Try Them Out!

I’ve left out a lot of the math on the above processes. My main point is that the intuition behind most of these algorithms is actually quite straight-forward when you start to dig into them. There are many excellent packages (such as Weka) and languages (such as R) that make it easy to run and experiment with a wide variety of Machine Learning algorithms to see how they operate.

### … But Know that Feature Engineering is Key

While running a Machine Learning algorithm is easy, in practice I have found that the biggest challenge is finding or building the features to give to the algorithms. Some data lend themselves readily to Machine Learning techniques, while other data require much more thought, study, and creative pre-processing! Plus, there are so many different issues that plague the machine learning process. For example, rarity can throw a monkey wrench into many algorithms – you really have to *study your data*, and *know the algorithms* in order to squeeze the most out of a machine learning technique.

Tune in next time, when I’ll have some interesting wildlife photos gathered during my data collection phase of the Deer Detection project!