I recently completed successfully a Machine Learning online course provided by Pr. Andrew Ng from Stanford University on Coursera. So I decided to share my knowledge through this post.
In this post we will get an introduction on to Linear Regression & Gradient Descent.
First, let’s define what’s machine learning? How can Machine Learning helps us?
As a short definition provided by Pr. Andrew Ng “Machine learning is the science of getting computers to act without being explicitly programmed.”
In the past decade, machine learning has given us self-driving cars, practical speech recognition, effective web search, and a vastly improved understanding of the human genome. Machine learning is so pervasive today that you probably use it dozens of times a day without knowing it.
In simple linear regression, we predict scores on one variable from the scores on a second variable. The variable we are predicting is called the criterion variable and is referred to as Y. The variable we are basing our predictions on is called the predictor variable (or feature) and is referred to as X. When there is only one predictor variable, the prediction method is called simple regression. In simple linear regression, the predictions of Y when plotted as a function of X form a straight line.
The example data in Table 1 are plotted in Figure 1. You can see that there is a positive relationship between X and Y. If you were going to predict Y from X, the higher the value of X, the higher your prediction of Y.
Linear regression consists of finding the best-fitting straight line through the points. The black diagonal line in Figure 2 is the regression line and consists of the predicted score on Y for each possible value of X. The vertical lines from the points to the regression line represent the errors of prediction. As you can see, the red point is very near the regression line, its error of prediction is small. On the other hand, the yellow point is much higher than the regression line and therefore its error of prediction is large.
To make our prediction as correct as possible, we must minimize the sum of distances between all points (X, Y) and the regression line. The error of prediction for a point is the value of the point minus the predicted value. Table 2 shows the predicted values (let’s call it Y’) and the errors of prediction (Y-Y’).
For example, the first point has a Y of 1.00 and Y’ of 1.21. Therefore, its error of prediction is –0.21.
You may have noticed that we did not specify what is meant by “best-fitting line”. The most commonly-used criterion for the best-fitting line is the line that minimizes the sum of the squared errors of prediction. That is the criterion that was used to find the line in Figure 2. The last column in Table 2 shows the squared errors of prediction. The sum of the squared errors of prediction shown in Table 2 is lower than it would be for any other regression line.
So, the formula for a regression line (also known as hypothesis) is:
Where m is the line’s slope and b is the line’s y-intercept.
A standard approach to solving this type of problem is to define an error function (also called a cost function) that measures how good a given line is. This function will take in a (m,b) pair and return an error value based on how well the line fits our data. To compute this error for a given line, we’ll iterate through each (X, Y) point in our data set and sum the square distances between each point’s Y value and the candidate line’s Y’ value (computed at mx + b). It’s conventional to square this distance to ensure that it is positive and to make our error function differentiable.
Formally, this error function looks like:
Theoretically, gradient descent is an algorithm that minimizes functions. Given a function defined by a set of parameters, gradient descent starts with an initial set of parameter values and iteratively moves toward a set of parameter values that minimize the function.
Lines that fit our data better (where better is defined by our error function) will result in lower error values. If we minimize this function, we will get the best line for our data. Since our error function consists of two parameters (m and b) we can visualize it as a two-dimensional surface. This is what it looks like for our data set:
When we run gradient descent search, we will start from some location on this surface and move downhill to find the line with the lowest error.
To run gradient descent on this error function, we first need to compute its gradient. The gradient will act like a compass and always point us downhill. To compute it, we will need to differentiate our error function. Since our function is defined by two parameters (m and b), we will need to compute a partial derivative for each. These derivatives work out to be:
We now have all the tools needed to run gradient descent. We can initialize our search to start at any pair of m and b values (i.e., any line) and let the gradient descent algorithm march downhill on our error function towards the best line. Each iteration will update m and b to a line that yields slightly lower error than the previous iteration.
Below are some snapshots of gradient descent running for 2000 iterations for our example problem. Each iteration m and b are updated to values that yield slightly lower error than the previous iteration. The left plot displays the current location of the gradient descent search (blue dot) and the path taken to get there (black line). The right plot displays the corresponding line for the current search location. Eventually we ended up with a pretty accurate fit.
Finally, for more information about gradient descent, linear regression, and other machine learning topics, I would strongly recommend the Machine Learning course on Coursera.