Least Squares Regression

What is Least Squares Regression?

The method of least squares is a standard approach in regression analysis to approximate the solution of overdetermined systems, i.e., sets of equations in which there are more equations than unknowns. "Least squares" means that the overall solution minimizes the sum of the squares of the residuals made in the results of every single equation.

The most important application is in data fitting. The best fit in the least-squares sense minimizes the sum of squared residuals (a residual being: the difference between an observed value, and the fitted value provided by a model). When the problem has substantial uncertainties in the independent variable (the x variable), then simple regression and least-squares methods have problems; in such cases, the methodology required for fitting errors-in-variables models may be considered instead of that for least squares.

The app below demonstrates an the ordinary least squares method. Click on the canvas to generate datapoints, and the app will draw the line corresponding to the best fit (minimizing the least squares error) among the points. Since this example is in 2D, we can just use a strict formula to derive the coefficients representing this best line; more information about how those coefficients were calculated can be found here.

Gradient Descent

What is Gradient Descent?

Gradient descent is a first-order iterative optimization algorithm for finding the minimum of a function. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or approximate gradient) of the function at the current point.

In the case of 2D, these weights and biases are nothing more than the values of the slope and y-intercept of the best fit line. The app below demonstrates stochastic gradient descent, which is slightly different than normal gradient descent; SGD is an iterative method for optimizing a differentiable objective function. It is called stochastic because samples are selected randomly (or shuffled) instead of as a single group (as in standard gradient descent) or in the order they appear in the training set. In the app below, the gradient is calculated from the error due to a single data point each time, which updates the line parameters each time. The change in these parameters as a result of this error is usually bounded by a hyperparameter known as the learning rate, which pretty much controls how fast parameters are updated. This learning rate is an important parameter to tune because in higher dimensional settings, a high learning rate can cause the algorithm to 'overshoot' the minimum in an objective function, and a low learning rate can cause the algorithm to take too long to produce results. In this simple example, it is not too much of an issue, however. More information on gradient descent can be found here.

 Learning rate: 0.5