# Why do we use a Least Squares fit?

I’ve been wondering for a while now if there’s any deep mathematical or statistical significance to finding the line that minimizes the square of the errors between the line and the data points.

If we use a less common method like LAD, where we just consider the absolute deviation, then outliers make less difference to the final model, while if we take the cube of the error (or any other power higher than 2), then outliers are far more significant than with the least squares model.

I suppose what I’m really asking is mathematically, is raising the error to the power of 2 really that special. Is it say more “accurate” in some sense than raising the error to the power of 1.95 or 2.05???

Thanks!

Carl Gauss (the most famous person to live on earth in the 19th century, except for people who did not work in the physical and mathematical sciences) showed that least squares estimates coincide with maximum-likelihood estimates when one assumes independent normally distributed errors with $$00$$ mean and equal variances.
Here are a couple of other points about raising errors to the power $$22$$ instead of $$1.951.95$$ or $$2.052.05$$ or whatever.
• The variance is the mean squared deviation from the average. The variance of the sum of ten-thousand random variables is the sum of their variances. That doesn’t work for other powers of the absolute value of the deviation. That means if you roll a die $$60006000$$ times, so that the expected number of $$11$$s you get is $$10001000$$, then you also know that the variance of the number of $$11$$s is $$6000×16×566000\times\frac 1 6\times\frac 5 6$$, so if you want the probability that the number of heads is between $$990990$$ and $$10201020$$, you can approximate the distribution by the normal distribution with the same mean and the same variance. You couldn’t do that if you didn’t know the variance, and you couldn’t know the variances without additivity of variances, and if the exponent is anything besides $$22$$, then you don’t have that. (Oddly, you do have additivity with the $$33$$rd powers of the deviations, but not with the $$33$$rd powers of the absolute values of the deviations.)
• Suppose the errors are not necessarily independent but are uncorrelated, and are not necessarily identically distributed but have identical variances and expected value $$00$$. You have $$Yi=α+βxi+erroriY_i = \alpha + \beta x_i + \text{error}_i$$. The $$YY$$s and $$xx$$s are observed; the $$xx$$s are treated as non-random (hence the lower-case letter) the coefficients $$α\alpha$$ and $$β\beta$$ are two be estimated. The least-squares estimate of $$β\beta$$ is $$ˆβ=∑i(xi−ˉx)(Yi−ˉY)∑i(xi−ˉx)2\widehat\beta = \frac{\sum_i (x_i-\bar x)(Y_i-\bar Y)}{\sum_i(x_i-\bar x)^2} \tag 1$$ where $$ˉx\bar x$$ and $$ˉY\bar Y$$ are the respective averages of the observed $$xx$$ and $$YY$$ values. Notice that $$(1)(1)$$ is linear in the vector of observed $$YY$$ values. Then among all unbiased estimators of $$β\beta$$ that are linear in the vector of $$YY$$ values, the one with the smallest variance is the least-squares estimator. And similarly for $$ˆα\widehat\alpha$$. That is the Gauss–Markov theorem.