I’ve been wondering for a while now if there’s any deep mathematical or statistical significance to finding the line that minimizes the square of the errors between the line and the data points.
If we use a less common method like LAD, where we just consider the absolute deviation, then outliers make less difference to the final model, while if we take the cube of the error (or any other power higher than 2), then outliers are far more significant than with the least squares model.
I suppose what I’m really asking is mathematically, is raising the error to the power of 2 really that special. Is it say more “accurate” in some sense than raising the error to the power of 1.95 or 2.05???
Carl Gauss (the most famous person to live on earth in the 19th century, except for people who did not work in the physical and mathematical sciences) showed that least squares estimates coincide with maximum-likelihood estimates when one assumes independent normally distributed errors with 0 mean and equal variances.
POSTSCRIPT four years later:
Here are a couple of other points about raising errors to the power 2 instead of 1.95 or 2.05 or whatever.
The variance is the mean squared deviation from the average. The variance of the sum of ten-thousand random variables is the sum of their variances. That doesn’t work for other powers of the absolute value of the deviation. That means if you roll a die 6000 times, so that the expected number of 1s you get is 1000, then you also know that the variance of the number of 1s is 6000×16×56, so if you want the probability that the number of heads is between 990 and 1020, you can approximate the distribution by the normal distribution with the same mean and the same variance. You couldn’t do that if you didn’t know the variance, and you couldn’t know the variances without additivity of variances, and if the exponent is anything besides 2, then you don’t have that. (Oddly, you do have additivity with the 3rd powers of the deviations, but not with the 3rd powers of the absolute values of the deviations.)
Suppose the errors are not necessarily independent but are uncorrelated, and are not necessarily identically distributed but have identical variances and expected value 0. You have Yi=α+βxi+errori. The Ys and xs are observed; the xs are treated as non-random (hence the lower-case letter) the coefficients α and β are two be estimated. The least-squares estimate of β is ˆβ=∑i(xi−ˉx)(Yi−ˉY)∑i(xi−ˉx)2 where ˉx and ˉY are the respective averages of the observed x and Y values. Notice that (1) is linear in the vector of observed Y values. Then among all unbiased estimators of β that are linear in the vector of Y values, the one with the smallest variance is the least-squares estimator. And similarly for ˆα. That is the Gauss–Markov theorem.