Top

Site Menu
History of Statistics

Statistics Topics

Correlation and Regression

Understanding of correlation and regression originated from looking at data and noticing patterns. Around the beginning of the 19th century, scientists were looking for a way to describe the relationship between two quantitative variables—variables measured with numbers. A prevalent idea was determining a line that best fit the data. There are multiple ways to imagine creating a line based on minimizing the distance between the points and the line. For example, one could minimize the absolute vertical distance or the absolute shortest distance. Among the approaches that scientists explored to fit bivariate data was least squares regression. This is where a line is fit to a dataset by minimizing the square of each vertical distance between the observed $y$ value and the predicted $y$ value for the corresponding $x$ value.

It is unclear who originally came up with the idea, since multiple people published this method within a few years of each other: Adrien Marie Legendre in 1805, Robert Adrain in 1808 or 1809, and Carl Friedrich Gauss in 1809. Although Gauss published last of the three, he claimed that he had been using the method since 1795, and published work that discussed the topic more deeply than the others.

Pierre Simon de Laplace (1749-1827) used a system of equations and linear regression methods in his study of astronomy to address the question of Jupiter's orbit accelerating and Saturn's deceleration compared to predicted trends. The difference from what had been expected was due to their gravitational pull. He later applied this same procedure to study the cause of tides, correctly speculating that it could be due to the relative position of the earth, moon, and sun.

Treasury of human inheritance (1909) (14596637458)
Human Inheritance Chart published by Galton Biometrical Laboratory

A further development and application of regression occurred when Francis Galton (1822-1911) used an inherited fortune to open the Galton Biometrical Laboratory for collecting human data, often from multiple members of the same family. He noticed a phenomenon relating fathers' heights to sons' heights. He observed that the tallest fathers tended to have tall sons, but the sons tended not to be as tall as the fathers. Similarly sons of short fathers were shorter than average but tended not to be as short as their short fathers. He named the phenomenon regression to the mean. In general, for extreme cases of the independent variable, the predicted outcome is also extreme, but less extreme than the original extreme case. One can see that this must be true in Galton's example of fathers' and sons' heights. If regression to the mean did not occur and the tallest fathers tended to have an equal number of sons taller and shorter than they were, then their taller sons, would have an equal number of sons taller and shorter than they are, and the tallest of those sons would also have even taller sons. Similarly, short fathers would have shorter sons who would have even shorter sons. This pattern would continue through generations until there were extremely tall and extremely short men, and the range would expand with each generation. Though Galton discovered regression to the mean in family characteristics, it occurs with any bivariate data. Acknowledging regression to the mean explains why a regression line for predicting variable $B$ from variable $A$ cannot be used to predict variable $A$ from variable $B$.