Correlation and Regression
Understanding of correlation and regression originated from looking at data and noticing patterns. Around the beginning of the 19th century, scientists were looking for a way to describe the relationship between two quantitative variables—variables measured with numbers. A prevalent idea was determining a line that best fit the data. There are multiple ways to imagine creating a line based on minimizing the distance between the points and the line. For example, one could minimize the absolute vertical distance or the absolute shortest distance. Among the approaches that scientists explored to fit bivariate data was least squares regression. This is where a line is fit to a dataset by minimizing the square of each vertical distance between the observed $y$ value and the predicted $y$ value for the corresponding $x$ value.
Pierre Simon de Laplace (1749-1827) used a system of equations and linear regression methods in his study of astronomy to address the question of Jupiter's orbit accelerating and Saturn's deceleration compared to predicted trends. The difference from what had been expected was due to their gravitational pull. He later applied this same procedure to study the cause of tides, correctly speculating that it could be due to the relative position of the earth, moon, and sun.
A further development and application of regression occurred
when Francis Galton (1822-1911) used an inherited fortune to open
the Galton Biometrical Laboratory for collecting human data, often
from multiple members of the same family.
He noticed a phenomenon relating fathers' heights to sons' heights.
He observed that the tallest fathers tended to have tall sons, but
the sons tended not to be as tall as the fathers. Similarly sons of
short fathers were shorter than average but tended not to be as
short as their short fathers. He named the
phenomenon regression to the mean
. In general, for extreme cases
of the independent variable, the predicted outcome is also extreme,
but less extreme than the original extreme case. One can see that
this must be true in Galton's example of fathers' and sons' heights.
If regression to the mean did not occur and the tallest fathers
tended to have an equal number of sons taller and shorter than they
were, then their taller sons, would have an equal number of sons
taller and shorter than they are, and the tallest of those sons
would also have even taller sons. Similarly, short fathers would
have shorter sons who would have even shorter sons. This pattern
would continue through generations until there were extremely tall
and extremely short men, and the range would expand with each
generation. Though Galton discovered regression to the mean in
family characteristics, it occurs with any bivariate data.
Acknowledging regression to the mean explains why a regression line
for predicting variable $B$ from variable $A$ cannot be used to
predict variable $A$ from variable $B$.