Hypothesis Testing
The main idea behind hypothesis testing has been around long before any formal hypothesis testing procedure. In the Middle Ages, English coins were made with precious metals, so those making the coins had ample opportunity for fraud by replacing some of the precious metal with another, less valuable, metal to keep some precious metal for themselves. Around 1247, the English government began a process where they would compare the weight of coins to trial plates to ensure the correct quality, in an annual event called the Trial of the Pyx. A Master of the Mint was imprisoned in 1318 when it was determined that the gold alloy was not of high enough quality, and another Master of the Mint was reprimanded in 1423 for producing coins of too high quality. The purpose of the Trial was to evaluate whether any difference from the expected weight was large enough to indicate that the coins were not of the expected quality.
![ONL (1887) 1.355 - Trial of the Pix](https://upload.wikimedia.org/wikipedia/commons/thumb/b/b1/ONL_%281887%29_1.355_-_Trial_of_the_Pix.jpg/512px-ONL_%281887%29_1.355_-_Trial_of_the_Pix.jpg)
John Arbuthnot (1667-1735) considered the number of males and females born between 1629 and 1710. More males than females were born in every year in that interval, so he determined that the probability that this difference was due to chance was too small, meaning that there must be another explanation, that is, that some being had a hand in determining the proportion born of each gender since men tended to have shorter lifespans than women. He argued that this showed that God exists.
Pierre Simon de Laplace (1749-1827) contributed to many areas of science, including multiple areas within statistics, but applicable here was his study of astronomy. He looked at measurements and used what he knew about probability to measure the reliability of results from data. As he wrote in Celestial Mechanics (1799), this meant that he calculated the probability that any astronomical phenomena observed were caused by some definite explanation as opposed to random chance.
While the above examples were not formal hypothesis tests, they
reflect the idea behind all hypothesis tests; researchers want to
know if any pattern in the data has a meaningful explanation or if
it is caused by random chance. Karl Pearson (1857-1935) came close to a formal
hypothesis testing method when he developed a goodness of fit test,
which was intended to determine if a particular distribution fit a
dataset. At that time, the word significant
, had referred to
something's meaning but came to be used when the probability of the
results occurring only by chance was low enough to assume that any
observed pattern had a meaningful explanation. While the term
significant
has changed colloquially, this definition is the same
in a statistical context.
Formal methods to determine whether an observed pattern in data
is due to chance or something else were developed simultaneously by
Jerzy Neyman (1894-1981) and Ronald A. Fisher (1890-1962). Neyman
worked with Egon Peason (1885-1980) and William Gosset (1876-1937),
but as Neyman wrote up the final product, their work is generally
attributed to him. Their method included having two hypotheses, a
null and alternative, and the researcher used data and analysis to
infer which hypothesis was true. The math used to support the
inference also allowed them to calculate the probability of
accepting the alternative hypothesis if it were true, a term which
they called power
. The hypotheses tended to be set up so that
the alternative was what the researcher hoped to show, and the null
was what to compare against.
Fisher's method involved only one hypothesis. After collecting
data, his method was to calculate the probability of observing the r
esults of the said data just by chance if the hypothesis was true.
This probability is known as the p-value. When the p-value is very
small, Fisher would say that there is evidence against the h
ypothesis. He gave a rule of thumb, It is a common practice to
judge a result significant if it is of such a magnitude that it
would have been produced by chance not more frequently than once in
twenty trials
(Salsburg, 2001, p. 99). This was intended as a
rough guideline, but it is now seen as a sharp cutoff by many
statisticians. Fisher supported using hypothesis testing only for
controlled experiments, which led him to reject conclusions drawn
about smoking causing lung cancer, a question only approachable
with observational studies. He published several papers and
compiled them into a pamphlet arguing that the data showing that
smoking causes lung cancer was faulty. This may or may not have had
anything to do with the fact that he was a smoker.
There were certain assumptions required for mathematical theorems to support the hypothesis testing process, such as the assumption that a sampling distribution follows a normal distribution. Fisher's and Neyman's approaches had differing philosophies supporting their logic. While they vehemently opposed each other, the hypothesis testing method we use currently is a mix between the two, using both Neyman's null and alternative hypotheses and Fisher's p-values. The following types of tests are often discussed in introductory s tatistics: t-test, analysis of variance (ANOVA), and chi-square tests. The applet below shows the distributions that are used in each test: t-distribution for t-tests, F-distribution for ANOVA, and $\chi^2$-distribution for chi-square tests
t-Test
As editor of the journal Biometrika, Karl Pearson operated under the prevalent assumption that much data was required to learn anything useful about a population. The journal became a place for people to submit and publish large sets of data. During the time that Pearson's assumption was commonly accepted, William S. Gosset (1876-1937) was hired by Guinness Brewery. His job was to ensure consistency in beer brewing. Gosset recognized that in many cases, a large sample size is not practical. He found a distribution that reflected the distribution of sample means when the sample size is very small and the standard deviation in unknown. He checked his work by using a population of the left middle finger lengths of 3000 criminals and considered all possible random samples of size four. He then plotted the distribution of the sample means. This distribution assumed that the population was normally distributed, but later research determined that the same distribution could come from other, less normal, distributions. This sampling distribution can be used to find the probability that observed results or something more extreme would be observed if the assumed null hypothesis is true, a process known as a t-test.
Gosset detailed his work in The Probable Error of the Mean
published in 1908 in Biometrika under the pseudonym Student
because Guinness had a strict no publishing policy. It is likely that
Guinness only found out about the publication at the time of his sudden
death.
Analysis of Variance (ANOVA)
While working at the Rothamsted Experimental Station, Ronald A. Fisher (1890-1962) recognized the need to compare the means of a numeric variable separated by levels of a categorical variable, adjusting the comparison by how much variability there is within each group. This is now called analysis of variance, or ANOVA. He published these ideas in 1923 with W. A. McKenzie as a second author but the test did not gain much attention until he published it in Statistical Methods for Research Workers in 1925. Other mathematicians and statisticians did not speak highly of his 1925 description because he did not give any proofs, probably because his audience consisted of those who needed to conduct research rather than those looking for a rigorous proof. It was more of a description about how to set up a study and evaluate the collected data.
After Fisher's research methods became available, many research companies, government agencies, and research councils sent people to act as volunteers in analyzing Rothamsted data. In turn, they discussed their own research endeavors and struggles with Fisher. By both learning about the techniques and using them in their own research, the visitors helped popularize Fisher's method for the analysis of variance. This method, however, could only reveal whether there is a difference between the means of any two levels, but not which two are different from each other. John Tukey (1915-2000) saw this need, and developed pairwise comparisons to identify which pairs of levels are different.
Chi-Square Tests
Karl Pearson (1857-1935) defined a new distribution, called the
skew distribution, and he incorrectly speculated that all possible
distributions can be described by a skew distribution by changing
the four parameters: mean, standard deviation, kurtosis, and
skew. However, data from a sample do
not follow a given distribution perfectly, even if a population does.
This caused Pearson to recognize the importance of testing whether
a given distribution, specifically the skew distribution with chosen
parameters, could be a good fit for an observed sample distribution.
He called this process the goodness of fit test
. This test
produced a test statistic with a distribution belonging to the
chi-square family of distributions.