Hypothesis Testing

The main idea behind hypothesis testing has been around long before any formal hypothesis testing procedure. In the Middle Ages, English coins were made with precious metals, so those making the coins had ample opportunity for fraud by replacing some of the precious metal with another, less valuable, metal to keep some precious metal for themselves. Around 1247, the English government began a process where they would compare the weight of coins to trial plates to ensure the correct quality, in an annual event called the Trial of the Pyx. A Master of the Mint was imprisoned in 1318 when it was determined that the gold alloy was not of high enough quality, and another Master of the Mint was reprimanded in 1423 for producing coins of too high quality. The purpose of the Trial was to evaluate whether any difference from the expected weight was large enough to indicate that the coins were not of the expected quality.

John Arbuthnot (1667-1735) considered the number of males and females born between 1629 and 1710. More males than females were born in every year in that interval, so he determined that the probability that this difference was due to chance was too small, meaning that there must be another explanation, that is, that some being had a hand in determining the proportion born of each gender since men tended to have shorter lifespans than women. He argued that this showed that God exists.

Pierre Simon de Laplace (1749-1827) contributed to many areas of science, including multiple areas within statistics, but applicable here was his study of astronomy. He looked at measurements and used what he knew about probability to measure the reliability of results from data. As he wrote in Celestial Mechanics (1799), this meant that he calculated the probability that any astronomical phenomena observed were caused by some definite explanation as opposed to random chance.

While the above examples were not formal hypothesis tests, they reflect the idea behind all hypothesis tests; researchers want to know if any pattern in the data has a meaningful explanation or if it is caused by random chance. Karl Pearson (1857-1935) came close to a formal hypothesis testing method when he developed a goodness of fit test, which was intended to determine if a particular distribution fit a dataset. At that time, the word significant, had referred to something's meaning but came to be used when the probability of the results occurring only by chance was low enough to assume that any observed pattern had a meaningful explanation. While the term significant has changed colloquially, this definition is the same in a statistical context.

Formal methods to determine whether an observed pattern in data is due to chance or something else were developed simultaneously by Jerzy Neyman (1894-1981) and Ronald A. Fisher (1890-1962). Neyman worked with Egon Peason (1885-1980) and William Gosset (1876-1937), but as Neyman wrote up the final product, their work is generally attributed to him. Their method included having two hypotheses, a null and alternative, and the researcher used data and analysis to infer which hypothesis was true. The math used to support the inference also allowed them to calculate the probability of accepting the alternative hypothesis if it were true, a term which they called power. The hypotheses tended to be set up so that the alternative was what the researcher hoped to show, and the null was what to compare against.

Fisher's method involved only one hypothesis. After collecting data, his method was to calculate the probability of observing the r esults of the said data just by chance if the hypothesis was true. This probability is known as the p-value. When the p-value is very small, Fisher would say that there is evidence against the h ypothesis. He gave a rule of thumb, It is a common practice to judge a result significant if it is of such a magnitude that it would have been produced by chance not more frequently than once in twenty trials (Salsburg, 2001, p. 99). This was intended as a rough guideline, but it is now seen as a sharp cutoff by many statisticians. Fisher supported using hypothesis testing only for controlled experiments, which led him to reject conclusions drawn about smoking causing lung cancer, a question only approachable with observational studies. He published several papers and compiled them into a pamphlet arguing that the data showing that smoking causes lung cancer was faulty. This may or may not have had anything to do with the fact that he was a smoker.

There were certain assumptions required for mathematical theorems to support the hypothesis testing process, such as the assumption that a sampling distribution follows a normal distribution. Fisher's and Neyman's approaches had differing philosophies supporting their logic. While they vehemently opposed each other, the hypothesis testing method we use currently is a mix between the two, using both Neyman's null and alternative hypotheses and Fisher's p-values. The following types of tests are often discussed in introductory s tatistics: t-test, analysis of variance (ANOVA), and chi-square tests. The applet below shows the distributions that are used in each test: t-distribution for t-tests, F-distribution for ANOVA, and $\chi^2$-distribution for chi-square tests

Distribution for t-tests, ANOVA, and Chi-Square Tests

t-Test

As editor of the journal Biometrika, Karl Pearson operated under the prevalent assumption that much data was required to learn anything useful about a population. The journal became a place for people to submit and publish large sets of data. During the time that Pearson's assumption was commonly accepted, William S. Gosset (1876-1937) was hired by Guinness Brewery. His job was to ensure consistency in beer brewing. Gosset recognized that in many cases, a large sample size is not practical. He found a distribution that reflected the distribution of sample means when the sample size is very small and the standard deviation in unknown. He checked his work by using a population of the left middle finger lengths of 3000 criminals and considered all possible random samples of size four. He then plotted the distribution of the sample means. This distribution assumed that the population was normally distributed, but later research determined that the same distribution could come from other, less normal, distributions. This sampling distribution can be used to find the probability that observed results or something more extreme would be observed if the assumed null hypothesis is true, a process known as a t-test.

Gosset detailed his work in The Probable Error of the Mean published in 1908 in Biometrika under the pseudonym Student because Guinness had a strict no publishing policy. It is likely that Guinness only found out about the publication at the time of his sudden death.

Analysis of Variance (ANOVA)

While working at the Rothamsted Experimental Station, Ronald A. Fisher (1890-1962) recognized the need to compare the means of a numeric variable separated by levels of a categorical variable, adjusting the comparison by how much variability there is within each group. This is now called analysis of variance, or ANOVA. He published these ideas in 1923 with W. A. McKenzie as a second author but the test did not gain much attention until he published it in Statistical Methods for Research Workers in 1925. Other mathematicians and statisticians did not speak highly of his 1925 description because he did not give any proofs, probably because his audience consisted of those who needed to conduct research rather than those looking for a rigorous proof. It was more of a description about how to set up a study and evaluate the collected data.

After Fisher's research methods became available, many research companies, government agencies, and research councils sent people to act as volunteers in analyzing Rothamsted data. In turn, they discussed their own research endeavors and struggles with Fisher. By both learning about the techniques and using them in their own research, the visitors helped popularize Fisher's method for the analysis of variance. This method, however, could only reveal whether there is a difference between the means of any two levels, but not which two are different from each other. John Tukey (1915-2000) saw this need, and developed pairwise comparisons to identify which pairs of levels are different.

Chi-Square Tests

Karl Pearson (1857-1935) defined a new distribution, called the skew distribution, and he incorrectly speculated that all possible distributions can be described by a skew distribution by changing the four parameters: mean, standard deviation, kurtosis, and skew. However, data from a sample do not follow a given distribution perfectly, even if a population does. This caused Pearson to recognize the importance of testing whether a given distribution, specifically the skew distribution with chosen parameters, could be a good fit for an observed sample distribution. He called this process the goodness of fit test. This test produced a test statistic with a distribution belonging to the chi-square family of distributions.

Goodness of fit tests have grown in their use since the time of Pearson. One use helped recognize fraud by Gregor Mendel (1822-1884) who published a paper about experiments carried out with green and yellow peas. About $25$ percent of second-generation seeds yielded green peas and $75$ percent yellow. This confirmed his conjectures about dominant and recessive genes. However, his results were too close to the expected values. Ronald A. Fisher (1890-1962) published a paper analyzing Mendel's results using Pearson's goodness of fit test. He found that the probability of getting data as close or closer to predicted values as Mendel's data was only $0.00004$, indicating that an assistant adjusted or misrecorded data. The chi-square goodness of fit test has since been formalized, following the pattern of a null and alternative hypothesis and has been generalized to also be able to test whether two categorical variables are independent.