Career
How Did the Concept of Statistical Significance Arise?
The history and origin of p < .05 and statistical significance.
Posted January 29, 2023 Reviewed by Jessica Schrader
Key points
- The history of statistical significance can be traced to statistician Ronald Fisher in the early 1920s.
- The concept of statistical significance arose through mere happenstance.
- Replication is the essence of statistical significance: If we get the same result 19 out of 20 times, the results can be trusted.
It may be common knowledge that p < .05 indicates statistical significance. Psychology students (and others) are often taught that p < .05 means the probability (p) of rejecting the null hypothesis when the null hypothesis is true. The null hypothesis is typically a conservative position, that is, this new method of doing something, new treatment, or new drug does not work any better than chance. The alternative hypothesis is that this new method, treatment, or drug does work better than chance. Experimenters begin with the null hypothesis and after their experiment and based on their statistical analysis of their data, their choice is to retain the null hypothesis (it does not work better than chance) or reject the null hypothesis in favor of an alternative hypothesis (it does work better than chance). Note that this is one important characteristic of hypothesis testing—that is, statisticians can only attach a probability of being wrong about their claim (p < .05).
The Type 1 Error
To claim something is true, when it is not, is considered the most severe or grievous error in science. The latter defines the essence of the Type 1 error in statistics, rejecting the null hypothesis if it is true. It calls to mind singer Paul Simon’s lyric, "I would not give you false hope,” which I interpret as a fear of committing the Type 1 error. As an example of the Type 1 error, reggae musician Bob Marley had skin cancer (melanoma) of the toe. His religious beliefs forbade him from having it amputated. A European doctor was offering an alternative diet-based "cure." As a result, Marley spent the last precious months of his life on an empirically baseless treatment. The doctor had sold him false hope and committed the Type 1 error. He had rejected the null hypothesis (the treatment does not work) when it was true. Ironically, the doctor was exonerated for manslaughter because he really believed his treatments worked, despite his patients dying and the complete lack of experimental support. One interesting observation: Imagine the lack of imagination for the statistician who labeled this dangerous position a Type 1 error. It is like Tarzan calling his child "boy.” But I digress.
The interpretation of p < .05 is also thought to be the probability of committing the Type 1 error, and thus, it should also be less than 5 chances out of 100. When we do reject the null hypothesis, who ultimately knows whether we are right or wrong? If one is a theist, one might say, “God.” If one is a non-theist, one might say, “No one.” And again, an important characteristic of hypothesis testing is that statisticians can only attach a probability about being wrong if they reject the null hypothesis. But where did less than 5 chances out of 100 actually come from?
Teatime at Rothhamsted
Interestingly, the origin of the null hypothesis and statistical significance can be traced to an agricultural research center in Rothhamsted (England). In about 1919, a young Cambridge University-trained mathematician, Ronald Fisher, joined Rothhamsted to conduct field research (likely varieties of barley). At teatime, Fisher, Muriel Bristol (a Ph.D. in algae studies), and William Roach (her soon-to-be husband) were having tea when she exclaimed said that the milk has been added to the tea, and she preferred the opposite. Fisher protested that such discrimination was impossible. Roach suggested that they offer her eight cups of tea, where four cups had tea added to the milk and four cups had milk added to the tea. Bristol was aware that the cups had been randomly presented and was aware of the nature of the eight cups (half tea to milk, half milk to tea). Fisher (1935), in one of his classic texts, The Design of Experiments, noted that the probability of getting all eight cups right was 1 out of 70 (1.4%). Thus, if she was simply lucky, and was able to discriminate all eight cups successfully, then the probability by chance was about 1.4%. Thus, Fisher established the concept of the null hypothesis that she could not accurately identify the cups any better than chance. He also noted in that classic text that the null hypothesis is never "proven" but it is "possible" to be disproven.
Remember, Bristol was aware that the eight cups were evenly divided between four cups of mild then tea and four cups of tea then milk, so she was unlikely to choose all eight cups as tea with milk or vice versa. And curiously, Fisher did not report the outcome of the tea experiment! However, in an interesting book about the history of 20th-century statistics, David Salsburg (2002) reported that one of Fisher’s colleagues, H. Fairfield Smith, claimed that Dr. Bristol correctly identified all of the cups of tea.
Replication
I like asking my colleagues if they had to define the essence of statistics in one word, what would it be? One said, “Measurement.” I like that answer because so often we read, “The 10 best cities to live in,” or “The 10 worst cars,” and so on. And the question arises, what were their criteria? For example, if they said, “Detroit” would you move there, if you like warm weather, oceans, and your family lived in Miami? Thus, what were their criteria for the best city or best car? Another colleague said, “Practical.” I also like that answer because it has the connotation of doing something that works or something that will be successful or effective. For example, sniffing camphor, taking ivermectin, or oleandrin were not "practical" treatments for the COVID-19 virus. In other words, they did not work, and those who promoted them as cures or treatments were committing the Type 1 error.
Fisher, through his 1925 and 1935 classic texts, came up with the theoretical foundation of p < .05, and it is the essence of replication. If one was to conduct an experiment 20 times and only once got a negative result (1/20 = 5%), then one could be fairly certain of the results, that is 19/20 = 95%.
The Alternative Hypothesis
Fisher did not define an alternative to the null hypothesis. Statisticians Egon Pearson and Jerzy Neyman did a bit later. It is said that Fisher opposed the idea, but that is another story.
References
If you wish to learn more about statistics and some of its history, please see Statistics: A Gentle Introduction (4th ed. 2021, Sage Inc).