Replication Crisis
The replication crisis in psychology refers to concerns about the credibility of findings in psychological science. The term, which originated in the early 2010s, denotes that findings in behavioral science often cannot be replicated: Researchers do not obtain results comparable to the original, peer-reviewed study when repeating that study using similar procedures. For this reason, many scientists question the accuracy of published findings and now call for increased scrutiny of research practices in psychology.
Contents
Some scientists have warned for years that certain ways of collecting, analyzing, and reporting data, often referred to as questionable research practices, make it more likely that results will appear to be statistically meaningful even though they are not. Flawed study designs and a “publication bias” that favors confirmatory results are other longtime sources of concern.
A series of replication projects in the mid-2010s amplified these worries. In one major project, fewer than half of the studies that replicators tried to recreate yielded similar results, suggesting that at least some of the original findings were false positives.
A variety of findings have come into question following replication attempts, including well-known ones suggesting that certain types of priming, physical poses, and other simple interventions could affect behavior in surprising or beneficial ways. It is important to note that psychology is not alone, however: Other fields, such as cancer research and economics, have faced similar questions about methodological rigor.
The growing awareness of how research practices can lead to false positives has coincided with extreme instances of willful misrepresentation and falsification—resulting, in some cases, in the removal or resignation of prominent scientists.
The field of psychology began to reckon with reproducibility around 2010 when a particularly dubious paper claimed to provide evidence of “precognition,” or the ability to perceive events in the future. Scientists increasingly began to discuss methodological concerns and to repeat experiments to corroborate published studies. The failure to consistently replicate those findings propelled the movement forward.
Journals are incentivized to publish interesting and surprising findings. This leads to publication bias, the tendency to publish positive findings rather than studies that find no effect. Researchers are incentivized to publish as often as possible to advance their careers. Therefore, they may exercise flexibility in their data analysis to achieve statistical significance.
A landmark paper in 2015 revealed that of 97 attempts to replicate previous research findings, fewer than 40 percent were deemed successful. Another large-scale project in 2018 tested 28 findings dating from the 1970s through 2014. It found evidence for about half. An examination of 21 findings published in top-tier journals found that two-thirds replicated successfully. These results are not necessarily representative of psychology as a whole, however, and certain areas of the field have likely amassed stronger evidence than others.
Other fields have struggled with reproducibility as well, such as economics and medicine. Yet psychology may face distinct challenges: Measuring human behavior can be less precise than measuring, for example, physiological markers such as blood pressure or heart rate.
The validity of psychological research is important both for the pursuit of knowledge about human behavior and for the influence of real world interventions in mental health care, medicine, education, business, and politics.
Despite confronting challenges of reliability, even skeptical scientists still believe in a range of findings about human behavior. Examples of those insights include that personality traits remain fairly stable in adulthood, that individual beliefs are shaped by group beliefs, that people seek to confirm their preexisting beliefs, and more.
To better grasp the replication crisis, it’s worth exploring some of the statistical methods used in psychology experiments. Flexibility in research methodology can help explain why researchers unconsciously (and sometimes consciously) produce unreliable results.
When conducting an experiment, a researcher develops a hypothesis. For example, they may hypothesize that spending time with friends makes people happier. They then seek to reject the null hypothesis—the possibility that there is no association or effect of the sort the researchers propose. In this case, the null hypothesis would be that there is no relationship between happiness and spending time with friends.
A finding is said to be statistically significant if the results of a study based on a particular sample of people are thought to be likely to generalize to the broader population of interest. A traditional benchmark of statistical significance in psychology is a p-value of .05, though more stringent benchmarks have recently been proposed.
The p-value is a measure to determine statistical significance. Roughly speaking, a p-value is the probability of obtaining a study result by random chance if the null hypothesis is true. The smaller the p-value, the less likely it is that an observed result would be found in the absence of a real effect or association between variables. The threshold for significance is traditionally a p-value of less than .05, although the replication crisis has led researchers to rethink relying on p-values or to propose changing the threshold for what counts as "significant" to a lower p-value (such as .005). The fact that .05 is an arbitrary benchmark is, for some, further evidence that p-values are given too much credence.
A Type I error occurs when the null hypothesis is rejected even though it’s actually true, commonly called a false positive. The lower the p-value, the lower the likelihood of a Type 1 error. A Type II error occurs when the null hypothesis is wrongly accepted, called a false negative. Greater statistical power in a study (which is related to factors such as the sample size) means a lower likelihood of a Type II error.
The replication crisis provoked heated internal debate in the field, with some arguing that it called for an overhaul of psychological science and others maintaining that the “crisis” was unreal or overblown. Nevertheless, psychologists interested in reform have pressed ahead with efforts to make the claims of psychological research more credible.
The reformers’ immediate aims include greater transparency in the study planning and data analysis, more routine follow-up testing of results to make sure they can be reliably observed, and study designs that are well-suited to the scientific questions at hand. It remains to be seen which approaches will ultimately be most useful in increasing the veracity of psychological findings.
Psychologists have developed an array of strategies to ensure that future findings have greater credibility. These include conducting more replications of emerging findings, relying on larger sample sizes, and leveraging thoroughly tested measures. Another is preregistration, delineating one’s hypothesis and study plans before conducting a study. Yet another is Registered Reports, in which journals agree to publish a transparently planned-out study no matter the results.
In addition to specific procedures to curb unreliable research practices, many organizations devoted to credibility and transparency have sprung up in the wake of the replication crisis. A few of those initiatives include the Open Science Collaboration, the Society for the Improvement of Psychological Science, and the Psychological Science Accelerator.