Fear

The ordeal of the one-tailed test

Statistics without fear: A tale of tails

Posted April 29, 2011

Editor's Note: This post references research by Diederik Stapel. Many of his studies have since been found to be fradulent.

Tomás de Torquemada, "The hammer of heretics," would have loved to hate the one-tailed test (OTT) of statistical significance. The OTT, and those who use it, have been reviled as soft-headed, self-serving pseudo-scientists by those bent on upholding the ritual orthodoxy of the two-tailed test. O.k., this phrase came out a bit hyperbolical, and I don't have the apt scientific reference to back it up empirically beyond a reasonable doubt, p < .05, ___-tailed). Nonetheless, after observing the scene for a quarter of a century and after reviewing innumerable manuscripts, I stand behind my assessment. There's just nothing like an experience-grounded appeal to one's own authority. Well, let me at least give an example from recent research and blogging thereof. Bem (2011) tested the presence of retroactive causation (that's right!) using OTTs and skeptics have complained about it (see here). Incidentally, I believe that Bem, regardless of my limited sympathy for his research project, was right in choosing the OTT. But I'm getting ahead of the story.

Let me first describe two legitimate uses of the two-tailed test. Say you have set up a null hypothesis that the coin in your hand is fair and you have tossed it a bunch of times. The alternative to the null is that the coin is biased, and it could be biased by producing a number of heads or tails (no pun intended) that would be improbable if the null hypothesis were true. You then reject the assumption that the coin is fair and look for a better one. What this example illustrates is the testing of human-made devices that are supposed to yield a particular result with a particular probability. Gaming devices in general are illustrative, but one could look to industrial production as well. Sugar, to take a prosaic case, is put in bags that are supposed to contain a pound. Production may be biased, thus either cheating the consumer or the producer. Taking samples and testing the null hypothesis of no systematic error in either direction is a fine example for the use of the two-tailed test. In fact, the quality (or rather "quantity") control scenario was among the inspirations for the Neyman-Pearson decision-theoretic approach to hypothesis testing.

In psychological research, precise hypotheses that could be discredited in either tail of the distribution are rare. It is only the strongest theories that generate point-specific predictions that are not trivial and not confounded with chance (as the nil-null hypothesis is). Suppose your hypothesis that jumbo eggs are exactly half a standard deviation larger in diameter than large eggs. This hypothesis is a bold because it can be nullified at either end of the distribution and you have no idea which type of refutation is more probable. A two-tailed test is in order. Again, this is good because your substantive claim will be corroborated by a failure to reject the null hypothesis.

As these two examples show, the two-tailed test enters the fray when the researcher has a strong claim in place and ventures to falsify it. Alas, this is not the typical situation for the psychological researcher. Often, we can speculate that the effect will go in a certain direction. Most experiments are designed to test if a presumed causal variable has an effect. Will increasing self-esteem induce greater happiness? Will practicing playing the guitar for long hours improve performance? Will holding a warm cup of coffee make you more agreeable? This is the kind of scenario envisioned in Fisher's conceptualization of significance testing. You look to reject a null hypothesis by finding an intervention that works as intended. Now, a rational person would not intend an intervention to work by either increasing or decreasing an outcome, be it happiness, virtuosity, or agreeableness. What kind of a theory predicts that the null hypothesis is false without caring in which direction it is false? Such a theory would be confirmed by a certain state of nature and also by its complete opposite. Nuts! Incidentally, this is why I prefer the phrase "global warming" over the noncommittal "climate change."

Ergo, the OTT is the one to call when testing a directional hypothesis. Why is it done so rarely? I think the reason is that the use of the two-tailed test has become ritualized (Gigerenzer, 2004). If the two-tailed test is the right choice for the two types of strong hypothesis testing described above, why should it not be serviceable for the testing of an ordinary, directional research question? Being ritualized, the two-tailed test has become the default option. Use it, they say, unless there is a clear need not to. Unfortunately, such a need often becomes apparent only after the data are in. Suppose you wanted to show that loud music makes you more (not less) irritable. After doing the experiment, properly running a treatment condition, in which loud noise is delivered and a control condition in which it is not, you find that p = .06, two-tailed, for the comparison of the two means. The old school, hard-shell orthodoxians who live by the sword would also die by the sword. They would pronounce the experiment inconclusive, and they would not reject the null hypothesis. To the rest of us, who can resist anything but temptation, it seems that the finding is "marginally significant," which sounds more significant that "marginally attractive" sounds attractive. We see a "trend" in the desired direction, or we even see the data "trending" that way, as if they were moving. All this sounds shady, and that's where Torquemada and his henchmen come in. Reconsidering the use of the OTT after the data are in, statistics police say, is (self-) deception. Loosening the belt after the meal is to relax a noble standard.

The charge is that going with the OTT is to double the chances of a false positive result (i.e., rejecting the null hypothesis when it is true), and that is the worst sin against scientific conservatism. It seems to me, though, that this is not a problem of the OTT per se. If a two-tailed region of rejection were set up with p = .025 on either side, while at the same time a directional hypothesis was being considered, setting up an OTT with p = .025 boils down to the same thing. An OTT is not automatically less conservative than a two-tailed test; the difference appears only if a two-tailed test is replaced with an OTT post hoc. With the idea of p = .05 being the conventional level of conservatism, why not attach it to the tail of the distribution that matters? If the data are significant on the other side of the distribution, the null hypothesis is rejected as well, but so is your directional hypothesis of interest, and a fortiori so.

To you researchers out there, my advice is to go with the OTT from the outset if your research hypothesis is directional. That's how you can keep it straight (and somewhat liberal), and along the way you avoid being led into temptation by "trending" data.

L'esprit de l'escalier. After writing this post, I was waiting for a good example to come along. It took only one day. In an otherwise interesting article, Lammers, Stapel, and Galinsky (2010) present the results of several two-by-two factorial analyses of variance (ANOVA), in which the interesting result is the interaction between interpersonal power (high vs. low) and morality (judgment vs. behavior). Compared with low-power individuals, high-power individuals set stricter moral standards but act in a more relaxed fashion (i.e., they cheat more). Well, that's the narrative anyway; the statistics are arguable.

Instead of following up these interactions with simple F tests (Keppel, 1991), Lammers et al. use two-tailed t tests and promptly find some of p > .05. They don't bother with the post-hoc move to a one-tailed test or the rhetoric of marginal significance; they simply describe the results as significant. Chutzpah. Incidentally, the proper simple effects F test avoids the ‘how-many-tails headache;' the F distribution has only one tail.

So, what to make of the claim that compared with low-power individuals, high-power individuals "overreport travel expenses, t(57) = 1.78, p = .08" (p. 739)? And the head-scratching isn't over. If 61 subjects are randomized over a four-condition design, the degrees of freedom comparing two of the four means should be a bit less than half the total sample size. The final puzzler: How do you squeeze measures of moral judgment and moral behavior into the same ANOVA when the two sets of data are represented on different metrics (a 9-point scale vs. a 100-point scale). Lammers at al. choose standardization so that overall, morality judgments and the index of cheating each have a mean of 0 and a standard deviation of 1. They then report that there is no main effect for the comparison between judgment and behavior, "F < .01." The statistical ritual is running amuck! As Jacob Cohen (1994, p. 997) would say "Would you believe it? It could happen."

Fear Essential Reads

Why You May Not Be Optimizing Your Cancer Protection

The Fear Factor: How Singlehood and Self-Esteem Drive Dating

Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49, 997-1003.

Gigerenzer, G. (2004). Mindless statistics. Journal of Socio-Economics, 33, 587-606.

Keppel, G. (1991). Design and analysis: a researcher's handbook (3rd edn). Englewood Cliffs, NJ: Prentice-Hall.

Lammers, J., Stapel, D. A., & Galinsky, A. (2010). Power increases hypocrisy: Moralizing in reasoning, immorality in behavior. Psychological Science, 21, 737-744.