In: Statistics and Probability
Critically assess the observation that ‘Traditional
use of significance testing is an inherently misleading process
that should be abandoned in favour of other approaches’ (Cohen,
1994)
"After four decades of severe criticism, the ritual of null hypothesis significance testing---mechanical dichotomous decisions around a sacred .05 criterion---still persist. This article reviews the problems with this practice..." ... "What's wrong with [null hypothesis significance testing]? Well, among many other things, it does not tell us what we want to know, and we so much want to know what we want to know that, out of desperation, we nevertheless believe that it does!" (Cohen 1994)
WHY ARE HYPOTHESIS TESTS USED?
With all the deficiencies of statistical hypoth- esis tests, it is reasonable to wonder why they remain so widely used. Nester (1996) suggested several reasons: (1) they appear to be objective and exact; (2) they are readily available and eas- ily invoked in many commercial statistics pack- ages; (3) everyone else seems to use them; (4) students, statisticians, and scientists are taught to use them; and (5) some journal editors and thesis supervisors demand them. Carver (1978) recognized that statistical significance is gener- ally interpreted as having some relation to rep- lication, which is the cornerstone of science. More cynically, Carver (1978) suggested that complicated mathematical procedures lend an air of scientific objectivity to conclusions. Shav- er (1993) noted that social scientists equate be- ing quantitative with being scientific. D. V. Lindley (quoted in Matthews 1997) observed that "People like conventional hypothesis tests because it's so easy to get significant results from them."
I attribute the heavy use of statistical hypoth- esis testing, not just in the wildlife field but in other "soft" sciences such as psychology, soci- ology, and education, to "physics envy." Physi- cists and other researchers in the "hard" sci- ences are widely respected for their ability to learn things about the real world (and universe) that are solid and incontrovertible, and also yield results that translate into products that we see daily. Psychologists, for 1 group, have diffi- culty developing tests that are able to distin- guish 2 competing theories.
In the hard sciences, hypotheses are tested; that process is an integral component of the hy- pothetico-deductive scientific method. Under that method, a theory is postulated, which gen- erates several predictions. These predictions are treated as scientific hypotheses, and an experi- ment is conducted to try to falsify each hypoth- esis. If the results of the experiment refute the hypothesis, that outcome implies that the theory is incorrect and should be modified or scrapped. If the results do not refute the hypothesis, the theory stands and may gain support, depending on how critical the experiment was.
In contrast, the hypotheses usually tested by wildlife ecologists do not devolve from general theories about how the real world operates. More typically they are statistical hypotheses (i.e., statements about properties of popula- tions; Simberloff 1990). Unlike scientific hy- potheses, the truth of which is truly in question, most statistical hypotheses are known a priori to be false. The confusion of the 2 types of hy- potheses has been attributed to the pervasive influence of R. A. Fisher, who did not distin- guish them (Schmidt and Hunter 1997).
Scientific hypothesis testing dates back at least to the 17th century: in 1620, Francis Ba- con discussed the role of proposing alternative explanations and conducting explicit tests to dis- tinguish between them as the most direct routeto scientific understanding (Quinn and Dunham 1983). This concept is related to Popperian in- ference, which seeks to develop and test hy- potheses that can clearly be falsified (Popper 1959), because a falsified hypothesis provides greater advance in understanding than does a hypothesis that is supported. Also similar is Platt's (1964) notion of strong inference, which emphasizes developing alternative hypotheses that lead to different predictions. In such a case, results inconsistent with predictions from a hy- pothesis cast doubt of its validity.
Examples of scientific hypotheses, which were considered credible, include Copernicus' notion HA: the Earth revolves around the sun, versus the conventional wisdom of the time, Ho: the sun revolves around the Earth. Another ex- ample is Fermat's last theorem, which states that for integers n, X, Y, and Z, Xn + yn = Zn implies n - 2. Alternatively, a physicist may make specific predictions about a parameter based on a theory, and the theory is provision- ally accepted only if the outcomes are within measurement error of the predicted value, and no other theories make predictions that also fall within that range (Mulaik et al. 1997). Contrast these hypotheses, which involve phenomena in nature, with the statistical hypotheses presented in The Journal of Wildlife Management, which were mentioned above, and which involve prop- erties of populations.
WHAT ARE THE ALTERNATIVES? What should we do instead of testing hypoth- eses? As Quinn and Dunham (1983) pointed out, it is more fruitful to determine the relative importance to the contributions of, and inter- actions between, a number of processes. For this purpose, estimation is far more appropriate than hypothesis testing (Campbell 1992). For certain other situations, decision theory is an appropriate tool. For either of these applica- tions, as well as for hypothesis testing itself, the Bayesian approach offers some distinct advan- tages over the traditional methods. These alter- natives are briefly outlined below. Although the alternatives will not meet all potential needs, they do offer attractive choices in many fre- quently encountered situations.
Estimates and Confidence
Intervals Four decades ago, Anscombe (1956) ob- served that statistical hypothesis tests were to- tally irrelevant, and that what was needed were estimates of magnitudes of effects, with stan- dard errors. Yates (1964) indicated that "The most commonly occurring weakness in the ap- plication of Fisherian methods is undue em- phasis on tests of significance, and failure to recognize that in many types of experimental work estimates of the treatment effects, togeth- er with estimates of the errors to which they are subject, are the quantities of primary interest." Further, because wildlife ecologists want to in- fluence management practices, Johnson (1995) noted that, "If ecologists are to be taken seri- ously by decision makers, they must provide in- formation useful for deciding on a course of ac- tion, as opposed to addressing purely academic questions." To enforce that point, several edu- cation and psychological journals have adopted editorial policies requiring that parameter esti- mates accompany any P-values be presented (McLean and Ernest 1998).
Decision Theory
Often experiments or surveys are conducted to help make some decision, such as what limits to set on hunting seasons, if a forest stand should be logged, or if a pesticide should be approved. In those cases, hypothesis testing is inadequate, for it does not take into consideration the costs of alternative actions. Here a useful tool is statistical decision theory: the theory of acting rationally with respect to anticipated gains and losses, in the face of uncertainty. Hypothesis testing generally limits the probability of a Type I error (rejecting a true null hypothesis), often arbitrarily set at a = 0.05, while letting the probability of a Type II error (accepting a false null hypothesis) fall where it may. In ecological situations, however, a Type II error may be far more costly than a Type I error (Toft and Shea 1983). As an example, ap- proving a pesticide that reduces the survival rate of an endangered species by 5% may be disas- trous to that species, even if that change is not statistically detectable. As another, continued overharvest in marine fisheries may result in the collapse of the ecosystem even while statistical tests are unable to reject the null hypothesis that fishing has no effect (Dayton 1998). Details on decision theory can be found in DeGroot (1970), Berger (1985), and Pratt et al. (1995).
Model Selection
Statistical tests can play a useful role in di- agnostic checks and evaluations of tentative sta- tistical models (Box 1980). But even for this ap- plication, competing tools are superior. Infor- mation criteria, such as Akaike's, provide objec- tive measures for selecting among different models fitted to a dataset. Burnham and An- derson (1998) provided a detailed overview of model selection procedures based on informa- tion criteria. In addition, for many applications it is not advisable to select a "best" model and then proceed as if that model was correct. There may be a group of models entertained, and the data will provide different strength of evidence for each model. Rather than basing decisions or conclusions on the single model most strongly supported by the data, one should acknowledge the uncertainty about the model by considering the entire set of models, each perhaps weighted by its own strength of evi- dence (Buckland et al. 1997).
Bayesian Approaches
Bayesian approaches offer some alternatives preferable to the ordinary (often called fre- quentist, because they invoke the idea of the long-term frequency of outcomes in imagined repeats of experiments or samples) methods for hypothesis testing, as well as for estimation and decision-making. Space limitations preclude a detailed review of the approach here; see Box and Tiao (1973), Berger (1985), and Carlin and Louis (1996) for longer expositions, and Schmitt (1969) for an elementary introduction.
CONCLUSIONS
Editors of scientific journals, along with the referees they rely on, are really the arbiters of scientific practice. They need to understand how statistical methods can be used to reach sound conclusions from data that have been gathered. It is not sufficient to insist that au- thors use statistical methods-the methods must be appropriate to the application. The most common and flagrant misuse of statistics, in my view, is the testing of hypotheses, espe- cially the vast majority of them known before- hand to be false. With the hundreds of articles already pub- lished that decry the use of statistical hypothesis testing, I was somewhat hesitant about writing another. It contains nothing new. But still, read- ing The Journal of Wildlife Management makes me realize that the message has not really reached the audience of wildlife biologists. Our work is important, so we should use the best tools we have available. Rarely, however, is that tool statistical hypothesis testing.