statistical significance testing and cumulative … · statistical significance testing and...

15
Psychological Methods 19%. Vol. I. No. 2. 115-129 Copyright 1996 hy the American Psychological Association. Inc. 1082-989X/96/S3.00 Statistical Significance Testing and Cumulative Knowledge in Psychology: Implications for Training of Researchers Frank L. Schmidt University of Iowa Data analysis methods in psychology still emphasize statistical significance testing, despite numerous articles demonstrating its severe deficiencies. It is now possible to use meta-analysis to show that reliance on significance testing retards the development of cumulative knowledge. But reform of teaching and practice will also require that researchers learn that the benefits that they believe flow from use of significance testing are illusory. Teachers must revamp their courses to bring students to understand that (a) reliance on significance testing retards the growth of cumulative research knowledge; (b) benefits widely believed to flow from significance testing do not in fact exist; and (c) significance testing methods must be replaced with point estimates and confidence intervals in individual studies and with meta-analyses in the integration of multiple studies. This reform is essential to the future progress of cumulative knowledge in psychological research. In 1990, Aiken, West, Sechrest, and Reno pub- lished an important article surveying the teaching of quantitative methods in graduate psychology programs. They were concerned about what was not being taught or was being inadequately taught to future researchers and the harm this might cause to research progress in psychology. For example, they found that new and important quantitative methods such as causal modeling, confirmatory factor analysis, and meta-analysis were not being taught in the majority of graduate programs. This is indeed a legitimate cause for concern. But in this article, I am concerned about the opposite: An earlier version of this article was presented as the presidential address to the Division of Evaluation, Measurement and Statistics (Division 5 of the American Psychological Association) at the 102nd Annual Con- vention of the American Psychological Association, Au- gust 13, 1994, Los Angeles, California. This article is largely drawn from work that John Hunter and I have done on meta-analysis over the years, and I would like to thank John Hunter for his comments on an earlier version of this article. Correspondence concerning this article should be ad- dressed to Frank L. Schmidt, Department of Manage- ment and Organizations, College of Business, Univer- sity of Iowa, Iowa City, Iowa 52242. Electronic mail may be sent via Internet to [email protected]. what is being taught and the harm that this is doing. Aiken et al. found that the vast majority of programs were teaching, on a rather thorough basis, what they referred to as "the old standards of statistics": traditional inferential statistics. This includes the t test, the F test, the chi-square test, analysis of variance (ANOVA), and other meth- ods of statistical significance testing. Hypothesis testing based on the statistical significance test has been the main feature of graduate training in sta- tistics in psychology for over 40 years, and the Aiken et al. study showed that it still is. Methods of data analysis and interpretation have a major effect on the development of cumula- tive knowledge. I demonstrate in this article that reliance on statistical significance testing in the analysis and interpretation of research data has systematically retarded the growth of cumulative knowledge in psychology (Hunter & Schmidt, 1990b; Schmidt, 1992). This conclusion is not new. It has been articulated in different ways by Roze- boom (1960), Meehl (1967), Carver (1978), Gutt- man (1985), Oakes (1986), Loftus (1991, 1994), and others, and most recently by Cohen (1994). Jack Hunter and I have used meta-analysis meth- ods to show that these traditional data analysis methods militate against the discovery of the un- derlying regularities and relationships that are the 115

Upload: ngokien

Post on 24-Jun-2018

228 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Statistical Significance Testing and Cumulative … · Statistical Significance Testing and Cumulative Knowledge in ... Frank L. Schmidt ... to thank John Hunter for his comments

Psychological Methods19%. Vol. I. No. 2. 115-129

Copyright 1996 hy the American Psychological Association. Inc.1082-989X/96/S3.00

Statistical Significance Testing and Cumulative Knowledge inPsychology: Implications for Training of Researchers

Frank L. SchmidtUniversity of Iowa

Data analysis methods in psychology still emphasize statistical significancetesting, despite numerous articles demonstrating its severe deficiencies. It isnow possible to use meta-analysis to show that reliance on significance testingretards the development of cumulative knowledge. But reform of teachingand practice will also require that researchers learn that the benefits thatthey believe flow from use of significance testing are illusory. Teachers mustrevamp their courses to bring students to understand that (a) reliance onsignificance testing retards the growth of cumulative research knowledge; (b)benefits widely believed to flow from significance testing do not in fact exist;and (c) significance testing methods must be replaced with point estimatesand confidence intervals in individual studies and with meta-analyses in theintegration of multiple studies. This reform is essential to the future progressof cumulative knowledge in psychological research.

In 1990, Aiken, West, Sechrest, and Reno pub-lished an important article surveying the teachingof quantitative methods in graduate psychologyprograms. They were concerned about what wasnot being taught or was being inadequately taughtto future researchers and the harm this might causeto research progress in psychology. For example,they found that new and important quantitativemethods such as causal modeling, confirmatoryfactor analysis, and meta-analysis were not beingtaught in the majority of graduate programs. Thisis indeed a legitimate cause for concern. But inthis article, I am concerned about the opposite:

An earlier version of this article was presented asthe presidential address to the Division of Evaluation,Measurement and Statistics (Division 5 of the AmericanPsychological Association) at the 102nd Annual Con-vention of the American Psychological Association, Au-gust 13, 1994, Los Angeles, California. This article islargely drawn from work that John Hunter and I havedone on meta-analysis over the years, and I would liketo thank John Hunter for his comments on an earlierversion of this article.

Correspondence concerning this article should be ad-dressed to Frank L. Schmidt, Department of Manage-ment and Organizations, College of Business, Univer-sity of Iowa, Iowa City, Iowa 52242. Electronic mailmay be sent via Internet to [email protected].

what is being taught and the harm that this isdoing. Aiken et al. found that the vast majorityof programs were teaching, on a rather thoroughbasis, what they referred to as "the old standardsof statistics": traditional inferential statistics. Thisincludes the t test, the F test, the chi-square test,analysis of variance (ANOVA), and other meth-ods of statistical significance testing. Hypothesistesting based on the statistical significance test hasbeen the main feature of graduate training in sta-tistics in psychology for over 40 years, and theAiken et al. study showed that it still is.

Methods of data analysis and interpretationhave a major effect on the development of cumula-tive knowledge. I demonstrate in this article thatreliance on statistical significance testing in theanalysis and interpretation of research data hassystematically retarded the growth of cumulativeknowledge in psychology (Hunter & Schmidt,1990b; Schmidt, 1992). This conclusion is not new.It has been articulated in different ways by Roze-boom (1960), Meehl (1967), Carver (1978), Gutt-man (1985), Oakes (1986), Loftus (1991, 1994),and others, and most recently by Cohen (1994).Jack Hunter and I have used meta-analysis meth-ods to show that these traditional data analysismethods militate against the discovery of the un-derlying regularities and relationships that are the

115

Page 2: Statistical Significance Testing and Cumulative … · Statistical Significance Testing and Cumulative Knowledge in ... Frank L. Schmidt ... to thank John Hunter for his comments

116 SCHMIDT

foundation for scientific progress (Hunter &Schmidt, 1990b). Those of us who are the keepersof the methodological and quantitative flame forthe field of psychology bear the major responsibil-ity for this failure because we have continued toemphasize significance testing in the training ofgraduate students despite clear demonstrations ofthe deficiencies of this approach to data analysis.We correctly decry the fact that quantitative meth-ods are given inadequate attention in graduateprograms, and we worry that this signals a futuredecline in research quality. Yet it was our excessiveemphasis on so-called inferential statistical meth-ods that caused a much more serious problem.And we ignore this fact.

My conclusion is that we must abandon the sta-tistical significance test. In our graduate programswe must teach that for analysis of data from indi-vidual studies, the appropriate statistics are pointestimates of effect sizes and confidence intervalsaround these point estimates. We must teach thatfor analysis of data from multiple studies, the ap-propriate method is meta-analysis. I am not thefirst to reach the conclusion that significance test-ing should be replaced by point estimates and con-fidence intervals. Jones stated this conclusion asearly as 1955, and Kish in 1959. Rozeboom reachedthis conclusion in 1960. Carver stated this conclu-sion in 1978, as did Hunter in 1979 in an invitedAmerican Psychological Association (APA) ad-dress, and Oakes in his excellent 1986 book. Sofar, these individuals (and others) have all beenvoices crying in the wilderness.

Why then is the situation any different today?If the closely reasoned and logically flawless argu-ments of Kish, Rozeboom, Carver, and Hunterhave been ignored all these years—and theyhave—what reason is there to believe that thiswill not continue to be the case? There is in facta reason to be optimistic that in the future we willsee reform of data analysis methods in psychology.That reason is the development and widespreaduse of meta-analysis methods. These methodshave revealed more clearly than ever before theextent to which reliance on significance testing hasretarded the growth of cumulative knowledge inpsychology. These demonstrations based on meta-analysis methods are what is new. As conclusionsfrom research literature come more and more tobe based on findings from meta-analysis (Coo-per & Hedges, 1994; Lipsey & Wilson, 1993;

Schmidt, 1992), the significance test necessarilybecomes less and less important. At worst, signifi-cance tests will become progressively deempha-sized. At best, their use will be discontinued andreplaced in individual studies by point estimationof effect sizes and confidence intervals.

The reader's reaction to this might be that thisis just one opinion and that there are defenses ofstatistical significance testing that are as convinc-ing as the arguments and demonstrations I presentin this article. This is not true. As Oakes (1986)stated, it is "extraordinarily difficult to find a statis-tician who argues explicitly in favor of the reten-tion of significance tests" (p. 71). A few psycholo-gists have so argued. But Oakes (1986) and Carver(1978) have carefully considered all such argu-ments and shown them to be logically flawed andhence false. Also, even these few defenders ofsignificance testing (e.g., Winch & Campbell, 1969)agree that the dominant usages of such tests indata analysis in psychology are misuses, and theyhold that the role of significance tests in data analy-sis should be greatly reduced. As you read thisarticle, I want you to consider this challenge: Canyou articulate even one legitimate contributionthat significance testing has made (or makes) tothe research enterprise (i.e., any way in which itcontributes to the development of cumulative sci-entific knowledge)? I believe you will not be ableto do so.

Traditional Methods Versus Meta-Analysis

Psychology and the other social sciences havetraditionally relied heavily on the statistical sig-nificance test in interpreting the meaning of data,both in individual studies and in research litera-ture. Following the fateful lead of Fisher (1932),null hypothesis significance testing has been thedominant data analysis procedure. The prevailingdecision rule, as Oakes (1986) has demonstratedempirically, has been this: If the statistic (f, F, etc.)is significant, there is an effect (or a relation); ifit is not significant, then there is no effect (orrelation). These prevailing interpretational proce-dures have focused heavily on the control of TypeI errors, with little attention being paid to thecontrol of Type II errors. A Type I error (alphaerror) consists of concluding that there is a relationor an effect when there is not. A Type II error(beta error) consists of the opposite, concluding

Page 3: Statistical Significance Testing and Cumulative … · Statistical Significance Testing and Cumulative Knowledge in ... Frank L. Schmidt ... to thank John Hunter for his comments

STATISTICAL SIGNIFICANCE TESTING 117

Total N=30

SE = .38

-.« -.5 -.4 -.3 -O. -.1 0 .1 .2 .3 .4 .5 .«

5%

Figure I . Null distribution of d values in a series ofexperiments. Required for significance: dc = 0.62; df =[1.645(0.38)] = 0.62 (one-tailed test, a = .05).

that there is no relation or effect when there is.Alpha levels have been controlled at the .05 or.01 levels, but beta levels have by default beenallowed to climb to high levels, often in the 50%to 80% range (Cohen, 1962, 1988, 1990, 1994;Schmidt, Hunter, & Urry, 1976). To illustrate this,let us look at an example from a hypothetical butstatistically typical area of experimental psy-chology.

Suppose the research question is the effect of acertain drug on learning, and suppose the actualeffect of a particular dosage is an increase of onehalf of a standard deviation in the amount learned.An effect size of .50 is considered medium-sizedby Cohen (1988) and corresponds to the differencebetween the 50th and 69th percentiles in a normaldistribution. With an effect size of this magnitude,69% of the experimental group would exceed themean of the control group, if both were normallydistributed. Many reviews of various literatureshave found relations of this general magnitude(Hunter & Schmidt, 1990b). Now suppose thata large number of studies are conducted on thisdosage, each with 15 rats in the experimentalgroup and 15 in the control group.

Figure 1 shows the distribution of effect sizes(d values) expected under the null hypothesis. Allvariability around the mean value of zero is dueto sampling error. To be significant at the .05 level(with a one-tailed test), the effect size must be .62or larger. If the null hypothesis is true, only 5%will be that large or larger. In analyzing their data,researchers in psychology typically focus only onthe information in Figure 1. Most believe that their

significance test limits the probability of an errorto 5%.

Actually, in this example the probability of aType I error is zero, not 5%. Because the actualeffect size is always .50, the null hypothesis is al-ways false, and therefore there is no possibility ofa Type I error. One cannot falsely conclude thatthere is an effect when in fact there is an effect.When the null hypothesis is false, the only kindof error that can occur is a Type II error: failureto detect the effect that is present (and the totalerror rate for the study is therefore the Type IIerror rate). The only type of error that can occuris the type that is not controlled.

Figure 2 shows not only the irrelevant null distri-bution but also the actual distribution of effectsizes across these studies. The mean of this distri-bution is the true value of .50, but because ofsampling error, there is substantial variation inobserved effect sizes. Again, to be significant, theeffect size must be .62 or larger. Only 37% ofstudies conducted will obtain a significant effectsize; thus statistical power for each of these studiesis only .37. That is, the true (population) effectsize of the drug is always .50; it is never zero. Yetit is only detected as significant in 37% of thestudies. The error rate in this research literatureis 63%, not 5%, as many would mistakenly believe.

In actuality, the error rate would be even higher.Most researchers in experimental psychologywould traditionally have used F tests from anANOVA to analyze these data. This means thesignificance test would be two-tailed rather thanone-tailed as in our example. With a two-tailed

NE=NC=15Total N=30

8 = .50

SE = .38 SE = .38

-.6 -.5 -.4 -J -.2 -.1 0 .1 .2 .3 .4 .5 .6 .7 * .9 1.0 1.1

Figure 2. Statistical power in a series of experiments.Required for significance: df = 0.62 (one-tailed test, a= .05); statistical power = 0.37; Type II error rate =63%; Type I error rate = 0%.

Page 4: Statistical Significance Testing and Cumulative … · Statistical Significance Testing and Cumulative Knowledge in ... Frank L. Schmidt ... to thank John Hunter for his comments

118 SCHMIDT

test (i.e., one-way ANOVA), statistical power iseven lower: .26 instead of .37. The Type II errorrate (and hence the overall error rate) would be74%. Also, this example assumes use of a z test; anyresearchers not using an ANOVA would probablyuse a t test. For a one-tailed t test with a = .05and df = 28, the effect size (d value) must be .65to be significant. (The t value must be at least 1.70,instead of the 1.645 required for the z test.) Withthe t test, statistical power would also be lower:.35 instead of .37. Thus both commonly used alter-native significance tests would yield even lowerstatistical power and produce even higher errorrates.

Also, note in Figure 2 that the studies that aresignificant yield distorted estimates of effect sizes.The true effect size is always .50; all departuresfrom .50 are due solely to sampling error. But theminimum value required for significance is .62.The obtained d value must be .12 above its truevalue—24% larger than its real value—to be sig-nificant. The average of the significant d values is.89, which is 78% larger than the true value of .50.

In any study in this research literature that bychance yields the correct value of .50, the conclu-sion under the prevailing decision rule is that thereis no relationship. That is, it is only the studiesthat by chance are quite inaccurate that lead tothe correct conclusion that a relationship exists.

How would this body of studies be interpretedas a research literature? There are two interpreta-tions that would have traditionally been accepted.The first is based on the traditional voting method(critiqued by Light & Smith, 1971, and byHedges & Olkin, 1980). Using this method, onewould note that 63% of the studies found "norelationship." Since this is a majority of the studies,the conclusion would be that no relation exists.This conclusion is completely false, yet many re-views in the past have been conducted in just thismanner (Hedges & Olkin, 1980).

The second interpretation is as follows: In 63%of the studies, the drug had no effect. However,in 37% of the studies, the drug did have an effect.(Moreover, when the drug did have an effect, theeffect was quite large, averaging .89.) Research isneeded to identify the moderator variables (inter-actions) that cause the drug to have an effect insome studies but not in others. For example, per-haps the strain of rat used or the mode of injectingthe drug affects study outcomes. This interpreta-

I. Compute Actual Variance of Effect Sizes

1. S| = .1444 (Observed Variance of d Values)

2. S* = .1444 (Variance Predicted from Sampling Error)

4. s| = .1444 - .1444 = 0 (True Variance of 6 Values)

II. Compute Mean Effect Size

1. 7 = .50 (Mean Observed d Value)

2. S = .50

3. SDS = 0

III. Conclusion: There is only one effect size, and its value is .50standard deviation.

Figure 3. Meta-analysis of drug studies.

tion is also completely erroneous. In addition, itleads to wasted research efforts to identify nonex-istent moderator variables.

Both traditional interpretations fail to reveal thetrue meaning of the studies and hence fail to leadto cumulative knowledge. In fact, the traditionalmethods based on significance testing make it im-possible to reach correct conclusions about themeaning of these studies. This is what is meant bythe statement that traditional data analysis meth-ods militate against the development of cumula-tive knowledge.

How would meta-analysis interpret these stud-ies? Different approaches to meta-analysis usesomewhat different quantitative procedures(Bangert-Drowns, 1986; Glass, McGaw, & Smith,1981; Hedges & Olkin, 1985; Hunter, Schmidt, &Jackson, 1982; Hunter & Schmidt, 1990b; Rosen-thai, 1984, 1991). I illustrate this example usingthe methods presented by Hunter, Schmidt, andJackson (1982) and Hunter and Schmidt (1990b).Figure 3 shows that meta-analysis reaches the cor-rect conclusion. Meta-analysis first computes thevariance of the observed d values (using the ordi-nary formula for the variance of any set of num-bers). Next, it uses the standard formula for thesampling error variance of d values (e.g., see

Page 5: Statistical Significance Testing and Cumulative … · Statistical Significance Testing and Cumulative Knowledge in ... Frank L. Schmidt ... to thank John Hunter for his comments

STATISTICAL SIGNIFICANCE TESTING 119

Hunter & Schmidt, 1990b, chap. 7) to determinehow much variance would be expected in observedd values from sampling error alone. The amountof real variance in population d values (d values)is estimated as the difference between the two.In our example, this difference is zero, indicatingcorrectly that there is only one population value.This single population value is estimated as theaverage observed value, which is .50 here, the cor-rect value. If the number of studies is large, theaverage d value will be close to the true (popula-tion) value, because sampling errors are randomand hence average out to zero.1

Note that these meta-analysis methods do notrely on statistical significance tests. Only effectsizes are used, and significance tests are not usedin analyzing the effect sizes. Unlike traditionalmethods based on significance tests, meta-analysisleads to correct conclusions and hence leads tocumulative knowledge.

The data in this example are hypothetical. How-ever, if one accepts the validity of basic statisticalformulas for sampling error, one will have no res-ervations about this example. But the same princi-ples do apply to real data, as shown next by anexample from research in personnel selection. Ta-ble 1 shows observed validity coefficients (correla-tions) from 21 studies of a single clerical test anda single measure of job performance. Each studyhas n = 68 (the median n in the literature in per-sonnel psychology), and every study is a randomdraw (without replacement) from a single largervalidity study with 1,428 subjects. The correlationin the large study (uncorrected for measurementerror, range restriction, or other artifacts) is .22(Schmidt, Ocasio, Hillery, & Hunter, 1985).

The validity is significant in 8 (or 38%) of thesestudies, for an error rate of 62%. The traditionalconclusion would be that this test is valid in 38%of the organizations, and invalid in the rest, andthat in organizations in which it is valid, its meanobserved validity is .33 (which is 50% larger thanits real value). Meta-analysis of these validitiesindicates that the mean is .22 and that all variancein the coefficients is due solely to sampling error.The meta-analysis conclusions are correct; the tra-ditional conclusions are false.

In these examples, the only type of error that iscontrolled—Type I error—is the type that cannotoccur. In most areas of research, as time goes by,researchers gain a better and better understanding

Table 121 Validity Studies, N = 68 for Each

Study

123456789

101112131415161718192021

Observedvalidity

correlation

.04

.14

.31*

.12

.38*

.27*

.15

.36*

.20

.02

.23

.11

.21

.37*

.14

.29*

.26*

.17

.39*

.22

.21

* p < .05 (two tailed).

of the processes they are studying; as a result, itis less and less frequently the case that the nullhypothesis is "true" and more and more likelythat the null hypothesis is false. Thus Type I errordecreases in importance, and Type II error in-creases in importance. This means that as timegoes by, researchers should be paying increasingattention to Type II error and to statistical powerand increasingly less attention to Type I error.However, a recent review in Psychological Bulletin(Sedlmeier & Gigerenzer, 1989) concluded thatthe average statistical power of studies in one APAjournal had declined from 46% to 37% over a 22-

1 Actually the d statistic has a slight positive (upward)bias as the estimator of S, the population value. Formu-las are available to correct observed d values for thisbias and are given in Hedges and Olkin (1985, p. 81)and Hunter and Schmidt (1990b, p. 262). This exampleassumes that this correction has been made. This biasis trivial if the sample size is 10 or greater in both theexperimental and control groups.

Page 6: Statistical Significance Testing and Cumulative … · Statistical Significance Testing and Cumulative Knowledge in ... Frank L. Schmidt ... to thank John Hunter for his comments

120 SCHMIDT

year period (despite the earlier appeal in that jour-nal by Cohen in 1962 for attention to statisticalpower). Only 2 of the 64 experiments reviewedeven mentioned statistical power, and none com-puted estimates of power. The review concludedthat the decline in power was due to increased useof alpha-adjusted procedures (such as the New-man-Keuls, Duncan, and Scheffe procedures).That is, instead of attempting to reduce the TypeII error rate, researchers had been imposing in-creasingly stringent controls on Type I errors,which probably cannot occur in most studies. Theresult is a further increase in the Type II errorrate, an average increase of 17%. This trend illus-trates the deep illogic embedded in the use ofsignificance tests.

These examples have examined only the effectsof sampling error. There are other statisticaland measurement artifacts that cause artifactualvariation in effect sizes and correlations acrossstudies, for example, differences between studiesin amount of measurement error, in degree ofrange restriction, and in dichotomization of mea-sures. Also, in meta-analysis, d values and corre-lations must be corrected for downward bias dueto such research artifacts as measurement errorand dichotomization of measures. These artifactsare beyond the scope of this presentation but arecovered in detail elsewhere (Hunter & Schmidt,1990a, 1990b; Schmidt & Hunter, 1996). Mypurpose here is to demonstrate only that tradi-tional data analysis and interpretation methodslogically lead to erroneous conclusions and todemonstrate that meta-analysis solves these prob-lems at the level of aggregate research literatures.

For almost 50 years, reliance on statisticalsignificance testing in psychology and the othersocial sciences has led to frequent serious errorsin interpreting the meaning of data (Hunter &Schmidt, 1990b, pp. 29-42 and 483-484), errorsthat have systematically retarded the growth ofcumulative knowledge. Despite the best effortsof such individuals as Kish (1959), Rozeboom(1960), Meehl (1967), Carver (1978), Hunter(1979), Guttman (1985), and Oakes (1986), ithas not been possible to wean researchers awayfrom their entrancement with significance testing.Can we now at least hope that the lessonsfrom meta-analysis will finally stimulate change?I would like to answer in the affirmative, butlater in this article I present reasons why I do

not believe these demonstrations alone will besufficient to bring about reform. Other steps arealso needed.

In my introduction, I state that the appropriatemethod for analyzing data from multiple studies ismeta-analysis. These two examples illustrate thatpoint dramatically. I also state that the appropriateway to analyze data in a single study is by meansof point estimation of the effect size and use of aconfidence interval around this point estimate. Ifthis had been done in the studies in these twoexamples, what would these two research litera-tures have looked like prior to application ofmeta-analysis?

In the first example, from the experimentalpsychology literature, the traditional practicewould have been to report only the F statisticvalues and their associated significance levels.Anyone looking at this literature would see that26% of these F ratios are significant and 74%are nonsignificant. This would create at best theimpression of a contradictory set of studies. Withappropriate data analysis methods, the observedd value is computed in each study; this is thepoint estimate of the effect size. Anyone lookingat this literature would quickly see that the vastmajority of these effect sizes—91%—are positive.This gives a very different and much more accu-rate impression than does the observation that74% of the effects are nonsignificant. Next, theconfidence interval around each effect size iscomputed and presented. A glance at these con-fidence intervals would reveal that almost all ofthem overlap with almost all of the other confi-dence intervals. This again correctly suggests thatthe studies are in substantial agreement, contraryto the false impression given by the traditionalinformation that 26% are significant and 74%are nonsignificant. (These studies use simple one-way ANOVA designs; however, d values andconfidence intervals can also be computed whenfactorial ANOVA designs or repeated measuresdesigns are used.)

To see this point more clearly, let us consideragain the observed correlations in Table 1. Theobserved correlation is an index of effect size,and therefore in a truly traditional analysis itwould not be reported; only significance levelswould be reported. So all we would know is thatin 62% of the studies there was "no significantrelationship," and in 38% of the studies there

Page 7: Statistical Significance Testing and Cumulative … · Statistical Significance Testing and Cumulative Knowledge in ... Frank L. Schmidt ... to thank John Hunter for his comments

STATISTICAL SIGNIFICANCE TESTING 121

Table 295% Confidence Intervals for Correlations From Table1, N = 68 for Each

Study

123456789

101112131415161718192021

Observedcorrelation

.39

.38

.37

.36

.31

.29

.27

.26

.23

.22

.21

.21

.20

.17

.15

.14

.14

.12

.11

.04

.02

95% confidenceinterval

Lower

.19

.18

.16

.15

.09

.07

.05

.04

.00-.01-.02-.02-.03-.06-.08-.09-.09-.12-.13-.20-.22

Upper

.59

.58

.58

.57

.53

.51

.49

.48

.46

.45

.44

.44

.43

.40

.38

.37

.37

.36

.35

.28

.26

was a significant relationship. Table 2 shows theinformation that would be provided by use ofpoint estimates of effect size and confidenceintervals. In Table 2, the observed correlationsare arranged in order of size with their 95%confidence intervals.

The first thing that is obvious is that all thecorrelations are positive. It can also be seen thatevery confidence interval overlaps every otherconfidence interval, indicating that these studiescould all be estimating the same population pa-rameter, which indeed they are. This is true foreven the largest and smallest correlations. Theconfidence interval for the largest correlation (.39)is .19 to .59. The confidence interval for the small-est correlation (.02) is - .22 to 26. These confidenceintervals have an overlap of .07. Thus in contrastto the picture provided by null hypothesis signifi-cance testing, point estimates and confidence inter-vals provide a much more correct picture, a picture

that correctly indicates substantial agreementamong the studies.2

There are also other reasons for preferring con-fidence intervals (see Carver, 1978; Hunter &Schmidt, 1990b, pp. 29-33; Kish, 1959; Oakes,1986, p. 67; and Rozeboom, 1960). One importantreason is that, unlike the significance test, the con-fidence interval does hold the overall error rateto the desired level. In the example from experi-mental psychology, we saw that many researchersbelieved that the error rate for the significancetest was held to 5% because the alpha level usedwas .05, when in fact the error rate was really63% (74% if F tests from an ANOVA are used).However, if the 95% confidence interval is used,the overall error rate is in fact held to 5%. Only5% of such computed confidence intervals will beexpected to not include the population (true) ef-fect size and 95% will.

To many researchers today, the idea of substi-tuting point estimates and confidence intervals forsignificance tests might seem radical. Therefore itis important to remember that prior to the appear-ance of Fisher's 1932 and 1935 texts, data analysisin individual studies was typically conducted usingpoint estimates and confidence intervals (Oakes,1986). The point estimates were usually accom-panied by estimates of the "probable error," the50% confidence interval. Significance tests wererarely used (and confidence intervals were not in-terpreted in terms of statistical significance). Mostof us rarely look at the psychological journals ofthe 1920s and early 1930s, but if we did, this iswhat we would see. As can be seen both in thepsychology research journals and in psychologystatistics textbooks, during the latter half of the1930s and during the 1940s, under the influence

- The confidence intervals in Table 2 have been com-puted using the usual formula for the standard error ofthe sample correlation: SE = (1 - r:)/V/V - 1. Hencethese confidence intervals are symmetrical. Some wouldadvocate the use of Fisher's Z transformation of r incomputing confidence intervals for r. This position istypically based on the belief that the sampling distribu-tion of Fisher's Z transformation of r is normally distrib-uted, while r itself is skewed. Actually, both are skewedand both approach normality as N increases, and theFisher's Z transformation approaches normality onlymarginally faster than r as N increases. For a populationcorrelation of .22 and sample sizes in the ranges consid-ered here, the differences are trivial.

Page 8: Statistical Significance Testing and Cumulative … · Statistical Significance Testing and Cumulative Knowledge in ... Frank L. Schmidt ... to thank John Hunter for his comments

122 SCHMIDT

of Fisher, psychological researchers adopted enmasse Fisher's null hypothesis significance testingapproach to analysis of data in individual studies(Huberty, 1993). This was a major mistake. It wasSir Ronald Fisher who led psychological research-ers down the primrose path of significance testing.All the other social sciences were similarly de-ceived, as were researchers in medicine, finance,marketing, and other areas.

Fisher's influence not only explains this unfortu-nate change, it also suggests one reason why psy-chologists for so long gave virtually no attentionto the question of statistical power. The conceptof statistical power does not exist in Fisherian sta-tistics. In Fisherian statistics, the focus of attentionis solely on the null hypothesis. No alternativehypothesis is introduced. Without an alternativehypothesis, there can be no concept of statisticalpower. When Neyman and Pearson (1932, 1933)later introduced the concepts of the alternate hy-pothesis and statistical power, Fisher argued thatstatistical power was irrelevant to statistical sig-nificance testing as used in scientific inference(Oakes, 1986). We have seen in our two exampleshow untrue that statement is.

Thus it is clear that even if meta-analysis hadnever been developed, use of point estimates ofeffect size and confidence intervals in interpretingdata in individual studies would have made ourresearch literatures far less confusing, far less ap-parently contradictory, and far more informativethan those that have been produced by the domi-nant practice of reliance on significance tests. In-deed, the fact of almost universal reliance on sig-nificance tests in data analysis in individual studiesis a major factor in making meta-analysis abso-lutely essential to making sense of research litera-tures (Hunter & Schmidt, 1990b, chap. 1).

However, it is important to understand thatmeta-analysis would still be useful even had re-searchers always relied only on point estimatesand confidence intervals, because the very largenumbers of studies characteristic of many of to-day's literatures create information overload evenif each study has been appropriately analyzed. In-deed, we saw earlier that applying meta-analysisto the studies in Table 1 produces an even clearerand more accurate picture of the meaning of thesestudies than the application of point estimates andconfidence intervals shown in Table 2. In our ex-ample, meta-analysis tells us that there is only one

population correlation and that that value is .22.Confidence intervals tell us only that there maybe only one population value; they do not specifywhat that value might be. In addition, in morecomplex applications, meta-analysis makes possi-ble corrections for the effects of other artifacts—both systematic and unsystematic—that bias effectsize estimates and cause false variation in suchestimates across studies (Hunter & Schmidt,1990b).

Consequences of TraditionalSignificance Testing

As we have seen, traditional reliance on statisti-cal significance testing leads to the false appear-ance of conflicting and internally contradictoryresearch literatures. This has a debilitating effecton the general research effort to develop cumula-tive theoretical knowledge and understanding.However, it is also important to note that itdestroys the usefulness of psychological researchas a means for solving practical problems in so-ciety.

The sequence of events has been much thesame in one applied research area after another.First, there is initial optimism about using socialscience research to answer socially importantquestions that arise. Do government-sponsoredjob-training programs work? One will do studiesto find out. Does integration increase the schoolachievement of Black children? Research willprovide the answer. Next, several studies onthe question are conducted, but the results areconflicting. There is some disappointment thatthe question has not been answered, but poli-cymakers—and people in general—are still opti-mistic. They, along with the researchers, concludethat more research is needed to identify theinteractions (moderators) that have caused theconflicting findings. For example, perhapswhether job training works depends on the ageand education of the trainees. Maybe smallerclasses in the schools are beneficial only forlower IQ children. Researchers may hypothesizethat psychotherapy works for middle-class pa-tients but not lower-class patients.

In the third phase, a large number of researchstudies are funded and conducted to test thesemoderator hypotheses. When they are completed,there is now a large body of studies, but instead

Page 9: Statistical Significance Testing and Cumulative … · Statistical Significance Testing and Cumulative Knowledge in ... Frank L. Schmidt ... to thank John Hunter for his comments

STATISTICAL SIGNIFICANCE TESTING 123

of being resolved, the number of conflicts in-creases. The moderator hypotheses from the initialstudies are not borne out. No one can make muchsense out of the conflicting findings. Researchersconclude that the phenomenon that was selectedfor study in this particular case has turned outto be hopelessly complex, and so they turn tothe investigation of another question, hoping thatthis time the question will turn out to be moretractable. Research sponsors, government offi-cials, and the public become disenchanted andcynical. Research funding agencies cut moneyfor research in this area and in related areas.After this cycle has been repeated enough times,social and behavioral scientists themselves be-come cynical about the value of their own work,and they begin to express doubts about whetherbehavioral and social science research is capablein principle of developing cumulative knowledgeand providing general answers to socially im-portant questions (e.g., see Cronbach, 1975; Ger-gen, 1982; Meehl, 1978). Cronbach's (1975) article"The Two Disciplines of Scientific PsychologyRevisited" is a clear statement of this senseof hopelessness.

Clearly, at this point the need is not for moreprimary research studies but for some means ofmaking sense of the vast number of accumulatedstudy findings. This is the purpose of meta-analy-sis. Applications of meta-analysis to accumulatedresearch literatures have generally shown that re-search findings are not nearly as conflicting as wehad thought and that useful general conclusionscan be drawn from past research. I have summa-rized some of these findings in a recent article(Schmidt, 1992; see also Hunter & Schmidt, inpress). Thus, socially important applied questionscan be answered.

Even more important, it means that scientificprogress is possible. It means that cumulative un-derstanding and progress in theory developmentis possible after all. It means that the behavioraland social sciences can attain the status of truesciences; they are not doomed forever to the statusof quasi-sciences or pseudosciences. One result ofthis is that the gloom, cynicism, and nihilism thathave enveloped many in the behavioral and socialsciences is lifting. Young people starting out in thebehavioral and social sciences today can hope fora much brighter future.

These are among the considerable benefits of

abandoning statistical significance testing in favorof point estimates of effect sizes and confidenceintervals in individual studies and meta-analysisfor combining findings across multiple studies.

Is Statistical Power the Solution?

So far in this article, the deficiencies of signifi-cance testing that I have emphasized are thosestemming from low statistical power in typicalstudies. Significance testing has other importantproblems, and I discuss some of these later. How-ever, in our work on meta-analysis methods, JohnHunter and I have repeatedly been confronted byresearchers who state that the only problem withsignificance testing is low power and that if thisproblem could be solved, there would be no prob-lems with reliance on significance testing in dataanalysis and interpretation. Almost invariably,these individuals see the solution as larger samplesizes. They believe that the problem would besolved if every researcher before conducting eachstudy would calculate the number of subjectsneeded for "adequate" power (usually taken aspower of .80), given the expected effect size andthe desired alpha level, and then use that sam-ple size.

What this position overlooks is that this require-ment would make it impossible for most studiesever to be conducted. At the inception of researchin a given area, the questions are often of the form,"Does Treatment A have an effect?" If TreatmentA indeed has a substantial effect, the sample sizeneeded for adequate power may not be prohibi-tively large. But as research develops, subsequentquestions tend to take the form, "Does TreatmentA have a larger effect than does Treatment B?"The effect size then becomes the difference be-tween the two effects. A similar progression occursin correlational research. Such effect sizes will of-ten be much smaller, and the required sample sizesare therefore often quite large, often 1,000 or more(Schmidt & Hunter, 1978). This is just to attainpower of .80, which still allows a 20% Type IIerror rate when the null hypothesis is false. Manyresearchers cannot obtain that many subjects, nomatter how hard they try; either it is beyond theirresources or the subjects are just unavailable atany cost. Thus the upshot of this position wouldbe that many—perhaps most—studies would notbe conducted at all.

Page 10: Statistical Significance Testing and Cumulative … · Statistical Significance Testing and Cumulative Knowledge in ... Frank L. Schmidt ... to thank John Hunter for his comments

124 SCHMIDT

People advocating the position being critiquedhere would say this would be no loss at all.They argue that a study with inadequate powercontributes nothing and therefore should not beconducted. But such studies do contain valuableinformation when combined with others likethem in a meta-analysis. In fact, very precisemeta-analysis results can be obtained on thebasis of studies that all have inadequate statisticalpower individually. The information in thesestudies is lost if these studies are never conducted.

The belief that such studies are worthless isbased on two false assumptions: (a) the assumptionthat each individual study must be able to supportand justify a conclusion, and (b) the assumptionthat every study should be analyzed with signifi-cance tests. In fact, meta-analysis has made clearthat any single study is rarely adequate by itself toanswer a scientific question. Therefore each studyshould be considered as a data point to be contrib-uted to a later meta-analysis, and individual studiesshould be analyzed using not significance tests butpoint estimates of effect sizes and confidence in-tervals.

How, then, can we solve the problem of statisti-cal power in individual studies? Actually, thisproblem is a pseudoproblem. It can be "solved"by discontinuing the significance test. As Oakes(1986, p. 68) noted, statistical power is a legiti-mate concept only within the context of statisticalsignificance testing. If significance testing is nolonger used, then the concept of statistical powerhas no place and is not meaningful. In particular,there need be no concern with statistical powerwhen point estimates and confidence intervalsare used to analyze data in studies and whenmeta-analysis is used to integrate findings acrossstudies.1 Thus when there is no significance test-ing, there are no statistical power problems.

(1978), and others have explored the various possi-ble reasons why researchers seem to be unable togive up significance testing.

Significance testing creates an illusion of objec-tivity, and objectivity is a critical value in science.But objectivity makes a negative contributionwhen it sabotages the research enterprise by mak-ing it impossible to reach correct conclusions aboutthe meaning of data.

Researchers conform to the dominant practiceof reliance on significance testing because they fearthat failure to follow these conventional practiceswould cause their studies to be rejected by journaleditors. But the solution to this problem is notconformity to counterproductive practices but ed-ucation of editors and reviewers.

There is also a feeling that, as bad as significancetesting is, there is no satisfactory alternative; justlooking at the data and making interpretationswill not do. But as we have seen, there is a goodstatistical alternative: point estimates and confi-dence intervals.

However, I do not believe that these and similarreasons are the whole story. An important part ofthe explanation is that researchers hold false be-liefs about significance testing, beliefs that tellthem that significance testing offers importantbenefits to researchers that it in fact does not.Three of these beliefs are particularly important.

The first is the false belief that the significancelevel of a study indicates the probability of success-ful replication of the study. Oakes (1986, pp. 79-82) empirically studied the beliefs about themeaning of significance tests of 70 research psy-chologists and advanced graduate students. Theywere presented with the following scenario:

Suppose you have a treatment which you suspectmay alter performance on a certain task. You com-pare the means of your control and experimental

Why Are Researchers Addicted toSignificance Testing?

Time after time, even in recent years, I haveseen researchers who have learned to understandthe deceptiveness of significance testing sink backinto the habit of reliance on significance testing. Ihave occasionally done it myself. Why is it so hardfor us to break our addiction to significance test-ing? Methodologists such as Bakan (1966), Meehl(1967), Rozeboom (1960), Oakes (1986), Carver

1 Some state that confidence intervals are the sameas significance tests, because if the lower bound of theconfidence interval does not include zero, that fact indi-cates that the effect size estimate is statistically signifi-cant. But the fact that the confidence interval can beinterpreted as a significance test does not mean that itmust be so interpreted. There is no necessity for suchan interpretation, and as noted earlier, the probableerrors (50% confidence intervals) popularly used in theliterature up until the mid 1930s were never interpretedas significance tests.

Page 11: Statistical Significance Testing and Cumulative … · Statistical Significance Testing and Cumulative Knowledge in ... Frank L. Schmidt ... to thank John Hunter for his comments

STATISTICAL SIGNIFICANCE TESTING 125

groups (20 subjects in each). Further, suppose youuse a simple independent means / test and yourresult is f = 2.7, d.f. = 38, p = .01. (Oakes, 1986,p. 79)

He then asked them to indicate whether each ofseveral statements were true or false. One of thesestatements was this:

You have a reliable experimental finding in thesense that if, hypothetically, the experiment wererepeated a great number of times, you would obtaina significant result in 99% of such studies. (Oakes,1986, p. 79)

Sixty percent of the researchers indicated thatthis false statement is true. The significance levelgives no information about the probability of repli-cation. This statement confuses significance levelwith power. The probability of replication is thepower of the study; the power of this study is not.99, but rather .43." If this study is repeated manytimes, the best estimate is that less than half of allsuch studies will be significant at the chosen alphalevel of .01. Yet 60% of the researchers endorsedthe belief that 99% of such studies would be sig-nificant. This false belief may help to explain thetraditional indifference to power among research-ers. Many researchers believe a power analysisdoes not provide any information not alreadygiven by the significance level. Furthermore, thisbelief leads to the false conclusion that statisticalpower for every statistically significant finding isvery high, at least .95.

That many researchers hold this false belief hasbeen known for decades. Bakan criticized this er-ror in 1966, and Lykken discussed it at some lengthin 1968. The following statement from an introduc-tory statistics textbook by Nunnally (1975) is aclear statement of this belief:

If the statistical significance is at the .05 level, it ismore informative to talk about the statistical confi-dence as being at the .95 level. This means that theinvestigator can be confident with odds of 95 outof 100 that the observed difference will hold up infuture investigations, (p. 195)

Most researchers, however, do not usually ex-plicitly state this belief. The fact that they holdit is revealed by their description of statisticallysignificant findings. Researchers obtaining a statis-tically significant result often refer to it as "a reli-able difference," meaning one that is replicable.In fact, a false argument frequently heard in favor

of significance testing is that we must have signifi-cance tests in order to know whether our findingsare reliable or not. As Carver (1978) pointed out,the popularity of statistical significance testingwould be greatly reduced if researchers could bemade to realize that the statistical significance leveldoes not indicate the replicability of research data.So it is critical that this false belief be eradicatedfrom the minds of researchers.

A second false belief widely held by researchersis that statistical significance level provides an in-dex of the importance or size of a difference orrelation (Bakan, 1966). A difference significantat the .001 level is regarded as theoretically (orpractically) more important or larger than a differ-ence significant at only the .05 level. In researchreports in the literature, one sees statements suchas the following: "Moreover, this difference ishighly significant (p < .001)," implying that thedifference is therefore large or important. Thisbelief ignores the fact that significance level de-pends on sample size; highly significant differencesin large sample studies may be smaller than evennonsignificant differences in smaller sample stud-ies. This belief also ignores the fact that even ifsample sizes were equal across studies compared,the p values would still provide no index of theactual size of the difference or effect. Only effectsize indices can do that.

Because of the influence of meta-analysis, thepractice of computing effect sizes has becomemore frequent in some research literatures, thusmitigating the pernicious effects of this false belief.But in other areas, especially in many areas ofexperimental psychology, effect sizes are rarelycomputed, and it remains the practice to infer sizeor importance of obtained findings from statisticalsignificance levels. In an empirical study, Oakes(1986, pp. 86-88) found that psychological re-searchers infer grossly overestimated effect sizesfrom significance levels. When the study p values

4 The statistical power for future replications of thisstudy is estimated as follows. The best estimate of thepopulation effect size is the effect size (d value) observedin this study. This observed d value is 2tl Viv (Hunter &Schmidt, 1990b, p. 272), which is .85 here. With 20 sub-jects each in the experimental and control groups, analpha level of .01 (two tailed), and a population d valueof .85, the power of the t test is .43.

Page 12: Statistical Significance Testing and Cumulative … · Statistical Significance Testing and Cumulative Knowledge in ... Frank L. Schmidt ... to thank John Hunter for his comments

126 SCHMIDT

were .01, they estimated effect sizes as five timesas large as they actually were.

The size or importance of findings is informationimportant to researchers. Researchers who con-tinue to believe that statistical significance levelsreveal the size or importance of differences orrelations will continue to refuse to abandon sig-nificance testing in favor of point estimates andconfidence intervals. So this is a second false beliefthat must be eradicated.

The third false belief held by many researchersis the most devastating of all to the research enter-prise. This is the belief that if a difference or rela-tion is not statistically significant, then it is zero,or at least so small that it can safely be consideredto be zero. This is the belief that if the null hypoth-esis is not rejected, then it is to be accepted. Thisis the belief that a major benefit from significancetests is that they tell us whether a difference oreffect is real or "probably just occurred bychance." If a difference is not significant, then weknow that it is probably just due to chance. Thetwo examples discussed earlier show how detri-mental this false operational decision rule is to theattainment of cumulative knowledge in psychol-ogy. This belief makes it impossible to discern thereal meaning of research literatures.

Although some of his writings are ambiguouson this point, Fisher himself probably did not ad-vocate this decision rule. In his 1935 book hestated,

It should be noted that this null hypothesis is neverproved or established, but is possibly disproved inthe course of experimentation. Every experimentmay be said to exist only in order to give the factsa chance of disproving the null hypothesis, (p. 19)

If the null hypothesis is not rejected, Fisher's posi-tion was that nothing could be concluded. Butresearchers find it hard to go to all the trouble ofconducting a study only to conclude that nothingcan be concluded. Oakes (1986) has shown empiri-cally that the operational decision rule used byresearchers is indeed "if it is not significant, it iszero." Use of this decision rule amounts to animplicit belief on the part of researchers that thepower of significance tests is perfect or nearly per-fect. Such a belief would account for the surprisetypically expressed by researchers when informedof the low level of statistical power in most studies.

The confidence of researchers in a research

finding is not a linear function of its significancelevel. Rosenthal and Gaito (1963) studied the con-fidence that researchers have that a difference isreal as a function of the p value of the significancetest. They found a precipitous decline in confi-dence as the p value increased from .05 to .06 or.07. There was no similar "cliff effect" as the pvalue increased from .01 to .05. This finding sug-gests that researchers believe that any finding sig-nificant at the .05 level or beyond is real and thatany finding with a larger p value—even one onlymarginally larger—is zero.

Researchers must be disabused of the false be-lief that if a finding is not significant, it is zero.This belief has probably done more than any ofthe other false beliefs about significance testingto retard the growth of cumulative knowledge inpsychology. Those of us concerned with the devel-opment of meta-analysis methods hope that dem-onstrations of the sort given earlier in this articlewill effectively eliminate this false belief.

I believe that these false beliefs are a majorcause of the addiction of researchers to signifi-cance tests. Many researchers believe that statisti-cal significance testing confers important benefitsthat are in fact completely imaginary. If we wereclairvoyant and could enter the mind of a typicalresearcher, we might eavesdrop on the followingthoughts:

Significance tests have been repeatedly criticizedby methodological specialists, but I find them veryuseful in interpreting my research data, and I haveno intention of giving them up. If my findings arenot significant, then I know that they probably justoccurred by chance and that the true difference isprobably zero. If the result is significant, then Iknow I have a reliable finding. The p values fromthe significance tests tell me whether the relation-ships in my data are large enough to be importantor not. I can also determine from the p value whatthe chances are that these findings would replicateif I conducted a new study. These are very valuablethings for a researcher to know. I wish the criticsof significance testing would recognize this fact.

Every one of these thoughts about the benefits ofsignificance testing is false. I ask the reader toponder this question: Does this describe yourthoughts about the significance test?

Analysis of Costs and Benefits

We saw earlier that meta-analysis reveals clearlythe horrendous costs in failure to attain cumulative

Page 13: Statistical Significance Testing and Cumulative … · Statistical Significance Testing and Cumulative Knowledge in ... Frank L. Schmidt ... to thank John Hunter for his comments

STATISTICAL SIGNIFICANCE TESTING 127

knowledge that psychology pays as the price forits addiction to significance testing. I expressed thehope that the appreciation of these massive costswill do what 40 years of logical demonstrations ofthe deficiencies of significance testing have failedto do: convince researchers to abandon the sig-nificance test in favor of point estimates of effectsizes and confidence intervals. But it seems un-likely to me that even these graphic demonstra-tions of costs will alone lead researchers to giveup statistical significance testing. We must alsoconsider the perceived benefits of significance test-ing. Researchers believe that significance testingconfers important imaginary benefits. Many re-searchers may believe that these "benefits" areimportant enough to outweigh even the terriblecosts that significance testing extracts from theresearch enterprise. It is unlikely that researcherswill abandon significance testing unless and untilthey are educated to see that they are not gettingthe benefits they believe they are getting fromsignificance testing. This means that quantitativepsychologists and teachers of statistics and othermethodological courses have the responsibility toteach researchers not only the high costs of signifi-cance testing but also the fact that the benefitstypically ascribed to them are illusory. The failureto do the latter has been a major oversight foralmost 50 years.

Current Situation in Data Analysis inPsychology

There is a fundamental contradiction in the cur-rent situation with respect to quantitative meth-ods. The research literatures and conclusions inour journals are now being shaped by the resultsand findings of meta-analyses, and this develop-ment is solving many of the problems created byreliance on significance testing (Cooper & Hedges,1994; Hunter & Schmidt, 1990b). Yet the contentof our basic graduate statistics courses has notchanged (Aiken et al., 1990); we are training ouryoung researchers in the discredited practices andmethods of the past. Let us examine this anomalyin more detail.

Meta-analysis has explicated the critical role ofsampling error, measurement error, and other arti-facts in determining the observed findings and thestatistical power of individual studies. In doingso, it has revealed how little information there

typically is in any single study. It has shown that,contrary to widespread belief, a single primarystudy can rarely resolve an issue or answer a ques-tion. Any individual study must be considered adata point to be contributed to a future meta-analysis. Thus the scientific status and value ofthe individual study is necessarily lower than hastypically been imagined in the past.

As a result, there has been a shift of the focusof scientific discovery in our research literaturesfrom the individual primary study to the meta-analysis, creating a major change in the relativestatus of reviews. Journals that formerly publishedonly primary studies and refused to publish re-views are now publishing meta-analytic reviewsin large numbers. Today, many discoveries andadvances in cumulative knowledge are being madenot by those who do primary research studies butby those who use meta-analysis to discover thelatent meaning of existing research literatures.This is apparent not only in the number of meta-analyses being published but also—and perhapsmore important—in the shifting pattern of cita-tions in the literature and in textbooks from pri-mary studies to meta-analyses. The same is truein education, social psychology, medicine, finance,accounting, marketing, and other areas (Hunter &Schmidt, 1990a, chap. 1).

In my own substantive area of industrial/organi-zational psychology there is even some evidenceof reduced reliance on significance testing in analy-ses of data within individual studies. Studies aremuch more likely today than in the past to reporteffect sizes and more likely to report confidenceintervals. Results of significance tests are usuallystill reported, but they are now often sandwichedinto parentheses almost as an afterthought and areoften given appropriately minimal attention. It israre today in industrial/organizational psychologyfor a finding to be touted as important solely onthe basis of its p value.

Thus when we look at the research enterprisebeing conducted by the established researchers ofour field, we see major improvements over thesituation that prevailed even 10 years ago. How-ever—and this is the worrisome part—there havebeen no similar improvements in the teaching ofquantitative methods in graduate and undergradu-ate programs. Our younger generations of upcom-ing researchers are still being inculcated with theold, discredited methods of reliance on statistical

Page 14: Statistical Significance Testing and Cumulative … · Statistical Significance Testing and Cumulative Knowledge in ... Frank L. Schmidt ... to thank John Hunter for his comments

128 SCHMIDT

significance testing. When we teach students howto analyze and interpret data in individual studies,we are still teaching them to apply t tests, F tests,chi-square tests, and ANOVAS. We are teachingthem the same methods that for over 40 yearsmade it impossible to discern the real meaning ofdata and research literatures and have thereforeretarded the development of cumulative knowl-edge in psychology and the social sciences. Wemust introduce the reforms needed to solve thisserious problem.

It will not be easy. At Michigan State Univer-sity, John Hunter and Ralph Levine reformed thegraduate statistics course sequence in psychologyover the last 2 years along the general linesindicated in this article. The result was protestsfrom significance testing traditionalists among thefaculty. These faculty did not contend that thenew methods were erroneous; rather, they wereconcerned that their graduate students might notbe able to get their research published unlessthey used traditional significance testing-basedmethods of data analysis. They did not succeedin derailing the reform, but it has not been easyfor these two pioneers. But this must be doneand done everywhere. We can no longer toleratea situation in which our upcoming generation ofresearchers are being trained to use discrediteddata analysis methods while the broader researchenterprise of which they are to become a parthas moved toward improved methods.

References

Aiken, L. S., West, S. G., Sechrest, L., & Reno, R. R.(1990). Graduate training in statistics, methodology,and measurement in psychology: A survey of PhDprograms in North America. American Psychologist,45, 721-734.

Bakan, D. (1966). The test of significance in psychologi-cal research. Psychological Bulletin, 66, 423-437.

Bangert-Drowns, R. L. (1986). Review of developmentsin meta-analytic method. Psychological Bulletin, 99,388-399.

Carver, R. P. (1978). The case against statistical signifi-cance testing. Harvard Educational Review, 48,378-399.

Cohen, J. (1962). The statistical power of abnormal-social psychological research: A review. Journal ofAbnormal and Social Psychology, 65, 145-153.

Cohen, J. (1988). Statistical power analysis for the behav-ioral sciences (2nd ed.) Hillsdale, NJ: Erlbaum.

Cohen, J. (1990). Things I have learned (so far). Ameri-can Psychologist, 45, 1304-1312.

Cohen, J. (1994). The earth is round ( p < .05). AmericanPsychologist, 49, 997-1003.

Cooper, H. M., & Hedges, L. V. (1994). Handbook ofresearch synthesis. New York: Russell Sage Foun-dation.

Cronbach, L. J. (1975). The two disciplines of scientificpsychology revisited. American Psychologist, 30,116-127.

Fisher, R. A. (1932). Statistical methods for researchworkers (4th ed.). Edinburgh, Scotland: Oliver &Boyd.

Fisher, R. A. (1935). The design of experiments. Edin-burgh, Scotland: Oliver & Boyd.

Gergen, K. J. (1982). Toward transformation in socialknowledge. New York: Springer-Verlag.

Glass, G. V., McGaw, B., & Smith, M. L. (1981). Mela-analysis in social research. Beverly Hills, CA: Sage.

Guttman, L. (1985). The illogic of statistical inferencefor cumulative science. Applied Stochastic Models andData Analysis, 1, 3-10.

Hedges, L. V., & Olkin, I. (1980). Vote counting meth-ods in research synthesis. Psychological Bulletin, 88,359-369.

Hedges, L. V., & Olkin, I. (1985). Statistical methodsfor meta-analysis. Orlando, FL: Academic Press.

Huberty, C. J. (1993). Historical origins of statisticaltesting practices. Journal of Experimental Education,61, 317-333.

Hunter, J. E. (1979, September). Cumulating resultsacross studies: A critique of factor analysis, canonicalcorrelation, MANOVA, and statistical significance test-ing. Invited address presented at the 86th AnnualConvention of the American Psychological Associa-tion, New York, NY.

Hunter, J. E., & Schmidt, F. L. (1990a). Dichotomizationof continuous variables: The implications for meta-analysis. Journal of Applied Psychology, 75, 334-349.

Hunter, J. E., & Schmidt, F. L. (1990b). Methods ofmeta-analysis: Correcting error and bias in researchfindings. Newbury Park, CA: Sage.

Hunter, J. E., & Schmidt, F. L. (in press). Cumulativeresearch knowledge and social policy formulation:The critical role of meta-analysis. Psychology, PublicPolicy, and Law.

Hunter, J. E., Schmidt, F. L., & Jackson, G. B. (1982).Meta-analysis: Cumulating research findings acrossstudies. Beverly Hills, CA: Sage.

Page 15: Statistical Significance Testing and Cumulative … · Statistical Significance Testing and Cumulative Knowledge in ... Frank L. Schmidt ... to thank John Hunter for his comments

STATISTICAL SIGNIFICANCE TESTING 129

Jones, L. V. (1955). Statistics and research design. An-nual Review of Psychology, 6, 405-430.

Kish, L. (1959). Some statistical problems in researchdesign. American Sociological Review, 24, 328-338.

Light, R. J., & Smith, P. V. (1971). Accumulating evi-dence: Procedures for resolving contradictions amongdifferent research studies. Harvard Educational Re-view, 41, 429-471.

Lipsey, M. W., & Wilson, D. B. (1993). The efficacy ofpsychological, educational, and behavioral treatment.American Psychologist, 48, 1181-1209.

Loftus, G. R. (1991). On the tyranny of hypothesis test-ing in the social sciences. Contemporary Psychology,36, 102-105.

Loftus, G. R. (1994, August). Why psychology will neverbe a real science until we change the way we analyzedata. Address presented at the American Psychologi-cal Association 102nd annual convention, LosAngeles, CA.

Lykken, D. (1968). Statistical significance in psychologi-cal research. Psychological Bulletin, 70, 151-159.

Meehl, P. E. (1967). Theory testing in psychology andphysics: A methodological paradox. Philosophy ofScience, 34, 103-115.

Meehl, P. E. (1978). Theoretical risks and tabular aster-isks: Sir Karl, Sir Ronald and the slow process ofsoft psychology. Journal of Consulting and ClinicalPsychology, 46, 806-834.

Neyman, J., & Pearson, E. S. (1932). The testing ofstatistical hypotheses in relation to probabilities a pri-ori. Proceedings of the Cambridge Philosophical Soci-ety, 29, 492-516.

Neyman, J., & Pearson, E. S. (1933). On the problemof the most efficient tests of statistical hypotheses.Philosophical Transactions of the Royal Society ofLondon, A231, 289-337.

Nunnally, J. C. (1975). Introduction to statistics for psy-chology and education. New York: McGraw-Hill.

Oakes, M. L. (1986). Statistical inference: A commentary

for the social and behavioral sciences. New York:Wiley.

Rosenthal, R. (1984). Meta-analytic procedures for so-cial research. Beverly Hills, CA: Sage.

Rosenthal, R. (1991). Meta-analytic procedures for so-cial research (2nd ed.). Newbury Park, CA: Sage.

Rosenthal, R., & Gaito, J. (1963). The interpretationof levels of significance by psychological researchers.Journal of Psychology, 55, 33-38.

Rozeboom, W. W. (1960). The fallacy of the null hypoth-esis significance test. Psychological Bulletin, 57,416-428.

Schmidt, F. L. (1992). What do data really mean? Re-search findings, meta-analysis, and cumulative knowl-edge in psychology. American Psychologist, 47, 1173-1181.

Schmidt, F. L., & Hunter, J. E. (1978). Moderator re-search and the law of small numbers. Personnel Psy-chology, 31, 215-232.

Schmidt, F. L., & Hunter, J. E. (1996). Measurementerror in psychological research: Lessons from 26 re-search scenarios. Psychological Methods, 1, 199-223.

Schmidt, F. L., Hunter, J. E., & Urry, V. E. (1976).Statistical power in criterion-related validation stud-ies. Journal of Applied Psychology, 61, 473-485.

Schmidt, F. L., Ocasio, B. P., Hillery, J. M., & Hunter,J. E. (1985). Further within-setting empirical tests ofthe situational specificity hypothesis in personnel se-lection. Personnel Psychology, 38, 509-524.

Sedlmeier. P., & Gigerenzer, G. (1989). Do studies ofstatistical power have an effect on the power of stud-ies? Psychological Bulletin, 105, 309-316.

Winch, R. F., & Campbell, D. T. (1969). Proof? No.Evidence? Yes. The significance of tests of signifi-cance. American Sociologist, 4, 140-143.

Received April 27, 1995Revision received September 7, 1995

Accepted September 25, 1995 •