d. mayo: replication research under an error statistical philosophy

SPP D. Mayo 1

Replication Research Under an Error Statistical Philosophy

Deborah Mayo

Around a year ago on my blog: “There are some ironic twists in the way psychology is dealing with its replication crisis that may well threaten even the most sincere efforts to put the field on firmer scientific footing”

Philosopher’s talk: I see a rich source of problems that cry out for ministrations of philosophers of science and of statistics

SPP D. Mayo 2

Three main philosophical tasks: #1 Clarify concepts and presuppositions #2 Reveal inconsistencies, puzzles, tensions (“ironies”) #3 Solve problems, improve on methodology

• Philosophers usually stop with the first two, but I think

going on to solve problems is important. This presentation is ‘programmatic’- what might replication research under an error statistical philosophy be? My interest grew thanks to Caitlin Parker whose MA thesis was on the topic

SPP D. Mayo 3

Example of a conceptual clarification (#1)

Editors of a journal, Basic and Applied Social Psychology, announced they are banning statistical hypothesis testing because it is “invalid” It’s invalid because it does not supply “the probability of the null hypothesis, given the finding” (the posterior probability of H0) (2015 Trafimow and Marks) • Since the methodology of testing explicitly rejects the mode

of inference they don’t supply, it would be incorrect to claim the methods were invalid.

• Simple conceptual job that philosophers are good at

SPP D. Mayo 4

Example of revealing inconsistencies and tensions (#2) Critic: It’s too easy to satisfy standard significance thresholds You: Why do replicationists find it so hard to achieve significance thresholds? Critic: Obviously the initial studies were guilty of p-hacking, cherry-picking, significance seeking, QRPs You: So, the replication researchers want methods that pick up on and block these biasing selection effects. Critic: Actually the “reforms” recommend methods where selection effects and data dredging make no difference

SPP D. Mayo 5

Whether this can be resolved or not is separate. • We are constantly hearing of how the “reward structure”

leads to taking advantage of researcher flexibility • As philosophers, we can at least show how to hold their

feet to the fire, and warn of the perils of accounts that bury the finagling

The philosopher is the curmudgeon (takes chutzpah!) I’ll give examples of #1 clarifying terms #2 inconsistencies #3 proposed solutions (though I won’t always number them) .

SPP D. Mayo 6

Demarcation: Bad Methodology/Bad Statistics

• A lot of the recent attention grew out of the case of Diederik

Stapel, the social psychologist who fabricated his data. • Kahneman in 2012 “I see a train-‐wreck looming,” setting up a “daisy chain” of replication.

• The Stapel investigators: 2012 Tilberg Report, “Flawed Science” do a good job of characterizing pseudoscience.

• Philosophers tend to have cold feet when it comes to saying anything general about science versus pseudoscience.

SPP D. Mayo 7

Items in their list of “dirty laundry” include: “An experiment fails to yield the expected statistically significant results. The experimenters try and try again until they find something (multiple testing, multiple modeling, post-data search of endpoint or subgroups), and the only experiment subsequently reported is the one that did yield the expected results.”

… continuing an experiment until it works as desired, or excluding unwelcome experimental subjects or results, inevitably tends to confirm the researcher’s research hypotheses, and essentially render the hypotheses immune to the facts”. (Report, 48)

--they walked into a “culture of verification bias”

SPP D. Mayo 8

Bad Statistics Severity Requirement: If data x0 agree with a hypothesis H, but the test procedure had little or no capability, i.e., little or no probability of finding flaws with H (even if H is incorrect), then x0 provide poor evidence for H.

Such a test we would say fails a minimal requirement for a stringent or severe test.

• This seems utterly uncontroversial.

SPP D. Mayo 9

• Methods that scrutinize a test’s capabilities, according to their severity, I call error statistical.

• Existing error probabilities (confidence levels, significance

levels) may but need not provide severity assessments.

• New name: frequentist, sampling theory, Fisherian, Neyman-Pearsonian—are too associated with hard line views and personality conflicts (“It’s the methods, stupid”)

(example of new solutions #3)

SPP D. Mayo 10

Are philosophies about science relevant? One of the final recommendations in the Report is this:

In the training program for PhD students, the relevant basic principles of philosophy of science, methodology, ethics and statistics that enable the responsible practice of science must be covered. (p. 57)

SPP D. Mayo 11

A critic might protest: “There’s nothing philosophical about my criticism of significance tests: a small p-value is invariably, and erroneously, interpreted as giving a small probability to the null hypothesis that the observed difference is mere chance.” Really? P-values are not intended to be used this way; presupposing they should be stems from a conception of the role of probability in statistical inference—this conception is philosophical. (of course criticizing them because they might be misinterpreted is just silly)

SPP D. Mayo 12

Two main views of the role of probability in inference Probabilism. To provide a post-‐data assignment of degree of probability, confirmation, support or belief in a hypothesis, absolute or comparative, given data x0. Performance. To ensure long-‐run reliability of methods, coverage probabilities, control the relative frequency of erroneous inferences in a long-run series of trials. What happened to the goal of scrutinizing bad science by the severity criterion?

SPP D. Mayo 13

• Neither “probabilism” nor “performance” directly captures

it. • Good long-run performance is a necessary not a sufficient

condition for avoiding insevere tests.

• The problems with selective reporting, multiple testing, stopping when the data look good are not problems about long-runs—

• It’s that we cannot say about the case at hand that it has

done a good job of avoiding the sources of misinterpretation.

SPP D. Mayo 14

• Probabilism says H is not justified unless it’s true or probable (made firmer).

• Error statistics (probativism) says H is not justified unless something (a good job) has been done to probe ways we can be wrong about H.

• If it’s assumed probabilism is required for inference, error probabilities could be relevant only by misinterpretation. False!

• Error probabilities have a crucial role in appraising well-‐testedness (new philosophy for probability #3)

• Both H and not-‐H be can be poorly tested, so a severe testing assessment violates probability

SPP D. Mayo 15

Understanding the Replication Crisis Requires Understanding How it Intermingles with PhilStat Controversies

• It’s not that I’m keen to defend many common uses of significance tests

• It’s just that the criticisms (in psychology and elsewhere)

are based on serious misunderstandings of the nature and role of these methods; consequently so are many “reforms”

• How can you be clear the reforms are better if you might be

mistaken about existing methods?

SPP D. Mayo 16

Criticisms concern a kind of Fisherian Significance Test (i) Sample space: Let the sample be X = (X1, …,Xn), be n iid (independent and identically distributed) outcomes from a Normal distribution with standard deviation σ (ii) A null hypothesis H0: µ = 0 (Δ: µΤ − µC = 0)

(iii) Test statistic: A function of the sample, d(X) reflecting the difference between the data x0 = (x1, …,xn), and H0:

The larger d(x0) the further the outcome from what’s expected under H0, with respect to the particular question.

(iv) Sampling distribution of test statistic: d(X)

SPP D. Mayo 17

The p-‐value is the probability of a difference larger than d(x0), under the assumption that H0 is true:

p(x0)=Pr(d(X) > d(x0); H0). If p(x0) is sufficiently small, there’s an indication of discrepancy from the null. (Even Fisher had implicit alternatives, by the way)

SPP D. Mayo 18

P-‐value reasoning: from high capacity to curb enthusiasm

If the hypothesis H0 is correct then, with high probability, 1-‐p, the data would not be statistically significant at level p.

x0 is statistically significant at level p.

____________________________

Thus, x0 indicates a discrepancy from H0.

That merely indicates some discrepancy!

SPP D. Mayo 19

A genuine experimental effect is needed “[W]e need, not an isolated record, but a reliable method of procedure. In relation to the test of significance, we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result.” (Fisher 1935, 14)

(low P-value ≠> H: statistical effect) “[A]ccording to Fisher, rejecting the null hypothesis is not equivalent to accepting the efficacy of the cause in question. The latter...requires obtaining more significant results when the experiment, or an improvement of it, is repeated at other laboratories or under other conditions.” (Gigerentzer 1989, 95-‐6) (H ≠> H*)

SPP D. Mayo 20

Still, simple Fisherian Tests have Important Uses

• Testing assumptions • Fraudbusting and forensics: Finding Data too good to be true (Simonsohn)

• Finding if data are consistent with a model Gelman and Shalizi (meeting of minds between a Bayesian and an error statistician) “What we are advocating, then, is what Cox and Hinkley (1974) call ‘pure significance testing’, in which certain of the model’s implications are compared directly to the data, rather than entering into a contest with some alternative model.” (p.20)

SPP D. Mayo 21

Fallacy of Rejection: H – > H* : Erroneously take statistical significance as evidence of research hypothesis H* The fallacy is explicated by severity: flaws in alternative H* have not been probed by the test, the inference from a statistically significant result to H* fails to pass with severity Merely refuting the null hypothesis is too weak to corroborate substantive H*, “we have to have ‘Popperian risk’, ‘severe test’ [as in Mayo], or what philosopher Wesley Salmon called ‘a highly improbable coincidence.’” (Meehl and Waller 2002, 184)

(Meehl was wrong to blame Fisher)

SPP D. Mayo 22

NHST are pseudostatistical: Why do psychologists speak of NHSTs –tests that supposedly allow moving from statistical to substantive? So defined, they exist only as abuses of tests: they exist as something you’re never supposed to do Psychologists tend to ignore Neyman-‐Pearson (N-‐P) tests: N-‐P supplemented Fisher’s tests with explicit alternatives

SPP D. Mayo 23

Neyman-‐Pearson (N-‐P) Tests: A null and alternative hypotheses H0, H1 that exhaust the parameter space So the fallacy of rejection H – > H* is impossible (rejecting the null only indicates statistical alternatives) Scotches criticisms that P-‐values are only under the null Example: Test T+: sampling distribution of d(x) under null and alternatives. H0: µ ≤ µ0 vs. H1: µ > µ0 if d(x0) > cα, "reject" H0, if d(x0) < cα, "do not reject” or “accept" H0, e.g. cα=1.96 for α=.025

SPP D. Mayo 24

The sampling distribution yields Error Probabilities Probability of a Type I error = P(d(X) > cα; H0) ≤ α. Probability of a Type II error: = P(d(X) < cα; H0) = ß(µ1), for any µ1 > µ0.

The complement of the Type II error probability= power against (µ1)

POW(µ1)= P(d(X) > cα; µ1)

Even without “best” tests, there are “good” tests

SPP D. Mayo 25

N-‐P test in terms of the P-‐value: reject H0 iff P-‐value < .025

• Even N-‐P report the attained significance level or P-‐value (Lehmann)

• “reject/do not reject” uninterpreted parts of the mathematical apparatus

Reject could be: “Declare statistically significant at the p-‐level”

• “The tests… must be used with discretion and understanding” (N-‐P, 1928, p. 58)

(“it’s the methods, stupid”)

SPP D. Mayo 26

Why Inductive behavior?

N-‐P justify tests (and confidence intervals) by performance, control of long-‐run error coverage probabilities

They called this inductive behavior, why?

• They were reaching conclusions beyond the data (inductive)

• If inductive inference is probabilist, then they needed a new term.

In Popperian spirit, they (mostly Neyman) called it inductive behavior-‐-‐ adjust how we’d act rather than beliefs

(I’m not knocking performance, but error probabilities also serve for particular inferences—evidential)

SPP D. Mayo 27

N-‐P tests can still commit a type of fallacy of rejection: Infer a discrepancy beyond what’s warranted: ––especially with n sufficiently large: large n problem.

• Severity tells us: an α-‐significant difference is indicative of less of a discrepancy from the null if it results from larger (n1) rather than a smaller (n2) sample size (n1 > n2 )

What’s more indicative of a large effect (fire), a fire alarm that goes off with burnt toast or one so insensitive that it doesn’t go off unless the house is fully ablaze? [The larger sample size is like the one that goes off with burnt toast.)

SPP D. Mayo 28

Fallacy of Non-‐Significant results: Insensitive tests • Negative results may not warrant 0 discrepancy from the null, but we can use severity to rule out discrepancies that, with high probability, would have resulted in a larger difference than observed

Similar to Cohen’s power analysis but sensitive to the outcome—P-‐value distribution (#3)

• I hear some replicationists say negative results are uninformative: not so (#2 ironies)

No point in running replication research if your account views negative results as uninformative

SPP D. Mayo 29

Error statistics gives evidential interpretation to tests (#3) Use results to infer discrepancies from a null that are well ruled-‐out, and those which are not I’d never just report a P-‐value Mayo (1996); Mayo and Cox (2010): Frequentist Principle of Evidence: FEV

Mayo and Spanos (2006): SEV

SPP D. Mayo 30

One-‐sided Test T+: H0: µ < µ0 vs. H1: µ > µ0 d(x) is statistically significant (set lower bounds) (i) If the test had high capacity to warn us (by producing a less significant result) if µ ≤ µ0 + γ. then d(x) is a good indication of µ > µ0 + γ. (ii) If the test had little (or even moderate) capacity (e.g. < .5) to produce a less significant result even if µ ≤ µ0 + γ, then d(x) is a poor indication of µ > µ0 + γ (If an even more impressive result is probable, due to guppies, it’s not a good indication of a great whale)

SPP D. Mayo 31

d(x) is not statistically significant (set upper bounds) (i)If the test had a high probability of producing a more statistically significant difference if µ > µ0 + γ, then d(x) is a good indication that µ ≤ µ0 + γ.

(ii) If the test had a low probability of a more statistically significant difference if µ > µ0 + γ, then d(x) is poor indication that µ ≤ µ0 + γ. (too insensitive to rule out discrepancy γ) If you set an overly stringent significance level in order to block rejecting a null, we can determine the discrepancies you can’t detect (e.g., risks of concern)

SPP D. Mayo 32

Confidence Intervals also require supplementing Duality between tests and intervals: values within the (1 -‐ α) CI are non-‐rejectable at the α level • Still too dichotomous: in /out, plausible/not plausible (Permit fallacies of rejection/non-‐rejection).

• Justified in terms of long-‐run coverage (performance). • All members of the CI treated on par. • Fixed confidence level (SEV needs several benchmarks). • Estimation is important but we need tests for distinguishing real and spurious effects, and checking assumptions of statistical models.

SPP D. Mayo 33

The evidential interpretation is crucial but error probabilities can be violated by selection effects (also violated model assumptions) One function of severity is to identify which selection effects are problematic (not all are) (#3). Biasing selection effects: when data or hypotheses are selected or generated (or a test criterion is specified), in such a way that the minimal severity requirement is violated, seriously altered or incapable of being assessed.

SPP D. Mayo 34

Nominal vs actual significance levels Suppose that twenty sets of differences have been examined, that one difference seems large enough to test and that this difference turns out to be ‘significant at the 5 percent level.’ ….The actual level of significance is not 5 percent, but 64 percent! (Selvin, 1970, p. 104) • They were clear on the fallacy: blurring the “computed” or “nominal” significance level, and the “actual” level

• There are many more ways you can be wrong with hunting (different sample space)

SPP D. Mayo 35

This is a genuine example of an invalid or unsound method You report: Such results would be difficult to achieve under the assumption of H0 When in fact such results are common under the assumption of H0

(formally): You say Pr(P-value < Pobs; H0) ~ α (small) but in fact Pr(P-value < Pobs; H0) = high, if not guaranteed • Nowadays, we’re likely to see the tests blamed for permitting such misuses (instead of the testers).

• Worse are those accounts where the abuse vanishes!

SPP D. Mayo 36

What defies scientific sense? On some views, biasing selection effects are irrelevant….

Stephen Goodman (epidemiologist): Two problems that plague frequentist inference: multiple comparisons and multiple looks, or, as they are more commonly called, data dredging and peeking at the data. The frequentist solution to both problems involves adjusting the P-‐value…But adjusting the measure of evidence because of considerations that have nothing to do with the data defies scientific sense, belies the claim of ‘objectivity’ that is often made for the P-‐value.” (1999, p. 1010).

SPP D. Mayo 37

Likelihood Principle (LP) The vanishing act takes us to the pivot point around which much debate in philosophy of statistics revolves: In probabilisms, the import of the data is via the ratios of likelihoods of hypotheses:

P(x0;H1)/P(x0;H0)

Different forms: posterior probabilities, Bayes factor (inference is comparative, data favors this over that–is that even inference?)

SPP D. Mayo 38

All error probabilities violate the LP (even without selection effects): “Sampling distributions, significance levels, power, all depend on something more [than the likelihood function]–something that is irrelevant in Bayesian inference–namely the sample space”. (Lindley 1971, p. 436) The information is just a matter of our “intentions” “The LP implies…the irrelevance of predesignation, of whether a hypothesis was thought of before hand or was introduced to explain known effects (Rosenkrantz, 1977, 122)

SPP D. Mayo 39

Many current Reforms are Probabilist Probabilist reforms to replace tests (and CIs) with likelihood ratios, Bayes factors, HPD intervals, or just lower the P-value (so that the maximal likely alternative gets .95 posterior)

while ignoring biasing selection effects, will fail. The same p-hacked hypothesis can occur in Bayes factors; optional stopping can exclude true nulls from HPD intervals.

With one big difference: Your direct basis for criticism and possible adjustments has just vanished. (lots of #2 inconsistencies)

SPP D. Mayo 40

How might probabilists block intuitively unwarranted inferences? (Consider first subjective) When we hear there’s statistical evidence of some unbelievable claim (distinguishing shades of grey and being politically moderate, ovulation and voting preferences), some probabilists claim—you see, if our beliefs were mixed into the interpretation of the evidence, we wouldn’t be fooled

We know these things are unbelievable, a subjective Bayesian might say That could work in some cases (though it still wouldn’t show what researchers had done wrong)—battle of beliefs.

SPP D. Mayo 41

It wouldn’t help with our most important problem:

• How to distinguish the warrant for a single hypothesis H with different methods (e.g., one has biasing selection effects, another, registered results and precautions)?

So now you’ve got two sources of flexibility, priors and biasing selection effects (which can no longer be criticized). Besides, researchers really do believe their hypotheses.

SPP D. Mayo 42

Diederik Stapel says he always read the research literature extensively to generate his hypotheses.

“So that it was believable and could be argued that this was the only logical thing you would find.” (E.g., eating meat causes aggression.) (In “The Mind of a Con Man,” NY Times, April 26, 2013[4])

SPP D. Mayo 43

Conventional Bayesians

The most popular probabilisms these days are “non-subjective” (reference, default) or conventional designed to prevent prior beliefs from influencing the posteriors: “The priors are not to be considered expressions of uncertainty, ignorance, or degree of belief. Conventional priors may not even be probabilities… .” (Cox and Mayo 2010, p. 299) How might they avoid too-‐easy rejections of a null?

SPP D. Mayo 44

Cult of the Holy Spike

Give a spike prior of .5 to H0 the remaining .5 probability being spread out over the alternative parameter space, Jeffreys. This “spiked concentration of belief in the null” is at odds with the prevailing view “we know all nulls are false” (#2) Bottom line: By convenient choices of priors and alternatives statistically significant differences can be evidence for the null

The conflict often considers the two sided test H0: µ = 0 versus H1: µ ≠ 0

SPP D. Mayo 45

Posterior Probabilities in H0

n (sample size) ____________________________ p z n=50 n=100 n=1000 .10 1.645 .65 .72 .89 .05 1.960 .52 .60 .82 .01 2.576 .22 .27 .53 .001 3.291 .034 .045 .124

If n = 1000, a result statistically significant at the .05 level leads to a posterior to the null of .82! From Berger and Sellke (1987) based on a Jeffreys pror

SPP D. Mayo 46

• With a z = 1.96 difference, the 95% CI (2-‐sided) or the .975 CI one sided excludes the null (0) from the interval

• Severity reasoning: Were H0 true, the probability of getting

d(x) < dobs is high (~.975), so SEV (µ > 0) ∼ .975

• But they give P(H0 | z = 1.96 ) = .82

• Error statistical critique: there’s a high probability that they give posterior probability of .82 to H0:µ = 0 erroneously

• The onus is on probabilists to show a high posterior for H

constitutes having passed a good test.

SPP D. Mayo 47

Informal and Quasi-‐Formal Severity : H -‐> H*

• Error statisticians avoid the fallacy of going directly from statistical to research hypothesis H*

• Can we say nothing about this link? • I think we can and must, and informal severity assessments are relevant (#3)

I will not discuss straw man studies (“chump effects”).

This is believable: Men react more negatively to success of their partners than to their failures (compared to women)?

Studies have shown: H: partner’s success lowers self-esteem in men

SPP D. Mayo 48

Macho Men H*: partner’s success lowers self-esteem in men

I have no doubts that certain types of men feel threatened by the success of their female partners, wives or girlfriends I’ve even known a few.

Can this be studied in the lab? Ratliff and Oishi (2013) did: .

H*: “men’s implicit self-‐esteem is lower when a partner succeeds than when a partner fails.”

Not so for women Their example does a good job, given the standards in place.

SPP D. Mayo 49

Treatments: Subjects are randomly assigned to five “treatments”: think, write about a time your partner succeeded, failed, succeeded when you failed (partner beats me), failed when you succeeded (I beat partner), and a typical day (control).

Effects: a measure of “self-‐esteem” Explicit: “How do you feel about yourself?” Implicit: a test of word associations with “me” versus “other”. None showed statistical significance in explicit self-esteem, so consider just implicit measures

SPP D. Mayo 50

Some null hypotheses: The average self-esteem score is no different (these are statistical hypotheses) a) when partner succeeds (rather than failing) b) when partner beats (surpasses) me or I beat her c) control: when she succeeds, fails, or it’s a regular day There are at least double this, given self-esteem could be “explicit” or “implicit” (others too, e.g., the area of success)

Only null (a) was rejected statistically!

Should they have taken the research hypothesis as

disconfirmed by negative cases? Or as casting doubt on their test?

SPP D. Mayo 51

Or should they just focus on the null hypotheses that were rejected, in particular null (a), for implicit self-‐esteem.

They opt for the third. It’s not that they should have regarded their research

hypothesis H* as disconfirmed much less falsified. This is precisely the nub of the problem! I’m saying the hypothesis that the study isn’t well-run needs to be considered

• Is the artificial writing assignment sufficiently relevant to the phenomenon of interest? (look at proxy variables)

• Is the measure of implicit self esteem (word associations) a

valid measure of the effect? (measurements of effects)

SPP D. Mayo 52

Take, null hypothesis b): The average self-esteem score is no different when partner beats (surpasses) me or I beat her

Clearly they expected “she beat me in X” to have a greater negative impact on self-‐esteem than “she succeeded at X”.

Still, they could view it as lending “some support to the idea that men interpret ‘my partner is successful’ as ‘my partner is more successful than me” (p. 698),

….as do the authors. That is, any success of hers is always construed by Macho man as, she beat me.

SPP D. Mayo 53

Bending over Backwards For the stringent self-‐critic, this skirts too close to viewing the data through the theory, a kind of “self-‐sealing fallacy”. I want to be clear that this is not a criticism of them given existing standards “I'm talking about a specific, extra type of integrity...bending over backwards to show how you're maybe wrong, that you ought to have when acting as a scientist.” (R. Feynman 1974)

I’m describing what’s needed to show “sincerely trying to find flaws” under the austere account I recommend

The most interesting information was never reported! Perhaps it was never even looked at: what they wrote about.

SPP D. Mayo 54

Conclusion: Replication Research in Psychology Under an Error Statistical Philosophy

Replication problems can’t be solved without correctly understanding their sources Biggest sources of problems in replication crises (a) Stat H -‐>research H* and (b) biasing selection effects:

Reasons for (a): focus on P-values and Fisherian tests ignoring N-P tests (and the illicit NHST that goes directly H–> H*)

SPP D. Mayo 55

Another reason, false dilemma: probabilism or long-run performance

plus assuming that N-P can only give the latter I argue for a third use of probability: Rather than report on believability researchers need to report the properties of the methods they used:

What was their capacity to have identified, avoided, admitted bias?

What’s wanted is not a high posterior probability in H (however construed) but a high probability the procedure would have unearthed flaws in H (reinterpretation of N-‐P methods)

SPP D. Mayo 56

What’s replicable? Discrepancies that are severely warranted

Reasons for (b) [embracing accounts that formally ignore selection effects]: accepting probabilisms that embrace the likelihood principle LP There’s no point in raising thresholds for significance if your methodology does not pick up on biasing selection effects.

SPP D. Mayo 57

Informal assessments of probativeness are needed to scrutinize statistical inferences in relation to research hypotheses H –> H* One hypothesis must always be: our results point to the inability of our study to severely probe the phenomenon of interest (problem with proxy variables, measurements, etc.) The scientific status of an inquiry is questionable if it cannot or will not distinguish the correctness of inferences from problems stemming from a poorly run study If ordinary research reports adopted the Feynman “bending over backwards” scrutiny, the interpretation of replication efforts would be more informative (or perhaps not needed)

SPP D. Mayo 58

REFERENCES Baggerly, K. A., Coombes, K. R. & Neeley, E. S. (2008). “Run Batch Effects

Potentially Compromise the Usefulness of Genomic Signatures for Ovarian Cancer.” Journal of Clinical Oncology. 26(7): 1186-‐1187.

Bartless, T. (2012). “Daniel Kahneman Sees ‘Train-‐Wreck Looming’ for Social Psychology”. Chronicle of Higher Education Blog (Oct. 4, 2012) article w/links to email D. Kahneman sent to several social psychologists. http://chronicle.com/blogs/percolator/daniel-‐kahneman-‐sees-‐train-‐wreck-‐looming-‐for-‐social-‐psychology/31338.

Berger, J. O. (2006). “The Case for Objective Bayesian Analysis.” Bayesian Analysis 1 (3): 385–402.

Berger, J. O. & Sellke, T. (1987). “Testing a Point Null Hypothesis: The Irreconcilability of P Values and Evidence (with Discussion).” Journal of the American Statistical Association 82 (397) (March 1): 112–122.

Bhattacharjee, Y. (2013). “The Mind of a Con Man”. The New York Times Magazine (4/28/2013), p. 44.

Cohen, J. 1988. Statistical Power Analysis for the Behavioral Sciences. 2nd ed. Hillsdale, NJ: Erlbaum.

SPP D. Mayo 59

Coombes, K. R., Wang, J. & Baggerly, K. A. (2007). “Microrrays: retracing steps.”

Nature Medicine. 13(11):1276-‐7. Cox, D. R. & D. V. Hinkley. (1974). Theoretical Statistics. London: Chapman and

Hall. Cox, D. R. & Mayo, D. G. (2010). “Objectivity and Conditionality in Frequentist

Inference.” In Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability, and the Objectivity and Rationality of Science, edited by Deborah G. Mayo and Aris Spanos, 276–304. Cambridge: Cambridge University Press.

Diaconis, P. (1978). “Statistical Problems in ESP Research”. Science 201 (4351): 131-‐136. (Letters in response can be found in the Dec. 15, 1978 issue pp. 1145-‐6.)

Dienes, Z. (2011) “Bayesian versus Orthodox Statistics: Which Side Are You On?” Perspectives on Psychological Science 6(3): 274-‐290.

Feynman, R. (1974). “Cargo Cult Science.” Caltech Commencement Speech. Fisher, R. A. (1947). The Design of Experiments, 4th ed. Edinburgh: Oliver and

Boyd.

SPP D. Mayo 60

Gelman, A. (2011). “Induction and Deduction in Bayesian Data Analysis.” Edited by Deborah G. Mayo, Aris Spanos, and Kent W. Staley. Rationality, Markets and Morals: Studies at the Intersection of Philosophy and Economics 2 (Special Topic: Statistical Science and Philosophy of Science): 67–78.

Gelman, A. & Shalizi, C. (2013). “Philosophy and the Practice of Bayesian Statistics.” British Journal of Mathematical and Statistical Psychology 66 (1): 8–38.

Gigerenzer, G. (2000). “The Superego, the Ego, and the Id in Statistical Reasoning. “ Adaptive Thinking, Rationality in the Real World, OUP.

Goodman, S. N. (1999). Toward evidence-‐based medical statistics. 2: The Bayes factor.” Annals of Internal Medicine, 130:1005 –1013.

Howson, C. & Urbach, P. (1993). Scientific Reasoning: The Bayesian Approach. 2nd ed. La Salle, IL: Open Court.

Johansson T. (2010) “Hail the impossible: p-‐values, evidence, and likelihood.” Scandinavian Journal of Psychology 52:113-‐125.

Kruschke, J. K. (2010). “What to believe: Bayesian methods for data analysis”. Trends in Cognitive Science, 14(7): 297-‐300.

Lehmann, E. L. (1993). “The Fisher, Neyman-‐Pearson Theories of Testing

SPP D. Mayo 61

Hypotheses: One Theory or Two?” Journal of the American Statistical Association 88 (424): 1242–1249.

Levelt Committee, Noort Committee, Drenth Committee. (2012). “Flawed science: The fraudulent research practices of social psychologist Diederik Stapel”. Stapel Investigation: Joint Tilburg/Groningen/Amsterdam investigation of the publications by Mr. Stapel. https://www.commissielevelt.nl/

Lindley, D. V. (1971). “The Estimation of Many Parameters.” In Foundations of Statistical Inference, edited by V. P. Godambe and D. A. Sprott, 435–455. Toronto: Holt, Rinehart and Winston.

Mayo, D. G. (1996). Error and the Growth of Experimental Knowledge. Science and Its Conceptual Foundation. Chicago: University of Chicago Press.

Mayo, D. G. & Cox, D. R. (2010). "Frequentist Statistics as a Theory of Inductive Inference" in Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science (D. Mayo and A. Spanos eds.), Cambridge: Cambridge University Press: 1-‐27. This paper appeared in The Second Erich L. Lehmann Symposium: Optimality, 2006, Lecture Notes-‐Monograph Series, Volume 49, Institute of Mathematical Statistics, pp. 247-‐275.

SPP D. Mayo 62

Mayo, D. G., and A. Spanos. (2006). “Severe Testing as a Basic Concept in a Neyman–Pearson Philosophy of Induction.” British Journal for the Philosophy of Science 57 (2) (June 1): 323–357.

Mayo, D. G., and A. Spanos. (2011). “Error Statistics.” In Philosophy of Statistics, edited by Prasanta S. Bandyopadhyay and Malcom R. Forster, 7:152–198. Handbook of the Philosophy of Science. The Netherlands: Elsevier.

Meehl, P. E. & Waller, N. G. (2002). “The Path Analysis Controversy: A New Statistical Approach to Strong Appraisal of Verisimilitude.” Psychological Methods 7(3): 283–300.

Morrison, D. E. & Henkel, R. E. (eEds). (1970). The Significance Test Controversy: A Reader. Chicago: Aldine De Gruyter.

Micheel, C. M., Nass, S. J. & Omenn G. S. (Eds) Committee on the Review of Omics-‐Based Tests for Predicting Patient Outcomes in Clinical Trials; Board on Health Care Services; Board on Health Sciences Policy; Institute of Medicine (2012). Evolution of Translational Omics: Lessons Learned and the Path Forward. Nat. Acad. Press.

Neyman, J. (1957). “‘Inductive Behavior’” as a Basic Concept of Science.” Revue de l'Institut International de Statistique/Review of the International Statistical Institute, 25 (1/3): 7-‐22.

SPP D. Mayo 63

Neyman, J. & Pearson, E. S. (1928). “On the Use and Interpretation of Certain Test Criteria for Purposes of Statistical Inference. Part I,” Biometrica 20A: 175-‐240 (reprinted in Joint Statistical Papers, University of California Press, Berkeley, 1967, pp. 1-‐66.)

Popper, K. (1962). Conjectures and Refutations: The Growth of Scientific Knowledge. New York: Basic Books.

Potti, A., Dressman H. K., Bild, A., Riedel, R. F., Chan, G., Sayer, R., Cragun, J., Cottrill, H., Kelley, M. J., Petersen, R., Harpole, D., Marks, J., Berchuck, A., Ginsburg, G. S., Febbo, P., Lancaster, J. & Nevins, J. R. (2006). “Genomic signatures to guide the use of chemotherapeutics.” Nature Medicine. Nov 12(11):1294-‐300. Epub 2006 Oct 22.

Potti, A. & Nevins, J. R. (2007) “Reply to Coombes, Wang & Baggerly.” Nature Medicine Nov 13(11):1277-‐8.

Ratliff, K. A. & Oishi, S. (2013). “Gender Differences in Implicit Self-‐Esteem Following a Romantic Partner’s Success or Failure”. Journal of Personality and Social Psychology 105(4): 688–702.

Rosenkrantz, R. (1977). Inference, Method and Decision: Towards a Bayesian Philosophy of Science. Dordrecht, The Netherlands: D. Reidel.

SPP D. Mayo 64

Savage, L. J. (1962). The Foundations of Statistical Inference: A Discussion. London: Methuen.

Savage, L. J. (1964). “The Foundations of Statistics Reconsidered.” In Studies in Subjective Probability, H. Kyburg & H. Smokler (eds.), 173-‐188. New York: John Wiley & Sons.

Selvin, H. (1970). “A Critique of Tests of Significance in Survey Research.” In The Significance Test Controversy, edited by D. Morrison and R. Henkel, 94-‐106. Chicago: Aldine De Gruyter.

Trafimow, D. & Marks M. (2015). “Editorial”. Basic and Applied Social Psychology, 37(1), pp. 1-‐2.

Wagenmakers, E.-‐J. (2007). “A Practical Solution to the Pervasive Problems of P Values”. Psychonomic Bulletin & Review 14 (5), 779-‐804.