multiple significance tests: the bonferroni correction

2
STATISTICAL QUESTION Multiple significance tests: the Bonferroni correction Philip Sedgwick senior lecturer in medical statistics Centre for Medical and Healthcare Education, St George’s, University of London, Tooting, London, UK Researchers assessed the effects of hormone replacement therapy, consisting of combined oestrogen and progestogen, on health related quality of life. A randomised placebo controlled, double blind trial study design was used. Women were recruited if they were postmenopausal, had a uterus, and were aged 50-69 at randomisation. Outcome measures included health related quality of life and psychological wellbeing. The study period was one year. 1 The researchers investigated the effects of combined hormone replacement therapy compared with placebo at one year, using a 0.05 (5%) critical level of significance and adjusting this with the Bonferroni correction. The researchers concluded that combined hormone replacement therapy started many years after the menopause can improve health related quality of life. For which of the following does the Bonferroni correction reduce the probability of occurring? a) Type I error b) Type II error Answers The Bonferroni correction reduces the probability of making a type I error (answer a) but not a type II error (answer b). Combined hormone replacement therapy was compared with placebo using statistical hypothesis testing, the purpose of which was to make inferences about the population on the basis of the sample. However, if the sample was not representative of the population then errors could have been committed in the hypothesis testing. Two types of error were possible, type I and II, described in a previous question. 2 The purpose of the Bonferroni correction was to limit the probability of committing a type I error (answer a). Type I and II errors would both result in the incorrect inference being made about the effectiveness of the combined hormone replacement therapy. A type I error would occur if the null hypothesis was incorrectly rejected in favour of the alternative—that is, if there was a difference in outcome between combined hormone treatment and placebo in the sample but not in the population. A type I error would occur because of sampling error: only a proportion of the population was studied, possibly resulting in an unrepresentative sample. Sampling error can also result in a type II error, which is when the null hypothesis is not rejected in favour of the alternative when it should have been—that is, there is a difference in outcome in the population between combined hormone treatment and placebo but the difference was not seen in the sample. However, the Bonferroni correction does not limit the probability of a type II error occurring (answer b is false). Sampling error can be reduced by increasing sample size, thus obtaining a more representative sample, and therefore doing so increases the power of the statistical test. 3 For each hypothesis test in the study, the P value was derived by hypothetically repeating the study an infinite number of times. The P value is the proportion of these hypothetical studies that would have produced a test statistic greater or equal to the absolute value calculated in the above study. The critical level of significance is set at 0.05 (5%). Therefore, for each hypothesis test the null hypothesis would be rejected in favour of the alternative for those 5% of the infinite number of studies with the largest test statistics; hence for any hypothesis test the maximum probability of rejecting the null hypothesis was 0.05. Since any hypothesis test could result in a type I error, the probability of it occurring for each test was 0.05. When multiple hypothesis tests are performed, the probability of a type I error occurring is greater than 0.05. 4 Care must be taken when studies undertake a large number of statistical tests—ultimately some of these will result in a type I error. However, we will not know which significant findings are a type I error. Various approaches have been suggested to reduce the number of type I errors when undertaking multiple testing, including the Bonferroni correction. The Bonferroni correction involved adjusting the critical significance level of 0.05 by dividing it by the number of statistical tests performed. The researchers reported performing 41 statistical tests, and so therefore statistical significance was achieved if P was less than 0.05 ÷ 41, or 0.001. The correction is conservative and not recommended if a large number of tests are performed, since few if any tests will be significant after the correction has been applied. [email protected] For personal use only: See rights and reprints http://www.bmj.com/permissions Subscribe: http://www.bmj.com/subscribe BMJ 2012;344:e509 doi: 10.1136/bmj.e509 (Published 25 January 2012) Page 1 of 2 Endgames ENDGAMES

Upload: p

Post on 03-Feb-2017

219 views

Category:

Documents


2 download

TRANSCRIPT

STATISTICAL QUESTION

Multiple significance tests: the Bonferroni correctionPhilip Sedgwick senior lecturer in medical statistics

Centre for Medical and Healthcare Education, St George’s, University of London, Tooting, London, UK

Researchers assessed the effects of hormone replacementtherapy, consisting of combined oestrogen and progestogen, onhealth related quality of life. A randomised placebo controlled,double blind trial study design was used.Women were recruitedif they were postmenopausal, had a uterus, and were aged 50-69at randomisation. Outcome measures included health relatedquality of life and psychological wellbeing. The study periodwas one year.1

The researchers investigated the effects of combined hormonereplacement therapy compared with placebo at one year, usinga 0.05 (5%) critical level of significance and adjusting this withthe Bonferroni correction. The researchers concluded thatcombined hormone replacement therapy started many yearsafter the menopause can improve health related quality of life.For which of the following does the Bonferroni correctionreduce the probability of occurring?

a) Type I errorb) Type II error

AnswersThe Bonferroni correction reduces the probability of making atype I error (answer a) but not a type II error (answer b).Combined hormone replacement therapy was compared withplacebo using statistical hypothesis testing, the purpose of whichwas to make inferences about the population on the basis of thesample. However, if the sample was not representative of thepopulation then errors could have been committed in thehypothesis testing. Two types of error were possible, type I andII, described in a previous question.2 The purpose of theBonferroni correction was to limit the probability of committinga type I error (answer a).Type I and II errors would both result in the incorrect inferencebeing made about the effectiveness of the combined hormonereplacement therapy. A type I error would occur if the nullhypothesis was incorrectly rejected in favour of thealternative—that is, if there was a difference in outcome betweencombined hormone treatment and placebo in the sample but notin the population. A type I error would occur because ofsampling error: only a proportion of the population was studied,

possibly resulting in an unrepresentative sample. Sampling errorcan also result in a type II error, which is when the nullhypothesis is not rejected in favour of the alternative when itshould have been—that is, there is a difference in outcome inthe population between combined hormone treatment andplacebo but the difference was not seen in the sample. However,the Bonferroni correction does not limit the probability of a typeII error occurring (answer b is false). Sampling error can bereduced by increasing sample size, thus obtaining a morerepresentative sample, and therefore doing so increases thepower of the statistical test.3

For each hypothesis test in the study, the P value was derivedby hypothetically repeating the study an infinite number oftimes. The P value is the proportion of these hypothetical studiesthat would have produced a test statistic greater or equal to theabsolute value calculated in the above study. The critical levelof significance is set at 0.05 (5%). Therefore, for each hypothesistest the null hypothesis would be rejected in favour of thealternative for those 5% of the infinite number of studies withthe largest test statistics; hence for any hypothesis test themaximum probability of rejecting the null hypothesis was 0.05.Since any hypothesis test could result in a type I error, theprobability of it occurring for each test was 0.05.Whenmultiplehypothesis tests are performed, the probability of a type I erroroccurring is greater than 0.05.4

Care must be taken when studies undertake a large number ofstatistical tests—ultimately some of these will result in a typeI error. However, we will not know which significant findingsare a type I error. Various approaches have been suggested toreduce the number of type I errors when undertaking multipletesting, including the Bonferroni correction.The Bonferroni correction involved adjusting the criticalsignificance level of 0.05 by dividing it by the number ofstatistical tests performed. The researchers reported performing41 statistical tests, and so therefore statistical significance wasachieved if P was less than 0.05 ÷ 41, or 0.001. The correctionis conservative and not recommended if a large number of testsare performed, since few if any tests will be significant after thecorrection has been applied.

[email protected]

For personal use only: See rights and reprints http://www.bmj.com/permissions Subscribe: http://www.bmj.com/subscribe

BMJ 2012;344:e509 doi: 10.1136/bmj.e509 (Published 25 January 2012) Page 1 of 2

Endgames

ENDGAMES

Competing interests: None declared.

1 Welton AJ, Vickers MR, Kim J, Ford D, Lawton BA, MacLennan AH, et al on behalf of theWISDOM team. Health related quality of life after combined hormone replacement therapy:randomised controlled trial. BMJ 2008;337:a1190.

2 Sedgwick P. Errors when statistical hypothesis testing. BMJ 2010;340:c2348.

3 Sedgwick P. Sample size calculations I. BMJ 2010;340:c3104.4 Sedgwick P. Multiple significance tests. BMJ 2010;340:c2963.

Cite this as: BMJ 2012;344:e509© BMJ Publishing Group Ltd 2012

For personal use only: See rights and reprints http://www.bmj.com/permissions Subscribe: http://www.bmj.com/subscribe

BMJ 2012;344:e509 doi: 10.1136/bmj.e509 (Published 25 January 2012) Page 2 of 2

ENDGAMES