what to do about the multiple comparisons problem? peter z. schochet february 2008
TRANSCRIPT
What To Do About the Multiple Comparisons Problem?
Peter Z. Schochet
What To Do About the Multiple Comparisons Problem?
Peter Z. Schochet
February 2008February 2008
Overview of Presentation Overview of Presentation
Background
Suggested testing guidelines
Background
Suggested testing guidelines
2
BackgroundBackground
Overview of the ProblemOverview of the Problem
Multiple hypothesis tests are often conducted in impact studies
– Outcomes– Subgroups – Treatment groups
Standard testing methods could yield:– Spurious significant impacts – Incorrect policy conclusions
Multiple hypothesis tests are often conducted in impact studies
– Outcomes– Subgroups – Treatment groups
Standard testing methods could yield:– Spurious significant impacts – Incorrect policy conclusions 4
Assume a Classical Hypothesis Testing Framework
Assume a Classical Hypothesis Testing Framework
True impacts are fixed for the study population
Test H0j: Impactj = 0
Reject H0j if p-value of t-test < =.05
Chance of finding a spurious impact is 5 percent for each test alone
True impacts are fixed for the study population
Test H0j: Impactj = 0
Reject H0j if p-value of t-test < =.05
Chance of finding a spurious impact is 5 percent for each test alone
5
But Suppose No True Impacts and the Tests Are Considered Together But Suppose No True Impacts and
the Tests Are Considered Together
Probability 1 t-test
Number of Testsa Is Statistically Significant
1 .05
5 .23
10 .40
20 .64
50 .92aAssumes independent tests
6
Impact Findings Can Be Misrepresented
Impact Findings Can Be Misrepresented
Publishing bias
A focus on “stars”
Publishing bias
A focus on “stars”
7
Adjustment Procedures Lower Levels for Individual Tests
Adjustment Procedures Lower Levels for Individual Tests
Control the “combined” error rate
Many available methods:
– Bonferroni: Compare p-values to (.05 / # of tests)
– Fisher’s LSD, Holm (1979), Sidak (1967), Scheffe (1959), Hochberg (1988), Rom (1990), Tukey (1953)
– Resampling methods (Westfall and Young 1993)
– Benjamini-Hochberg (1995)
Control the “combined” error rate
Many available methods:
– Bonferroni: Compare p-values to (.05 / # of tests)
– Fisher’s LSD, Holm (1979), Sidak (1967), Scheffe (1959), Hochberg (1988), Rom (1990), Tukey (1953)
– Resampling methods (Westfall and Young 1993)
– Benjamini-Hochberg (1995)
8
These Methods Reduce Statistical Power-
The Chances of Finding Real Effects These Methods Reduce Statistical Power-
The Chances of Finding Real Effects
Simulated Statistical Powera
Number of Tests Unadjusted Bonferroni
5 .80 .59
10 .80 .50
20 .80 .41
50 .80 .31
a Assumes 1,000 treatments and 1,000 controls, 20 percent of all null hypotheses are true, and independent tests
9
Big Debate on Whether To Use Adjustment Procedures
Big Debate on Whether To Use Adjustment Procedures
What is the proper balance between Type I and Type II errors?
What is the proper balance between Type I and Type II errors?
10
To Adjust or Not To Adjust?
To Adjust or Not To Adjust?
February, July, December 2007 Advisory Panel Meetings Held at IES
February, July, December 2007 Advisory Panel Meetings Held at IES
Chairs:
Phoebe Cottingham, IESRob Hollister, SwarthmoreRebecca Maynard, U. of PA
Chairs:
Phoebe Cottingham, IESRob Hollister, SwarthmoreRebecca Maynard, U. of PA
Participants:
Steve Bell, AbtHoward Bloom, MDRC John Burghardt, MPRMark Dynarski, MPRAndrew Gelman, ColumbiaDavid Judkins, WestatJeff Kling, BrookingsDavid Myers, AIRLarry Orr, AbtPeter Schochet, MPR
12
Basic Principles for a Testing Strategy
Basic Principles for a Testing Strategy
The Multiplicity Problem Should Not Be Ignored
The Multiplicity Problem Should Not Be Ignored
Erroneous conclusions can result otherwise
But need a strategy that balances Type I and II errors
Erroneous conclusions can result otherwise
But need a strategy that balances Type I and II errors
14
Limiting the Number of Outcomes and Subgroups Can Help
Limiting the Number of Outcomes and Subgroups Can Help
But not always possible or desirable
Need flexible strategy for confirmatory and exploratory analyses
But not always possible or desirable
Need flexible strategy for confirmatory and exploratory analyses
15
Problem Should Be Addressed by First Structuring the Data
Problem Should Be Addressed by First Structuring the Data
Structure will depend on the research questions
Adjustments should not be conducted blindly across all contrasts
Structure will depend on the research questions
Adjustments should not be conducted blindly across all contrasts
16
Suggested Testing Guidelines Suggested Testing Guidelines
The Plan Must Be Specified Up Front
The Plan Must Be Specified Up Front
Rigor requires that the strategy be documented prior to data analysis
Rigor requires that the strategy be documented prior to data analysis
18
Delineate Separate Outcome Domains
Delineate Separate Outcome Domains
Based on a conceptual framework that relates the intervention to the outcomes
Represent key clusters of constructs
Domain “items” are likely to measure the same underlying trait
– Test scores– Teacher practices– School attendance
Based on a conceptual framework that relates the intervention to the outcomes
Represent key clusters of constructs
Domain “items” are likely to measure the same underlying trait
– Test scores– Teacher practices– School attendance
19
Testing Strategy: Both Confirmatory and Exploratory Components
Testing Strategy: Both Confirmatory and Exploratory Components
Confirmatory component
– Addresses central study hypotheses
– Must adjust for multiple comparisons
– Must be specified in advance
Exploratory component
– Identify impacts or relationships for future study
– Findings should be regarded as preliminary
Confirmatory component
– Addresses central study hypotheses
– Must adjust for multiple comparisons
– Must be specified in advance
Exploratory component
– Identify impacts or relationships for future study
– Findings should be regarded as preliminary 20
Confirmatory Analysis Has Two Potential Parts
Confirmatory Analysis Has Two Potential Parts
1. Domain-specific analysis
2. Between-domain analysis
1. Domain-specific analysis
2. Between-domain analysis
21
Domain-Specific Analysis Domain-Specific Analysis
Test Impacts for Outcomes as a Group
Test Impacts for Outcomes as a Group
Create a composite domain outcome
– Weighted average of standardized outcomes
Simple average Index Latent factor
Conduct a t-test on the composite
Create a composite domain outcome
– Weighted average of standardized outcomes
Simple average Index Latent factor
Conduct a t-test on the composite
23
What About Tests for Individual Domain Outcomes?
What About Tests for Individual Domain Outcomes?
If impact on composite is significant
– Test impacts for individual domain outcomes without multiplicity corrections
– Use only for interpretation
If impact on composite is not significant
– Further tests are not warranted
If impact on composite is significant
– Test impacts for individual domain outcomes without multiplicity corrections
– Use only for interpretation
If impact on composite is not significant
– Further tests are not warranted
24
Between-Domain Analysis Between-Domain Analysis
Applicable If Studies Require Summative Evidence of Impacts
Applicable If Studies Require Summative Evidence of Impacts
Constructing “unified” composites may not make sense
– Domains measure different latent traits
Test domain composites individually using adjustment procedures
Constructing “unified” composites may not make sense
– Domains measure different latent traits
Test domain composites individually using adjustment procedures
26
Testing Strategy Will Depend on the Research Questions
Testing Strategy Will Depend on the Research Questions
Are impacts significant in all domains? – No adjustments are needed
Are impacts significant in any domain? – Adjustments are needed
Are impacts significant in all domains? – No adjustments are needed
Are impacts significant in any domain? – Adjustments are needed
27
Other Situations That Require Multiplicity Adjustments
Other Situations That Require Multiplicity Adjustments
1. Designs with multiple treatment groups
– Apply Tukey-Kramer, Dunnett, or resampling methods to domain composites
2. Subgroup analyses that are part of the confirmatory analysis
– Conduct F-tests for differences across subgroup impacts
1. Designs with multiple treatment groups
– Apply Tukey-Kramer, Dunnett, or resampling methods to domain composites
2. Subgroup analyses that are part of the confirmatory analysis
– Conduct F-tests for differences across subgroup impacts
28
Statistical Power Statistical Power
Studies must be designed to have sufficient statistical power for all confirmatory analyses
– Includes subgroup analyses
Studies must be designed to have sufficient statistical power for all confirmatory analyses
– Includes subgroup analyses
29
Reporting Must Link to the Study Protocols
Reporting Must Link to the Study Protocols
Qualify confirmatory and exploratory analysis findings in reports
– No one way to present adjusted and unadjusted p-values
– Confidence intervals may be helpful
– Emphasize confirmatory analysis results in the executive summary
Qualify confirmatory and exploratory analysis findings in reports
– No one way to present adjusted and unadjusted p-values
– Confidence intervals may be helpful
– Emphasize confirmatory analysis results in the executive summary
30
Testing Approach SummaryTesting Approach Summary
Pre-specify plan in the study protocols
Structure the data– Delineate outcome domains
Confirmatory analysis
–Within and between domains
Exploratory analysis
Qualify findings appropriately
Pre-specify plan in the study protocols
Structure the data– Delineate outcome domains
Confirmatory analysis
–Within and between domains
Exploratory analysis
Qualify findings appropriately31