what to do about the multiple comparisons problem? peter z. schochet february 2008

What To Do About the Multiple Comparisons Problem?

Peter Z. Schochet

What To Do About the Multiple Comparisons Problem?

Peter Z. Schochet

February 2008February 2008

Overview of Presentation Overview of Presentation

Background

Suggested testing guidelines

Background

Suggested testing guidelines

2

BackgroundBackground

Overview of the ProblemOverview of the Problem

Multiple hypothesis tests are often conducted in impact studies

– Outcomes– Subgroups – Treatment groups

Standard testing methods could yield:– Spurious significant impacts – Incorrect policy conclusions

Multiple hypothesis tests are often conducted in impact studies

– Outcomes– Subgroups – Treatment groups

Standard testing methods could yield:– Spurious significant impacts – Incorrect policy conclusions 4

Assume a Classical Hypothesis Testing Framework

Assume a Classical Hypothesis Testing Framework

True impacts are fixed for the study population

Test H0j: Impactj = 0

Reject H0j if p-value of t-test < =.05

Chance of finding a spurious impact is 5 percent for each test alone

True impacts are fixed for the study population

Test H0j: Impactj = 0

Reject H0j if p-value of t-test < =.05

Chance of finding a spurious impact is 5 percent for each test alone

5

But Suppose No True Impacts and the Tests Are Considered Together But Suppose No True Impacts and

the Tests Are Considered Together

Probability 1 t-test

Number of Testsa Is Statistically Significant

1 .05

5 .23

10 .40

20 .64

50 .92aAssumes independent tests

6

Impact Findings Can Be Misrepresented

Impact Findings Can Be Misrepresented

Publishing bias

A focus on “stars”

Publishing bias

A focus on “stars”

7

Adjustment Procedures Lower Levels for Individual Tests

Adjustment Procedures Lower Levels for Individual Tests

Control the “combined” error rate

Many available methods:

– Bonferroni: Compare p-values to (.05 / # of tests)

– Fisher’s LSD, Holm (1979), Sidak (1967), Scheffe (1959), Hochberg (1988), Rom (1990), Tukey (1953)

– Resampling methods (Westfall and Young 1993)

– Benjamini-Hochberg (1995)

Control the “combined” error rate

Many available methods:

– Bonferroni: Compare p-values to (.05 / # of tests)

– Fisher’s LSD, Holm (1979), Sidak (1967), Scheffe (1959), Hochberg (1988), Rom (1990), Tukey (1953)

– Resampling methods (Westfall and Young 1993)

– Benjamini-Hochberg (1995)

8

These Methods Reduce Statistical Power-

The Chances of Finding Real Effects These Methods Reduce Statistical Power-

The Chances of Finding Real Effects

Simulated Statistical Powera

Number of Tests Unadjusted Bonferroni

5 .80 .59

10 .80 .50

20 .80 .41

50 .80 .31

a Assumes 1,000 treatments and 1,000 controls, 20 percent of all null hypotheses are true, and independent tests

9

Big Debate on Whether To Use Adjustment Procedures

Big Debate on Whether To Use Adjustment Procedures

What is the proper balance between Type I and Type II errors?

What is the proper balance between Type I and Type II errors?

10

To Adjust or Not To Adjust?

To Adjust or Not To Adjust?

February, July, December 2007 Advisory Panel Meetings Held at IES

February, July, December 2007 Advisory Panel Meetings Held at IES

Chairs:

Phoebe Cottingham, IESRob Hollister, SwarthmoreRebecca Maynard, U. of PA

Chairs:

Phoebe Cottingham, IESRob Hollister, SwarthmoreRebecca Maynard, U. of PA

Participants:

Steve Bell, AbtHoward Bloom, MDRC John Burghardt, MPRMark Dynarski, MPRAndrew Gelman, ColumbiaDavid Judkins, WestatJeff Kling, BrookingsDavid Myers, AIRLarry Orr, AbtPeter Schochet, MPR

12

Basic Principles for a Testing Strategy

Basic Principles for a Testing Strategy

The Multiplicity Problem Should Not Be Ignored

The Multiplicity Problem Should Not Be Ignored

Erroneous conclusions can result otherwise

But need a strategy that balances Type I and II errors

Erroneous conclusions can result otherwise

But need a strategy that balances Type I and II errors

14

Limiting the Number of Outcomes and Subgroups Can Help

Limiting the Number of Outcomes and Subgroups Can Help

But not always possible or desirable

Need flexible strategy for confirmatory and exploratory analyses

But not always possible or desirable

Need flexible strategy for confirmatory and exploratory analyses

15

Problem Should Be Addressed by First Structuring the Data

Problem Should Be Addressed by First Structuring the Data

Structure will depend on the research questions

Adjustments should not be conducted blindly across all contrasts

Structure will depend on the research questions

Adjustments should not be conducted blindly across all contrasts

16

Suggested Testing Guidelines Suggested Testing Guidelines

The Plan Must Be Specified Up Front

The Plan Must Be Specified Up Front

Rigor requires that the strategy be documented prior to data analysis

Rigor requires that the strategy be documented prior to data analysis

18

Delineate Separate Outcome Domains

Delineate Separate Outcome Domains

Based on a conceptual framework that relates the intervention to the outcomes

Represent key clusters of constructs

Domain “items” are likely to measure the same underlying trait

– Test scores– Teacher practices– School attendance

Based on a conceptual framework that relates the intervention to the outcomes

Represent key clusters of constructs

Domain “items” are likely to measure the same underlying trait

– Test scores– Teacher practices– School attendance

19

Testing Strategy: Both Confirmatory and Exploratory Components

Testing Strategy: Both Confirmatory and Exploratory Components

Confirmatory component

– Addresses central study hypotheses

– Must adjust for multiple comparisons

– Must be specified in advance

Exploratory component

– Identify impacts or relationships for future study

– Findings should be regarded as preliminary

Confirmatory component

– Addresses central study hypotheses

– Must adjust for multiple comparisons

– Must be specified in advance

Exploratory component

– Identify impacts or relationships for future study

– Findings should be regarded as preliminary 20

Confirmatory Analysis Has Two Potential Parts

Confirmatory Analysis Has Two Potential Parts

1. Domain-specific analysis

2. Between-domain analysis

1. Domain-specific analysis

2. Between-domain analysis

21

Domain-Specific Analysis Domain-Specific Analysis

Test Impacts for Outcomes as a Group

Test Impacts for Outcomes as a Group

Create a composite domain outcome

– Weighted average of standardized outcomes

Simple average Index Latent factor

Conduct a t-test on the composite

Create a composite domain outcome

– Weighted average of standardized outcomes

Simple average Index Latent factor

Conduct a t-test on the composite

23

What About Tests for Individual Domain Outcomes?

What About Tests for Individual Domain Outcomes?

If impact on composite is significant

– Test impacts for individual domain outcomes without multiplicity corrections

– Use only for interpretation

If impact on composite is not significant

– Further tests are not warranted

If impact on composite is significant

– Test impacts for individual domain outcomes without multiplicity corrections

– Use only for interpretation

If impact on composite is not significant

– Further tests are not warranted

24

Between-Domain Analysis Between-Domain Analysis

Applicable If Studies Require Summative Evidence of Impacts

Applicable If Studies Require Summative Evidence of Impacts

Constructing “unified” composites may not make sense

– Domains measure different latent traits

Test domain composites individually using adjustment procedures

Constructing “unified” composites may not make sense

– Domains measure different latent traits

Test domain composites individually using adjustment procedures

26

Testing Strategy Will Depend on the Research Questions

Testing Strategy Will Depend on the Research Questions

Are impacts significant in all domains? – No adjustments are needed

Are impacts significant in any domain? – Adjustments are needed

Are impacts significant in all domains? – No adjustments are needed

Are impacts significant in any domain? – Adjustments are needed

27

Other Situations That Require Multiplicity Adjustments

Other Situations That Require Multiplicity Adjustments

1. Designs with multiple treatment groups

– Apply Tukey-Kramer, Dunnett, or resampling methods to domain composites

2. Subgroup analyses that are part of the confirmatory analysis

– Conduct F-tests for differences across subgroup impacts

1. Designs with multiple treatment groups

– Apply Tukey-Kramer, Dunnett, or resampling methods to domain composites

2. Subgroup analyses that are part of the confirmatory analysis

– Conduct F-tests for differences across subgroup impacts

28

Statistical Power Statistical Power

Studies must be designed to have sufficient statistical power for all confirmatory analyses

– Includes subgroup analyses

Studies must be designed to have sufficient statistical power for all confirmatory analyses

– Includes subgroup analyses

29

Reporting Must Link to the Study Protocols

Reporting Must Link to the Study Protocols

Qualify confirmatory and exploratory analysis findings in reports

– No one way to present adjusted and unadjusted p-values

– Confidence intervals may be helpful

– Emphasize confirmatory analysis results in the executive summary

Qualify confirmatory and exploratory analysis findings in reports

– No one way to present adjusted and unadjusted p-values

– Confidence intervals may be helpful

– Emphasize confirmatory analysis results in the executive summary

30

Testing Approach SummaryTesting Approach Summary

Pre-specify plan in the study protocols

Structure the data– Delineate outcome domains

Confirmatory analysis

–Within and between domains

Exploratory analysis

Qualify findings appropriately

Pre-specify plan in the study protocols

Structure the data– Delineate outcome domains

Confirmatory analysis

–Within and between domains

Exploratory analysis

Qualify findings appropriately31

what to do about the multiple comparisons problem? peter z. schochet february 2008

Documents

testing strategy slide

background slide

impact j

ttest number of tests

independent tests

confirmatory analysis

domain analysis

spurious impact