spurious skill? varying event frequencies, non-collapsibility and simpson’s paradox in forecast...

23
Spurious skill? Varying event frequencies, non-collapsibility and Simpson’s Paradox in forecast verification Roger Harbord [email protected] 15 th EMS & 12 th ECAM, 10 September 2015, Sofia

Upload: ophelia-sherman

Post on 14-Dec-2015

218 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Spurious skill? Varying event frequencies, non-collapsibility and Simpson’s Paradox in forecast verification Roger Harbordroger.harbord@metoffice.gov.uk

Spurious skill? Varying event frequencies, non-collapsibility and Simpson’s Paradox in forecast verificationRoger Harbord [email protected] EMS & 12th ECAM, 10 September 2015, Sofia

Page 2: Spurious skill? Varying event frequencies, non-collapsibility and Simpson’s Paradox in forecast verification Roger Harbordroger.harbord@metoffice.gov.uk

Who was Simpson?

Journal of the Royal Statistical Society,Series B (Methodological). Vol. 13, No. 2, pp. 238 241‒

© Royal Statistical Society

Page 3: Spurious skill? Varying event frequencies, non-collapsibility and Simpson’s Paradox in forecast verification Roger Harbordroger.harbord@metoffice.gov.uk

© Crown copyright Met Office

Verification of binary forecastsEvent forecast? Event observed?

Yes No

Yes Hits False alarms

No Misses Correct rejections

• Hit rate H = Hits  ∕ ( Hits + Misses )

• False alarm rate F = False alarms  ∕  ( False alarms + Correct rejections )

• Peirce skill score (true skill statistic, Hanssen & Kuiper’s discriminant, Youden’s index …)

PSS  =  H − F

Page 4: Spurious skill? Varying event frequencies, non-collapsibility and Simpson’s Paradox in forecast verification Roger Harbordroger.harbord@metoffice.gov.uk

© Crown copyright Met Office

Homer’s performanceDecember to May

Low vis. forecast?

Low visibility observed?

Yes No Total

Yes 35 56 91

No 21 70 91

Total 56 126 182

• Hit rate H = 35 / 56 = 0.625

• False alarm rate F = 56 / 126 = 0.444

• Peirce skill score PSS  =  H − F = 0.625 − 0.444 = 0.18

two-sided P = 0.025

Page 5: Spurious skill? Varying event frequencies, non-collapsibility and Simpson’s Paradox in forecast verification Roger Harbordroger.harbord@metoffice.gov.uk

© Crown copyright Met Office

December to February

Low vis. forecast?

Low visibility observed?

Yes No Total

Yes 33 43 76

No 7 7 14

Total 40 50 90

• Hit rate H = 33 / 40 = 0.825

• False alarm rate F = 43 / 50 = 0.860

• Peirce skill score PSS  =  H − F = 0.825 − 0.860 = − 0.035

Page 6: Spurious skill? Varying event frequencies, non-collapsibility and Simpson’s Paradox in forecast verification Roger Harbordroger.harbord@metoffice.gov.uk

© Crown copyright Met Office

March to May

Low vis. forecast?

Low visibility observed?

Yes No Total

Yes 2 13 15

No 14 63 77

Total 16 76 92

• Hit rate H = 2 / 16 = 0.125

• False alarm rate F = 13 / 76 = 0.171

• Peirce skill score PSS  =  H − F = 0.125 − 0.171 = − 0.046

Page 7: Spurious skill? Varying event frequencies, non-collapsibility and Simpson’s Paradox in forecast verification Roger Harbordroger.harbord@metoffice.gov.uk

© Crown copyright Met Office

Collapsing contingency tables:A geometric approach

Shapiro SH (1982). The American Statistician, 36 (1): 43-46.

False alarm rate

Hit rate

0

1

0 1

Page 8: Spurious skill? Varying event frequencies, non-collapsibility and Simpson’s Paradox in forecast verification Roger Harbordroger.harbord@metoffice.gov.uk

© Crown copyright Met Office

Simpson’s paradox (‘the reversal paradox’)

• Not limited to Peirce’s skill score — Independent of the measure used, as all sensible performance measures agree on direction of effect (positive or negative)

• Not limited to deterministic forecasts of dichotomous events— analogous phenomena occur for:

• Continuous variables (‘Spurious correlation’: Pearson, 1899)

• Probabilistic forecasts

Page 9: Spurious skill? Varying event frequencies, non-collapsibility and Simpson’s Paradox in forecast verification Roger Harbordroger.harbord@metoffice.gov.uk

© Crown copyright Met Office

Non-collapsibility (Greenland, Robins & Pearl 1999)

• More generally, the value of a measure overall compared to the same measure in two or more subgroups can:• Reverse • Change from zero to non-zero or vice-versa• Increase or decrease in magnitude

• In general, conditions for collapsibility do depend on the measure chosen (Shapiro 1982)

Page 10: Spurious skill? Varying event frequencies, non-collapsibility and Simpson’s Paradox in forecast verification Roger Harbordroger.harbord@metoffice.gov.uk

© Crown copyright Met Office

Some real data

• Observations from UK surface stations

• Equitable threat score (ETS),also known as Gilbert Skill Score

1. Precipitation ≥ 0.5mm in 6 hoursMet Office global modelCombining groups of stations

2. Visibility ≤ 1000mMet Office ‘UKV’ modelCombining dates and times

Page 11: Spurious skill? Varying event frequencies, non-collapsibility and Simpson’s Paradox in forecast verification Roger Harbordroger.harbord@metoffice.gov.uk

© Crown copyright Met Office

Combining over areas

Page 12: Spurious skill? Varying event frequencies, non-collapsibility and Simpson’s Paradox in forecast verification Roger Harbordroger.harbord@metoffice.gov.uk

© Crown copyright Met Office

Combining over times

6 12 18 24 30 360

0.02

0.04

0.06

0.08

0.1

0.12

Mean

36-month pooled

Median

ETS score for visibility <1000m,UK sites

Forecast lead time

Page 13: Spurious skill? Varying event frequencies, non-collapsibility and Simpson’s Paradox in forecast verification Roger Harbordroger.harbord@metoffice.gov.uk

© Crown copyright Met Office

Time of day

Base rate

Time (hours)

Page 14: Spurious skill? Varying event frequencies, non-collapsibility and Simpson’s Paradox in forecast verification Roger Harbordroger.harbord@metoffice.gov.uk

© Crown copyright Met Office

Is this new to verification?• Non-collapsibility isn’t:

• Hamill TM & Juras J (2006). Measuring forecast skill: is it real skill or is it the varying climatology? Quarterly Journal of the Royal Meteorological Society, Vol. 132, No. 621C, pp. 2905-2923

• Mason I (1989). Dependence of the critical success index on sample climate and threshold probability. Australian Meteorological Magazine, Vol. 37, pp. 75-81

• Yet to find any mention or description of Simpson’s paradox in the forecast verification literature.

Page 15: Spurious skill? Varying event frequencies, non-collapsibility and Simpson’s Paradox in forecast verification Roger Harbordroger.harbord@metoffice.gov.uk

© Crown copyright Met Office

Alternatives

• We have seen that it may be misleading to pool data over large areas or time periods simply by adding up the numbers (whether counts, mean squares, Brier scores…)

• But what’s the alternative?

1. Report performance measures only in homogeneous subgroups

• But there may be rather a lot of them,so we often want a summary measure

Page 16: Spurious skill? Varying event frequencies, non-collapsibility and Simpson’s Paradox in forecast verification Roger Harbordroger.harbord@metoffice.gov.uk

© Crown copyright Met Office

2. Percentile thresholds

• The issue of non-collapsibility goes awayif the base rate (climatological event frequency)is the same for all samples

• True if use percentile thresholds,i.e. quantiles of the local climatological distribution

• But such thresholds can be harder to interpret

Page 17: Spurious skill? Varying event frequencies, non-collapsibility and Simpson’s Paradox in forecast verification Roger Harbordroger.harbord@metoffice.gov.uk

© Crown copyright Met Office

3. Weighted averaging

Estimate the measure within each homogeneous subgroup (e.g. coastal stations in autumn 2013)

1. Summarise these estimates graphically

2. If the estimates are fairly homogeneous:Report a single summary measure by taking a weighted average of the estimates in each subgroup

In statistics, this is known as meta-analysis

(whole literature on how to choose weights, derive confidence intervals ...)

Page 18: Spurious skill? Varying event frequencies, non-collapsibility and Simpson’s Paradox in forecast verification Roger Harbordroger.harbord@metoffice.gov.uk

© Crown copyright Met Office

4. Paired comparisons

• Skill scores relative to persistence account for variation with time of yearand time of day (24-h persistence) (Mittermaier 2008)

• Commonly used for continuous variables

• Less common for dichotomous variables?

• Can define e.g. ETS skill score relative to persistence in the usual way

Page 19: Spurious skill? Varying event frequencies, non-collapsibility and Simpson’s Paradox in forecast verification Roger Harbordroger.harbord@metoffice.gov.uk

© Crown copyright Met Office

Summary & future work

• “Pooling over heterogeneous regions can easily produce misleading results”

• Not clear how big an issue this is in practice in a small region such as the UK

• Produce some empirical results on frequency of Simpson’s Paradox and of substantial non-collapsibility in real data

• Solutions include:

• weighted averaging (meta-analysis)

• Paired comparisons (e.g. to persistence)

Page 20: Spurious skill? Varying event frequencies, non-collapsibility and Simpson’s Paradox in forecast verification Roger Harbordroger.harbord@metoffice.gov.uk

© Crown copyright Met Office

Discussion / Questions• Anything I’ve missed?

• Should we worry more about these issues?

• What do you do in practice?

Page 21: Spurious skill? Varying event frequencies, non-collapsibility and Simpson’s Paradox in forecast verification Roger Harbordroger.harbord@metoffice.gov.uk

© Crown copyright Met Office

References

Greenland S, Robins JM, Pearl J (1999). Confounding and collapsibility in causal inference. Statistical Science 14(1), 29-46.

Hogan RJ, Mason IB (2012). Deterministic forecasts of binary events. Chapter 3 in Jolliffe IT, Stephenson DB. Forecast Verification: A Practitioner's Guide in Atmospheric Science. 2nd edition. John Wiley & Sons.

Mittermaier MP (2008). The potential impact of using persistence as a reference forecast on perceived forecast skill. Weather & Forecasting 23, 1022 ‒1031

Pearson K (1899). Mathematical Contributions to the Theory of Evolution. VI. Genetic (Reproductive) Selection. Philosophical Transactions of the Royal Society of London. Series A. 192, 259-278.(Specifically, Proposition VI pp. 277‒278 “On the spurious correlation produced by forming a mixture of heterogeneous but uncorrelated materials”)

Shapiro SH (1982). Collapsing Contingency Tables—A Geometric Approach. The American Statistician 36(1), 43-46.

Simpson EH (1951). The interpretation of interaction in contingency tables. Journal of the Royal Statistical Society. Series B (Methodological) 13(2), 238-241.

Simpson EH (2010). Edward Simpson: Bayes at Bletchley Park. Significance 7(2): 76-80.

Wikipedia contributors. Simpson's paradox. In Wikipedia, The Free Encyclopedia. (accessed 2015-09-10).

Wikipedia contributors. Edward H. Simpson. In Wikipedia, The Free Encyclopedia. (accessed 2015-09-10)

Page 22: Spurious skill? Varying event frequencies, non-collapsibility and Simpson’s Paradox in forecast verification Roger Harbordroger.harbord@metoffice.gov.uk

© Crown copyright Met Office

Historical aside: Stigler's law of eponymy

“No scientific discovery is named after its original discoverer.”

(Stephen Stigler, 1980)

• Simpson’s paper doesn’t describe reversal,but rather a fictitious example in which the measure is equal and non-zero in both of two categories, but zero when the categories are collapsed

• Reversal phenomenon named ‘Simpson’s Paradox’ by Colin Blyth in 1972

• Described (with a real-data example) as early as 1934 by Cohen & Nagel

Page 23: Spurious skill? Varying event frequencies, non-collapsibility and Simpson’s Paradox in forecast verification Roger Harbordroger.harbord@metoffice.gov.uk

© Crown copyright Met Office

Blyth CR (1972). On Simpson’s Paradox and the Sure-Thing Principle. Journal of the American Statistical Association 67(338), 364‒366.

Cohen MR, Nagel E (1934). An Introduction to Logic and Scientific Method. (Harcourt, Brace & Co.)

Stigler SM (1980). Stigler's law of eponymy. In: Gieryn TF, ed. Science and social structure: a festschrift for Robert K. Merton. (New York Academy of Sciences) pp. 147–57. Republished in Stigler's collection Statistics on the Table: The History of Statistical Concepts and Methods (1999, Harvard University Press)