1 a prediction interval for the misclassification rate e.b. laber & s.a. murphy

42
1 A Prediction Interval for the Misclassification Rate E.B. Laber & S.A. Murphy

Post on 20-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 A Prediction Interval for the Misclassification Rate E.B. Laber & S.A. Murphy

1

A Prediction Interval for theMisclassification Rate

E.B. Laber

&

S.A. Murphy

Page 2: 1 A Prediction Interval for the Misclassification Rate E.B. Laber & S.A. Murphy

2

Outline

– Review– Three challenges in constructing PIs– Combining a statistical approach with a

learning theory approach to constructing PIs– Relevance to confidence measures for the value

of a dynamic treatment regime.

Page 3: 1 A Prediction Interval for the Misclassification Rate E.B. Laber & S.A. Murphy

3

Review– X is the vector of features in Rq, Y is the binary

label in {-1,1}– Misclassification Rate:

– Data: N iid observations of (Y,X)

– Given a space of classifiers, , and the data, use some method to construct a classifier,

– The goal is to provide a PI for

Page 4: 1 A Prediction Interval for the Misclassification Rate E.B. Laber & S.A. Murphy

4

Review

– Since the loss function is not smooth, one commonly uses a smooth surrogate loss to estimate the classifier

– Surrogate Loss: L(Y,f(X))

Page 5: 1 A Prediction Interval for the Misclassification Rate E.B. Laber & S.A. Murphy

5

ReviewGeneral approach to providing a PI:

– We estimate using the data, resulting in

– Derive approximate distribution for

– Use this approximate distribution to construct a prediction interval for

Page 6: 1 A Prediction Interval for the Misclassification Rate E.B. Laber & S.A. Murphy

6

Review

A common choice for is the resubstitution error or training error:

evaluated at e.g. if

then

Page 7: 1 A Prediction Interval for the Misclassification Rate E.B. Laber & S.A. Murphy

7

Three challenges

1) is too large leading to over-fitting and

(negative bias)

2) is a non-smooth function of f.

3) may behave like an extreme quantity

No assumption that is close to optimal.

Page 8: 1 A Prediction Interval for the Misclassification Rate E.B. Laber & S.A. Murphy

8

A Challenge

2) is non-smooth.

Example: The unknown optimal classifier has quadratic decision boundary. We fit, by least squares, a linear decision boundary

f(x)= sign(β0 + β1 x)

Page 9: 1 A Prediction Interval for the Misclassification Rate E.B. Laber & S.A. Murphy

9

Density of

Three Point Dist. (n=30)

Three Point Dist. (n=100)

Page 10: 1 A Prediction Interval for the Misclassification Rate E.B. Laber & S.A. Murphy

10

Bias of Common on Three Point Example

Page 11: 1 A Prediction Interval for the Misclassification Rate E.B. Laber & S.A. Murphy

11

Coverage of Bootstrap PI in Three Point Example (goal =95%)

Page 12: 1 A Prediction Interval for the Misclassification Rate E.B. Laber & S.A. Murphy

12

Coverage of Correctly Centered Bootstrap PI (goal= 95%)

Page 13: 1 A Prediction Interval for the Misclassification Rate E.B. Laber & S.A. Murphy

13

Sample Size

Bootstrap Percentile

Yang

CV

CUD-Bound

30 .79 .75 .97

50 .79 .62 .97

100 .78 .46 .96

200 .78 .35 .96

Coverage of 95% PI (Three Point Example)

Page 14: 1 A Prediction Interval for the Misclassification Rate E.B. Laber & S.A. Murphy

14

PIs from Learning Theory

Given a result of the form: for all N

where is known to belong to and

forms a conservative 1-δ PI:

Page 15: 1 A Prediction Interval for the Misclassification Rate E.B. Laber & S.A. Murphy

15

Combine statistical ideas with learning theory ideas

Construct a prediction interval for

where is chosen to be small yet contain

---from this PI deduce a conservative PI for

---use the surrogate loss to perform estimation and to construct

Page 16: 1 A Prediction Interval for the Misclassification Rate E.B. Laber & S.A. Murphy

16

Construct a prediction interval for

--- should contain all that are close to

--- all f for which

--- is the “limiting value” of ;

Page 17: 1 A Prediction Interval for the Misclassification Rate E.B. Laber & S.A. Murphy

17

Prediction Interval

Construct a prediction interval for

---

Page 18: 1 A Prediction Interval for the Misclassification Rate E.B. Laber & S.A. Murphy

18

Prediction Interval

Page 19: 1 A Prediction Interval for the Misclassification Rate E.B. Laber & S.A. Murphy

19

BootstrapWe use bootstrap to obtain an estimate of an

upper percentile of the distribution of

to obtain bU; similarly for bL. The PI is then

Page 20: 1 A Prediction Interval for the Misclassification Rate E.B. Laber & S.A. Murphy

20

Implementation

• Approximation space for the classifier is linear:

• Surrogate loss is least squares:

Page 21: 1 A Prediction Interval for the Misclassification Rate E.B. Laber & S.A. Murphy

21

Implementation

becomes

Page 22: 1 A Prediction Interval for the Misclassification Rate E.B. Laber & S.A. Murphy

22

Implementation

• Bootstrap version:

• denotes the expectation for the bootstrap

distribution

Page 23: 1 A Prediction Interval for the Misclassification Rate E.B. Laber & S.A. Murphy

23

Cud-Bound Level Sets (n=30)

Three Point Dist.

Page 24: 1 A Prediction Interval for the Misclassification Rate E.B. Laber & S.A. Murphy

24

Computational Issues

• Partition Rq into equivalence classes defined by the 2N possible values of the first term.

• Each equivalence class, can be written as a set of β satisfying linear constraints.

• The first term is constant on

Page 25: 1 A Prediction Interval for the Misclassification Rate E.B. Laber & S.A. Murphy

25

Computational Issues

can be written as

since g is non-decreasing.

Page 26: 1 A Prediction Interval for the Misclassification Rate E.B. Laber & S.A. Murphy

26

Computational Issues• Reduced the problem to the computation of

at most 2N mixed integer quadratic programming problems.

• Using commercial solvers (e.g. CPLEX) the CUD bound can be computed for moderately sized data sets in a few minutes on a standard desktop (2.8 GHz processor 2GB RAM).

Page 27: 1 A Prediction Interval for the Misclassification Rate E.B. Laber & S.A. Murphy

27

Comparisons, 95% PI

Data CUD BS M Y

Magic 1.0 .92 .98 .99

Mamm. 1.0 .68 .43 .98

Ion. 1.0 .61 .76 .99

Donut 1.0 .88 .63 .94

3-Pt .97 .83 .90 .75

Balance .95 .91 .61 .99

Liver 1.0 .96 1.0 1.0

Sample size = 30 (1000 data sets)

Page 28: 1 A Prediction Interval for the Misclassification Rate E.B. Laber & S.A. Murphy

28

Comparisons, Length of PI

Sample size=30 (1000 data sets)

Data CUD BS M Y

Magic .60 .31 .28 .46

Mamm. .46 .53 .32 .42

Ion. .42 .43 .30 .50

Donut .47 .59 .32 .41

3-Pt .38 .48 .32 .46

Balance .38 .09 .29 .48

Liver .62 .37 .33 .49

Page 29: 1 A Prediction Interval for the Misclassification Rate E.B. Laber & S.A. Murphy

29

Intuition In large samples

behaves like

Page 30: 1 A Prediction Interval for the Misclassification Rate E.B. Laber & S.A. Murphy

30

Intuition The large sample distribution is the same as the

distribution of

where

Page 31: 1 A Prediction Interval for the Misclassification Rate E.B. Laber & S.A. Murphy

31

Intuition If then the distribution is approximately that of a

(limiting distribution for binomial, as expected).

Page 32: 1 A Prediction Interval for the Misclassification Rate E.B. Laber & S.A. Murphy

32

Intuition If

the distribution is approximately that of

where

Page 33: 1 A Prediction Interval for the Misclassification Rate E.B. Laber & S.A. Murphy

33

Discussion• Further reduce the conservatism of the CUD-

bound.– Replace by other quantities. – Other surrogates (exponential, logit, hinge)

• Construct a principle for minimizing the length of the conservative PI?

• The real goal is to produce PIs for the Value of a policy.

Page 34: 1 A Prediction Interval for the Misclassification Rate E.B. Laber & S.A. Murphy

34

The simplest Dynamic treatment regime (e.g. policy) is a decision rule if there is only one stage of treatment

1 Stage for each individual

Observation available at jth stage

Action at jth stage (usually a treatment)

Primary Outcome:

Page 35: 1 A Prediction Interval for the Misclassification Rate E.B. Laber & S.A. Murphy

35

Goal:

Construct decision rules that input patient information and output a recommended action; these decision rules should lead to a maximal mean Y.

In future one selects action:

Page 36: 1 A Prediction Interval for the Misclassification Rate E.B. Laber & S.A. Murphy

36

Single Stage

• Find a confidence interval for the mean outcome if a particular estimated policy (here one decision rule) is employed.

• Treatment A is randomized in {-1,1}.

• Suppose the decision rule is of form

• We do not assume the optimal decision boundary is linear.

Page 37: 1 A Prediction Interval for the Misclassification Rate E.B. Laber & S.A. Murphy

37

Single Stage

Mean outcome following this policy is

is the randomization probability

Page 38: 1 A Prediction Interval for the Misclassification Rate E.B. Laber & S.A. Murphy

38

STAR*D "Sequenced Treatment to Relieve Depression"

Preference Treatment Intermediate Preference Treatment Two Outcome Three

Follow-up

CIT + BUS Remission L2-Tx +THY

Augment R Augment R

CIT + BUP-SR L2-Tx +LI

CIT Non-remission

Bup-SR MIRTSwitch Switch

R RVEN

SER NTP

Page 39: 1 A Prediction Interval for the Misclassification Rate E.B. Laber & S.A. Murphy

39

Oslin ExTENd

Late Trigger forNonresponse

8 wks Response

TDM + Naltrexone

CBIRandom

assignment:

CBI +Naltrexone

Nonresponse

Early Trigger for Nonresponse

Randomassignment:

Randomassignment:

Randomassignment:

Naltrexone

8 wks Response

Randomassignment:

CBI +Naltrexone

CBI

TDM + Naltrexone

Naltrexone

Nonresponse

Page 40: 1 A Prediction Interval for the Misclassification Rate E.B. Laber & S.A. Murphy

40

This seminar can be found at:

http://www.stat.lsa.umich.edu/~samurphy/

seminars/ColumbiaBioStat02.09.ppt

Email Eric or me with questions or if you would like a copy of the associated paper:

[email protected] or [email protected]

Page 41: 1 A Prediction Interval for the Misclassification Rate E.B. Laber & S.A. Murphy

41

Bias of Common on Three Point Example

Page 42: 1 A Prediction Interval for the Misclassification Rate E.B. Laber & S.A. Murphy

42

IntuitionConsider the large sample variance of

Variance is

if in place of we put where is close to 0

then due to the non-smoothness in

at we can get jittering.