cross-validation methods

Cross-Validation Methods

Issue: How to estimate test error rate from training data?

Resampling: Repeatedly draw samples from a dataset andapply the learning method of interest to each sample toestimate its accuracy.

Cross-validation (CV): A resampling method to estimate testerror. We will consider 3 approaches for CV:

Validation set approach

Leave-one-out cross-validation (LOOCV)

k-fold cross-validation

1 / 28

Validation Set Approach

This approach involves 4 steps:

1 Randomly divide the available data into two (often equal)parts — a training set and a validation set

2 Estimate the learning method f from the training set as f

3 Use f to predict responses in the validation set

4 Compute Ave{I(y0 6= y0)} over all validation set obs(x0, y0) — validation estimate of test error

Essentially, a test set is created by randomly splitting the data.

Q: Recall how we created the training and test sets for StockMarket data. Was the test set a validation set?

A: No, it was not based on a random split.

2 / 28

Drawbacks of the validation set approach:

The estimated test error may have high variability, limitingits usefulness

The validation set observations — a substantial portionof the data — are not used in training the method

Methods trained on larger dataset tend to perform better

Validation estimate of test error may be overestimating thetrue test error

3 / 28

Leave-One-Out-Cross-Validation (LOOCV)

For i = 1, . . . , n,

leave out observation i, i.e., (xi, yi),train the method on the remaining n− 1 observations,predict observation i as yi, andcompute Erri = I(yi 6= yi)

LOOCV estimate of test error: Average of the n errors:

CV(n) =1

n

n∑

i=1

Erri

Each term uses almost the entire dataProvides an unbiased estimate of test errorFor a given dataset, LOOCV estimate is constantAddresses drawbacks of the validation set approachMay be computationally intensive if n is largeFor linear models, closed-form expression available basedon one least-squares fit

4 / 28

5.1 Cross-Validation 179

1 2 3

1 2 3

1 2 3

1 2 3

1 2 3

n

n

n

n

n

···

FIGURE 5.3. A schematic display of LOOCV. A set of n data points is repeat-edly split into a training set (shown in blue) containing all but one observation,and a validation set that contains only that observation (shown in beige). The testerror is then estimated by averaging the n resulting MSE’s. The first training setcontains all but observation 1, the second training set contains all but observation2, and so forth.

(y1 − y1 )2 provides an approximately unbiased estimate for the test error.

But even though MSE1 is unbiased for the test error, it is a poor estimatebecause it is highly variable, since it is based upon a single observation(x1 , y1 ).

We can repeat the procedure by selecting (x2 , y2 ) for the validationdata, training the statistical learning procedure on the n− 1 observations{(x1 , y1 ), (x3 , y3 ), . . . , (xn, yn)}, and computing MSE2 = (y2−y2 )

2 . Repeat-ing this approach n times produces n squared errors, MSE1 , . . . , MSEn.The LOOCV estimate for the test MSE is the average of these n test errorestimates:

CV(n) =1

n

n∑

i=1

MSEi. (5.1)

A schematic of the LOOCV approach is illustrated in Figure 5.3.LOOCV has a couple of major advantages over the validation set ap-

proach. First, it has far less bias. In LOOCV, we repeatedly fit the sta-tistical learning method using training sets that contain n − 1 observa-tions, almost as many as are in the entire data set. This is in contrast tothe validation set approach, in which the training set is typically aroundhalf the size of the original data set. Consequently, the LOOCV approachtends not to overestimate the test error rate as much as the validationset approach does. Second, in contrast to the validation approach whichwill yield different results when applied repeatedly due to randomness inthe training/validation set splits, performing LOOCV multiple times will

Source: ISL5 / 28

k-Fold Cross-ValidationRandomly divide the data into k groups or folds ofapproximately equal size. Then, for i = 1, . . . , k,

leave out the ith fold as a validation set,

train the method on the remaining k − 1 folds,

predict the observations in the ith fold, and compute Erri.

k-fold CV estimate of test error: Average the k errors,

CV(k) =1

k

k∑

i=1

Erri

In practice, k = 5, 10 are common

Requires fitting the model only k times

LOOCV is a special case of k-fold CV with k = n

Starts with a random split so has some variability but notas much as validation set error

Computationally much simpler than LOOCV6 / 28

5.1 Cross-Validation 181

1 2 3

11 76 5

11 76 5

11 76 5

11 76 5

11 76 5

n

47

47

47

47

47

FIGURE 5.5. A schematic display of 5-fold CV. A set of n observations israndomly split into five non-overlapping groups. Each of these fifths acts as avalidation set (shown in beige), and the remainder as a training set (shown inblue). The test error is estimated by averaging the five resulting MSE estimates.

chapters. The magic formula (5.2) does not hold in general, in which casethe model has to be refit n times.

5.1.3 k-Fold Cross-Validation

An alternative to LOOCV is k-fold CV. This approach involves randomlyk-fold CV

dividing the set of observations into k groups, or folds, of approximatelyequal size. The first fold is treated as a validation set, and the methodis fit on the remaining k − 1 folds. The mean squared error, MSE1 , isthen computed on the observations in the held-out fold. This procedure isrepeated k times; each time, a different group of observations is treatedas a validation set. This process results in k estimates of the test error,MSE1 , MSE2 , . . . , MSEk. The k-fold CV estimate is computed by averagingthese values,

CV(k) =1

k

k∑

i=1

MSEi. (5.3)

Figure 5.5 illustrates the k-fold CV approach.It is not hard to see that LOOCV is a special case of k-fold CV in which k

is set to equal n. In practice, one typically performs k-fold CV using k = 5or k = 10. What is the advantage of using k = 5 or k = 10 rather thank = n? The most obvious advantage is computational. LOOCV requiresfitting the statistical learning method n times. This has the potential to becomputationally expensive (except for linear models fit by least squares,in which case formula (5.2) can be used). But cross-validation is a verygeneral approach that can be applied to almost any statistical learningmethod. Some statistical learning methods have computationally intensivefitting procedures, and so performing LOOCV may pose computationalproblems, especially if n is extremely large. In contrast, performing 10-fold

Source: ISL7 / 28

Bias-Variance Trade-Off for k-Fold CV

From the perspective of bias:

validation set estimate overestimates test error

LOOCV estimate is unbiased

k-fold CV estimate has an intermediate level of bias

In practice, k-fold CV with k = 5 or 10 is often preferableover LOOCV — the resulting estimate does not suffer fromexcessively high bias or variance.

8 / 28

Takeaways

To estimate test error using training data, we can usecross-validation methods.

LOOCV and k-fold CV with k = 5 or k = 10 are commonchoices.

These work even for regression problems — just changethe definition of Erri from zero-one error, i.e., I(yi 6= yi) tosquared-error, i.e., (yi − yi)2

9 / 28

Part II:

Logistic Regression and DiscriminantAnalysis

10 / 28

Additional Classification Approaches

Values of Y : Class labels k = 1, . . . ,K — unordered

Training data: (Yi, Xi), i = 1, . . . , n

Predicted value for a given x: Y (a class label)

Error: Zero-one error, i.e.,

I(Y 6= Y ) =

{1, if Y 6= Y

0, if Y = Y

Expected error rate: E{I(Y 6= Y )} = P (Y 6= Y ) — prob ofmisclassification

Bayes classifier: Predicts the most likely class, i.e., the class kfor which pk(x) = P (Y = k|x) is maximum — optimal in thatit minimizes the expected error rate

11 / 28

Issue: pk(x) is unknown — need to estimate it from trainingdata so that pk(x) can be used for classification.

Posterior probability: pk(x) = P (Y = k|X = x) — prob offalling in class k conditional on predictor x;

∑k pk(x) = 1

Prior probability: πk = P (Y = k) — marginal probabilityof falling in class k (aka prevalence);

∑k πk = 1

Class-conditional distribution: fk(x) — joint pdf/pmf ofX|Y = k, i.e., distribution of X for an observation that comesfrom class k. In other words, if X is discrete, fk(x) representsP (X = x|Y = k), whereas if X is continuous, fk(x) representsthe pdf of X conditional on Y = k.

Q: Why do the class-conditional distributions matter?A: If they are well-separated, the classification is easier.

12 / 28

4.2 Why Not Linear Regression? 129

Balance

Inco

me

Default Default

0 500 1000 1500 2000 2500

020

000

4000

060

000

No Yes

050

010

0015

0020

0025

00

Bal

ance

No Yes

020

000

4000

060

000

Inco

me

FIGURE 4.1. The Default data set. Left: The annual incomes and monthlycredit card balances of a number of individuals. The individuals who defaulted ontheir credit card payments are shown in orange, and those who did not are shownin blue. Center: Boxplots of balance as a function of default status. Right:Boxplots of income as a function of default status.

4.2 Why Not Linear Regression?

We have stated that linear regression is not appropriate in the case of aqualitative response. Why not?

Suppose that we are trying to predict the medical condition of a patientin the emergency room on the basis of her symptoms. In this simplifiedexample, there are three possible diagnoses: stroke, drug overdose, andepileptic seizure. We could consider encoding these values as a quantita-tive response variable, Y , as follows:

Y =

⎧⎪⎨⎪⎩

1 if stroke;

2 if drug overdose;

3 if epileptic seizure.

Using this coding, least squares could be used to fit a linear regression modelto predict Y on the basis of a set of predictors X1, . . . , Xp. Unfortunately,this coding implies an ordering on the outcomes, putting drug overdose inbetween stroke and epileptic seizure, and insisting that the differencebetween stroke and drug overdose is the same as the difference betweendrug overdose and epileptic seizure. In practice there is no particularreason that this needs to be the case. For instance, one could choose anequally reasonable coding,

Y =

⎧⎪⎨⎪⎩

1 if epileptic seizure;

2 if stroke;

3 if drug overdose.

Source: ISL13 / 28

Bayes theorem:

pk(x) =πkfk(x)

∑Kl=1 πlfl(x)

Note: The denominator is a normalization factor that iscommon to all classes. Therefore, it does not play any role inthe comparison of posterior probabilities of any two classes(because it gets cancelled).

Two approaches for classification:

Approach 1: Model fk(x) and use Bayes theorem to getpk(x) — discriminant analysis — primarily for prediction

Approach 2: Model pk(x) directly — logistic regression —for both prediction and inference

Let’s begin with discriminant analysis for K = 2 classes.

14 / 28

Discriminant Analysis for K = 2 Classes

Bayes classifier: Assign x to class 1 if

p1(x) > p2(x) ≡ π1f1(x) > π2f2(x) ≡ δ1(x) > δ2(x),

where δk(x) is called the discriminant function. It is obtainedby starting with log(πkfk(x)) and dropping terms that arecommon to both class-conditionals because they cancel outupon differencing.

Both pk(x) and δk(x) induce the same ordering of classes,implying that we can focus on the latter

Bayes decision boundary: {x : δ1(x) = δ2(x)}Assign x to class 2 if δ1(x) < δ2(x)

If x is such that δ1(x) = δ2(x), break tie randomly

15 / 28

Assumption: X|Y = k ∼ N(µk,Σk), i.e.,

fk(x) =1

(2π)p/2|Σk|1/2exp

{−1

2(x− µk)TΣ−1k (x− µk)

}

Class-conditionals are p-dimensional normal

E(X|Y = k) = µk, var(X|Y = k) = Σk

|Σk| = determinant of Σk

Verify:

δk(x) = −1

2log(|Σk|)−

1

2xTΣ−1k x+xTΣ−1k µk−

1

2µTk Σ−1k µk+log(πk)

In addition, if Σ1 = Σ2 = Σ is assumed,

δk(x) = xTΣ−1µk −1

2µTk Σ−1µk + log(πk)

Note:

Class-specific cov matrix: δk(x) is quadratic in x

Common cov matrix for all classes: δk(x) is linear in x

16 / 28

Linear Discriminant Analysis (LDA),X|Y = k ∼ N(µk,Σ)

To use δk(x) in practice, we need to estimate µk and Σ (andalso πk if unknown) from the training data. For class k, let

nk = total number of observations

xk and Sk = sample mean and covariance matrix ofpredictor values

Estimates:

πk =nkn, µk = xk, Σ =

(n1 − 1)S1 + (n2 − 1)S2

n− 2

“Natural” estimators

Σ is the pooled sample covariance matrix

Plug in to get:

δk(x) = µTk Σ−1x− 1

2µTk Σ

−1µk + log(πk)

17 / 28

LDA Decision RuleAssign x to class 1 if δ1(x)− δ2(x) > 0 or equivalently (verify)

(µ1 − µ2)T Σ−1x > c,

where

c =1

2(µ1 − µ2)T Σ

−1(µ1 + µ2) + log

(π2π1

)

LDA decision boundary: {x : (µ1 − µ2)T Σ−1x = c} —

linear

Estimated Bayes classifier under the assumption ofnormality with equal variance matrix

Does not share the optimality property of the Bayesclassifier because the unknowns therein are replaced withtheir estimates. However, when the assumptions hold, itoften tends to approximate the Bayes classifier quite well

If π1 = π2, the log(πk) term drops out from δk(x).

18 / 28

140 4. Classification

−4 −2 0 2 4 −3 −2 −1 20 1 3 4

01

23

45

FIGURE 4.4. Left: Two one-dimensional normal density functions are shown.The dashed vertical line represents the Bayes decision boundary. Right: 20 obser-vations were drawn from each of the two classes, and are shown as histograms.The Bayes decision boundary is again shown as a dashed vertical line. The solidvertical line represents the LDA decision boundary estimated from the trainingdata.

X = x to the class for which (4.12) is largest. Taking the log of (4.12)and rearranging the terms, it is not hard to show that this is equivalent toassigning the observation to the class for which

δk(x) = x · µk

σ2− µ2

k

2σ2+ log(πk) (4.13)

is largest. For instance, if K = 2 and π1 = π2, then the Bayes classifierassigns an observation to class 1 if 2x (µ1 − µ2) > µ2

1 − µ22, and to class

2 otherwise. In this case, the Bayes decision boundary corresponds to thepoint where

x =µ2

1 − µ22

2(µ1 − µ2)=

µ1 + µ2

2. (4.14)

An example is shown in the left-hand panel of Figure 4.4. The two normaldensity functions that are displayed, f1(x) and f2(x), represent two distinctclasses. The mean and variance parameters for the two density functionsare µ1 = −1.25, µ2 = 1.25, and σ2

1 = σ22 = 1. The two densities overlap,

and so given that X = x, there is some uncertainty about the class to whichthe observation belongs. If we assume that an observation is equally likelyto come from either class—that is, π1 = π2 = 0.5—then by inspection of(4.14), we see that the Bayes classifier assigns the observation to class 1if x < 0 and class 2 otherwise. Note that in this case, we can computethe Bayes classifier because we know that X is drawn from a Gaussiandistribution within each class, and we know all of the parameters involved.In a real-life situation, we are not able to calculate the Bayes classifier.

In practice, even if we are quite certain of our assumption that X is drawnfrom a Gaussian distribution within each class, we still have to estimatethe parameters µ1, . . . , µK , π1, . . . , πK , and σ2. The linear discriminant

Source: ISL19 / 28

Quadratic Discriminant Analysis (QDA),X|Y = k ∼ N(µk,Σk)

The unknowns πk and µk are estimated as in LDA and Σk = Sk.Plugging in gives the estimated discriminant function as:

δk(x) = −1

2log(|Σk|)−

1

2xT Σ

−1k x+µTk Σ

−1k x−1

2µTk Σ

−1k µk+log(πk)

QDA decision rule: Assign x to class 1 if δ1(x)− δ2(x) > 0 orequivalently (verify)

−1

2xT(Σ−11 − Σ

−12

)x+

(µT1 Σ

−11 − µT2 Σ

−12

)x > c,

where

c =1

2log

(|Σ1||Σ2|

)+

1

2

(µT1 Σ

−11 µ1 − µT2 Σ

−12 µ2

)+ log

(π2π1

)

20 / 28

Decision boundary: (quadratic)

{x : −1

2xT(Σ−11 − Σ

−12

)x+

(µT1 Σ

−11 − µT2 Σ

−12

)x = c

}

Estimated Bayes classifier under the assumption ofnormality with unequal variance matrix

Not optimal as the unknowns in the Bayes classifier arereplaced by estimates. However, if the assumptions hold, itapproximates the Bayes classifier quite well.

If π1 = π2, the log(πk) term drops out from δk(x)

Reduces to LDA when Σk is replaced by a common Σ

21 / 28

Discriminant Analysis for K > 2 Classes

The extension to K > 2 case is straightforward keeping in mindthat the Bayes classifier assigns x to the class for which pk(x)and hence δk(x) is largest. Just let the index k run from 1 toK, and replace the comparison of δ1(x) and δ2(x) with that ofδk(x) and δl(x) for k 6= l.

Decision boundary: A given point x will fall in class k if

pk(x) ≥ pl(x) for all l 6= k ≡ p∗k(x) = pk(x)−maxl 6=k

pl(x) ≥ 0.

The boundary for class k is {x : p∗k(x) = 0}. Thus, to draw thedecision boundary for a multiclass classifier, just draw the classboundaries {x : p∗k(x) = 0} for k = 1, . . . ,K − 1.

Notes:

pk(x) can be replaced with δk(x) in the above.

In case of LDA with p = 2, the decision boundaries will beformed by intersecting lines.

In application, we will use the estimated δk(x).22 / 28

4.4 Linear Discriminant Analysis 143

−4 −2 0 2 4

−4−2

02

4

−4−2

02

4

X1

−4 −2 0 2 4

X1

X2

X2

FIGURE 4.6. An example with three classes. The observations from each classare drawn from a multivariate Gaussian distribution with p = 2, with a class-spe-cific mean vector and a common covariance matrix. Left: Ellipses that contain95 % of the probability for each of the three classes are shown. The dashed linesare the Bayes decision boundaries. Right: 20 observations were generated fromeach class, and the corresponding LDA decision boundaries are indicated usingsolid black lines. The Bayes decision boundaries are once again shown as dashedlines.

shape. To indicate that a p-dimensional random variable X has a multi-variate Gaussian distribution, we write X ∼ N(µ,Σ). Here E(X) = µ isthe mean of X (a vector with p components), and Cov(X) = Σ is thep× p covariance matrix of X . Formally, the multivariate Gaussian densityis defined as

f(x) =1

(2π)p/2|Σ|1/2exp

(−1

2(x− µ)T Σ−1(x− µ)

). (4.18)

In the case of p > 1 predictors, the LDA classifier assumes that theobservations in the kth class are drawn from a multivariate Gaussian dis-tribution N(µk,Σ), where µk is a class-specific mean vector, and Σ is acovariance matrix that is common to all K classes. Plugging the densityfunction for the kth class, fk(X = x), into (4.10) and performing a littlebit of algebra reveals that the Bayes classifier assigns an observation X = xto the class for which

δk(x) = xT Σ−1µk −1

2µT

k Σ−1µk + log πk (4.19)

is largest. This is the vector/matrix version of (4.13).An example is shown in the left-hand panel of Figure 4.6. Three equally-

sized Gaussian classes are shown with class-specific mean vectors and acommon covariance matrix. The three ellipses represent regions that con-tain 95% of the probability for each of the three classes. The dashed lines

Source: ISL23 / 28

LDA vs QDA

Just one difference: common covariance matrix orclass-specific covariance matrices

Bias-variance tradeoff — Kp(p+ 1)/2 cov parametersfor QDA whereas only p(p+ 1)/2 for LDA — both haveKp mean parameters

LDA is simple and is less flexible (and tends to have higherbias but lower variance) than QDA

In practice, QDA generally works better than LDA whenthe training set is large (so that the variance of classifierthe not a major concern) or if the equal varianceassumption is clearly wrong

Sometimes QDA leads to rather odd decision boundaries(e.g., disjointed intervals in case of p = 1)

QDA is more sensitive than LDA to normality assumption

Proof is in the pudding — try both and compare!

24 / 28


−4 −2 0 2 4

−4−3

−2−1

01

2

−4 −2 0 2 4

−4−3

−2−1

01

2

X1X1

X2

X2

FIGURE 4.9. Left: The Bayes (purple dashed), LDA (black dotted), and QDA(green solid) decision boundaries for a two-class problem with Σ1 = Σ2. Theshading indicates the QDA decision rule. Since the Bayes decision boundary islinear, it is more accurately approximated by LDA than by QDA. Right: Detailsare as given in the left-hand panel, except that Σ1 = Σ2. Since the Bayes decisionboundary is non-linear, it is more accurately approximated by QDA than by LDA.

is some multiple of 1,275, which is a lot of parameters. By instead assum-ing that the K classes share a common covariance matrix, the LDA modelbecomes linear in x, which means there are Kp linear coefficients to esti-mate. Consequently, LDA is a much less flexible classifier than QDA, andso has substantially lower variance. This can potentially lead to improvedprediction performance. But there is a trade-off: if LDA’s assumption thatthe K classes share a common covariance matrix is badly off, then LDAcan suffer from high bias. Roughly speaking, LDA tends to be a better betthan QDA if there are relatively few training observations and so reducingvariance is crucial. In contrast, QDA is recommended if the training set isvery large, so that the variance of the classifier is not a major concern, or ifthe assumption of a common covariance matrix for the K classes is clearlyuntenable.

Figure 4.9 illustrates the performances of LDA and QDA in two scenarios.In the left-hand panel, the two Gaussian classes have a common correla-tion of 0.7 between X1 and X2. As a result, the Bayes decision boundaryis linear and is accurately approximated by the LDA decision boundary.The QDA decision boundary is inferior, because it suffers from higher vari-ance without a corresponding decrease in bias. In contrast, the right-handpanel displays a situation in which the orange class has a correlation of 0.7between the variables and the blue class has a correlation of −0.7. Nowthe Bayes decision boundary is quadratic, and so QDA more accuratelyapproximates this boundary than does LDA.

Source: ISL25 / 28

Default data

class ‘+’: default = yes, class ‘−’: default = no

Predictors: balance and student status — note: usingLDA with a qualitative predictor (a common practice)sensitivity = P (+|+) = P (correctly predict a defaulter)specificity = P (−|−) = P (correctly predict a non-defaulter)

4.4 Linear Discriminant Analysis 145

True default statusNo Yes Total

Predicted No 9, 644 252 9, 896default status Yes 23 81 104

Total 9, 667 333 10, 000

TABLE 4.4. A confusion matrix compares the LDA predictions to the true de-fault statuses for the 10, 000 training observations in the Default data set. Ele-ments on the diagonal of the matrix represent individuals whose default statuseswere correctly predicted, while off-diagonal elements represent individuals thatwere misclassified. LDA made incorrect predictions for 23 individuals who didnot default and for 252 individuals who did default.

each individual will not default, regardless of his or her credit cardbalance and student status, will result in an error rate of 3.33%. Inother words, the trivial null classifier will achieve an error rate that

nullis only a bit higher than the LDA training set error rate.

In practice, a binary classifier such as this one can make two types oferrors: it can incorrectly assign an individual who defaults to the no defaultcategory, or it can incorrectly assign an individual who does not default tothe default category. It is often of interest to determine which of these twotypes of errors are being made. A confusion matrix, shown for the Default

confusionmatrixdata in Table 4.4, is a convenient way to display this information. The

table reveals that LDA predicted that a total of 104 people would default.Of these people, 81 actually defaulted and 23 did not. Hence only 23 outof 9, 667 of the individuals who did not default were incorrectly labeled.This looks like a pretty low error rate! However, of the 333 individuals whodefaulted, 252 (or 75.7%) were missed by LDA. So while the overall errorrate is low, the error rate among individuals who defaulted is very high.From the perspective of a credit card company that is trying to identifyhigh-risk individuals, an error rate of 252/333 = 75.7% among individualswho default may well be unacceptable.

Class-specific performance is also important in medicine and biology,where the terms sensitivity and specificity characterize the performance of

sensitivity

specificitya classifier or screening test. In this case the sensitivity is the percentage oftrue defaulters that are identified, a low 24.3% in this case. The specificityis the percentage of non-defaulters that are correctly identified, here (1 −23/9, 667)× 100 = 99.8%.

Why does LDA do such a poor job of classifying the customers who de-fault? In other words, why does it have such a low sensitivity? As we haveseen, LDA is trying to approximate the Bayes classifier, which has the low-est total error rate out of all classifiers (if the Gaussian model is correct).That is, the Bayes classifier will yield the smallest possible total numberof misclassified observations, irrespective of which class the errors comefrom. That is, some misclassifications will result from incorrectly assigning

With p = 0.5: overall error rate = (23 + 252)/10000 =2.8%, sensitivity = 81/333 = 24.3%, specificity =9644/9667 = 99.8% — sensitivity too low

Source: ISL26 / 28


True default statusNo Yes Total

Predicted No 9, 432 138 9, 570default status Yes 235 195 430

Total 9, 667 333 10, 000

TABLE 4.5. A confusion matrix compares the LDA predictions to the true de-fault statuses for the 10, 000 training observations in the Default data set, usinga modified threshold value that predicts default for any individuals whose posteriordefault probability exceeds 20 %.

a customer who does not default to the default class, and others will re-sult from incorrectly assigning a customer who defaults to the non-defaultclass. In contrast, a credit card company might particularly wish to avoidincorrectly classifying an individual who will default, whereas incorrectlyclassifying an individual who will not default, though still to be avoided,is less problematic. We will now see that it is possible to modify LDA inorder to develop a classifier that better meets the credit card company’sneeds.

The Bayes classifier works by assigning an observation to the class forwhich the posterior probability pk(X) is greatest. In the two-class case, thisamounts to assigning an observation to the default class if

Pr(default = Yes|X = x) > 0.5. (4.21)

Thus, the Bayes classifier, and by extension LDA, uses a threshold of 50%for the posterior probability of default in order to assign an observationto the default class. However, if we are concerned about incorrectly pre-dicting the default status for individuals who default, then we can considerlowering this threshold. For instance, we might label any customer with aposterior probability of default above 20% to the default class. In otherwords, instead of assigning an observation to the default class if (4.21)holds, we could instead assign an observation to this class if

Pr(default = Yes|X = x) > 0.2. (4.22)

The error rates that result from taking this approach are shown in Table 4.5.Now LDA predicts that 430 individuals will default. Of the 333 individualswho default, LDA correctly predicts all but 138, or 41.4%. This is a vastimprovement over the error rate of 75.7% that resulted from using thethreshold of 50%. However, this improvement comes at a cost: now 235individuals who do not default are incorrectly classified. As a result, theoverall error rate has increased slightly to 3.73%. But a credit card companymay consider this slight increase in the total error rate to be a small price topay for more accurate identification of individuals who do indeed default.

Figure 4.7 illustrates the trade-off that results from modifying the thresh-old value for the posterior probability of default. Various error rates are

With p = 0.2: overall error rate = (235+138)/10000 =3.7%, sensitivity = 195/333 = 58.6%, specificity =9432/9667 = 97.6% — may be more acceptable

Source: ISL27 / 28

Takeaways

Discriminant analysis

Multivariate normality assumption of X, even when somepredictors may be categorical

LDA vs QDA

LDA — linear decision boundary

QDA — quadratic decision boundary

If a non-linear decision boundary is called for, one may usenon-linear transformations for X before applyingdiscriminant analysis

Note: If X1, . . . , Xp have very different scales, then generallyit’s a good idea to standardize them before applying a learningmethod.

28 / 28

cross-validation methods

Documents