data mining taylor statistics 202: data miningjtaylo/courses/stats... · data mining c jonathan...

42
Statistics 202: Data Mining c Jonathan Taylor Statistics 202: Data Mining Week 5 Based in part on slides from textbook, slides of Susan Holmes c Jonathan Taylor October 29, 2012 1/1

Upload: others

Post on 28-Sep-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Mining Taylor Statistics 202: Data Miningjtaylo/courses/stats... · Data Mining c Jonathan Taylor Statistics 202: Data Mining Week 5 Based in part on slides from textbook, slides

Statistics 202:Data Mining

c©JonathanTaylor

Statistics 202: Data MiningWeek 5

Based in part on slides from textbook, slides of Susan Holmes

c©Jonathan Taylor

October 29, 2012

1 / 1

Page 2: Data Mining Taylor Statistics 202: Data Miningjtaylo/courses/stats... · Data Mining c Jonathan Taylor Statistics 202: Data Mining Week 5 Based in part on slides from textbook, slides

Statistics 202:Data Mining

c©JonathanTaylor

Part I

Linear Discriminant Analysis

2 / 1

Page 3: Data Mining Taylor Statistics 202: Data Miningjtaylo/courses/stats... · Data Mining c Jonathan Taylor Statistics 202: Data Mining Week 5 Based in part on slides from textbook, slides

Statistics 202:Data Mining

c©JonathanTaylor

Discriminant analysis

Nearest centroid rule

Suppose we break down our data matrix as by the labelsyielding (XXX j)1≤j≤k with sizes nrow(XXX j) = nj .

A simple rule for classification is:

Assign a new observation with features xxx to

f (xxx) = argmin1≤j≤k

d(xxx ,XXX j)

What do we mean by distance here?

3 / 1

Page 4: Data Mining Taylor Statistics 202: Data Miningjtaylo/courses/stats... · Data Mining c Jonathan Taylor Statistics 202: Data Mining Week 5 Based in part on slides from textbook, slides

Statistics 202:Data Mining

c©JonathanTaylor

Discriminant analysis

Nearest centroid rule

If we can assign a central point or centroid µj to each XXX j ,then we can define the distance above as distance to thecentroid µj .

This yields the nearest centroid rule

f (xxx) = argmin1≤j≤k

d(xxx , µj)

4 / 1

Page 5: Data Mining Taylor Statistics 202: Data Miningjtaylo/courses/stats... · Data Mining c Jonathan Taylor Statistics 202: Data Mining Week 5 Based in part on slides from textbook, slides

Statistics 202:Data Mining

c©JonathanTaylor

Discriminant analysis

Nearest centroid rule

This rule is described completely by the functions

hij(xxx) =d(xxx , µj)

d(xxx , µi )

with f (xxx) being any i such that

hij(xxx) ≥ 1 ∀j .

5 / 1

Page 6: Data Mining Taylor Statistics 202: Data Miningjtaylo/courses/stats... · Data Mining c Jonathan Taylor Statistics 202: Data Mining Week 5 Based in part on slides from textbook, slides

Statistics 202:Data Mining

c©JonathanTaylor

Discriminant analysis

Nearest centroid rule

If d(xxx ,yyy) = ‖x − y‖2 then the natural centroid is

µj =1

nj

nj∑l=1

XXX jl

This rule just classifies points to the nearest µj .

But, if the covariance matrix of our data is not I , this ruleignores structure in the data . . .

6 / 1

Page 7: Data Mining Taylor Statistics 202: Data Miningjtaylo/courses/stats... · Data Mining c Jonathan Taylor Statistics 202: Data Mining Week 5 Based in part on slides from textbook, slides

Statistics 202:Data Mining

c©JonathanTaylor

Why we should use covariance

3 2 1 0 1 2 35

0

5

10

15

20

7 / 1

Page 8: Data Mining Taylor Statistics 202: Data Miningjtaylo/courses/stats... · Data Mining c Jonathan Taylor Statistics 202: Data Mining Week 5 Based in part on slides from textbook, slides

Statistics 202:Data Mining

c©JonathanTaylor

Discriminant analysis

Choice of distance

Often, there is some background model for our data thatis equivalent to a given procedure.

For this nearest centroid rule, using the Euclidean distanceeffectively assumes that within the set of points XXX j , therows are multivariate Gaussian with covariance matrixproportional to I .

It also implicitly assumes that the nj ’s are roughly equal.

For instance, if one nj was very small just because a datapoint is close to that µj doesn’t necessarily mean that weshould conclude it has label j because there might be ahuge number of points of label i near that µj .

8 / 1

Page 9: Data Mining Taylor Statistics 202: Data Miningjtaylo/courses/stats... · Data Mining c Jonathan Taylor Statistics 202: Data Mining Week 5 Based in part on slides from textbook, slides

Statistics 202:Data Mining

c©JonathanTaylor

Discriminant analysis

Gaussian discriminant functions

Suppose each group with label j had its own mean µj andcovariance matrix Σj , as well as proportion πj .

The Gaussian discriminant functions are defined as

hij(xxx) = hi (xxx)− hj(xxx)

hi (xxx) = log πi − log |Σi |/2− (xxx − µi )TΣ−1i (xxx − µi )/2

The first term weights the prior probability, the second twoterms are log φµi ,Σi

(xxx) = log L(µi ,Σi |xxx).

Ignoring the first two terms, the hi ’s are essentiallywithin-group Mahalanobis distances . . .

9 / 1

Page 10: Data Mining Taylor Statistics 202: Data Miningjtaylo/courses/stats... · Data Mining c Jonathan Taylor Statistics 202: Data Mining Week 5 Based in part on slides from textbook, slides

Statistics 202:Data Mining

c©JonathanTaylor

Discriminant analysis

Gaussian discriminant functions

The classifier assigns xxx label i if hi (xxx) ≥ hj(xxx)∀j .Or,

f (xxx) = argmax1≤i≤k

hi (xxx)

This is equivalent to a Bayesian rule. We’ll see moreBayesian rules when we talk about naıve Bayes . . .

When all Σi and πi ’s are identical, the classifier is justnearest centroid using Mahalanobis distance instead ofEuclidean distance.

10 / 1

Page 11: Data Mining Taylor Statistics 202: Data Miningjtaylo/courses/stats... · Data Mining c Jonathan Taylor Statistics 202: Data Mining Week 5 Based in part on slides from textbook, slides

Statistics 202:Data Mining

c©JonathanTaylor

Discriminant analysis

Estimating discriminant functions

In practice, we will have to estimate πj , µj ,Σj .

Obvious estimates:

πj =nj∑kj=1 nj

µj =1

nj

nj∑l=1

XXX jl

Σj =1

nj − 1(XXX j − µj111)T (XXX j − µj111)

11 / 1

Page 12: Data Mining Taylor Statistics 202: Data Miningjtaylo/courses/stats... · Data Mining c Jonathan Taylor Statistics 202: Data Mining Week 5 Based in part on slides from textbook, slides

Statistics 202:Data Mining

c©JonathanTaylor

Discriminant analysis

Estimating discriminant functions

If we assume that the covariance matrix is the same withingroups, then we might also form the pooled estimate

ΣP =

∑kj=1(nj − 1)Σj∑k

j=1 nj − 1

If we use the pooled estimate Σj = ΣP and plug these intothe Gaussian discrimants, the functions hij(xxx) are linear(or affine) functions of xxx .

This is called Linear Discriminant Analysis (LDA).

Not to be confused with the other LDA (Latent DirichletAllocation) . . .

12 / 1

Page 13: Data Mining Taylor Statistics 202: Data Miningjtaylo/courses/stats... · Data Mining c Jonathan Taylor Statistics 202: Data Mining Week 5 Based in part on slides from textbook, slides

Statistics 202:Data Mining

c©JonathanTaylor

Linear Discriminant Analysis using (petal.width,petal.length)

0 1 2 3 4 5 6 7 80.5

0.0

0.5

1.0

1.5

2.0

2.5

3.0

13 / 1

Page 14: Data Mining Taylor Statistics 202: Data Miningjtaylo/courses/stats... · Data Mining c Jonathan Taylor Statistics 202: Data Mining Week 5 Based in part on slides from textbook, slides

Statistics 202:Data Mining

c©JonathanTaylor

Linear Discriminant Analysis using (sepal.width,sepal.length)

3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.02.0

2.5

3.0

3.5

4.0

4.5

5.0

14 / 1

Page 15: Data Mining Taylor Statistics 202: Data Miningjtaylo/courses/stats... · Data Mining c Jonathan Taylor Statistics 202: Data Mining Week 5 Based in part on slides from textbook, slides

Statistics 202:Data Mining

c©JonathanTaylor

Discriminant analysis

Quadratic Discriminant Analysis

If we use don’t use pooled estimate Σj = Σj and plugthese into the Gaussian discrimants, the functions hij(xxx)are quadratic functions of xxx .

This is called Quadratic Discriminant Analysis (QDA).

15 / 1

Page 16: Data Mining Taylor Statistics 202: Data Miningjtaylo/courses/stats... · Data Mining c Jonathan Taylor Statistics 202: Data Mining Week 5 Based in part on slides from textbook, slides

Statistics 202:Data Mining

c©JonathanTaylor

Quadratic Discriminant Analysis using(petal.width, petal.length)

0 1 2 3 4 5 6 7 80.5

0.0

0.5

1.0

1.5

2.0

2.5

3.0

16 / 1

Page 17: Data Mining Taylor Statistics 202: Data Miningjtaylo/courses/stats... · Data Mining c Jonathan Taylor Statistics 202: Data Mining Week 5 Based in part on slides from textbook, slides

Statistics 202:Data Mining

c©JonathanTaylor

Quadratic Discriminant Analysis using(sepal.width, sepal.length)

3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.02.0

2.5

3.0

3.5

4.0

4.5

5.0

17 / 1

Page 18: Data Mining Taylor Statistics 202: Data Miningjtaylo/courses/stats... · Data Mining c Jonathan Taylor Statistics 202: Data Mining Week 5 Based in part on slides from textbook, slides

Statistics 202:Data Mining

c©JonathanTaylor

Motivation for Fisher’s rule

3 2 1 0 1 2 35

0

5

10

15

20

18 / 1

Page 19: Data Mining Taylor Statistics 202: Data Miningjtaylo/courses/stats... · Data Mining c Jonathan Taylor Statistics 202: Data Mining Week 5 Based in part on slides from textbook, slides

Statistics 202:Data Mining

c©JonathanTaylor

Discriminant analysis

Fisher’s discriminant function

Fisher proposed to classify using a linear rule.

He first decomposed

(XXX − µ111)T (XXX − µ111) = SSB + SSW

Then, he proposed,

v = argmaxv :vT SSW v=1

vT SSBv

Having found v , form XXX j v and the centroidηj = mean(XXX j v)

In the two-class problem k = 2, this is the same as LDA.

19 / 1

Page 20: Data Mining Taylor Statistics 202: Data Miningjtaylo/courses/stats... · Data Mining c Jonathan Taylor Statistics 202: Data Mining Week 5 Based in part on slides from textbook, slides

Statistics 202:Data Mining

c©JonathanTaylor

Discriminant analysis

Fisher’s discriminant functions

The direction v1 is an eigenvector of some matrix. Thereare others, up to k − 2 more.

Suppose we find all k − 1 vectors and form XXX jVT , each

one an nj × (k − 1) matrix with centroid ηj ∈ Rk−1.

The matrix V determines a map from Rp to Rk−1, sogiven a new data point we can compute Vxxx ∈ Rk−1.

This gives rise to a classifier

f (xxx) = nearest centroid(Vxxx)

This is LDA (assuming πj = 1k ) . . .

20 / 1

Page 21: Data Mining Taylor Statistics 202: Data Miningjtaylo/courses/stats... · Data Mining c Jonathan Taylor Statistics 202: Data Mining Week 5 Based in part on slides from textbook, slides

Statistics 202:Data Mining

c©JonathanTaylor

Discriminant analysis

Discriminant models in general

A discriminant model is generally a model that estimates

P(Y = j |xxx), 1 ≤ j ≤ k

That is, given that the features I observe are xxx , theprobability I think this label is j ...

LDA and QDA are actually generative models since theyspecify

P(X = xxx |Y = j).

There are lots of discriminant models . . .

21 / 1

Page 22: Data Mining Taylor Statistics 202: Data Miningjtaylo/courses/stats... · Data Mining c Jonathan Taylor Statistics 202: Data Mining Week 5 Based in part on slides from textbook, slides

Statistics 202:Data Mining

c©JonathanTaylor

Discriminant analysis

Logistic regression

The logistic regression model is ubiquitous in binaryclassification (two-class) problems

Model:

P(Y = 1|xxx) =exxx

1 + exxxTβ= π(β,xxx)

22 / 1

Page 23: Data Mining Taylor Statistics 202: Data Miningjtaylo/courses/stats... · Data Mining c Jonathan Taylor Statistics 202: Data Mining Week 5 Based in part on slides from textbook, slides

Statistics 202:Data Mining

c©JonathanTaylor

Discriminant analysis

Logistic regression

Software that fits a logistic regression model produces anestimate of β based on a data matrix XXX n×p and binarylabels YYY n×1 ∈ {0, 1}n

It fits the model minimizing what we call the deviance

DEV(β) = −2n∑

i=1

(YYY i log π(β,XXX i )+(1−YYY i ) log(1−π(β,XXX i ))

While not immediately obvious, this is a convexminimization problem, hence is fairly easy to solve.

Unlike trees, the convexity yields a globally optimalsolution.

23 / 1

Page 24: Data Mining Taylor Statistics 202: Data Miningjtaylo/courses/stats... · Data Mining c Jonathan Taylor Statistics 202: Data Mining Week 5 Based in part on slides from textbook, slides

Statistics 202:Data Mining

c©JonathanTaylor

Logistic regression, setosa vs virginica, versicolorusing (sepal.width, sepal.length)

3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.02.0

2.5

3.0

3.5

4.0

4.5

5.0

24 / 1

Page 25: Data Mining Taylor Statistics 202: Data Miningjtaylo/courses/stats... · Data Mining c Jonathan Taylor Statistics 202: Data Mining Week 5 Based in part on slides from textbook, slides

Statistics 202:Data Mining

c©JonathanTaylor

Logistic regression, virginica vs setosa, versicolorusing (sepal.width, sepal.length)

3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.02.0

2.5

3.0

3.5

4.0

4.5

5.0

25 / 1

Page 26: Data Mining Taylor Statistics 202: Data Miningjtaylo/courses/stats... · Data Mining c Jonathan Taylor Statistics 202: Data Mining Week 5 Based in part on slides from textbook, slides

Statistics 202:Data Mining

c©JonathanTaylor

Logistic regression, versicolor vs setosa, virginicausing (sepal.width, sepal.length)

3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.02.0

2.5

3.0

3.5

4.0

4.5

5.0

26 / 1

Page 27: Data Mining Taylor Statistics 202: Data Miningjtaylo/courses/stats... · Data Mining c Jonathan Taylor Statistics 202: Data Mining Week 5 Based in part on slides from textbook, slides

Statistics 202:Data Mining

c©JonathanTaylor

Discriminant analysis

Logistic regression

Logistic regression produces an estimate of

P(yyy = 1|xxx) = π(xxx)

Typically, we classify as 1 if π(xxx) > 0.5.

This yields a 2× 2 confusion matrixPredicted: 0 Predicted: 1

Actual : 0 TN FP

Actual : 1 FN TP

From the 2× 2 confusion matrix, we can computeSensitivity, Specificity, etc.

27 / 1

Page 28: Data Mining Taylor Statistics 202: Data Miningjtaylo/courses/stats... · Data Mining c Jonathan Taylor Statistics 202: Data Mining Week 5 Based in part on slides from textbook, slides

Statistics 202:Data Mining

c©JonathanTaylor

Discriminant analysis

Logistic regression

However, we could choose the threshold differently,perhaps related to estimates of the prior probabilities of0’s and 1’s.

Now, each threshold 0 ≤ t ≤ 1 yields a new confusionmatrix

This yields a 2× 2 confusion matrixPredicted: 0 Predicted: 1

Actual : 0 TN(t) FP(t)

Actual : 1 FN(t) TP(t)

28 / 1

Page 29: Data Mining Taylor Statistics 202: Data Miningjtaylo/courses/stats... · Data Mining c Jonathan Taylor Statistics 202: Data Mining Week 5 Based in part on slides from textbook, slides

Statistics 202:Data Mining

c©JonathanTaylor

Discriminant analysis

ROC curve

Generally speaking, we prefer classifiers that are bothhighly sensitive and highly specific.

These confusion matrices can be summarized using anROC (Receiver Operating Characteristic) curve.

This is a plot of the curve

(1− Specificity(t), Sensitivity(t))0≤t≤1

Often, Specificity is referred to as TNR (True NegativeRate), and (1− Specificity(t)) as FPR.

Often, Sensitivity is referred to as TPR (True PositiveRate), and (1− Sensitivity(t)) as FNR.

29 / 1

Page 30: Data Mining Taylor Statistics 202: Data Miningjtaylo/courses/stats... · Data Mining c Jonathan Taylor Statistics 202: Data Mining Week 5 Based in part on slides from textbook, slides

Statistics 202:Data Mining

c©JonathanTaylor

AUC: Area under ROC curve

0.0 0.2 0.4 0.6 0.8 1.0False Positive Rate

0.0

0.2

0.4

0.6

0.8

1.0

Tru

e P

osi

tive R

ate

ROC curveRandom guessingt=0.5

30 / 1

Page 31: Data Mining Taylor Statistics 202: Data Miningjtaylo/courses/stats... · Data Mining c Jonathan Taylor Statistics 202: Data Mining Week 5 Based in part on slides from textbook, slides

Statistics 202:Data Mining

c©JonathanTaylor

Discriminant analysis

ROC curve

Points in the upper left of the ROC curve are good.

Any point on the diagonal line represents a classifier thatis guessing randomly.

A tree classifier as we’ve discussed seems to correspond toonly one point in the ROC plot.

But one can estimate probabilities based on frequencies interminal nodes.

31 / 1

Page 32: Data Mining Taylor Statistics 202: Data Miningjtaylo/courses/stats... · Data Mining c Jonathan Taylor Statistics 202: Data Mining Week 5 Based in part on slides from textbook, slides

Statistics 202:Data Mining

c©JonathanTaylor

Discriminant analysis

ROC curve

A common numeric summary of the ROC curve is

AUC (ROC curve) = Area under ROC curve.

Can be interpreted as an estimate of the probability thatthe classifier will give a random positive instance a higherscore than a random negative instance.Maximum value is 1.For a random guesser, AUC is 0.5

32 / 1

Page 33: Data Mining Taylor Statistics 202: Data Miningjtaylo/courses/stats... · Data Mining c Jonathan Taylor Statistics 202: Data Mining Week 5 Based in part on slides from textbook, slides

Statistics 202:Data Mining

c©JonathanTaylor

AUC: Area under ROC curve

0.0 0.2 0.4 0.6 0.8 1.0False Positive Rate

0.0

0.2

0.4

0.6

0.8

1.0

Tru

e P

osi

tive R

ate

Logistic-RandomRandom

33 / 1

Page 34: Data Mining Taylor Statistics 202: Data Miningjtaylo/courses/stats... · Data Mining c Jonathan Taylor Statistics 202: Data Mining Week 5 Based in part on slides from textbook, slides

Statistics 202:Data Mining

c©JonathanTaylor

ROC curve: logistic, rpart, lda

34 / 1

Page 35: Data Mining Taylor Statistics 202: Data Miningjtaylo/courses/stats... · Data Mining c Jonathan Taylor Statistics 202: Data Mining Week 5 Based in part on slides from textbook, slides

Statistics 202:Data Mining

c©JonathanTaylor

Part II

Midterm review

35 / 1

Page 36: Data Mining Taylor Statistics 202: Data Miningjtaylo/courses/stats... · Data Mining c Jonathan Taylor Statistics 202: Data Mining Week 5 Based in part on slides from textbook, slides

Statistics 202:Data Mining

c©JonathanTaylor

Midterm review

Overview

General goals of data mining.

Datatypes.

Preprocessing & dimension reduction.

Distances.

Multidimensional scaling.

Multidimensional arrays.

Decision trees.

Performance measures for classifiers.

Discriminant analysis.

36 / 1

Page 37: Data Mining Taylor Statistics 202: Data Miningjtaylo/courses/stats... · Data Mining c Jonathan Taylor Statistics 202: Data Mining Week 5 Based in part on slides from textbook, slides

Statistics 202:Data Mining

c©JonathanTaylor

Midterm review

General goals

Definition: what is and isn’t data mining.

Different types of problems:

Unsupervised problems.Supervised problems.Semi-supervised problems.

37 / 1

Page 38: Data Mining Taylor Statistics 202: Data Miningjtaylo/courses/stats... · Data Mining c Jonathan Taylor Statistics 202: Data Mining Week 5 Based in part on slides from textbook, slides

Statistics 202:Data Mining

c©JonathanTaylor

Midterm review

Datatypes

Continuous, discrete, etc.

How data is represented in R.

Descriptive statistics for different datatypes.

General characteristics:

Is it observational or experimental?Is it very noisy?Is there spatial or temporal structure?

Preprocessing

General tasks: aggregation, transformation, discretization.

Feature extraction: wavelet / FFT transform.

Another type of feature extraction: dimension reduction.

38 / 1

Page 39: Data Mining Taylor Statistics 202: Data Miningjtaylo/courses/stats... · Data Mining c Jonathan Taylor Statistics 202: Data Mining Week 5 Based in part on slides from textbook, slides

Statistics 202:Data Mining

c©JonathanTaylor

Midterm review

Dimension reduction

PCA as a dimension reduction tool.

PCA in terms of the SVD.

PCA loadings, scores.

Graphical summaries

Various plots of univariate continuous data:

stem-leaf;histogram;density (using kernel estimate);ECDF;quantile.Boxplot.Pairs plot.Correlation / similarity matrix (cases or features)

39 / 1

Page 40: Data Mining Taylor Statistics 202: Data Miningjtaylo/courses/stats... · Data Mining c Jonathan Taylor Statistics 202: Data Mining Week 5 Based in part on slides from textbook, slides

Statistics 202:Data Mining

c©JonathanTaylor

Midterm review

Multidimensional scaling

Relation between distances and similarities.

Graphical (Euclidean) representation of data that is“closest” to original dissimilarity.

Relation to PCA when similarity is Euclidean.

Multidimensional arrays

Data cubes.

Standard types of operations on data cubes.

A few examples in R.

40 / 1

Page 41: Data Mining Taylor Statistics 202: Data Miningjtaylo/courses/stats... · Data Mining c Jonathan Taylor Statistics 202: Data Mining Week 5 Based in part on slides from textbook, slides

Statistics 202:Data Mining

c©JonathanTaylor

Midterm review

Decision trees

Definition of a classifier.

Specific form of a decision tree classifier.

Applying the decision tree model.

Fitting a decision tree using Hunt’s algorithm.

Various measures of impurity: Gini, entropy,Misclassification Rate.

Pre and post-pruning to simplify tree.

41 / 1

Page 42: Data Mining Taylor Statistics 202: Data Miningjtaylo/courses/stats... · Data Mining c Jonathan Taylor Statistics 202: Data Mining Week 5 Based in part on slides from textbook, slides

Statistics 202:Data Mining

c©JonathanTaylor

Midterm review

Performance measures

Sensitivity, specificity, true/false negatives, true/falsepositives.

Confusion matrix.

Using a cost matrix to evaluate a classifier.

Discriminant analysis

Linear / Quadratic Discriminant Analysis;

Logistic regression.

ROC curve.

42 / 1