sml week 4-6: linear methods for classification

30
Refine LSE Introduction Regression and Discriminant Analysis Linear Discrimination Analysis Logistic Regression Recap Separa SML Week 4-6: Linear Methods for Classification C. Andy Tsao Dept. of Applied Math National Dong Hwa University March 28, 2019 SML Week 4-6: Linear Methods for Classification C. Andy Tsao

Upload: others

Post on 28-Feb-2022

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SML Week 4-6: Linear Methods for Classification

Refine LSE Introduction Regression and Discriminant Analysis Linear Discrimination Analysis Logistic Regression Recap Separating hyperplane Digression on Iterative Methods for finding roots

SML Week 4-6: Linear Methods for Classification

C. Andy Tsao

Dept. of Applied MathNational Dong Hwa University

March 28, 2019

SML Week 4-6: Linear Methods for Classification C. Andy Tsao

Page 2: SML Week 4-6: Linear Methods for Classification

Refine LSE Introduction Regression and Discriminant Analysis Linear Discrimination Analysis Logistic Regression Recap Separating hyperplane Digression on Iterative Methods for finding roots

Outline

1 Refine LSE

2 Introduction

3 Regression and Discriminant Analysis

4 Linear Discrimination Analysis

5 Logistic Regression

6 Recap

7 Separating hyperplane

8 Digression on Iterative Methods for finding roots

Reference: §3.4, Chapter 4 of HTF

SML Week 4-6: Linear Methods for Classification C. Andy Tsao

Page 3: SML Week 4-6: Linear Methods for Classification

Refine LSE Introduction Regression and Discriminant Analysis Linear Discrimination Analysis Logistic Regression Recap Separating hyperplane Digression on Iterative Methods for finding roots

Why is LSE not satisfactory?

Prediction can be improved by shrinking or zeroing somecoefficients.

Large number of predictors is not helpful in interpretation norcomputation.

SML Week 4-6: Linear Methods for Classification C. Andy Tsao

Page 4: SML Week 4-6: Linear Methods for Classification

Refine LSE Introduction Regression and Discriminant Analysis Linear Discrimination Analysis Logistic Regression Recap Separating hyperplane Digression on Iterative Methods for finding roots

Approaches to var. selection and coef. shrinkage

Subset selection: Best, forward (stepwise), backward(stepwise)

Shrinkage methods

1 Bridge regression includes Lasso [q=1] and ridge regression[q=2] in whichβ minimizes L(β) = ||Y − β0 − X β||2 + λ ∑p

j=1 |βj |qwhere λ, q > 0.

2 For q = 2, βridge minimizes equivalently||Y − β0 − X β||2 subject to ∑p

j=1 |βj |2 ≤ s. One-to-one

correspondence between λ and s.3 Standardization: x ′ij = (xij − x.j )/sj where x.j =

1n ∑n

i=1 xij

and s2j = ∑n

i=1(xij−x.j )2

n−1 . And β0 = y .

SML Week 4-6: Linear Methods for Classification C. Andy Tsao

Page 5: SML Week 4-6: Linear Methods for Classification

Refine LSE Introduction Regression and Discriminant Analysis Linear Discrimination Analysis Logistic Regression Recap Separating hyperplane Digression on Iterative Methods for finding roots

Other Methods

Methods using derived inputs: Principal ComponentRegression (PCA), Partial Least Squares (PLS)

Bayesian shrinkage/variable selection

SML Week 4-6: Linear Methods for Classification C. Andy Tsao

Page 6: SML Week 4-6: Linear Methods for Classification

Refine LSE Introduction Regression and Discriminant Analysis Linear Discrimination Analysis Logistic Regression Recap Separating hyperplane Digression on Iterative Methods for finding roots

Discrete Y classification

Closer look at the task of classification: Find F : X → YLinear Methods: Linear Decision Boundaries.

There is nothing completely new under the sun.

Naive approach: regression. What’s wrong and what can beright?

Hypothesis Testing

SML Week 4-6: Linear Methods for Classification C. Andy Tsao

Page 7: SML Week 4-6: Linear Methods for Classification

Refine LSE Introduction Regression and Discriminant Analysis Linear Discrimination Analysis Logistic Regression Recap Separating hyperplane Digression on Iterative Methods for finding roots

Discrete Y classification

Closer look at the task of classification: Find F : X → YThere is nothing completely new under the sun.

Regression

Hypothesis Testing

SML Week 4-6: Linear Methods for Classification C. Andy Tsao

Page 8: SML Week 4-6: Linear Methods for Classification

Refine LSE Introduction Regression and Discriminant Analysis Linear Discrimination Analysis Logistic Regression Recap Separating hyperplane Digression on Iterative Methods for finding roots

Regression for classification

Naive approach: regression. Y = X β + ε.

What’s wrong and what can be right?

Point Estimation Y = X (X ′X )−1X ′Y .

Inference on Y : Confidence Interval for Y ?

Gauss-Markov Condition

Normality assumption

Is it a linear method?

SML Week 4-6: Linear Methods for Classification C. Andy Tsao

Page 9: SML Week 4-6: Linear Methods for Classification

Refine LSE Introduction Regression and Discriminant Analysis Linear Discrimination Analysis Logistic Regression Recap Separating hyperplane Digression on Iterative Methods for finding roots

Regression for classification

Naive approach: regression. Y = X β + ε.

What’s wrong and what can be right?

Point Estimation Y = X (X ′X )−1X ′Y .

Inference on Y : Confidence Interval for Y ?

Gauss-Markov Condition

Normality assumption

Is it a linear method?

SML Week 4-6: Linear Methods for Classification C. Andy Tsao

Page 10: SML Week 4-6: Linear Methods for Classification

Refine LSE Introduction Regression and Discriminant Analysis Linear Discrimination Analysis Logistic Regression Recap Separating hyperplane Digression on Iterative Methods for finding roots

Regression for classification

Naive approach: regression. Y = X β + ε.

What’s wrong and what can be right?

Point Estimation Y = X (X ′X )−1X ′Y .

Inference on Y : Confidence Interval for Y ?

Gauss-Markov Condition

Normality assumption

Is it a linear method?

SML Week 4-6: Linear Methods for Classification C. Andy Tsao

Page 11: SML Week 4-6: Linear Methods for Classification

Refine LSE Introduction Regression and Discriminant Analysis Linear Discrimination Analysis Logistic Regression Recap Separating hyperplane Digression on Iterative Methods for finding roots

Regression for classification

Naive approach: regression. Y = X β + ε.

What’s wrong and what can be right?

Point Estimation Y = X (X ′X )−1X ′Y .

Inference on Y : Confidence Interval for Y ?

Gauss-Markov Condition

Normality assumption

Is it a linear method?

SML Week 4-6: Linear Methods for Classification C. Andy Tsao

Page 12: SML Week 4-6: Linear Methods for Classification

Refine LSE Introduction Regression and Discriminant Analysis Linear Discrimination Analysis Logistic Regression Recap Separating hyperplane Digression on Iterative Methods for finding roots

Testing Hypotheses as a classification problem

Recall hypotheses testing problemFormulation, Criteria/guarantee,(recommended) procedure, Supplemental measures

Two classes ≈ two subspaces

Matching criteria: TE/GE vs power function

Accuracy estimation approach. (cf. Hwang, Casella, Robert,Wells and Farrel (1992, Annals of Stat))

Is it a linear method?

SML Week 4-6: Linear Methods for Classification C. Andy Tsao

Page 13: SML Week 4-6: Linear Methods for Classification

Refine LSE Introduction Regression and Discriminant Analysis Linear Discrimination Analysis Logistic Regression Recap Separating hyperplane Digression on Iterative Methods for finding roots

Logistic Regression

For classifying K classes Estimation of P [G = k |X = x ]

For two classes G = {1, 2}.P(G = 1|X = x) = exp(β0+x ′β)

1+exp(β0+x ′β) ,

P(G = 2|X = x) = 11+exp(β0+x ′β) ,

g(E (Y |x)) = β0 + x ′β where g(p) = log [p/(1− p)], thelogit transformation

For usual GLM, Y is normally distributed and g is identitytransformation.

Strength and weakness

SML Week 4-6: Linear Methods for Classification C. Andy Tsao

Page 14: SML Week 4-6: Linear Methods for Classification

Refine LSE Introduction Regression and Discriminant Analysis Linear Discrimination Analysis Logistic Regression Recap Separating hyperplane Digression on Iterative Methods for finding roots

Logistic Regression

For classifying K classes Estimation of P [G = k |X = x ]

For two classes G = {1, 2}.P(G = 1|X = x) = exp(β0+x ′β)

1+exp(β0+x ′β) ,

P(G = 2|X = x) = 11+exp(β0+x ′β) ,

g(E (Y |x)) = β0 + x ′β where g(p) = log [p/(1− p)], thelogit transformation

For usual GLM, Y is normally distributed and g is identitytransformation.

Strength and weakness

SML Week 4-6: Linear Methods for Classification C. Andy Tsao

Page 15: SML Week 4-6: Linear Methods for Classification

Refine LSE Introduction Regression and Discriminant Analysis Linear Discrimination Analysis Logistic Regression Recap Separating hyperplane Digression on Iterative Methods for finding roots

LR of an Indicator Matrix

For classifying K classes, coding response via an indicatorvariables.

If #G = K , defineYk = 1[G=k ], Y = (Y1, · · · , YK ) and Y is an NxK indicatorresponse matrix.

Y = X (X ′X )−1X ′Y, an NxK matrix.

B = (X ′X )−1X ′Y, a (p+1)xK matrix.

A new observation with input x is classified

f (x) = [(1, x)B]′ = (f1(x), · · · , fK (x)), a K (row) vector, or1xK matrix.G (x) = argmaxk∈G fk (x).

Strength and weakness

Work out special cases when K = 2, 3.

SML Week 4-6: Linear Methods for Classification C. Andy Tsao

Page 16: SML Week 4-6: Linear Methods for Classification

Refine LSE Introduction Regression and Discriminant Analysis Linear Discrimination Analysis Logistic Regression Recap Separating hyperplane Digression on Iterative Methods for finding roots

P(G = k |X = x) (E (G |x))

Let πk be prior probability of class k, ∑k πk = 1

P(G = k |X = x) = fk (x)πk

∑Kj=1 fj (x)πj

Various techniques are based on models of f ’s:

Gaussian: linear and quadratic discriminant analysismixture of Gaussians, nonparametric densityNaive Bayes: a variant of the previous models assuming theconditional independence among explanatory variables

SML Week 4-6: Linear Methods for Classification C. Andy Tsao

Page 17: SML Week 4-6: Linear Methods for Classification

Refine LSE Introduction Regression and Discriminant Analysis Linear Discrimination Analysis Logistic Regression Recap Separating hyperplane Digression on Iterative Methods for finding roots

Gaussian class densities: LDA

fk(x) =1

(2π)p/2|Σk |1/2 exp(−12 (x − µk)

′Σ−1k (x − µk)).

Linear Discriminant Analysis (LDA): Σk = Σ for all k

logP(Y = k |X = x)

P(Y = l |X = x)= log

fk(x)

fl (x)+ log

πk

πl

= log πkπl −1

2(µk + µl )

′Σ−1(µk − µl ) + x ′Σ−1(µk + µl )

δk(x) = x ′Σ−1µk − 12 µ′kΣ−1µk + logπk

G (x) = argmaxkδk(x)

πk = Nk/N where Nk is the number of class-k obs.µk = ∑gi=k xi/Nk

Σ = ∑k ∑gi=k(xi − µk)(xi − µk)′/(N −K ).

SML Week 4-6: Linear Methods for Classification C. Andy Tsao

Page 18: SML Week 4-6: Linear Methods for Classification

Refine LSE Introduction Regression and Discriminant Analysis Linear Discrimination Analysis Logistic Regression Recap Separating hyperplane Digression on Iterative Methods for finding roots

LDA vs. LSE

For K = 2, the LDA classification rule is

x ′Σ−1(µ2 − µ1) >1

2µ2′Σ−1µ2 −

1

2µ1′Σ−1µ1

+ log(N1/N)− log(N2/N)

SML Week 4-6: Linear Methods for Classification C. Andy Tsao

Page 19: SML Week 4-6: Linear Methods for Classification

Refine LSE Introduction Regression and Discriminant Analysis Linear Discrimination Analysis Logistic Regression Recap Separating hyperplane Digression on Iterative Methods for finding roots

Quadratic discriminant Analysis

QDA

Non-equal Σk

δk(x) =12 log |Σk | − 1

2 (x − µk)′Σ−1

k (x − µk) + logπk

SML Week 4-6: Linear Methods for Classification C. Andy Tsao

Page 20: SML Week 4-6: Linear Methods for Classification

Refine LSE Introduction Regression and Discriminant Analysis Linear Discrimination Analysis Logistic Regression Recap Separating hyperplane Digression on Iterative Methods for finding roots

Interpretation of FDA

For easy exposition, let K = 2

For LDA, assuming π2 = π1, decision ruleG (x) = arg maxk fk(x)

Note:fk(x) ∝ (x − µk)′Σ−1(x − µk)

Connection with logistic regression

SML Week 4-6: Linear Methods for Classification C. Andy Tsao

Page 21: SML Week 4-6: Linear Methods for Classification

Refine LSE Introduction Regression and Discriminant Analysis Linear Discrimination Analysis Logistic Regression Recap Separating hyperplane Digression on Iterative Methods for finding roots

Model and Estimation

log(

P(G=1|X=x)P(G=K |X=x)

)= β1,0 + β′1x

log(

P(G=k |X=x)P(G=K |X=x)

)= βk,0 + β′kx , g = 1, · · · , K − 1.

That is

P(G = k |X = x) =exp(βk,0+β′kx)

1+∑K−1k=1 exp(βk,0+β′kx)

P(G = K |X = x) = 11+∑K−1

k=1 exp(βk,0+β′kx)

Newton-Raphson Method (R)

Large #x demands heavy computation and might notconverge at all.

SML Week 4-6: Linear Methods for Classification C. Andy Tsao

Page 22: SML Week 4-6: Linear Methods for Classification

Refine LSE Introduction Regression and Discriminant Analysis Linear Discrimination Analysis Logistic Regression Recap Separating hyperplane Digression on Iterative Methods for finding roots

Logistic Regression or LDA?

For logistic regression

logP(Y = k |X = x)

P(Y = l |X = x)= βk0 + β′kx .

Recall in LDA

logP(Y = k |X = x)

P(Y = l |X = x)

= logπk

πl− 1

2(µk + µl )

′Σ−1(µk − µl ) + x ′Σ−1(µk + µl )

= αk0 + α′kx .

SML Week 4-6: Linear Methods for Classification C. Andy Tsao

Page 23: SML Week 4-6: Linear Methods for Classification

Refine LSE Introduction Regression and Discriminant Analysis Linear Discrimination Analysis Logistic Regression Recap Separating hyperplane Digression on Iterative Methods for finding roots

Compare and contrast

Same log conditional probability ratio log(

P(G=k |X=x)P(G=K |X=x)

)Different ways in estimating the unknown parameters:Logistic: Max conditional likelihood vs. LDA: full likelihoodP(X , Y = k) = fk(x)πk

SML Week 4-6: Linear Methods for Classification C. Andy Tsao

Page 24: SML Week 4-6: Linear Methods for Classification

Refine LSE Introduction Regression and Discriminant Analysis Linear Discrimination Analysis Logistic Regression Recap Separating hyperplane Digression on Iterative Methods for finding roots

Variables: Continuous or Categorical

R summary of data set (data.frame)

Continuous or categorical variable, a practical understanding

Continuous: detailed information, effective yet sensitivemodelling

Categorical: rough information, complex (when the levelnumber is large) yet robust modelling

SML Week 4-6: Linear Methods for Classification C. Andy Tsao

Page 25: SML Week 4-6: Linear Methods for Classification

Refine LSE Introduction Regression and Discriminant Analysis Linear Discrimination Analysis Logistic Regression Recap Separating hyperplane Digression on Iterative Methods for finding roots

Hyperplane

A hyperplane or affine set L is defined by the equationf (x) = β0 + β′x. For R2 this is a line. Perceptrons (Rosenblatt1958)

For any x1, x2 ∈ L, β′(x1 − x2) = 0.

β∗ = β/||β|| is the (unit) normal vector of L

For any x0 in L, β′x0 = −β0

The signed distance of any x to L is

β∗(x− x0) =1||β|| (β′x+ β0) =

f (x)||f ′(x)||

Figure 4.14.

f (x) is proportional to the signed distance from x to thehyperplane: f (x) = 0.

SML Week 4-6: Linear Methods for Classification C. Andy Tsao

Page 26: SML Week 4-6: Linear Methods for Classification

Refine LSE Introduction Regression and Discriminant Analysis Linear Discrimination Analysis Logistic Regression Recap Separating hyperplane Digression on Iterative Methods for finding roots

Bisection Method

Given f (x) is continuous on [a0, b0] and f (a0)f (b0) < 0.

For n = 0, 1, 2, . . . until satisfied, do

Set m = (an + bn)/2

If f (an)f (m) < 0, set an+1 = an, bn+1 = m

Otherwise set an+1 = m, bn+1 = bn

Then f (x) has a root in [an + 1, bn+1].

Conte and de Boor (1980). Elementary numerical analysis: analgorithmic approach. 3rd Edition.

SML Week 4-6: Linear Methods for Classification C. Andy Tsao

Page 27: SML Week 4-6: Linear Methods for Classification

Refine LSE Introduction Regression and Discriminant Analysis Linear Discrimination Analysis Logistic Regression Recap Separating hyperplane Digression on Iterative Methods for finding roots

Perceptron Learning Algorithm–1

Goal: Minimize the distance of misclassified points to the decisionboundary. Consider the binary problem when Y = ±1.

Min D(β, β0) = −∑i∈M yi (β0 + x′i β) (4.37)

D ≥ 0 and D is proportional to the distances of themisclassified points to the decision boundary defined byβ0 + x′i β = 0. Assuming M is fixed, the gradient is∂D(β,β0)

∂β = −∑i∈M yixi

∂D(β,β0)∂β0

= −∑i∈M yi

SML Week 4-6: Linear Methods for Classification C. Andy Tsao

Page 28: SML Week 4-6: Linear Methods for Classification

Refine LSE Introduction Regression and Discriminant Analysis Linear Discrimination Analysis Logistic Regression Recap Separating hyperplane Digression on Iterative Methods for finding roots

Perceptron Learning Algorithm–2

A step is taken after each obs rather than the sum. This is astochastic gradient decent minimizes a piecewise linearcriterion.

β = β + ρyixi

β0 = β0 + ρyi with ρ is the learning rate. WLOG ρ = 1.

When the cases are linearly separable, this algorithm willconverge to a separating hyperplane in finite number of steps.

Remark: Contrast to population version

SML Week 4-6: Linear Methods for Classification C. Andy Tsao

Page 29: SML Week 4-6: Linear Methods for Classification

Refine LSE Introduction Regression and Discriminant Analysis Linear Discrimination Analysis Logistic Regression Recap Separating hyperplane Digression on Iterative Methods for finding roots

Perceptron Learning Algorithm–3

Problems with the algorithm, Ripley (1996)

When data are separable, the solutions are non-unique anddepend on starting points.

“Finite” can be very large.

When the cases are Not linearly separable, this algorithm candevelop long hard-to-detect cycles.

SML Week 4-6: Linear Methods for Classification C. Andy Tsao

Page 30: SML Week 4-6: Linear Methods for Classification

Refine LSE Introduction Regression and Discriminant Analysis Linear Discrimination Analysis Logistic Regression Recap Separating hyperplane Digression on Iterative Methods for finding roots

Optimal Separating Hyperplane

The Optimal Separating Hyperplane separates the two classes andmaximizes the distance to the closest point from either class,Vapnik (1996).

maxβ0,β,||β||=1 C

subject to yi (β0 + x ′i β) ≥ C , i = 1, · · · , N. Equivalently

subject to 1||β||yi (β0 + x ′i β) ≥ C (redefining β0)

subject to ↔ yi (β0 + x ′i β) ≥ C ||β||WLOG, rescale and set ||β|| = 1/C , the optimization problembecomes

minβ0,β12 ||β||2

subject to yi (β0 + x ′i β) ≥ 1

Further transform the optimization problem → SVM.

SML Week 4-6: Linear Methods for Classification C. Andy Tsao