sml week 4-6: linear methods for classification

Refine LSE Introduction Regression and Discriminant Analysis Linear Discrimination Analysis Logistic Regression Recap Separating hyperplane Digression on Iterative Methods for finding roots

SML Week 4-6: Linear Methods for Classification

C. Andy Tsao

Dept. of Applied MathNational Dong Hwa University

March 28, 2019

SML Week 4-6: Linear Methods for Classification C. Andy Tsao


Outline

1 Refine LSE

2 Introduction

3 Regression and Discriminant Analysis

4 Linear Discrimination Analysis

5 Logistic Regression

6 Recap

7 Separating hyperplane

8 Digression on Iterative Methods for finding roots

Reference: §3.4, Chapter 4 of HTF



Why is LSE not satisfactory?

Prediction can be improved by shrinking or zeroing somecoefficients.

Large number of predictors is not helpful in interpretation norcomputation.



Approaches to var. selection and coef. shrinkage

Subset selection: Best, forward (stepwise), backward(stepwise)

Shrinkage methods

1 Bridge regression includes Lasso [q=1] and ridge regression[q=2] in whichβ minimizes L(β) = ||Y − β0 − X β||2 + λ ∑p

j=1 |βj |qwhere λ, q > 0.

2 For q = 2, βridge minimizes equivalently||Y − β0 − X β||2 subject to ∑p

j=1 |βj |2 ≤ s. One-to-one

correspondence between λ and s.3 Standardization: x ′ij = (xij − x.j )/sj where x.j =

1n ∑n

i=1 xij

and s2j = ∑n

i=1(xij−x.j )2

n−1 . And β0 = y .



Other Methods

Methods using derived inputs: Principal ComponentRegression (PCA), Partial Least Squares (PLS)

Bayesian shrinkage/variable selection



Discrete Y classification

Closer look at the task of classification: Find F : X → YLinear Methods: Linear Decision Boundaries.

There is nothing completely new under the sun.

Naive approach: regression. What’s wrong and what can beright?

Hypothesis Testing



Discrete Y classification

Closer look at the task of classification: Find F : X → YThere is nothing completely new under the sun.

Regression

Hypothesis Testing



Regression for classification

Naive approach: regression. Y = X β + ε.

What’s wrong and what can be right?

Point Estimation Y = X (X ′X )−1X ′Y .

Inference on Y : Confidence Interval for Y ?

Gauss-Markov Condition

Normality assumption

Is it a linear method?



Testing Hypotheses as a classification problem

Recall hypotheses testing problemFormulation, Criteria/guarantee,(recommended) procedure, Supplemental measures

Two classes ≈ two subspaces

Matching criteria: TE/GE vs power function

Accuracy estimation approach. (cf. Hwang, Casella, Robert,Wells and Farrel (1992, Annals of Stat))

Is it a linear method?



Logistic Regression

For classifying K classes Estimation of P [G = k |X = x ]

For two classes G = {1, 2}.P(G = 1|X = x) = exp(β0+x ′β)

1+exp(β0+x ′β) ,

P(G = 2|X = x) = 11+exp(β0+x ′β) ,

g(E (Y |x)) = β0 + x ′β where g(p) = log [p/(1− p)], thelogit transformation

For usual GLM, Y is normally distributed and g is identitytransformation.

Strength and weakness



LR of an Indicator Matrix

For classifying K classes, coding response via an indicatorvariables.

If #G = K , defineYk = 1[G=k ], Y = (Y1, · · · , YK ) and Y is an NxK indicatorresponse matrix.

Y = X (X ′X )−1X ′Y, an NxK matrix.

B = (X ′X )−1X ′Y, a (p+1)xK matrix.

A new observation with input x is classified

f (x) = [(1, x)B]′ = (f1(x), · · · , fK (x)), a K (row) vector, or1xK matrix.G (x) = argmaxk∈G fk (x).

Strength and weakness

Work out special cases when K = 2, 3.



P(G = k |X = x) (E (G |x))

Let πk be prior probability of class k, ∑k πk = 1

P(G = k |X = x) = fk (x)πk

∑Kj=1 fj (x)πj

Various techniques are based on models of f ’s:

Gaussian: linear and quadratic discriminant analysismixture of Gaussians, nonparametric densityNaive Bayes: a variant of the previous models assuming theconditional independence among explanatory variables



Gaussian class densities: LDA

fk(x) =1

(2π)p/2|Σk |1/2 exp(−12 (x − µk)

′Σ−1k (x − µk)).

Linear Discriminant Analysis (LDA): Σk = Σ for all k

logP(Y = k |X = x)

P(Y = l |X = x)= log

fk(x)

fl (x)+ log

πk

πl

= log πkπl −1

2(µk + µl )

′Σ−1(µk − µl ) + x ′Σ−1(µk + µl )

δk(x) = x ′Σ−1µk − 12 µ′kΣ−1µk + logπk

G (x) = argmaxkδk(x)

πk = Nk/N where Nk is the number of class-k obs.µk = ∑gi=k xi/Nk

Σ = ∑k ∑gi=k(xi − µk)(xi − µk)′/(N −K ).



LDA vs. LSE

For K = 2, the LDA classification rule is

x ′Σ−1(µ2 − µ1) >1

2µ2′Σ−1µ2 −

1

2µ1′Σ−1µ1

+ log(N1/N)− log(N2/N)



Quadratic discriminant Analysis

QDA

Non-equal Σk

δk(x) =12 log |Σk | − 1

2 (x − µk)′Σ−1

k (x − µk) + logπk



Interpretation of FDA

For easy exposition, let K = 2

For LDA, assuming π2 = π1, decision ruleG (x) = arg maxk fk(x)

Note:fk(x) ∝ (x − µk)′Σ−1(x − µk)

Connection with logistic regression



Model and Estimation

log(

P(G=1|X=x)P(G=K |X=x)

)= β1,0 + β′1x

log(

P(G=k |X=x)P(G=K |X=x)

)= βk,0 + β′kx , g = 1, · · · , K − 1.

That is

P(G = k |X = x) =exp(βk,0+β′kx)

1+∑K−1k=1 exp(βk,0+β′kx)

P(G = K |X = x) = 11+∑K−1

k=1 exp(βk,0+β′kx)

Newton-Raphson Method (R)

Large #x demands heavy computation and might notconverge at all.



Logistic Regression or LDA?

For logistic regression

logP(Y = k |X = x)

P(Y = l |X = x)= βk0 + β′kx .

Recall in LDA

logP(Y = k |X = x)

P(Y = l |X = x)

= logπk

πl− 1

2(µk + µl )

′Σ−1(µk − µl ) + x ′Σ−1(µk + µl )

= αk0 + α′kx .



Compare and contrast

Same log conditional probability ratio log(

P(G=k |X=x)P(G=K |X=x)

)Different ways in estimating the unknown parameters:Logistic: Max conditional likelihood vs. LDA: full likelihoodP(X , Y = k) = fk(x)πk



Variables: Continuous or Categorical

R summary of data set (data.frame)

Continuous or categorical variable, a practical understanding

Continuous: detailed information, effective yet sensitivemodelling

Categorical: rough information, complex (when the levelnumber is large) yet robust modelling



Hyperplane

A hyperplane or affine set L is defined by the equationf (x) = β0 + β′x. For R2 this is a line. Perceptrons (Rosenblatt1958)

For any x1, x2 ∈ L, β′(x1 − x2) = 0.

β∗ = β/||β|| is the (unit) normal vector of L

For any x0 in L, β′x0 = −β0

The signed distance of any x to L is

β∗(x− x0) =1||β|| (β′x+ β0) =

f (x)||f ′(x)||

Figure 4.14.

f (x) is proportional to the signed distance from x to thehyperplane: f (x) = 0.



Bisection Method

Given f (x) is continuous on [a0, b0] and f (a0)f (b0) < 0.

For n = 0, 1, 2, . . . until satisfied, do

Set m = (an + bn)/2

If f (an)f (m) < 0, set an+1 = an, bn+1 = m

Otherwise set an+1 = m, bn+1 = bn

Then f (x) has a root in [an + 1, bn+1].

Conte and de Boor (1980). Elementary numerical analysis: analgorithmic approach. 3rd Edition.



Perceptron Learning Algorithm–1

Goal: Minimize the distance of misclassified points to the decisionboundary. Consider the binary problem when Y = ±1.

Min D(β, β0) = −∑i∈M yi (β0 + x′i β) (4.37)

D ≥ 0 and D is proportional to the distances of themisclassified points to the decision boundary defined byβ0 + x′i β = 0. Assuming M is fixed, the gradient is∂D(β,β0)

∂β = −∑i∈M yixi

∂D(β,β0)∂β0

= −∑i∈M yi




A step is taken after each obs rather than the sum. This is astochastic gradient decent minimizes a piecewise linearcriterion.

β = β + ρyixi

β0 = β0 + ρyi with ρ is the learning rate. WLOG ρ = 1.

When the cases are linearly separable, this algorithm willconverge to a separating hyperplane in finite number of steps.

Remark: Contrast to population version




Problems with the algorithm, Ripley (1996)

When data are separable, the solutions are non-unique anddepend on starting points.

“Finite” can be very large.

When the cases are Not linearly separable, this algorithm candevelop long hard-to-detect cycles.



Optimal Separating Hyperplane

The Optimal Separating Hyperplane separates the two classes andmaximizes the distance to the closest point from either class,Vapnik (1996).

maxβ0,β,||β||=1 C

subject to yi (β0 + x ′i β) ≥ C , i = 1, · · · , N. Equivalently

subject to 1||β||yi (β0 + x ′i β) ≥ C (redefining β0)

subject to ↔ yi (β0 + x ′i β) ≥ C ||β||WLOG, rescale and set ||β|| = 1/C , the optimization problembecomes

minβ0,β12 ||β||2

subject to yi (β0 + x ′i β) ≥ 1

Further transform the optimization problem → SVM.


sml week 4-6: linear methods for classification

Documents