sml week 4-6: linear methods for classification
TRANSCRIPT
Refine LSE Introduction Regression and Discriminant Analysis Linear Discrimination Analysis Logistic Regression Recap Separating hyperplane Digression on Iterative Methods for finding roots
SML Week 4-6: Linear Methods for Classification
C. Andy Tsao
Dept. of Applied MathNational Dong Hwa University
March 28, 2019
SML Week 4-6: Linear Methods for Classification C. Andy Tsao
Refine LSE Introduction Regression and Discriminant Analysis Linear Discrimination Analysis Logistic Regression Recap Separating hyperplane Digression on Iterative Methods for finding roots
Outline
1 Refine LSE
2 Introduction
3 Regression and Discriminant Analysis
4 Linear Discrimination Analysis
5 Logistic Regression
6 Recap
7 Separating hyperplane
8 Digression on Iterative Methods for finding roots
Reference: §3.4, Chapter 4 of HTF
SML Week 4-6: Linear Methods for Classification C. Andy Tsao
Refine LSE Introduction Regression and Discriminant Analysis Linear Discrimination Analysis Logistic Regression Recap Separating hyperplane Digression on Iterative Methods for finding roots
Why is LSE not satisfactory?
Prediction can be improved by shrinking or zeroing somecoefficients.
Large number of predictors is not helpful in interpretation norcomputation.
SML Week 4-6: Linear Methods for Classification C. Andy Tsao
Refine LSE Introduction Regression and Discriminant Analysis Linear Discrimination Analysis Logistic Regression Recap Separating hyperplane Digression on Iterative Methods for finding roots
Approaches to var. selection and coef. shrinkage
Subset selection: Best, forward (stepwise), backward(stepwise)
Shrinkage methods
1 Bridge regression includes Lasso [q=1] and ridge regression[q=2] in whichβ minimizes L(β) = ||Y − β0 − X β||2 + λ ∑p
j=1 |βj |qwhere λ, q > 0.
2 For q = 2, βridge minimizes equivalently||Y − β0 − X β||2 subject to ∑p
j=1 |βj |2 ≤ s. One-to-one
correspondence between λ and s.3 Standardization: x ′ij = (xij − x.j )/sj where x.j =
1n ∑n
i=1 xij
and s2j = ∑n
i=1(xij−x.j )2
n−1 . And β0 = y .
SML Week 4-6: Linear Methods for Classification C. Andy Tsao
Refine LSE Introduction Regression and Discriminant Analysis Linear Discrimination Analysis Logistic Regression Recap Separating hyperplane Digression on Iterative Methods for finding roots
Other Methods
Methods using derived inputs: Principal ComponentRegression (PCA), Partial Least Squares (PLS)
Bayesian shrinkage/variable selection
SML Week 4-6: Linear Methods for Classification C. Andy Tsao
Refine LSE Introduction Regression and Discriminant Analysis Linear Discrimination Analysis Logistic Regression Recap Separating hyperplane Digression on Iterative Methods for finding roots
Discrete Y classification
Closer look at the task of classification: Find F : X → YLinear Methods: Linear Decision Boundaries.
There is nothing completely new under the sun.
Naive approach: regression. What’s wrong and what can beright?
Hypothesis Testing
SML Week 4-6: Linear Methods for Classification C. Andy Tsao
Refine LSE Introduction Regression and Discriminant Analysis Linear Discrimination Analysis Logistic Regression Recap Separating hyperplane Digression on Iterative Methods for finding roots
Discrete Y classification
Closer look at the task of classification: Find F : X → YThere is nothing completely new under the sun.
Regression
Hypothesis Testing
SML Week 4-6: Linear Methods for Classification C. Andy Tsao
Refine LSE Introduction Regression and Discriminant Analysis Linear Discrimination Analysis Logistic Regression Recap Separating hyperplane Digression on Iterative Methods for finding roots
Regression for classification
Naive approach: regression. Y = X β + ε.
What’s wrong and what can be right?
Point Estimation Y = X (X ′X )−1X ′Y .
Inference on Y : Confidence Interval for Y ?
Gauss-Markov Condition
Normality assumption
Is it a linear method?
SML Week 4-6: Linear Methods for Classification C. Andy Tsao
Refine LSE Introduction Regression and Discriminant Analysis Linear Discrimination Analysis Logistic Regression Recap Separating hyperplane Digression on Iterative Methods for finding roots
Regression for classification
Naive approach: regression. Y = X β + ε.
What’s wrong and what can be right?
Point Estimation Y = X (X ′X )−1X ′Y .
Inference on Y : Confidence Interval for Y ?
Gauss-Markov Condition
Normality assumption
Is it a linear method?
SML Week 4-6: Linear Methods for Classification C. Andy Tsao
Refine LSE Introduction Regression and Discriminant Analysis Linear Discrimination Analysis Logistic Regression Recap Separating hyperplane Digression on Iterative Methods for finding roots
Regression for classification
Naive approach: regression. Y = X β + ε.
What’s wrong and what can be right?
Point Estimation Y = X (X ′X )−1X ′Y .
Inference on Y : Confidence Interval for Y ?
Gauss-Markov Condition
Normality assumption
Is it a linear method?
SML Week 4-6: Linear Methods for Classification C. Andy Tsao
Refine LSE Introduction Regression and Discriminant Analysis Linear Discrimination Analysis Logistic Regression Recap Separating hyperplane Digression on Iterative Methods for finding roots
Regression for classification
Naive approach: regression. Y = X β + ε.
What’s wrong and what can be right?
Point Estimation Y = X (X ′X )−1X ′Y .
Inference on Y : Confidence Interval for Y ?
Gauss-Markov Condition
Normality assumption
Is it a linear method?
SML Week 4-6: Linear Methods for Classification C. Andy Tsao
Refine LSE Introduction Regression and Discriminant Analysis Linear Discrimination Analysis Logistic Regression Recap Separating hyperplane Digression on Iterative Methods for finding roots
Testing Hypotheses as a classification problem
Recall hypotheses testing problemFormulation, Criteria/guarantee,(recommended) procedure, Supplemental measures
Two classes ≈ two subspaces
Matching criteria: TE/GE vs power function
Accuracy estimation approach. (cf. Hwang, Casella, Robert,Wells and Farrel (1992, Annals of Stat))
Is it a linear method?
SML Week 4-6: Linear Methods for Classification C. Andy Tsao
Refine LSE Introduction Regression and Discriminant Analysis Linear Discrimination Analysis Logistic Regression Recap Separating hyperplane Digression on Iterative Methods for finding roots
Logistic Regression
For classifying K classes Estimation of P [G = k |X = x ]
For two classes G = {1, 2}.P(G = 1|X = x) = exp(β0+x ′β)
1+exp(β0+x ′β) ,
P(G = 2|X = x) = 11+exp(β0+x ′β) ,
g(E (Y |x)) = β0 + x ′β where g(p) = log [p/(1− p)], thelogit transformation
For usual GLM, Y is normally distributed and g is identitytransformation.
Strength and weakness
SML Week 4-6: Linear Methods for Classification C. Andy Tsao
Refine LSE Introduction Regression and Discriminant Analysis Linear Discrimination Analysis Logistic Regression Recap Separating hyperplane Digression on Iterative Methods for finding roots
Logistic Regression
For classifying K classes Estimation of P [G = k |X = x ]
For two classes G = {1, 2}.P(G = 1|X = x) = exp(β0+x ′β)
1+exp(β0+x ′β) ,
P(G = 2|X = x) = 11+exp(β0+x ′β) ,
g(E (Y |x)) = β0 + x ′β where g(p) = log [p/(1− p)], thelogit transformation
For usual GLM, Y is normally distributed and g is identitytransformation.
Strength and weakness
SML Week 4-6: Linear Methods for Classification C. Andy Tsao
Refine LSE Introduction Regression and Discriminant Analysis Linear Discrimination Analysis Logistic Regression Recap Separating hyperplane Digression on Iterative Methods for finding roots
LR of an Indicator Matrix
For classifying K classes, coding response via an indicatorvariables.
If #G = K , defineYk = 1[G=k ], Y = (Y1, · · · , YK ) and Y is an NxK indicatorresponse matrix.
Y = X (X ′X )−1X ′Y, an NxK matrix.
B = (X ′X )−1X ′Y, a (p+1)xK matrix.
A new observation with input x is classified
f (x) = [(1, x)B]′ = (f1(x), · · · , fK (x)), a K (row) vector, or1xK matrix.G (x) = argmaxk∈G fk (x).
Strength and weakness
Work out special cases when K = 2, 3.
SML Week 4-6: Linear Methods for Classification C. Andy Tsao
Refine LSE Introduction Regression and Discriminant Analysis Linear Discrimination Analysis Logistic Regression Recap Separating hyperplane Digression on Iterative Methods for finding roots
P(G = k |X = x) (E (G |x))
Let πk be prior probability of class k, ∑k πk = 1
P(G = k |X = x) = fk (x)πk
∑Kj=1 fj (x)πj
Various techniques are based on models of f ’s:
Gaussian: linear and quadratic discriminant analysismixture of Gaussians, nonparametric densityNaive Bayes: a variant of the previous models assuming theconditional independence among explanatory variables
SML Week 4-6: Linear Methods for Classification C. Andy Tsao
Refine LSE Introduction Regression and Discriminant Analysis Linear Discrimination Analysis Logistic Regression Recap Separating hyperplane Digression on Iterative Methods for finding roots
Gaussian class densities: LDA
fk(x) =1
(2π)p/2|Σk |1/2 exp(−12 (x − µk)
′Σ−1k (x − µk)).
Linear Discriminant Analysis (LDA): Σk = Σ for all k
logP(Y = k |X = x)
P(Y = l |X = x)= log
fk(x)
fl (x)+ log
πk
πl
= log πkπl −1
2(µk + µl )
′Σ−1(µk − µl ) + x ′Σ−1(µk + µl )
δk(x) = x ′Σ−1µk − 12 µ′kΣ−1µk + logπk
G (x) = argmaxkδk(x)
πk = Nk/N where Nk is the number of class-k obs.µk = ∑gi=k xi/Nk
Σ = ∑k ∑gi=k(xi − µk)(xi − µk)′/(N −K ).
SML Week 4-6: Linear Methods for Classification C. Andy Tsao
Refine LSE Introduction Regression and Discriminant Analysis Linear Discrimination Analysis Logistic Regression Recap Separating hyperplane Digression on Iterative Methods for finding roots
LDA vs. LSE
For K = 2, the LDA classification rule is
x ′Σ−1(µ2 − µ1) >1
2µ2′Σ−1µ2 −
1
2µ1′Σ−1µ1
+ log(N1/N)− log(N2/N)
SML Week 4-6: Linear Methods for Classification C. Andy Tsao
Refine LSE Introduction Regression and Discriminant Analysis Linear Discrimination Analysis Logistic Regression Recap Separating hyperplane Digression on Iterative Methods for finding roots
Quadratic discriminant Analysis
QDA
Non-equal Σk
δk(x) =12 log |Σk | − 1
2 (x − µk)′Σ−1
k (x − µk) + logπk
SML Week 4-6: Linear Methods for Classification C. Andy Tsao
Refine LSE Introduction Regression and Discriminant Analysis Linear Discrimination Analysis Logistic Regression Recap Separating hyperplane Digression on Iterative Methods for finding roots
Interpretation of FDA
For easy exposition, let K = 2
For LDA, assuming π2 = π1, decision ruleG (x) = arg maxk fk(x)
Note:fk(x) ∝ (x − µk)′Σ−1(x − µk)
Connection with logistic regression
SML Week 4-6: Linear Methods for Classification C. Andy Tsao
Refine LSE Introduction Regression and Discriminant Analysis Linear Discrimination Analysis Logistic Regression Recap Separating hyperplane Digression on Iterative Methods for finding roots
Model and Estimation
log(
P(G=1|X=x)P(G=K |X=x)
)= β1,0 + β′1x
log(
P(G=k |X=x)P(G=K |X=x)
)= βk,0 + β′kx , g = 1, · · · , K − 1.
That is
P(G = k |X = x) =exp(βk,0+β′kx)
1+∑K−1k=1 exp(βk,0+β′kx)
P(G = K |X = x) = 11+∑K−1
k=1 exp(βk,0+β′kx)
Newton-Raphson Method (R)
Large #x demands heavy computation and might notconverge at all.
SML Week 4-6: Linear Methods for Classification C. Andy Tsao
Refine LSE Introduction Regression and Discriminant Analysis Linear Discrimination Analysis Logistic Regression Recap Separating hyperplane Digression on Iterative Methods for finding roots
Logistic Regression or LDA?
For logistic regression
logP(Y = k |X = x)
P(Y = l |X = x)= βk0 + β′kx .
Recall in LDA
logP(Y = k |X = x)
P(Y = l |X = x)
= logπk
πl− 1
2(µk + µl )
′Σ−1(µk − µl ) + x ′Σ−1(µk + µl )
= αk0 + α′kx .
SML Week 4-6: Linear Methods for Classification C. Andy Tsao
Refine LSE Introduction Regression and Discriminant Analysis Linear Discrimination Analysis Logistic Regression Recap Separating hyperplane Digression on Iterative Methods for finding roots
Compare and contrast
Same log conditional probability ratio log(
P(G=k |X=x)P(G=K |X=x)
)Different ways in estimating the unknown parameters:Logistic: Max conditional likelihood vs. LDA: full likelihoodP(X , Y = k) = fk(x)πk
SML Week 4-6: Linear Methods for Classification C. Andy Tsao
Refine LSE Introduction Regression and Discriminant Analysis Linear Discrimination Analysis Logistic Regression Recap Separating hyperplane Digression on Iterative Methods for finding roots
Variables: Continuous or Categorical
R summary of data set (data.frame)
Continuous or categorical variable, a practical understanding
Continuous: detailed information, effective yet sensitivemodelling
Categorical: rough information, complex (when the levelnumber is large) yet robust modelling
SML Week 4-6: Linear Methods for Classification C. Andy Tsao
Refine LSE Introduction Regression and Discriminant Analysis Linear Discrimination Analysis Logistic Regression Recap Separating hyperplane Digression on Iterative Methods for finding roots
Hyperplane
A hyperplane or affine set L is defined by the equationf (x) = β0 + β′x. For R2 this is a line. Perceptrons (Rosenblatt1958)
For any x1, x2 ∈ L, β′(x1 − x2) = 0.
β∗ = β/||β|| is the (unit) normal vector of L
For any x0 in L, β′x0 = −β0
The signed distance of any x to L is
β∗(x− x0) =1||β|| (β′x+ β0) =
f (x)||f ′(x)||
Figure 4.14.
f (x) is proportional to the signed distance from x to thehyperplane: f (x) = 0.
SML Week 4-6: Linear Methods for Classification C. Andy Tsao
Refine LSE Introduction Regression and Discriminant Analysis Linear Discrimination Analysis Logistic Regression Recap Separating hyperplane Digression on Iterative Methods for finding roots
Bisection Method
Given f (x) is continuous on [a0, b0] and f (a0)f (b0) < 0.
For n = 0, 1, 2, . . . until satisfied, do
Set m = (an + bn)/2
If f (an)f (m) < 0, set an+1 = an, bn+1 = m
Otherwise set an+1 = m, bn+1 = bn
Then f (x) has a root in [an + 1, bn+1].
Conte and de Boor (1980). Elementary numerical analysis: analgorithmic approach. 3rd Edition.
SML Week 4-6: Linear Methods for Classification C. Andy Tsao
Refine LSE Introduction Regression and Discriminant Analysis Linear Discrimination Analysis Logistic Regression Recap Separating hyperplane Digression on Iterative Methods for finding roots
Perceptron Learning Algorithm–1
Goal: Minimize the distance of misclassified points to the decisionboundary. Consider the binary problem when Y = ±1.
Min D(β, β0) = −∑i∈M yi (β0 + x′i β) (4.37)
D ≥ 0 and D is proportional to the distances of themisclassified points to the decision boundary defined byβ0 + x′i β = 0. Assuming M is fixed, the gradient is∂D(β,β0)
∂β = −∑i∈M yixi
∂D(β,β0)∂β0
= −∑i∈M yi
SML Week 4-6: Linear Methods for Classification C. Andy Tsao
Refine LSE Introduction Regression and Discriminant Analysis Linear Discrimination Analysis Logistic Regression Recap Separating hyperplane Digression on Iterative Methods for finding roots
Perceptron Learning Algorithm–2
A step is taken after each obs rather than the sum. This is astochastic gradient decent minimizes a piecewise linearcriterion.
β = β + ρyixi
β0 = β0 + ρyi with ρ is the learning rate. WLOG ρ = 1.
When the cases are linearly separable, this algorithm willconverge to a separating hyperplane in finite number of steps.
Remark: Contrast to population version
SML Week 4-6: Linear Methods for Classification C. Andy Tsao
Refine LSE Introduction Regression and Discriminant Analysis Linear Discrimination Analysis Logistic Regression Recap Separating hyperplane Digression on Iterative Methods for finding roots
Perceptron Learning Algorithm–3
Problems with the algorithm, Ripley (1996)
When data are separable, the solutions are non-unique anddepend on starting points.
“Finite” can be very large.
When the cases are Not linearly separable, this algorithm candevelop long hard-to-detect cycles.
SML Week 4-6: Linear Methods for Classification C. Andy Tsao
Refine LSE Introduction Regression and Discriminant Analysis Linear Discrimination Analysis Logistic Regression Recap Separating hyperplane Digression on Iterative Methods for finding roots
Optimal Separating Hyperplane
The Optimal Separating Hyperplane separates the two classes andmaximizes the distance to the closest point from either class,Vapnik (1996).
maxβ0,β,||β||=1 C
subject to yi (β0 + x ′i β) ≥ C , i = 1, · · · , N. Equivalently
subject to 1||β||yi (β0 + x ′i β) ≥ C (redefining β0)
subject to ↔ yi (β0 + x ′i β) ≥ C ||β||WLOG, rescale and set ||β|| = 1/C , the optimization problembecomes
minβ0,β12 ||β||2
subject to yi (β0 + x ′i β) ≥ 1
Further transform the optimization problem → SVM.
SML Week 4-6: Linear Methods for Classification C. Andy Tsao