![Page 1: 1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649d535503460f94a2f691/html5/thumbnails/1.jpg)
1
Linear Methods for Classification
Lecture Notes for CMPUT 466/551
Nilanjan Ray
![Page 2: 1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649d535503460f94a2f691/html5/thumbnails/2.jpg)
2
Linear Classification
• What is meant by linear classification?– The decision boundaries in the in the feature
(input) space is linear
• Should the regions be contiguous?
R1 R2
R3R4
X1
X2
Piecewise linear decision boundaries in 2D input space
![Page 3: 1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649d535503460f94a2f691/html5/thumbnails/3.jpg)
3
Linear Classification…
• There is a discriminant function k(x) for
each class k
• Classification rule:
• In higher dimensional space the decision
boundaries are piecewise hyperplanar
• Remember that 0-1 loss function led to the
classification rule:
• So, can serve as k(x)
)}(maxarg:{ xkxR jj
k
)}|(maxarg:{ xXjGPkxRj
k
)|( XkGP
![Page 4: 1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649d535503460f94a2f691/html5/thumbnails/4.jpg)
4
Linear Classification…
• All we require here is the class boundaries {x:k(x) = j(x)} be linear for every (k, j) pair
• One can achieve this if k(x) themselves are linear or any monotone transform of k(x) is linear– An example:
xxXGP
xXGP
xxXGP
x
xxXGP
T
T
T
T
0
0
0
0
])|2(
)|1(log[
)exp(1
1)|2(
)exp(1
)exp()|1(
Linear
So that
![Page 5: 1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649d535503460f94a2f691/html5/thumbnails/5.jpg)
5
Linear Classification as a Linear Regression
)())(1()),((ˆ321
12121 TTTTT xxxxxxxY YXXX
535251
434241
333231
232221
131211
5251
4241
3231
2221
1211
,
1
1
1
1
1
yyy
yyy
yyy
yyy
yyy
xx
xx
xx
xx
xx
YX
321213
221212
121211
)1())((ˆ
)1())((ˆ
)1())((ˆ
xxxxY
xxxxY
xxxxY
2D Input space: X = (X1, X2)
Number of classes/categories K=3, So output Y = (Y1, Y2, Y3)
Training sample, size N=5,
Regression output:
Each row hasexactly one 1indicating thecategory/class
Indicator Matrix
Or, Classification rule:
))((ˆmaxarg))((ˆ2121 xxYxxG k
k
![Page 6: 1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649d535503460f94a2f691/html5/thumbnails/6.jpg)
6
The Masking
3213 )1(ˆ xxY
2212 )1(ˆ xxY
Linear regression of the indicator matrix can lead to masking
LDA can avoid this masking
2D input space and three classes Masking
1211 )1(ˆ xxY
Viewing direction
![Page 7: 1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649d535503460f94a2f691/html5/thumbnails/7.jpg)
7
Linear Discriminant Analysis
K
lll
kk
xf
xfxXkG
1
)(
)()|Pr(
Essentially minimum error Bayes’ classifier
Assumes that the conditional class densities are (multivariate) Gaussian
Assumes equal covariance for every class
Posterior probability
k is the prior probability for class k
fk(x) is class conditional density or likelihood density
Application ofBayes rule
))()(2
1exp(
||)2(
1)( 1
2/12/ kT
kpk xxxf
ΣΣ
![Page 8: 1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649d535503460f94a2f691/html5/thumbnails/8.jpg)
8
LDA…
)2
1(log)
2
1(log
loglog)|Pr(
)|Pr(log
1111l
Tll
Tlk
Tkk
Tk
l
k
l
k
xx
f
f
xXlG
xXkG
ΣΣΣΣ
)(xl)(xk
)(maxarg)(ˆ xxG kk
)|Pr(maxarg)(ˆ xXkGxGk
Classification rule:
is equivalent to:
The good old Bayes classifier!
![Page 9: 1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649d535503460f94a2f691/html5/thumbnails/9.jpg)
9
LDA…
kkg ik Nxi
/ˆ
NNkk /ˆ
)/()ˆ)(ˆ(ˆ1
KNxxK
k g
Tkiki
i Σ
Training data utilized to estimate
Prior probabilities:
Means:
Covariance matrix:
When are we going to use the training data?
Nixg ii :1),,( Total N input-output pairs Nk number of pairs in class k Total number of classes: K
![Page 10: 1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649d535503460f94a2f691/html5/thumbnails/10.jpg)
10
LDA: Example
LDA was able to avoid masking here
![Page 11: 1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649d535503460f94a2f691/html5/thumbnails/11.jpg)
11
Quadratic Discriminant Analysis
• Relaxes the same covariance assumption– class conditional probability densities (still multivariate Gaussians) are allowed to have different covariant matrices
• The class decision boundaries are not linear rather quadratic
|)|log2
1)()(
2
1(log|)|log
2
1)()(
2
1(log
loglog)|Pr(
)|Pr(log
11lll
Tllkkk
Tkk
l
k
l
k
xxxx
f
f
xXlG
xXkG
ΣΣΣΣ
)(xl)(xk
![Page 12: 1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649d535503460f94a2f691/html5/thumbnails/12.jpg)
12
QDA and Masking
Better than Linear Regression in terms of handling masking:
Usually computationally more expensive than LDA
![Page 13: 1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649d535503460f94a2f691/html5/thumbnails/13.jpg)
13
Fisher’s Linear Discriminant[DHS]
From training set we want to find out a direction where the separationbetween the class means is high and overlap between the classes is small
![Page 14: 1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649d535503460f94a2f691/html5/thumbnails/14.jpg)
14
Fisher’s LD…
w
xwT
x
Projection of a vector x on a unit vector w:
Geometric interpretation:
xwT
From training set we want to find out a direction w where the separationbetween the projections of class means is high and
the projections of the class overlap is small
![Page 15: 1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649d535503460f94a2f691/html5/thumbnails/15.jpg)
15
Fisher’s LD…
21 2
21
1
1,
1
Rxi
Rxi
ii
xN
mxN
m
22
211
1
21
1~,1~ mwxw
Nmmwxw
Nm T
Rxi
TT
Rxi
T
ii
)(~~1212 mmwmm T
wSwwmxmxwmwxwmys
wSwwmxmxwmwxwmys
T
Rx
Tii
T
Rx
Ti
T
Rxyi
T
Rx
Tii
T
Rx
Ti
T
Rxyi
iiii
iiii
2222
2:
22
22
1112
1:
21
21
222
111
))(()()~(~
))(()()~(~
Class means:
Projected class means:
Difference between projected class means:
Scatter of projected data (this will indicate overlap between the classes):
![Page 16: 1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649d535503460f94a2f691/html5/thumbnails/16.jpg)
16
Fisher’s LD…
wSw
wSw
ss
mmwr
wT
BT
2
22
1
212
~~)~~(
)(
TB
w
mmmmS
SSS
))(( 1212
21
)( 121 mmSw w
Ratio of difference of projected means over total scatter:
where
We want to maximize r(w). The solution is
Rayleigh quotient
![Page 17: 1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649d535503460f94a2f691/html5/thumbnails/17.jpg)
17
Fisher’s LD: Classifier
))(2
1)(()(
2
1)~~(
2
1)( 2112
12121 mmxmmSmmwxwmmxwxy w
TTT
Classification rule: x in R2 if y(x)>0, else x in R1, where
So far so good. However, how do we get the classifier?
All we know at this point is that the direction )( 121 mmSw w
separates the projected data very well
Since we know that the projected class means are well separated, we can choose average of the two projected means as a thresholdfor classification
![Page 18: 1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649d535503460f94a2f691/html5/thumbnails/18.jpg)
18
Fisher’s LD and LDA
They become same when
(1) Prior probabilities are same
(2) Common covariance matrix for the class conditional densities
(3) Both class conditional densities are multivariate Gaussian
Ex. Show that Fisher’s LD classifier and LDA produce thesame rule of classification given the above assumptions
Note: (1) Fisher’s LD does not assume Gaussian densities (2) Fisher’s LD can be used in dimension reduction for a multiple class scenario
![Page 19: 1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649d535503460f94a2f691/html5/thumbnails/19.jpg)
19
Logistic Regression
• The output of regression is the posterior probability i.e., Pr(output | input)
• Always ensures that the sum of output variables is 1 and each output is non-negative
• A linear classification method• We need to know about two concepts to
understand logistic regression– Newton-Raphson method– Maximum likelihood estimation
![Page 20: 1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649d535503460f94a2f691/html5/thumbnails/20.jpg)
20
Newton-Raphson Method
0)( 1 nxf
)(
)()( 11
n
nnnn xf
xfxfxx
)()()()( 11 nnnnn xfxxxfxf
)(
)(1
n
nnn xf
xfxx
A technique for solving non-linear equation f(x)=0
Taylor series:
After rearrangement:
If xn+1 is a root or very close to the root, then:
So:
Rule for iterationNeed an initial guess x0
![Page 21: 1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649d535503460f94a2f691/html5/thumbnails/21.jpg)
21
Newton-Raphson in Multi-dimensions
Njxx
fxfxxf
N
kk
k
jjj ,...,1,)()(
1
0),,,(
0),,,(
0),,,(
21
212
211
NN
N
N
xxxf
xxxf
xxxf
We want to solve the equations:
Taylor series:
After some rearrangement etc.the rule for iteration:(Need an initial guess)
),,,(
),,,(
),,,(
21
212
211
1
21
2
2
2
1
2
1
2
1
1
1
1
12
11
1
12
11
nN
nnN
nN
nn
nN
nn
N
NNN
N
N
nN
n
n
nN
n
n
xxxf
xxxf
xxxf
x
f
x
f
x
f
x
f
x
f
x
fx
f
x
f
x
f
x
x
x
x
x
x
Jacobian matrix
![Page 22: 1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649d535503460f94a2f691/html5/thumbnails/22.jpg)
22
Newton-Raphson : Example
0)sin(),(
0)cos(),(32
211212
221211
xxxxxf
xxxxfSolve:
32
211
22
1
1
2211
21
2
11
2
11
)()()sin(
)cos()(
)(32)cos(
)sin(2nnn
nn
nnn
nn
n
n
n
n
xxx
xx
xxx
xx
x
x
x
x
Iteration ruleneed initial guess
![Page 23: 1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649d535503460f94a2f691/html5/thumbnails/23.jpg)
23
Maximum Likelihood Parameter Estimation
)2
)(exp(
2
1),;(
2
2
x
xp
N
i
ixL1
2
2
)2
)(exp(
2
1),(
),(maxarg)ˆ,ˆ(,
L
Let’s start with an example. We want to find out the unknown parameters mean and standard deviation of a Gaussian pdf, given N independent samples from it.
Samples: x1,….,xN
Form the likelihood function:
Estimate the parameters that maximize the likelihood function
Let’s find out )ˆ,ˆ(
![Page 24: 1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649d535503460f94a2f691/html5/thumbnails/24.jpg)
24
Logistic Regression Model
1
10
1
10
0
)exp(1
1)|Pr(
1,,1,)exp(1
)exp()|Pr(
K
l
Tll
K
l
Tll
Tkk
xxXKG
Kkx
xxXkG
The method directly models the posterior probabilities as the output of regression
Note that the class boundaries are linear
How can we show this linear nature?
What is the discriminant function for every class in this model?
x is p-dimensional input vector
k is a p-dimensional vector for each k
Total number of parameters is (K-1)(p+1)
![Page 25: 1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649d535503460f94a2f691/html5/thumbnails/25.jpg)
25
Logistic Regression Computation
Let’s fit the logistic regression model for K=2, i.e., number of classes is 2
N
ii
Tii
Ti
N
i iTii
Ti
N
iiiii
N
iii
xyxy
xyxy
xXGyxXGy
xXyGl
1
1
1
1
)))exp(1log()1((
))exp(1
1log)1((
))|0log(Pr()1())|1log(Pr(
)}|Pr({log)(
Training set: (xi, gi), i=1,…,N
Log-likelihood:
We want to maximize the log-likelihood in order to estimate
xi are (p+1)-dimensional input vector with leading entry 1 is a (p+1)-dimensional vectoryi = 1 if gi =1; yi = 0 if gi =2
![Page 26: 1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649d535503460f94a2f691/html5/thumbnails/26.jpg)
26
Newton-Raphson for LR
0))exp(1
)exp((
)(
1
N
iiT
T
i xx
xy
l
(p+1) Non-linear equations to solve for (p+1) unknowns
Solve by Newton-Raphson method:
,)(
)])(
Jacobian([ 1-
ll
N
i iT
iTi
TTii xx
xxx
l
1
))exp(1
1)(
)exp(1
)exp((-)
)(Jacobian(where,
![Page 27: 1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649d535503460f94a2f691/html5/thumbnails/27.jpg)
Newton-Raphson for LR…
27
),()( 1 pyXWXX TT
WXXl
pyXxx
xy
l
T
TN
iiT
T
i
))(
(Jacobian
)())exp(1
)exp((
)(
1
So, NR rule becomes:
,,
1
2
1
)1(
2
1
byNNpbyN
TN
T
T
y
y
y
y
x
x
x
X
,
))exp(1/()exp(
))exp(1/()exp(
))exp(1/()exp(
1
22
11
byNNT
NT
TT
TT
xx
xx
xx
p
)))exp((1
11)(
))exp((1
)exp((
iT
iTi
T
xx
x
W is a N-by-N diagonal matrix with ith diagonal entry:
![Page 28: 1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649d535503460f94a2f691/html5/thumbnails/28.jpg)
Newton-Raphson for LR…
28
• Newton-Raphson–
– Adjusted response
– Iteratively reweighted least squares (IRLS)
WzXWXX
pyWXWXWXX
pyXWXX
TT
oldTT
TToldnew
1
11
1
)(
))(()(
)()(
)(1 pyWXz old
)()(minarg
)()(minarg
1 pyWpy
XzWXz
T
TTTnew
![Page 29: 1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649d535503460f94a2f691/html5/thumbnails/29.jpg)
Example: South African Heart Disease
29
![Page 30: 1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649d535503460f94a2f691/html5/thumbnails/30.jpg)
Example: South African Heart Disease…
30
After data fitting in the logistic regression model:
)043.0001.0035.0939.0185.008.0006.0130.4exp(1
)043.0001.0035.0939.0185.008.0006.0130.4exp()|Pr(
agealcoholobesityfamhistldltobacosbp
agealcoholobesityfamhistldltobacosbp
xxxxxxx
xxxxxxxxyesMI
Coefficient Std. Error Z Score
(Intercept) -4.130 0.964 -4.285
sbp 0.006 0.006 1.023
tobacco 0.080 0.026 3.034
ldl 0.185 0.057 3.219
famhist 0.939 0.225 4.178
obesity -0.035 0.029 -1.187
alcohol 0.001 0.004 0.136
age 0.043 0.010 4.184
![Page 31: 1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649d535503460f94a2f691/html5/thumbnails/31.jpg)
Example: South African Heart Disease…
31
After ignoring negligible coefficients:
)044.0924.0168.0081.0204.4exp(1
)044.0924.0168.0081.0204.4exp()|Pr(
agefamhistldltobaco
agefamhistldltobaco
xxxx
xxxxxyesMI
What happened to systolic blood pressure? Obesity?
![Page 32: 1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649d535503460f94a2f691/html5/thumbnails/32.jpg)
Multi-Class Logistic Regression
32
)1(
2
1
)1)(1()1(
1)1)(1()1(
0)1(
2
20
1
11
10
1
1
1
where,~
,~
pbyN
TN
T
T
pKbyKN
bypKpK
K
p
p
x
x
x
X
X
X
X
X
X
)~~(~
)~~~
(~~ 1 pyXXWX TT NR update:
![Page 33: 1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649d535503460f94a2f691/html5/thumbnails/33.jpg)
Multi-Class LR…
33
.11,
)(
)(
)(
where,~ 2
1
1
2
1
Kk
kg
kg
kg
y
y
y
y
y
N
k
K
is a N(K-1) dimension vector:y~
p~ is a N(K-1) dimension vector:
.11,
))exp(1/()exp(
))exp(1/()exp(
))exp(1/()exp(
where,~
1
100
1
12020
1
11010
1
2
1
Kk
xx
xx
xx
p
p
p
p
p
K
lNllNkk
K
lllkk
K
lllkk
k
K
(z) is a indicator function:
.otherwise,0
0if,1)(
zz
![Page 34: 1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649d535503460f94a2f691/html5/thumbnails/34.jpg)
MC-LR…
34
).))exp((1
)exp()(
))exp((1
)exp((isentry diagonal thethen,if
),))exp((1
)exp(1)(
))exp((1
)exp((isentry diagonal thethen,if
matrix,diagonalan is,1,1,where
,
1-
10
01-
10
0th
1-
10
01-
10
0th
)1()1()1)(1(2)1(1)1(
)1(22221
)1(11211
K
li
Tll
iTmm
K
li
Tll
iTkk
K
li
Tll
iTkk
K
li
Tll
iTkk
km
KNbyKNKKKK
K
K
x
x
x
ximk
x
x
x
ximk
NbyNKmkW
WWW
WWW
WWW
W
![Page 35: 1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649d535503460f94a2f691/html5/thumbnails/35.jpg)
LDA vs. Logistic Regression
• LDA (Generative model)– Assumes Gaussian class-conditional densities and a common covariance– Model parameters are estimated by maximizing the full log likelihood, parameters
for each class are estimated independently of other classes, Kp+p(p+1)/2+(K-1) parameters
– Makes use of marginal density information Pr(X)– Easier to train, low variance, more efficient if model is correct– Higher asymptotic error, but converges faster
• Logistic Regression (Discriminative model)– Assumes class-conditional densities are members of the (same) exponential
family distribution– Model parameters are estimated by maximizing the conditional log likelihood,
simultaneous consideration of all other classes, (K-1)(p+1) parameters– Ignores marginal density information Pr(X)– Harder to train, robust to uncertainty about the data generation process– Lower asymptotic error, but converges more slowly
![Page 36: 1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649d535503460f94a2f691/html5/thumbnails/36.jpg)
Generative vs. Discriminative Learning
Generative Discriminative
Example Linear Discriminant Analysis
Logistic Regression
Objective Functions Full log likelihood: Conditional log likelihood
Model Assumptions Class densities:
e.g. Gaussian in LDA
Discriminant functions
Parameter Estimation “Easy” – One single sweep “Hard” – iterative optimization
Advantages More efficient if model correct, borrows strength from p(x)
More flexible, robust because fewer assumptions
Disadvantages Bias if model is incorrect May also be biased. Ignores information in p(x)
i
ii yxp ),(log i
ii xyp )|(log
)|( kyxp )(xk