linear discriminant analysis (lda)€¦ · flexible discriminant analysis (fda) can tackle the rst...
TRANSCRIPT
Linear Discriminant Analysis (LDA)
Yang Xiaozhou
March 18, 2020
Industrial Systems Engineering and Management, NUS
Table of contents
1. LDA
2. Reduced-rank LDA
3. Fisher’s LDA
4. Flexible Discriminant Analysis
March 18, 2020 1
LDA
LDA and its applications
LDA is used as a tool for classification.
• Bankruptcy prediction: Edward Altman’s 1968 model
• Face recognition: learnt features are called Fisher faces
• Biomedical studies: discriminate different stages of a disease
• and many more
It has shown some really good results:
• Top 3 classifiers for 11 of the 22 datasets studied in the STATLOG
project1
1(Michie et al. 1994)
March 18, 2020 2
Classification by discriminant analysis
Consider a generic classification problem:
• K groups: G = 1, . . . ,K , each with a density fk(x) on Rp.
• A discriminant rule divides the space into K disjoint regions
R1, . . . ,RK and
allocate x to Πj if x ∈ Rj .
• Maximum likelihood rule:
allocate x to Πj if j = arg maxi
fi (x) ,
• Bayesian rule with class priors π1, . . . , πK :
allocate x to Πj if j = arg maxiπi fi (x) .
March 18, 2020 3
Classification by discriminant analysis
Consider a generic classification problem:
• K groups: G = 1, . . . ,K , each with a density fk(x) on Rp.
• A discriminant rule divides the space into K disjoint regions
R1, . . . ,RK and
allocate x to Πj if x ∈ Rj .
• Maximum likelihood rule:
allocate x to Πj if j = arg maxi
fi (x) ,
• Bayesian rule with class priors π1, . . . , πK :
allocate x to Πj if j = arg maxiπi fi (x) .
March 18, 2020 3
Gaussian as class density
If we assume data comes from Gaussian distribution:
fk(x) = 1(2π)p/2|Σk |1/2
e−12 (x−µk )
T Σ−1k (x−µk )
• Parameters are estimated using training data: πk , µk , Σk .
• Looking at the log-likelihod:
allocate x to Πj if j = arg maxiδi (x) .
δi (x) = log fi (x) + log πi is called discriminant function.
Assume equal covariance among K classes: LDA
δk(x) = xTΣ−1µk −1
2µTk Σ−1µk + log πk
Without that assumption on class covariance: QDA.
March 18, 2020 4
Gaussian as class density
If we assume data comes from Gaussian distribution:
fk(x) = 1(2π)p/2|Σk |1/2
e−12 (x−µk )
T Σ−1k (x−µk )
• Parameters are estimated using training data: πk , µk , Σk .
• Looking at the log-likelihod:
allocate x to Πj if j = arg maxiδi (x) .
δi (x) = log fi (x) + log πi is called discriminant function.
Assume equal covariance among K classes: LDA
δk(x) = xTΣ−1µk −1
2µTk Σ−1µk + log πk
Without that assumption on class covariance: QDA.
March 18, 2020 4
Decision boundary: LDA vs QDA
• Between any pair of classes k and `, the decision boundary is:
{x : δk(x) = δ`(x)}
• LDA: linear boundary; QDA: quadratic boundary.
• Number of parameters to estimate rises quickly in QDA:
• LDA: (K − 1)(p + 1)
• QDA: (K − 1){p(p + 3)/2 + 1}
March 18, 2020 5
Reduced-rank LDA
Reduced-rank LDA
Computation for LDA:
• Sphere the data:
x∗ ← D−12 UTx ,
where Σ = UDUT .
• Classify x to the closest centroid in the transformed space:
δk(x∗) = x∗T µk −1
2µTk µk + log πk .
Inherent dimension reduction in LDA:
• K centroids lie in a subspace of dimension at most (K − 1):
HK−1 = µ1 ⊕ span {µi − µ1, 2 ≤ i ≤ K}
• Classification is done by distance comparison in HK−1.
• p → K − 1 dimension reduction assuming p > K .
March 18, 2020 6
Reduced-rank LDA
Computation for LDA:
• Sphere the data:
x∗ ← D−12 UTx ,
where Σ = UDUT .
• Classify x to the closest centroid in the transformed space:
δk(x∗) = x∗T µk −1
2µTk µk + log πk .
Inherent dimension reduction in LDA:
• K centroids lie in a subspace of dimension at most (K − 1):
HK−1 = µ1 ⊕ span {µi − µ1, 2 ≤ i ≤ K}
• Classification is done by distance comparison in HK−1.
• p → K − 1 dimension reduction assuming p > K .
March 18, 2020 6
Reduced-rank LDA
We can look for an even smaller subspace HL ⊆ HK−1:
• Rule: Class centroids of sphered data have maximum separation in
this subspace in terms of variance.
PCA on class centroids to find coordinates of HL.
1. Find class mean and pooled var-cov: M,W.
2. Sphere the centroids: M∗ = MW−12 .
3. Obtain eigenvectors (v∗` ) in V∗ of cov(M∗) = V∗DBV∗T .
4. Obtain new (discriminant) variables Z` = (W−12 v∗` )TX , ` = 1, . . . , L.
Dimension reduction: XN×p → ZN×L.
March 18, 2020 7
Reduced-rank LDA
We can look for an even smaller subspace HL ⊆ HK−1:
• Rule: Class centroids of sphered data have maximum separation in
this subspace in terms of variance.
PCA on class centroids to find coordinates of HL.
1. Find class mean and pooled var-cov: M,W.
2. Sphere the centroids: M∗ = MW−12 .
3. Obtain eigenvectors (v∗` ) in V∗ of cov(M∗) = V∗DBV∗T .
4. Obtain new (discriminant) variables Z` = (W−12 v∗` )TX , ` = 1, . . . , L.
Dimension reduction: XN×p → ZN×L.
March 18, 2020 7
Reduced-rank LDA vs PCA
Wine dataset: 13 variables to distinguish three types of wines.
6 4 2 0 2 4 6Discriminant Coordinate 1
6
4
2
0
2
4
Disc
rimin
ant C
oord
inat
e 2
LDA for Wine dataset
400 200 0 200 400 600 800 1000PC 1
20
0
20
40
60
PC 2
PCA for Wine dataset
March 18, 2020 8
Fisher’s LDA
Fisher’s LDA
The previous rule is proposed by Fisher:
• Find a linear combination Z = aTX that has maximum
between-class variance relative to its within-class variance:
a = arg maxa
aTBa
aTWa.
• The optimization is solved by a generalized eigenvalue problem:
W−1Ba = λa.
• Eigenvectors (a`) of W−1B are the same as (W−12 v∗` ). Fisher
arrives at this without Gaussian assumption.
March 18, 2020 9
Fisher’s LDA
The previous rule is proposed by Fisher:
• Find a linear combination Z = aTX that has maximum
between-class variance relative to its within-class variance:
a = arg maxa
aTBa
aTWa.
• The optimization is solved by a generalized eigenvalue problem:
W−1Ba = λa.
• Eigenvectors (a`) of W−1B are the same as (W−12 v∗` ). Fisher
arrives at this without Gaussian assumption.
March 18, 2020 9
Fisher’s LDA
Digit dataset: 64 variables to distinguish 10 written digits.
• Top 4 of Fisher’s discriminant variables are shown.
• For example, coordinate 1 contrasts 4’s and 2/3’s.
10 8 6 4 2 0 2 4 6Discriminant Coordinate 1
6
4
2
0
2
4
6
Disc
rimin
ant C
oord
inat
e 2
6 4 2 0 2 4Discriminant Coordinate 3
4
2
0
2
4
6
Disc
rimin
ant C
oord
inat
e 4
March 18, 2020 10
Summary of LDA
Virtues of LDA:
1. Simple prototype classifier: simple to interpret.
2. Decision boundary is linear: simple to describe and implement.
3. Dimension reduction: provides informative low-dimensional view on
data.
Shortcomings of LDA:
1. Linear decision boundaries may not adequately separate the classes.
Support for more general boundaries is desired.
2. In high-dimensional setting, LDA uses too many parameters.
Regularized version of LDA is desired.
March 18, 2020 11
Flexible Discriminant Analysis
Beyond linear boundaries: FDA
Flexible discriminant analysis (FDA) can tackle the first shortcoming.
−4
0
4
−5 0 5X1
X2
y
1
2
3
LDA Decision Boundaries
−5
0
5
−5 0 5X1
X2
y
1
2
3
QDA Decision Boundaries
Idea: Recast LDA as a regression problem, apply the same techniques
generalizing linear regression.
March 18, 2020 12
LDA as a regression problem
We can recast LDA as a regression problem via optimal scoring.
Set up:
• Response G falls into one of K classes, G = {1, . . . ,K}.• X is the p-dimensional feature vector.
Suppose a scoring function:
θ : G 7→ R1
such that scores are optimally predicted by regressing on X , e.g. a linear
map η(X ) = XTβ.
In general, select L ≤ K − 1 such scoring functions and find the optimal
{score, linear map} pairs that minimize:
ASR =1
N
L∑`=1
[N∑i=1
(θ` (gi )− xT
i β`)2]
March 18, 2020 13
LDA as a regression problem
We can recast LDA as a regression problem via optimal scoring.
Set up:
• Response G falls into one of K classes, G = {1, . . . ,K}.• X is the p-dimensional feature vector.
Suppose a scoring function:
θ : G 7→ R1
such that scores are optimally predicted by regressing on X , e.g. a linear
map η(X ) = XTβ.
In general, select L ≤ K − 1 such scoring functions and find the optimal
{score, linear map} pairs that minimize:
ASR =1
N
L∑`=1
[N∑i=1
(θ` (gi )− xT
i β`)2]
March 18, 2020 13
LDA via optimal scoring
Procedures of LDA via optimal scoring:
1. Initialize. Build response indicator matrix YN×K where Yij = 1 if
ith samples comes from jth class, and 0 otherwise.
2. Multivariate regression. Regress Y on X using ASR to get PX
where Y = PXY, and regression coefficients B.
3. Optimal scores. Obtain the L largest eigenvectors Θ of YTPXY.
4. Update. Update the coefficients: B← BΘ
• The optimal linear map is: η(X ) = BTX .
• Columns of B, β1, . . . ,β`, are the same as a`’s in LDA up to a
constant.
This equivalence with regression problem provides a starting point for
generalizing LDA to a more flexible and nonparametric version.
March 18, 2020 14
LDA via optimal scoring
Procedures of LDA via optimal scoring:
1. Initialize. Build response indicator matrix YN×K where Yij = 1 if
ith samples comes from jth class, and 0 otherwise.
2. Multivariate regression. Regress Y on X using ASR to get PX
where Y = PXY, and regression coefficients B.
3. Optimal scores. Obtain the L largest eigenvectors Θ of YTPXY.
4. Update. Update the coefficients: B← BΘ
• The optimal linear map is: η(X ) = BTX .
• Columns of B, β1, . . . ,β`, are the same as a`’s in LDA up to a
constant.
This equivalence with regression problem provides a starting point for
generalizing LDA to a more flexible and nonparametric version.
March 18, 2020 14
From LDA to FDA
Extend LDA by generalizing the linear map:
η(X ) = BTX
to
η(X ) = BTh(X ).
• Generalized additive fits
• Spline functions
• MARS models
• Projection pursuits
• Neural networks
The idea behind FDA: LDA in an enlarged space.
March 18, 2020 15
FDA via optimal scoring
The procedures of FDA is the same as LDA via optimal scoring with one
change:
• Replace PX with Sh(X ), the nonparametric regression operator.
Initialize → Multivariate regression → Optimal scores → Update.
• Optimal fit: η(X).
• Fitted class centroids: ηk =∑
gi=k η (xi ) /Nk .
A new observation X is classified to class k that minimizes:
δ(x, k) =∥∥D(η(x)− ηj
)∥∥2where D is the constant factor linking optimal fits and LDA coordinates.
March 18, 2020 16
FDA via optimal scoring
The procedures of FDA is the same as LDA via optimal scoring with one
change:
• Replace PX with Sh(X ), the nonparametric regression operator.
Initialize → Multivariate regression → Optimal scores → Update.
• Optimal fit: η(X).
• Fitted class centroids: ηk =∑
gi=k η (xi ) /Nk .
A new observation X is classified to class k that minimizes:
δ(x, k) =∥∥D(η(x)− ηj
)∥∥2where D is the constant factor linking optimal fits and LDA coordinates.
March 18, 2020 16
LDA vs FDA
Data: three classes with mixture Gaussian densities.
FDA uses an additive model using smoothing splines of the form:
α +
p∑1
fj (Xj)
−4
0
4
−5 0 5X1
X2
y
1
2
3
LDA Decision Boundaries
−5
0
5
−5 0 5X1
X2
y
1
2
3
FDA Decision Boundaries
March 18, 2020 17
References
1. Friedman, J., Hastie, T., & Tibshirani, R. (2001). The elements of
statistical learning (Vol. 1, No. 10). New York: Springer series in
statistics.
2. Hastie, T., Tibshirani, R., & Buja, A. (1994). Flexible discriminant
analysis by optimal scoring. Journal of the American statistical
association, 89(428), 1255-1270.
3. Hastie, T., Tibshirani, R., & Buja, A. (1995). Flexible discriminant
and mixture models.
4. Mardia, K. V., Kent, J. T., & Bibby, J. M. Multivariate analysis.
1979. Probability and mathematical statistics. Academic Press Inc.
March 18, 2020 18