Kernelized Discriminant Analysis and Adaptive Methods for Discriminant Analysis
Haesun Park
Georgia Institute of Technology,
Atlanta, GA, USA
(joint work with C. Park)
KAIST, Korea, June 2007
Clustering
Clustering : grouping of data based on similarity measures
Classification: assign a class label to new unseen data
Classification
Data Mining
Data Preparation
Preprocessing
Classification Clustering •Association Analysis• Regression• Probabilistic modeling …
Dimension reduction-Feature Selection
-
Data Reduction
• Mining or discovery of new information - patterns or rules - from large databases
Feature Extraction
Feature Extraction
• Optimal feature extraction - Reduce the dimensionality of data space - Minimize effects of redundant features and noise
Apply a classifier to predict a class label of new data
feature extraction
.. .. ..
......
number of features
new data
Curse of dimensionality
Linear dimension reduction
Maximize class separability
in the reduced dimensional space
Linear dimension reduction
Maximize class separability
in the reduced dimensional space
What if data is not linear separable?
Nonlinear Dimension Reduction
Contents
• Linear Discriminant Analysis
• Nonlinear Dimension Reduction based on Kernel Methods
- Nonlinear Discriminant Analysis
• Application to Fingerprint Classification
n
iia
nc
1
1
For a given data set {a1,┉,an }
• Within-class scatter matrix
• trace(Sw)
r
ii
iclassa
ca1
2||||
r
i
Ti
iclassaiw cacaS
1
))((
Centroids :
iclassai
i an
c 1
Linear Discriminant Analysis (LDA)
• Between-class scatter matrix
• trace(Sb)2
1
|||| ccr
ii
Tiii
r
ib ccccnS ))((
1
GT→
maximize minimize trace(GTSwG)
trace(GTSbG)
a1┉ an GTa1┉ GTan
Eigenvalue problem xxSS bw 1
Sw-1 Sb
G
=
Sw-1Sb X = X
))()(( tracemax)( 1 GSGGSGGJ bT
wT
G
rank(Sb) number of classes - 1
Face Recognition
…
…
92 x 112
10304
…
GT
…
?
dimension reduction to maximize the distances among classes.
Text Classification
• A bag of words: each document is represented with frequencies of words contained
Education
FacultyStudentSyllabusGradeTuition….
Recreation
MovieMusicSportHollywoodTheater…..
GT
SbSw
Generalized LDA Algorithms
xxSS bw 1
xSxS wb
• Undersampled problems:
high dimensionality & small number of data
Can’t compute Sw-1Sb
Nonlinear Dimension Reductionbased on Kernel Methods
Nonlinear Dimension Reduction
GT
nonlinear mapping linear dimension reduction ),2,(),( 2
2212121 xxxxxx
Kernel Method
• If a kernel function k(x,y) satisfies Mercer’s condition, then there exists a mapping
for which <(x),(y)>= k(x,y) holds
A (A) < x, y > < (x), (y) > = k(x,y)
• For a finite data set A=[a1,…,an], Mercer’s condition can be rephrased as the kernel matrix is positive semi-definite.
njiji aakK ,1)],([
Nonlinear Dimension Reduction by Kernel Methods
GT
),()(),( yxkyx
Given a kernel function k(x,y)
linear dimension reduction
Positive Definite Kernel Functions
• Gaussian kernel
• Polynomial kernel
)/exp(),(2 yxyxk
),,0(),(),( 2121 Rdyxyxk d
Nonlinear Discriminant Analysis using Kernel Methods
{a1,a2,…,an}
Sb x= Sw x
{(a1),…,(an)}
Want to apply LDA
<(x),(y)>= k(x,y)
Nonlinear Discriminant Analysis using Kernel Methods
{a1,a2,…,an}
Sb x= Sw x
{(a1),…,(an)}
k(a1,a1) k(a1,an) … ,…, … k(an,a1) k(an,an)
Sb u= Sw u
Apply Generalized LDA
Algorithms
SbSw
Generalized LDA Algorithms
xSxS wb
Minimize trace(xT Sw x)
xT Sw x = 0
x null(Sw)
Maximize trace(xT Sb x)
xT Sb x ≠ 0
x range(Sb)
Generalized LDA algorithms
• Add a positive diagonal matrix I
to Sw so that Sw+I is nonsingularRLDA
LDA/GSVD • Apply the generalized singular value
decomposition (GSVD) to {Hw , Hb}
in Sb = Hb HbT and Sw=Hw Hw
T
To-N(Sw) • Projection to null space of Sw
• Maximize between-class scatter in the projected space
Generalized LDA Algorithms
To-R(Sb)• Transformation to range space of Sb
• Diagonalize within-class scatter matrix in the transformed space
To-NR(Sw)• Reduce data dimension by PCA• Maximize between-class scatter
in range(Sw) and null(Sw)
Data sets
Data dim no. of data no. of classes
Musk 166 6599 2
Isolet 617 7797 26
Car 6 1728 4
Mfeature 649 2000 10
Bcancer 9 699 2
Bscale 4 625 3
From Machine Learning Repository Database
Experimental Settings
Split
kernel function k and a linear transf. GT
Dimension reducing
Predict class labels of test data using training data
Original data
Training data Test data
• Each color represents different data sets
methods
Prediction accuracies
Linear and Nonlinear Discriminant Analysis
Data sets
Face Recognition
Application of Nonlinear Discriminant Analysis to Fingerprint Classification
Left Loop Right Loop Whorl
Arch Tented Arch
Fingerprint Classification
From NIST Fingerprint database 4
Previous Works in Fingerprint Classification
Feature representation
Minutiae
Gabor filtering
Directional partitioning
Apply Classifiers:
Neural Networks
Support Vector
Machines
Probabilistic NN
Our Approach Construct core directional images by DFT Dimension Reduction by Nonlinear Discriminant Analysis
Construction of Core Directional Images
Left Loop Right Loop Whorl
Construction of Core Directional Images
Core Point
Discrete Fourier transform (DFT)
Discrete Fourier transform (DFT)
Construction of Directional Images
Computation of local dominant directions by DFT and directional filtering
Core point detection Reconstruction of core directional images
• Fast computation of DFT by FFT
• Reliable for low quality images
Computation of local dominant directions by DFT and directional filtering
Construction of Directional Images
105 x 105
512 x 512
Nonlinear discriminant Analysis
…
…
105 x 105
11025-dim. space
GT
Left loop
WhorlRight loop
Tented archArch
Maximizing class separability in the reduced dimensional space
4-dim. space
Comparison of Experimental Results
NIST Database 4
Rejection rate (%) 0 1.8 8.5 20.0
Nonlinear LDA/GSVD 90.7 91.3 92.8 95.3PCASYS + 89.7 90.5 92.8 95.6
Jain et.al. [1999,TPAMI] - 90.0 91.2 93.5
Yao et al. [2003,PR] - 90.0 92.2 95.6
prediction accuracies (%)
Summary
• Nonlinear Feature Extraction based on Kernel Methods
- Nonlinear Discriminant Analysis
- Kernel Orthogonal Centroid Method (KOC)
• A comparison of Generalized Linear and Nonlinear Discriminant Analysis Algorithms
• Application to Fingerprint Classification
• Dimension reduction - feature transformation :
linear combination of original features
• Feature selection :
select a part of original features
gene expression microarray data anaysis
-- gene selection
• Visualization of high dimensional data
• Visual data mining
• θi,j : dominant direction on the neighborhood
centered at (i, j)• Measure consistency of local dominant directions
| ΣΣi,j=-1,0,1 [cos(2θi,j), sin(2θi,j)] |
:distance from the starting point to finishing point
• the lowest value -> Core point
Core point detection
References• L.Chen et al., A new LDA-based face recognition system which can solve the small
sample size problem, Pattern Recognition, 33:1713-1726, 2000
• P.Howland et al., Structure preserving dimension reduction for clustered text data based on the generalized singular value decomposition, SIMAX, 25(1):165-179, 2003
• H.Yu and J.Yang, A direct LDA algorithm for high-dimensional data-with application to face recognition, Pattern Recognition, 34:2067-2070, 2001
• J.Yang and J.-Y.Yang, Why can LDA be performed in PCA transformed space?, Pattern Recognition, 36:563-566, 2003
• H. Park et al., Lower dimensional representation of text data based on centroids and least squares, BIT Numerical Mathematics, 43(2):1-22, 2003
• S. Mika et al., Fisher discriminant analysis with kernels, Neural networks for signal processing IX, J.Larsen and S.Douglas, pp.41-48, IEEE, 1999
• B. Scholkopf et al., Nonlinear component analysis as a kernel eigenvalue problem, Neural computation, 10:1299-1319, 1998
• G. Baudat and F. Anouar, Generalized discriminant analysis using a kernel approach, Neural computation, 12:2385-2404, 2000
• V. Roth and V. Steinhage, Nonlinear discriminant analysis using a kernel functions, Advances in neural information processing functions, 12:568-574, 2000
..
• S.A. Billings and K.L. Lee, Nonlinear fisher discriminant analysis using a minimum squared error cost function and the orthogonal least squares algorithm, Neural networks, 15(2):263-270, 2002
• C.H. Park and H. Park, Nonlinear discriminant analysis based on generalized singular value decomposition, SIMAX, 27-1, pp. 98-102, 2005
• A.K.Jain et al., A multichannel approach to fingerprint classification, IEEE transactions on Pattern Analysis and Machine Intelligence, 21(4):348-359,1999
• Y.Yao et al., Combining flat and structural representations for fingerprint classifiaction with recursive neural networks and support vector machines, Pattern recognition, 36(2):397-406,2003
• C.H.Park and H.Park, Nonlinear feature extraction based on cetroids and kernel functions, Pattern recognition, 37(4):801-810
• C.H.Park and H.Park, A Comparison of Generalized LDA algorithms for undersampled problems, Pattern Recognition, to appear
• C.H.Park and H.Park, Fingerprint classification using fast fourier transform and nonlinear discriminant analysis, Pattern recognition, 2006