advanced machine learning & perception
DESCRIPTION
Advanced Machine Learning & Perception. Instructor: Tony Jebara. Topic 12. Manifold Learning (Unsupervised) Beyond Principal Components Analysis (PCA) Multidimensional Scaling (MDS) Generative Topographic Map (GTM) Locally Linear Embedding (LLE) Convex Invariance Learning (CoIL) - PowerPoint PPT PresentationTRANSCRIPT
Tony Jebara, Columbia University
Advanced Machine Learning & Perception
Instructor: Tony Jebara
Tony Jebara, Columbia University
Topic 12•Manifold Learning (Unsupervised)
•Beyond Principal Components Analysis (PCA)
•Multidimensional Scaling (MDS)
•Generative Topographic Map (GTM)
•Locally Linear Embedding (LLE)
•Convex Invariance Learning (CoIL)
•Kernel PCA (KPCA)
Tony Jebara, Columbia University
Manifolds•Data is often embedded in a lower dimensional space•Consider image of face being translated from left-to-right
•How to capture the true coordinates of the data on the manifold or embedding space and represent it compactly?•Open problem: many possible approaches…•PCA: linear manifold•MDS: get inter-point distances, find 2D data with same•LLE: mimic neighborhoods using low dimensional vectors•GTM: fit a grid of Gaussians to data via nonlinear warp•Linear after Nonlinear normalization/invariance of data•Linear in Hilbert space (Kernels)
0t
tx T x=r r
Tony Jebara, Columbia University
•If we have eigenvectors, mean and coefficients:
•Getting eigenvectors (I.e. approximating the covariance):
•Eigenvectors are orthonormal:•In coordinates of v, Gaussian is diagonal, cov = •All eigenvalues are non-negative•Higher eigenvalues are higher variance, use those first
•To compute the coefficients:
Principal Components Analysis
1
C
i ij jjx c v
=» m+å
r rr
11 12 13 1
12 22 23 1 2 3 2 1 2 3
13 23 33 2
0 0
0 0
0 0
T
T
V V
v v v v v v
S = L
é ù é ùS S S lê ú ê úê ú ê úé ù é ùé ù é ù é ù é ù é ù é ùS S S = lê ú ê úê ú ê úë û ë û ë û ë û ë û ë ûë û ë ûê ú ê úê ú ê úS S S lê ú ê úë û ë û
r r r r r r
( )Tij i jc x v= - m
r rr
Ti j ijv v = dr r
0i
l ³
1 2 3 4l ³ l ³ l ³ l ³ K
Tony Jebara, Columbia University
Multidimensional Scaling (MDS)•Idea: capture only distances between points X in original space•Construct another set of low dim or 2D Y points having same distances•A Dissimilarity d(x,y) is a function of two objects x and y such that
•A Metric also has to satisfy triangle inequality:
•Standard example: Euclidean l2 metric•Assume for N objects, we compute a dissimilarity matrix which tells us how far they are
( )
( )
( ) ( )
, 0
, 0
, ,
d x y
d x x
d x y d y x
³
=
=
( ) ( ) ( ), , ,d x z d x y d y z£ +( ) 21
2,d x y x y= -
( ),ij i j
d X XD =
Tony Jebara, Columbia University
Multidimensional Scaling•Given dissimilarity between original X points under original d() metric, find Y points with dissimilarity D under another d’() metric such that D is similar to
•Want to find Y’s that minimize some difference from D to •Eg. Least Squares Stress =
•Eg. Invariant Stress =
•Eg. Sammon Mapping =
•Eg. Strain =
( ) ( ), ' ,ij i j ij i j
d X X D d Y YD = =
( ) ( )2
1, ,
ij ijN ijStress Y Y D= - DåK
( )2iji j
Stress YInvStress
D<
=å
( )21
ij ijijij
D - DDå
( ) ( )( )2 2 2 2 1 11TNtrace J D J D whereJ ID - D - = -r r
Some are globalSome are localGradient descent
Tony Jebara, Columbia University
•Have distances from cities to cities, these are on the surface of a sphere (Earth) in 3D space•Reconstructed 2D points on plane capture essential properties (poles?)
MDS Example 3D to 2D
Tony Jebara, Columbia University
•More elaborate example•Have correlation matrix between crimes. These are arbitrary dimensionality.•Hack: convert correlation to dissimilarity and show reconstructed Y
MDS Example Multi-D to 2D
Tony Jebara, Columbia University
•Instead of distance, look at neighborhood of each point. Preserve reconstruction of point with neighbors in low dim •Find K nearest neighbors for each point•Describe neighborhood as best weights on neighbors to reconstruct the point
•Find best vectors that still have same weights
Locally Linear Embedding
( )2
1
i i ij jj
ijj
W X W X
subjectto W i
e = -å
= "
åå
r r
( ) { } { }2
0i i ij jjY Y W Y subjectto E Y Cov Y IF = - = =å å
r r
Why?
Tony Jebara, Columbia University
Locally Linear Embedding•Finding W’s (convex combination of weights on neighbors):( ) ( ) ( )
2
i i i i i i ij jjW W where W X W X
· ·e = e e = -å å
r r
( ) ( )
( )( ) ( )( )( ) ( )
22
1
i i i ij j ij i jj j
T
ij i j ij i jj j
T
ij i j iik kjk
ij ijik jkjk j
W X W X W X X
W X X W X X
WW X X X X
WW C andrecall W
·e = - = -
= - -
= - -
= =
å å
å å
åå å
r r r r
r r r r
r r r r
( )* 12argmin 1T T
wiW w Cw w
·= - l
r
( )1 0
1
Cw
wC
- l =
æ ö÷ç =÷ç ÷çè øl
r
r
1) Take Deriv& Set to 0
2) SolveLinear system
3) Find
4) Find w
1 1
1 1
T
T
w
w
=
æ ö÷çl =÷ç ÷çè øl
r
r
Tony Jebara, Columbia University
Locally Linear Embedding•Finding Y’s (new low-D points that agree with the W’s)
•Solve for Y as the bottom d+1 eigenvectors of M•Plot the Y values
( )
( ) ( )( )( )
2
i ij ji j
T
i ij j i ik ki j k
T T T Ti i i ij j i ij jik k ik ki k j jk
Tij jjk jk kj ik kjk i
Tjjk kjk
Y Y W Y
Y W Y Y W Y
Y Y W Y Y W Y Y WW Y Y
W W WW Y Y
M Y Y subjecttoY beingwhite
F = -
= - -
= - - +
= d - + +
=
å å
å å å
å å å å
å åå
r r
r r r r
r r r r r r r r
r r
r r
Tony Jebara, Columbia University
•Original X data are raw images
•Dots are reconstructed two-dimensional Y points
LLE Examples
1
3
0
2
éùêúêúêú® êúêúêúêúêúëû
Tony Jebara, Columbia University
•Top=PCA•Bottom=LLE
LLEs
Tony Jebara, Columbia University
•A principled altenative to the Kohonen map•Forms a generative model of the manifold. Can sample it, etc.•Find a nonlinear mapping y() from a 2D grid of Gaussians.•Pick params W of mapping such that mapped Gaussians in data space maximize the likelihood of the observed data.•Have two spaces, the data space t (old notation were X’s) and the hidden latent space x (old notation were Y’s).•The mapping goes from latent space to observed space
Generative Topographic Map
( ),i it y x W»
Tony Jebara, Columbia University
•We choose our priors and conditionals for all variables of interest•Assume Gaussian noise on the y() mapping
•Assume our prior latent variables are a grid model equally spaced in latent space•Can now write out the full likelihood
GTM as a Grid of Gaussians
( ) ( )/ 2
2| , , exp ,
2 2
D
p t xW y xW tæ ö æ öb b÷ ÷ç çb = - -÷ ÷ç ç÷ ÷ç çè ø è øp
( ) ( )11
K
K kkp x x x
== d -å
( ) ( ) ( ) ( )1 1
, log | , log | , ,N N
n nn nL W p t W p t xW p xdx
= =b = b = bå å ò
Tony Jebara, Columbia University
•Integrating over delta functions makes a summation
•Note the log-sum, need to apply EM to maximize•Also, use the following parametric (linear in the basis) form of the mapping•Examples of manifolds for randomly chosen W mappings•Typically, we are given the data and want to find the maximum likelihood mapping W for it…
GTM Distribution Model
( ) ( ) ( )
( )1
11 1
, log | , ,
log | , ,
N
nn
N K
K n kn k
L W p t xW p xdx
p t x W
=
= =
b = b
= b
å òå å
( ) ( ),y xW W x= f
Tony Jebara, Columbia University
•Recover non-linear manifold by warping grid with W params•Synthetic Example: Left = Initialized Right = Converged
•Real Example: Oil Data 3-Classes Left = GTM Right = PCA
GTM Examples