lecture 16. manifold learning - github pagesย ยท statistical machine learning (s2 2017) deck 16...
Post on 17-Jul-2020
2 Views
Preview:
TRANSCRIPT
Semester 2, 2017Lecturer: Andrey Kan
Lecture 16. Manifold LearningCOMP90051 Statistical Machine Learning
Copyright: University of MelbourneSwiss roll image: Evan-Amos, Wikimedia Commons, CC0
Statistical Machine Learning (S2 2017) Deck 16
This lectureโข Introduction to manifold learning
โ Motivationโ Focus on data transformation
โข Unfolding the manifoldโ Geodesic distancesโ Isomap algorithm
โข Spectral clusteringโ Laplacian eigenmapsโ Spectral clustering pipeline
2
Statistical Machine Learning (S2 2017) Deck 16
Manifold Learning
Recovering low dimensional data representation non-
linearly embedded within a higher dimensional space
3
Statistical Machine Learning (S2 2017) Deck 16
The limitation of k-means and GMMโข K-means algorithm can find spherical clusters
โข GMM can find elliptical clusters
โข These algorithms will struggle in cases like this
4
desired resultK-means clustering
Figure from Murphy
Statistical Machine Learning (S2 2017) Deck 16
Focusing on data geometryโข We are not dismissing the k-means algorithm yet,
but we are going to put it aside for a moment
โข One approach to address the problem in the previous slide would be to introduce improvements to algorithms such as k-means
โข Instead, letโs focus on geometry of the data and see if we can transform the data to make it amenable for simpler algorithmsโ Recall โtransform the data vs modify the modelโ discussion
in supervised learning
5
Statistical Machine Learning (S2 2017) Deck 16
Non-linear data embeddingโข Recall the example with 3D GPS coordinates that denote a carโs
location on a 2D surface
โข In a similar example consider coordinates of items on a picnic blanket which is approximately a planeโ In this example, the data resides on a plane embedded in 3D
โข A low dimensional surface can be quite curved in a higher dimensional spaceโ A plane of dough (2D) in a Swiss roll (3D)
6
Picnic blanket image: Theo Wright, Flickr, CC2Swiss roll image: Evan-Amos, Wikimedia Commons, CC0
Statistical Machine Learning (S2 2017) Deck 16
Key assumption: Itโs simpler than it looks!
7
โข Key assumption: High dimensional data actually resides in a lower dimensional space that is locally Euclidean
โข Informally, the manifold is a subset of points in the high-dimensional space that locally looks like a low-dimensional space
Statistical Machine Learning (S2 2017) Deck 16
Manifold example
8
โข Informally, the manifold is a subset of points in the high-dimensional space that locally looks like a low-dimensional space
โข Example: arc of a circleโ consider a tiny bit of a circumference (2D) can treat as line (1D)
A
BC
๐ด๐ด๐ด๐ด โ ๐ด๐ด๐ด๐ด + ๐ด๐ด๐ด๐ด
Statistical Machine Learning (S2 2017) Deck 16
๐๐-dimensional manifoldโข Definition from Guillemin and Pollack, Differential Topology, 1974
โข A mapping ๐๐ on an open set ๐๐ โ ๐น๐น๐๐ is called smooth if it has continuous partial derivatives of all orders
โข A map ๐๐:๐๐ โ ๐น๐น๐๐ is called smooth if around each point ๐๐ โ ๐๐ there is an open set ๐๐ โ ๐น๐น๐๐ and a smooth map ๐น๐น:๐๐ โ ๐น๐น๐๐ such that ๐น๐นequals ๐๐ on ๐๐ โฉ ๐๐
โข A smooth map ๐๐:๐๐ โ ๐๐ of subsets of two Euclidean spaces is a diffeomorphism if it is one to one and onto, and if the inverse map ๐๐โ1:๐๐ โ ๐๐ is also smooth. ๐๐ and ๐๐ are diffeomorphic if such a map exists
โข Suppose that ๐๐ is a subset of some ambient Euclidean space ๐น๐น๐๐. Then ๐๐ is an ๐๐-dimensional manifold if each point ๐๐ โ ๐๐ possesses a neighbourhood ๐๐ โ ๐๐ which is diffeomorphic to an open set ๐๐ โ๐น๐น๐๐
9
Statistical Machine Learning (S2 2017) Deck 16
Manifold examples
10
โข A few examples of manifolds are shown below
โข In all cases, the idea is that (hopefully) once the manifold is โunfoldedโ, the analysis, such as clustering becomes easy
โข How to โunfoldโ a manifold?
Statistical Machine Learning (S2 2017) Deck 16
Geodesic Distancesand Isomap
A non-linear dimensionality reduction algorithm that
preserves locality information using geodesic distances
11
Statistical Machine Learning (S2 2017) Deck 16
12
โข Find a lower dimensional representation of data that preserves distances between points (MDS)
โข Do visualization, clustering, etc. on lower dimensional representation Problems?
General idea: Dimensionality reduction
A
AB
B
?
Statistical Machine Learning (S2 2017) Deck 16
โGlobal distancesโ VS geodesic distances
โข โGlobal distancesโ cause a problem: we may not want to preserve them
โข We are interested in preserving distances along the manifold (geodesic distances)
geodesic distance
CD
Images: ClkerFreeVectorImages and Kaz @pixabay.com (CC0) 7
Statistical Machine Learning (S2 2017) Deck 16
MDS and similarity matrixโข In essence, โunfoldingโ a manifold is achieved via
dimensionality reduction, using methods such as MDS
โข Recall that the input of an MDS algorithm is similarity (aka proximity) matrix where each element ๐ค๐ค๐๐๐๐ denotes how similar data points ๐๐ and ๐๐ are
โข Replacing distances with geodesic distances simply means constructing a different similarity matrix without changing the MDS algorithmโ Compare it to the idea of modular learning in kernel methods
โข As you will see shortly, there is a close connection between similarity matrices and graphs and in the next slide, we review basic definitions from graph theory
14
Statistical Machine Learning (S2 2017) Deck 16
Refresher on graph terminologyโข Graph is a tuple ๐บ๐บ = {๐๐,๐ธ๐ธ}, where ๐๐ is a set of
vertices, and ๐ธ๐ธ โ ๐๐ ร ๐๐ is a set of edges. Each edge is a pair of verticesโ Undirected graph: pairs are unorderedโ Directed graph: pairs are ordered
โข Graphs model pairwise relations between objectsโ Similarity or distance between the data points
โข In a weighted graph, each vertex ๐ฃ๐ฃ๐๐๐๐ has an associated weight ๐ค๐ค๐๐๐๐โ Weights capture the strength of the relation between
objects15
Statistical Machine Learning (S2 2017) Deck 16
Weighted adjacency matrixโข We will consider weighted undirected graphs with
non-negative weights ๐ค๐ค๐๐๐๐ โฅ 0. Moreover, we will assume that ๐ค๐ค๐๐๐๐ = 0, if and only if vertices ๐๐ and ๐๐are not connected
โข The degree of a vertex ๐ฃ๐ฃ๐๐ โ ๐๐ is defined as
deg ๐๐ โก๏ฟฝ๐๐=1
๐๐๐ค๐ค๐๐๐๐
โข A weighted undirected graph can be represented with an weighted adjacency matrix ๐พ๐พ that contain weights ๐ค๐ค๐๐๐๐ as its elements
16
Statistical Machine Learning (S2 2017) Deck 16
Similarity graph models data geometryโข Geodesic distances can be
approximated using a graph in which vertices represent data points
โข Let ๐๐(๐๐, ๐๐) be the Euclidean distance between the points in the original space
โข Option 1: define some local radius ๐๐. Connect vertices ๐๐ and ๐๐ with an edge if ๐๐ ๐๐, ๐๐ โค ๐๐
โข Option 2: define nearest neighbor threshold ๐๐. Connect vertices ๐๐ and ๐๐if ๐๐ is among the ๐๐ nearest neighbors of ๐๐ OR ๐๐ is among the ๐๐nearest neighbors of ๐๐
โข Set weight for each edge to ๐๐ ๐๐, ๐๐17
geodesic distance
CD
C
D
Statistical Machine Learning (S2 2017) Deck 16
Computing geodesic distancesโข Given the similarity graph,
compute shortest paths between each pair of pointsโ E.g., using Floyd-Warshall
algorithm in ๐๐ ๐๐3
โข Set geodesic distance between vertices ๐๐ and ๐๐ to the length (sum of weights) of the shortest path between them
โข Define a new similarity matrix based on geodesic distances
18
geodesic distance
CD
C
D
Statistical Machine Learning (S2 2017) Deck 16
Isomap: summary1. Construct the similarity
graph
2. Compute shortest paths
3. Geodesic distances are the lengths of the shortest paths
4. Construct similarity matrix using geodesic distances
5. Apply MDS19
geodesic distance
CD
C
D
Statistical Machine Learning (S2 2017) Deck 16
Spectral Clustering
An spectral graph theory approach to non-linear
dimensionality reduction
20
Statistical Machine Learning (S2 2017) Deck 16
Data processing pipelinesโข Isomap algorithm can be considered a pipeline in a
sense that in combines different processing blocks, such as graph construction, and MDS
โข Here MDS serves as a core sub-routine to Isomap
โข Spectral clustering is similar to Isomap in that it also comprises a few standard blocks, including k-means clustering
โข In contrast to Isomap, spectral clustering uses a different non-linear mapping technique called Laplacian eigenmap
21
Statistical Machine Learning (S2 2017) Deck 16
Spectral clustering algorithm1. Construct similarity graph, use the corresponding
adjacency matrix as a new similarity matrixโ Just as in Isomap, the graph captures local geometry and
breaks long distance relationsโ Unlike Isomap, the adjacency matrix is used โas isโ,
shortest paths are not used
2. Map data to a lower dimensional space using Laplacian eigenmaps on the adjacency matrixโ This uses results from spectral graph theory
3. Apply k-means clustering to the mapped points
22
Statistical Machine Learning (S2 2017) Deck 16
Similarity graph for spectral clusteringโข Again, we start with constructing a similarity graph. This can
be done in the same way as for Isomap (but no need to compute the shortest paths)
โข Recall that option 1 was to connect points that are closer than ๐๐, and options 2 was to connect points within ๐๐ neiborhood
โข There is also option 3 usually considered for spectral clustering. Here all points are connected to each other (the graph is fully connected). The weights are assigned using a Gaussian kernel (aka heat kernel) with width parameter ๐๐
๐ค๐ค๐๐๐๐ = exp โ1๐๐
๐๐๐๐ โ ๐๐๐๐2
23
Statistical Machine Learning (S2 2017) Deck 16
Graph Laplacianโข Recall that ๐พ๐พ denotes weighted adjacency matrix which
contains all weights ๐ค๐ค๐๐๐๐โข Next, degree matrix ๐ซ๐ซ is defined as a diagonal matrix
with vertex degrees on the diagonal. Recall that a vertex degree is deg ๐๐ = โ๐๐=1๐๐ ๐ค๐ค๐๐๐๐
โข Finally, another special matrix associated with each graph is called unnormalised graph Laplacian and is defined as ๐ณ๐ณ โก ๐ซ๐ซ โ๐พ๐พโ For simplicity, here we introduce spectral clustering using
unnormalised Laplacian. In practice, it is common to use a Laplacian normalised in certain way, e.g., ๐ณ๐ณ๐๐๐๐๐๐ โก ๐ฐ๐ฐ โ ๐ซ๐ซโ1๐พ๐พ, where ๐ฐ๐ฐ is an identity matrix
24
Statistical Machine Learning (S2 2017) Deck 16
Laplacian eigenmapsโข Laplacian eigenmaps, a central sub-routine of spectral clustering, is
a non-linear dimensionality reduction method
โข Similar to MDS, the idea is to map the original data points ๐๐๐๐ โ ๐น๐น๐๐, ๐๐ = 1, โฆ ,๐๐ to a set of low-dimensional points ๐๐๐๐ โ ๐น๐น๐๐, ๐๐ < ๐๐ that โbest representโ the original data
โข Laplacian eigenmaps use a similarity matrix ๐พ๐พ rather than original data coordinates as a starting pointโ Here the similarity matrix ๐พ๐พ is the weighted adjacency matrix of the
similarity graph
โข Earlier, weโve seen examples of how โbest representโ criterion is formalised in MDS methods
โข Laplacian eigenmaps use a different criterion, namely the aim is to minimise (subject to some constraints)
๏ฟฝ๐๐,๐๐
๐๐๐๐ โ ๐๐๐๐2๐ค๐ค๐๐๐๐
25
Statistical Machine Learning (S2 2017) Deck 16
Alternative representation of mappingโข This minimisation problem is solved using results from
spectral graph theory
โข Instead of the mapped points ๐๐๐๐, the output can be viewed as a set of ๐๐ โdimensional vectors ๐๐๐๐, ๐๐ = 1, โฆ , ๐๐. The solution eigenmap is expressed in terms of these ๐๐๐๐โ For example, if the mapping is onto 1D line, ๐๐1 = ๐๐ is just a
collection of coordinates for all ๐๐ pointsโ If the mapping is onto 2D, ๐๐1 is a collection of all the fist
coordinates, and ๐๐2 is a collection of all the second coordinates
โข For illustrative purposes, we will consider a simple example of mapping to 1D
26
Statistical Machine Learning (S2 2017) Deck 16
Problem formulation for 1D eigenmapโข Given an ๐๐ ร ๐๐ similarity matrix ๐พ๐พ, our aim is to find
a 1D mapping ๐๐, such that ๐๐๐๐ is the coordinate of the mapped ๐๐๐ก๐ก๐ก point. We are looking for a mapping that minimises 1
2โ๐๐,๐๐ ๐๐๐๐ โ ๐๐๐๐
2๐ค๐ค๐๐๐๐
โข Clearly for any ๐๐, this can be minimised by multiplying ๐๐ by a small constant, so we need to introduce a scaling constraint, e.g., ๐๐ 2 = ๐๐โฒ๐๐ = 1
โข Next recall that ๐ณ๐ณ โก ๐ซ๐ซ โ๐พ๐พ
27
Statistical Machine Learning (S2 2017) Deck 16
Re-writing the objective in vector formโข 1
2โ๐๐,๐๐ ๐๐๐๐ โ ๐๐๐๐
2๐ค๐ค๐๐๐๐
โข = 12โ๐๐,๐๐ ๐๐๐๐2๐ค๐ค๐๐๐๐ โ 2๐๐๐๐๐๐๐๐๐ค๐ค๐๐๐๐ + ๐๐๐๐2๐ค๐ค๐๐๐๐
โข = 12โ๐๐=1๐๐ ๐๐๐๐2 โ๐๐=1๐๐ ๐ค๐ค๐๐๐๐ โ 2โ๐๐,๐๐ ๐๐๐๐๐๐๐๐๐ค๐ค๐๐๐๐ + โ๐๐=1๐๐ ๐๐๐๐2 โ๐๐=1๐๐ ๐ค๐ค๐๐๐๐
โข = 12โ๐๐=1๐๐ ๐๐๐๐2deg ๐๐ โ 2โ๐๐,๐๐ ๐๐๐๐๐๐๐๐๐ค๐ค๐๐๐๐ + โ๐๐=1๐๐ ๐๐๐๐2deg ๐๐
โข = โ๐๐=1๐๐ ๐๐๐๐2deg ๐๐ โ โ๐๐,๐๐ ๐๐๐๐๐๐๐๐๐ค๐ค๐๐๐๐
โข = ๐๐โฒ๐ซ๐ซ๐๐ โ ๐๐โฒ๐พ๐พ๐๐
โข = ๐๐โฒ๐ณ๐ณ๐๐
28
Statistical Machine Learning (S2 2017) Deck 16
Laplace recontre Lagrangeโข Our problem becomes to minimise ๐๐โฒ๐ณ๐ณ๐๐, subject to ๐๐โฒ๐๐ = 1. Recall
the method of Lagrange multipliers. Introduce a Lagrange multiplier ๐๐, and set derivatives of the Lagrangian to zero
โข ๐๐ = ๐๐โฒ๐ณ๐ณ๐๐ โ ๐๐ ๐๐โฒ๐๐ โ 1
โข 2๐๐โฒ๐ณ๐ณโฒ โ 2๐๐๐๐โฒ = 0
โข ๐ณ๐ณ๐๐ = ๐๐๐๐
โข The latter is precisely the definition of an eigenvector with ๐๐ being the corresponding eigenvalue!
โข Critical points of our objective function ๐๐โฒ๐ณ๐ณ๐๐ = 12โ๐๐,๐๐ ๐๐๐๐ โ ๐๐๐๐
2๐ค๐ค๐๐๐๐
are eigenvectors of ๐ณ๐ณ
โข Note that the function is actually minimised using eigenvector ๐๐, which is not useful. Therefore, for 1D mapping we use an eigenvector with the second smallest eigenvalue
29
Dรฉjร vu?
Statistical Machine Learning (S2 2017) Deck 16
Laplacian eigenmaps: summaryโข Start with points ๐๐๐๐ โ ๐น๐น๐๐. Construct a similarity graph using
one of 3 options
โข Construct weighted adjacency matrix ๐พ๐พ (do not compute shortest paths) and the corresponding Laplacian matrix ๐ณ๐ณ
โข Compute eigenvectors of ๐ณ๐ณ, and arrange them in the order of the corresponding eignevalues 0 = ๐๐1 < ๐๐2 < โฏ < ๐๐๐๐
โข Take eigenvectors corresponding to ๐๐2 to ๐๐๐๐+1 as ๐๐1, โฆ ,๐๐๐๐, ๐๐ < ๐๐, where each ๐๐๐๐ corresponds to one of the new dimensions
โข Combine all vectors into an ๐๐ ร ๐๐ matrix, with ๐๐๐๐ in columns. The mapped points are the rows of the matrix
30
Statistical Machine Learning (S2 2017) Deck 16
Spectral clustering: summary
31
1. Construct a similarity graph
2. Map data to a lower dimensional space using Laplacian eigenmaps on the adjacency matrix
3. Apply k-means clusteringto the mapped points
spectral clustering result
Statistical Machine Learning (S2 2017) Deck 16
This lectureโข Introduction to manifold learning
โ Motivationโ Focus on data transformation
โข Unfolding the manifoldโ Geodesic distancesโ Isomap algorithm
โข Spectral clusteringโ Laplacian eigenmapsโ Spectral clustering pipeline
32
top related