atul singh junior undergraduate cse, iit kanpur. dimension reduction is a technique which is used...

22
Atul Singh Junior Undergraduate CSE, IIT Kanpur

Upload: ross-jenkins

Post on 21-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Atul Singh Junior Undergraduate CSE, IIT Kanpur.  Dimension reduction is a technique which is used to represent a high dimensional data in a more compact

Atul SinghJunior Undergraduate

CSE, IIT Kanpur

Page 2: Atul Singh Junior Undergraduate CSE, IIT Kanpur.  Dimension reduction is a technique which is used to represent a high dimensional data in a more compact

Dimension reduction is a technique which is used to represent a high dimensional data in a more compact way using some alternate representation

Many process of data generation generate a large data set which is embedded in a low dimensional manifold of a high dimensional space. Dimensional reduction can be applied to those data sets

Page 3: Atul Singh Junior Undergraduate CSE, IIT Kanpur.  Dimension reduction is a technique which is used to represent a high dimensional data in a more compact

In statistical pattern recognition, we often encounter a data set in a high dimensional space

Many times the data is correlated in such a manner that there are very few independent dimensions

Possible to represent the data using much lower dimensions. Some benefits are -◦ Compact representation◦ Less processing time◦ Visualization of high dimensional data◦ Interpolate meaningful dimensions

Page 4: Atul Singh Junior Undergraduate CSE, IIT Kanpur.  Dimension reduction is a technique which is used to represent a high dimensional data in a more compact

Linear Methods◦ Principal Component Analysis (PCA)◦ Independent Component Analysis (ICA)◦ Multi-dimensional Scaling (MDS)

Non-linear Methods◦ Global

Isomap and its variants◦ Local

Locally Linear Embedding (LLE) Laplacian Eigenmaps

Page 5: Atul Singh Junior Undergraduate CSE, IIT Kanpur.  Dimension reduction is a technique which is used to represent a high dimensional data in a more compact

Principal Component Analysis◦ Involves finding directions having large

covariance◦ Express the data as a linear combination of

eigenvectors along those directions

Multidimensional Scaling◦ Keep inter point distances as invariant◦ Again a linear methodology

Page 6: Atul Singh Junior Undergraduate CSE, IIT Kanpur.  Dimension reduction is a technique which is used to represent a high dimensional data in a more compact

Many variables can’t be represented as linear combinations of some vectors◦ Examples – Swiss roll, faces data, etc

In general the low dimension data is embedded on some non linear manifold

Not possible to transform these manifolds to low dimension space using only translation, rotation, rescaling

Page 7: Atul Singh Junior Undergraduate CSE, IIT Kanpur.  Dimension reduction is a technique which is used to represent a high dimensional data in a more compact

Linear methods have a pre conceived notion of dimension reduction

Goal is to – automate the estimation (infer the degrees of freedom of data manifold) and classification (embed data in low dimension space) process

So need to go beyond Linear methods Non Linear Methods

◦ Isomap◦ Locally Linear Embedding

Page 8: Atul Singh Junior Undergraduate CSE, IIT Kanpur.  Dimension reduction is a technique which is used to represent a high dimensional data in a more compact

Global manifold learning algorithm

A swiss roll (fig a) embedded as a manifold in high dimension space. Goal is to reduce it to two dimensions (fig c)

Page 9: Atul Singh Junior Undergraduate CSE, IIT Kanpur.  Dimension reduction is a technique which is used to represent a high dimensional data in a more compact

Consider a human face◦ How is it represented/stored in brain?◦ How is it represented/stored in a computer?

Do we need to store all the information (every pixel)?

Just need to figure out some important structures

Page 10: Atul Singh Junior Undergraduate CSE, IIT Kanpur.  Dimension reduction is a technique which is used to represent a high dimensional data in a more compact

Basic steps involved are◦ Construct Neighborhood Graph: Determine the

neighbors of each point and assign edge weights dx (i, j) to the graph thus formed

◦ Compute Geodesic distances: Estimate the geodesic distances dM (i, j) using Dijkstra’s algorithm

◦ Use MDS to lower dimensions: Apply MDS on the computed shortest path distance matrix and thus reduce the dimensionality

So basically similar to MDS just using geodesic distances rather than Euclidean

Page 11: Atul Singh Junior Undergraduate CSE, IIT Kanpur.  Dimension reduction is a technique which is used to represent a high dimensional data in a more compact
Page 12: Atul Singh Junior Undergraduate CSE, IIT Kanpur.  Dimension reduction is a technique which is used to represent a high dimensional data in a more compact

Another non-linear dimension reduction algorithm

Follows a local approach to reduce the dimensions

The idea is based on the assumption that a point on a manifold reside on a hyper plane determined by the point and its some nearest neighbors

Key Question – How to combine these parts of hyper planes and map them to low dimensional space?

Page 13: Atul Singh Junior Undergraduate CSE, IIT Kanpur.  Dimension reduction is a technique which is used to represent a high dimensional data in a more compact

The basic steps involved are◦ Assign neighbors: To each point assign

neighbors using nearest neighbor approach◦ Weight calculation: We compute weights Wij

such that Xi is best reconstructed from its neighbors

◦ Compute Low dimension embedding: Using the above computed weight matrix find corresponding embedding vectors Yi in lower dimensional space minimizing error function

Page 14: Atul Singh Junior Undergraduate CSE, IIT Kanpur.  Dimension reduction is a technique which is used to represent a high dimensional data in a more compact

The weights computed for reconstruction are invariant with respect to translation, rotation and rescaling

The same weights should reconstruct the map in reduced dimensions

So we can conclude that the local geometry is preserved

Page 15: Atul Singh Junior Undergraduate CSE, IIT Kanpur.  Dimension reduction is a technique which is used to represent a high dimensional data in a more compact
Page 16: Atul Singh Junior Undergraduate CSE, IIT Kanpur.  Dimension reduction is a technique which is used to represent a high dimensional data in a more compact

Shortcomings of Isomap◦ Need a dense sampling of data on the manifold◦ If k is chosen very small then residual error will

be too large◦ If k is chosen very large then short-circuiting

may happen

Shortcomings of LLE◦ Due to local nature doesn’t give a complete

picture of the data◦ Again problems with selection of k

Page 17: Atul Singh Junior Undergraduate CSE, IIT Kanpur.  Dimension reduction is a technique which is used to represent a high dimensional data in a more compact

Short-circuiting◦ “When the distance between the folds is very less or there is

noise such that a point from a different fold is chosen to be a neighbour of the point, the distance computed does not represent geodesic distance and hence the algorithm fails”

Insight◦ This problem arises due to selection of neighbors just on the

basis of their Euclidean distance. The basic selection criteria are Select all the points within a ball of radius ε Select K nearest neighbors

◦ Locally Linear isomaps overcomes this problem by modifying the neighbor selection criteria

Proposed algorithm – KLL Isomaps

Page 18: Atul Singh Junior Undergraduate CSE, IIT Kanpur.  Dimension reduction is a technique which is used to represent a high dimensional data in a more compact

Similar to Tanenbaum’s Isomap except for the selection of nearest neighbors

Previous algorithms (Isomap, LLE) consider only the Euclidean distance to judge neighborhood criteria

The proposed algorithm is◦ Find a candidate neighborhood using K-nn approach◦ Construct the data point using candidate neighbors (as in

LLE) to minimize the reconstruction error◦ The neighbours whose Euclidean distance is less and

those lying on the locally linear patch of the manifold get higher weights and hence are selected preferably

◦ Now KLL ≤ K neighbors are chosen based on the post construction weights

Page 19: Atul Singh Junior Undergraduate CSE, IIT Kanpur.  Dimension reduction is a technique which is used to represent a high dimensional data in a more compact

KLL Isomap has been demonstrated to perform better than Isomap under◦ Sparsely sampled data◦ Noisy data◦ Dense data without noise

The metric used for Analysis are “Short circuit edges” and “Residual variance”

To establish a formal proof for better performance of this algorithm

Page 20: Atul Singh Junior Undergraduate CSE, IIT Kanpur.  Dimension reduction is a technique which is used to represent a high dimensional data in a more compact

Tenenbaum J.B., Silva V.d., Langford J.C.: A Global Geometric Framework for Nonlinear Dimensionality Reduction

Roweis S.T., Saul L.K.: Nonlinear Dimensionality Reduction by Locally Linear Embedding

Balasubramanian M., Schwartz E.L., Tenenbaum J.B., Silva V.d., Langford J.C.: The Isomap Algorithm and Topological Stability

Silva V.d., Tenenbaum J.B.: Global versus local methods in nonlinear dimensionality reduction

Roweis S.T., Saul L.K.: An Introduction to Locally Linear Embedding

Saxena A., Gupta A., Mukherjee A.: Non-linear Dimensionality Reduction by Locally Linear Isomaps

Page 21: Atul Singh Junior Undergraduate CSE, IIT Kanpur.  Dimension reduction is a technique which is used to represent a high dimensional data in a more compact
Page 22: Atul Singh Junior Undergraduate CSE, IIT Kanpur.  Dimension reduction is a technique which is used to represent a high dimensional data in a more compact