atul singh junior undergraduate cse, iit kanpur. dimension reduction is a technique which is used...

Atul SinghJunior Undergraduate

CSE, IIT Kanpur

Dimension reduction is a technique which is used to represent a high dimensional data in a more compact way using some alternate representation

Many process of data generation generate a large data set which is embedded in a low dimensional manifold of a high dimensional space. Dimensional reduction can be applied to those data sets

In statistical pattern recognition, we often encounter a data set in a high dimensional space

Many times the data is correlated in such a manner that there are very few independent dimensions

Possible to represent the data using much lower dimensions. Some benefits are -◦ Compact representation◦ Less processing time◦ Visualization of high dimensional data◦ Interpolate meaningful dimensions

Linear Methods◦ Principal Component Analysis (PCA)◦ Independent Component Analysis (ICA)◦ Multi-dimensional Scaling (MDS)

Non-linear Methods◦ Global

Isomap and its variants◦ Local

Locally Linear Embedding (LLE) Laplacian Eigenmaps

Principal Component Analysis◦ Involves finding directions having large

covariance◦ Express the data as a linear combination of

eigenvectors along those directions

Multidimensional Scaling◦ Keep inter point distances as invariant◦ Again a linear methodology

Many variables can’t be represented as linear combinations of some vectors◦ Examples – Swiss roll, faces data, etc

In general the low dimension data is embedded on some non linear manifold

Not possible to transform these manifolds to low dimension space using only translation, rotation, rescaling

Linear methods have a pre conceived notion of dimension reduction

Goal is to – automate the estimation (infer the degrees of freedom of data manifold) and classification (embed data in low dimension space) process

So need to go beyond Linear methods Non Linear Methods

◦ Isomap◦ Locally Linear Embedding

Global manifold learning algorithm

A swiss roll (fig a) embedded as a manifold in high dimension space. Goal is to reduce it to two dimensions (fig c)

Consider a human face◦ How is it represented/stored in brain?◦ How is it represented/stored in a computer?

Do we need to store all the information (every pixel)?

Just need to figure out some important structures

Basic steps involved are◦ Construct Neighborhood Graph: Determine the

neighbors of each point and assign edge weights dx (i, j) to the graph thus formed

◦ Compute Geodesic distances: Estimate the geodesic distances dM (i, j) using Dijkstra’s algorithm

◦ Use MDS to lower dimensions: Apply MDS on the computed shortest path distance matrix and thus reduce the dimensionality

So basically similar to MDS just using geodesic distances rather than Euclidean

Another non-linear dimension reduction algorithm

Follows a local approach to reduce the dimensions

The idea is based on the assumption that a point on a manifold reside on a hyper plane determined by the point and its some nearest neighbors

Key Question – How to combine these parts of hyper planes and map them to low dimensional space?

The basic steps involved are◦ Assign neighbors: To each point assign

neighbors using nearest neighbor approach◦ Weight calculation: We compute weights Wij

such that Xi is best reconstructed from its neighbors

◦ Compute Low dimension embedding: Using the above computed weight matrix find corresponding embedding vectors Yi in lower dimensional space minimizing error function

The weights computed for reconstruction are invariant with respect to translation, rotation and rescaling

The same weights should reconstruct the map in reduced dimensions

So we can conclude that the local geometry is preserved

Shortcomings of Isomap◦ Need a dense sampling of data on the manifold◦ If k is chosen very small then residual error will

be too large◦ If k is chosen very large then short-circuiting

may happen

Shortcomings of LLE◦ Due to local nature doesn’t give a complete

picture of the data◦ Again problems with selection of k

Short-circuiting◦ “When the distance between the folds is very less or there is

noise such that a point from a different fold is chosen to be a neighbour of the point, the distance computed does not represent geodesic distance and hence the algorithm fails”

Insight◦ This problem arises due to selection of neighbors just on the

basis of their Euclidean distance. The basic selection criteria are Select all the points within a ball of radius ε Select K nearest neighbors

◦ Locally Linear isomaps overcomes this problem by modifying the neighbor selection criteria

Proposed algorithm – KLL Isomaps

Similar to Tanenbaum’s Isomap except for the selection of nearest neighbors

Previous algorithms (Isomap, LLE) consider only the Euclidean distance to judge neighborhood criteria

The proposed algorithm is◦ Find a candidate neighborhood using K-nn approach◦ Construct the data point using candidate neighbors (as in

LLE) to minimize the reconstruction error◦ The neighbours whose Euclidean distance is less and

those lying on the locally linear patch of the manifold get higher weights and hence are selected preferably

◦ Now KLL ≤ K neighbors are chosen based on the post construction weights

KLL Isomap has been demonstrated to perform better than Isomap under◦ Sparsely sampled data◦ Noisy data◦ Dense data without noise

The metric used for Analysis are “Short circuit edges” and “Residual variance”

To establish a formal proof for better performance of this algorithm

Tenenbaum J.B., Silva V.d., Langford J.C.: A Global Geometric Framework for Nonlinear Dimensionality Reduction

Roweis S.T., Saul L.K.: Nonlinear Dimensionality Reduction by Locally Linear Embedding

Balasubramanian M., Schwartz E.L., Tenenbaum J.B., Silva V.d., Langford J.C.: The Isomap Algorithm and Topological Stability

Silva V.d., Tenenbaum J.B.: Global versus local methods in nonlinear dimensionality reduction

Roweis S.T., Saul L.K.: An Introduction to Locally Linear Embedding

Saxena A., Gupta A., Mukherjee A.: Non-linear Dimensionality Reduction by Locally Linear Isomaps

atul singh junior undergraduate cse, iit kanpur. dimension reduction is a technique which is used...

Documents

data manifold

data sets slide

linear embedding slide

low dimension data

linear manifold

linear methodology slide

rescaling slide

euclidean slide