applied multivariate statistics spring 2013 · multidimensional scaling applied multivariate...
TRANSCRIPT
Multidimensional Scaling
Applied Multivariate Statistics – Spring 2013
TexPoint fonts used in EMF.
Read the TexPoint manual before you delete this box.: AAAAAAAAA
Outline
Fundamental Idea
Classical Multidimensional Scaling
Non-metric Multidimensional Scaling
Appl. Multivariate Statistics - Spring 2013
Basic Idea
Appl. Multivariate Statistics - Spring 2013
How to represent in two dimensions?
Idea 1: Projection
Appl. Multivariate Statistics - Spring 2013
Idea 2: Squeeze on table
Appl. Multivariate Statistics - Spring 2013
Close points stay close
Which idea is better?
Appl. Multivariate Statistics - Spring 2013
Idea of MDS
Represent high-dimensional point cloud in few (usually 2)
dimensions keeping distances between points similar
Classical/Metric MDS: Use a clever projection
R: cmdscale
Non-metric MDS: Squeeze data on table, only conserve
ranks
R: isoMDS
Appl. Multivariate Statistics - Spring 2013
Classical MDS
Problem: Given euclidean distances among points, recover
the position of the points!
Example: Road distance between 21 European cities
(almost euclidean, but not quite)
Appl. Multivariate Statistics - Spring 2013
…
Classical MDS
First try:
Appl. Multivariate Statistics - Spring 2013
Classical MDS
Flip axes:
Appl. Multivariate Statistics - Spring 2013
Can identify points up to
- shift
- rotation
- reflection
Classical MDS
Another example: Airpollution in US cities
Range of manu and popul is much bigger than range of
wind
Need to standardize to give every variable equal weight
Appl. Multivariate Statistics - Spring 2013
Classical MDS
Appl. Multivariate Statistics - Spring 2013
Classical MDS: Theory
Input: Euclidean distances between n objects in p
dimensions
Output: Position of points up to rotation, reflection, shift
Two steps:
- Compute inner products matrix B from distance
- Compute positions from B
Appl. Multivariate Statistics - Spring 2013
Classical MDS: Theory – Step 1
Inner products matrix B = XXT
Connect to distance:
Center points to avoid shift invariance
Invert relationship:
“doubly centered”
(Hint for middle of page 108: Plug in (4.3) and equations on
top of page 108 to show that the expression involving d’s is
equal to bij)
Thus, we obtained B from the distance matrix Appl. Multivariate Statistics - Spring 2013
d2ij =Pq
k=1(xik ¡xjk)2 = ::: = bii + bjj ¡ 2bij
bij =¡12(d2ij ¡ d2i: ¡ d2:j + d2::)
bij =Pq
k=1 xikxjk
n * q data matrix
³x = 0!
Pn
i=1 xik = 0!P
i or j bij = 0´
Classical MDS: Theory – Step 2
Since B = XXT, we need the “square root” of B
B is a symmetric and positive definite n*n matrix
Thus, B can be diagonalized:
D is a diagonal matrix with on diagonal
(“eigenvalues”)
V contains as columns normalized eigenvectors
Some eigenvalues will be zero; drop them:
Take “square root”:
Thus we obtained the position of points from the distances
between all points
Appl. Multivariate Statistics - Spring 2013
B = V¤V T
¸1 ¸ ¸2 ¸ ::: ¸ ¸n
B = V1¤1VT1
X = V1¤12
1
Classical MDS: Low-dim representation
Keep only few (e.g. 2) largest eigenvalues and
corresponding eigenvectors
The resulting X will be the low-dimensional representation
we were looking for
Goodness of fit (GOF) if we reduce to m dimensions:
(should be at least 0.8)
Finds “optimal” low-dim representation: Minimizes
Appl. Multivariate Statistics - Spring 2013
GOF =
Pm
i=1¸iP
n
i=1¸i
S =Pn
i=1
Pn
j=1
³d2ij ¡ (d
(m)ij )2
´
Classical MDS: Pros and Cons
+ Optimal for euclidean input data
+ Still optimal, if B has non-negative eigenvalues
(pos. semidefinite)
+ Very fast
- No guarantees if B has negative eigenvalues
However, in practice, it is still used then. New measures for
Goodness of fit:
Appl. Multivariate Statistics - Spring 2013
GOF =
Pm
i=1j¸ijP
n
i=1j¸ij
GOF =
Pm
i=1¸2iP
n
i=1¸2i
GOF =
Pm
i=1max(0;¸i)P
n
i=1max(0;¸i)
Used in R function “cmdscale”
Non-metric MDS: Idea
Sometimes, there is no strict metric on original points
Example: How beautiful are these persons?
(1: Not at all, 10: Very much)
Appl. Multivariate Statistics - Spring 2013
2 6 9
OR 1 5 10 ??
Non-metric MDS: Idea
Absolute values are not
that meaningful
Ranking is important
Non-metric MDS finds a low-dimensional
representation, which
respects the ranking of distances
Appl. Multivariate Statistics - Spring 2013
>
>
Non-metric MDS: Theory
is the true dissimilarity, dij is the distance of representation
Minimize STRESS ( is an increasing function):
Optimize over both position of points and µ
is called “disparity”
Solved numerically (isotonic regression);
Classical MDS as starting value;
very time consuming
Appl. Multivariate Statistics - Spring 2013
S =
Pi<j
(µ(±ij)¡dij)2Pi<j
d2ij
±ij
µ
d̂ij = µ(±ij)
Non-metric MDS: Example for intuition (only)
Appl. Multivariate Statistics - Spring 2013
True points in
high dimensional space
3
2
5
B A
C
STRESS = 19.7
Compute best
representation
±AB < ±BC < ±AC
Non-metric MDS: Example for intuition (only)
Appl. Multivariate Statistics - Spring 2013
True points in
high dimensional space
2.7
2
4.8
B A
C
STRESS = 20.1
Compute best
representation
±AB < ±BC < ±AC
Non-metric MDS: Example for intuition (only)
Appl. Multivariate Statistics - Spring 2013
True points in
high dimensional space
2.9
2
5.2
B A
C
STRESS = 18.9
We will finally represent the
“transformed true distances”
(called disparities):
Compute best
representation
±AB < ±BC < ±AC d̂AB = 2; d̂BC = 2:9; d̂AC = 5:2instead of the true distances:
±AB = 2; ±BC = 3; ±AC = 5
Stop if minimal STRESS is found.
Non-metric MDS: Pros and Cons
+ Fulfills a clear objective without many assumptions
(minimize STRESS)
+ Results don’t change with rescaling or monotonic variable
transformation
+ Works even if you only have rank information
- Slow in large problems
- Usually only local (not global) optimum found
- Only gets ranks of distances right
Appl. Multivariate Statistics - Spring 2013
Non-metric MDS: Example
Do people in the same party vote alike?
Number of votes where 15 congressmen disagreed in 19
votes
Appl. Multivariate Statistics - Spring 2013
…
Non-metric MDS: Example
Appl. Multivariate Statistics - Spring 2013
Concepts to know
Classical MDS:
- Finds low-dim projection that respects distances
- Optimal for euclidean distances
- No clear guarantees for other distances
- fast
Non-metric MDS:
- Squeezes data points on table
- respects only rankings of distances
- (locally) solves clear objective
- slow
Appl. Multivariate Statistics - Spring 2013
R commands to know
cmdscale included in standard R distribution
isoMDS from package “MASS”
Appl. Multivariate Statistics - Spring 2013