advanced studies in applied statistics (wbl), ethz applied ... · res.cmd=cmdscale(eurodist)...
TRANSCRIPT
Advanced Studies in Applied Statistics (WBL), ETHZApplied Multivariate Statistics, Week 3
Lecturer: Beate Sick
1
Remark: Much of the material have been developed together with Oliver Dürr.
Topics of today
2
• Similarity and Distances
• Numeric data
• Categorical data
• Mixed data types
• Multi Dimensional Scaling
• 2D plots of high dimensional data starting from pair-wise distances
• Outlier detection
• Univariate outlier detection by visual checks and additional tests
• Multivariate outlier detection
• Parametric: squared Mahalanobis distances and Chi-Square test
• Non-parametric: Robust PCA for multi-variate outlier detection
The quality or state of being similar; likeness; resemblance; as, a similarity of
features.
Similarity is
hard to define,
but…
“We know it
when we see it”.
The real
meaning of
similarity is a
philosophical
question. We
will take a more
pragmatic
approach.
Webster's
Dictionary
3
What is Similarity?
Definition: Let O1 and O2 be two objects. The distance (dissimilarity)
between O1 and O2 is a real number denoted by d(O1,O2).
0.23 3 342.7
Peter Piotr
4
Defining Distance Measures (Recap)
(Dis-)similarity / Distance
Pairs of Objects:
Similarity (large ⇒ similar), vague definition
Dissimilarity (small ⇒ similar), Rules 1-3
Distance, Metric (small ⇒ similar), Rule 4 in addition
Examples of metrics (more follow with the examples)
● Euclidian and other Lp-Metrics
● Jaccard-Distance ( 1 - Jaccard Index)
● Graph Distance (shortest-path)
Rules
5
Example of a Metric
Task 1
• Draw 3 Objects on a piece of paper and measure their distances (e.g. by a ruler).
• Is this a proper distance? Are Axioms 1-4 fulfilled?
Task 2
• The 3 entities A,B,C have the dissimilarity:
d(A,B) = 1
d(B,C) = 1
d(A,C) = 3
• Is this dissimilarity a distance?
• Can you try to draw them on a piece of paper?
6
Problematic: Wordmaps
Try to do a wordmap with:
Bank
Finance
Sitting
Triangular Inequality:
Not just a mathematical gimmick!
Triangle inequality would imply:
d(„sitting“, „finance“) ≤ d(„sitting“, „bank“) + d(„bank“, „finance“)
7
We live in an Euclidean Space
If we are presented objects in the two dimensional plane, we intuitively assume
Euclidean distances between the objects.
0
1
2
3
0 1 2 3 4 5 6
p1
p2
p3 p4
8
Distance between observations oi, oj
p features describing each observation
Euclidean Distance for 2 observations
oi, oj, described by p numeric features:
Minkowski Distance as
generalization: 0
1
2
3
0 1 2 3 4 5 6
p1
p2
p3 p4
obs x1 x2
o1=p1 0 2
o2=p2 2 0
o3=p3 3 1
o4=p4 5 1 2
2
1
o , (o )p
i j ik jk
k
d o o
1
1
o , | o |p r
r i j ik jk
k
rd o o
2D example
(2 features per observation)
x1
x2
2d
9
2 2
2 2 3o , (2 3) (0 1) 2d o
Euclidean Distance and its Generalization
L1: Manhattan Distances
A
B
One block is one unit.
• How many blocks do you have to walk from A to B?
• What is the L1 distance from A to B
• r=1
• What is the Euclidean distance?
1
1
o , | o |p r
r i j ik jk
k
rd o o
Image from Wikipedia 10
Minkowski Distances
r = 1: City block (Manhattan, taxicab, L1 norm) distance.
r = 2: Euclidean distance (L2 norm)
r = ∞: “Supremum” or maximum (Lmax norm, L∞ norm) distance. This is the maximum difference between any component of the vectors
1
1
o , op
i j ik jk
k
d o o
1...po , max oi j k ik jkd o o
2
2
1
o , (o )p
i j ik jk
k
d o o
11
12 1
21 2
1 2
0 . .
0 . .
. . .
. . .
. . 0
n
n
n n
d d
d d
d d
D
o ,i j ijd o d
All diagonal elements are 0!
As discussed on the last couple of slides, there are different
possibilities to determine the pairwise distance between two
observations oi and oj.
We can collect all these pairwise distances dij in a distance matrix:
o , 0k k kkd o d
Symmetry:
o , o ,ij i j j i jid d o d o d
12
Distance matrix
How to calculate dissimilarities
with categorical variables?
13
Common situation is that objects, o1 and o2, have only binary attributes, like for
example gender (f/M), driving license (yes/no), Nobel price holder (yes/no).
We distinguish between symmetric and asymmetric binary variables.
In symmetric binary variables, both levels have roughly comparable frequencies.Example: gender
In asymmetric binary variables, both levels have very different frequencies.Example: Nobel price holder
Similarity measures for binary data
1 11 12 1 2 21 22 2( , ,...., ) and ( , ,...., )p po o o o o o o o
14
The objects o1 and o2 have only binary attributes.
Matching CoefficientSimilarity measures for “symmetric” binary vectors
1 11 12 1 2 21 22 2( , ,...., ) and ( , ,...., )p po o o o o o o o
15
Simple Matching Coefficients for symmetric binary variables (could be only a subset of the p binary variables) is defined as:
SMC = # matches / # attributes
Corresponding to the proportion of matching attributes over all attributes
SMC = (M11
+ M00) / (M
01+ M
10+ M
11+ M
00)
M01 = number of attributes where o1i is 0 and o2i is 1
M10 = number of attributes where o1i is 1 and o2i is 0
M00 = number of attributes where o1i is 0 and o2i is 0
M11 = number of attributes where o1i is 1 and o2i is 1
The objects o1 and o2 have only binary attributes.
The Jaccard Coefficient for the asymmetric binary variable (could be only a subset of the p binary variables) is defined as :
J = # both-1-matches / # of not-both-zero attributes values
Corresponding to the proportion of matching attributes over thoseattributes which are 1 in at least one of both observations.
J = = (M11) / (M
01+ M
10+ M
11)
M01 = number of attributes where o1i is 0 and o2i is 1
M10 = number of attributes where o1i is 1 and o2i is 0
M11 = number of attributes where o1i is 1 and o2i is 1
Matching Coefficient:Similarity measures for “asymmetric” binary vectors
1 11 12 1 2 21 22 2( , ,...., ) and ( , ,...., )p po o o o o o o o
16
M01 = 2 (the number of attributes where p is 0 and q is 1)
M10 = 1 (the number of attributes where p is 1 and q is 0)
M00 = 7 (the number of attributes where p is 0 and q is 0)
M11 = 0 (the number of attributes where p is 1 and q is 1)
SMC = (M11 + M00)/(M01 + M10 + M11 + M00)
J = (M11) / (M01 + M10 + M11)
Example: How similar are two given binary vectors?
p = 1 0 0 0 0 0 0 0 0 0
q = 0 0 0 0 0 0 1 0 0 1
17
SMC = (M11 + M00)/(M01 + M10 + M11 + M00) = (0+7) / (2+1+0+7) = 0.7
= 0 / (2 + 1 + 0) = 0
More than 2 levels (Nominal Data)
Simple miss-matching coefficient (ranges between 0 and 1)
mm: number of variables where object i and j do not match
p: number of features
Character strings can be understood as nominal data.
If the strings are of equal length, this is also called Hamming-Distance (sometimes
without dividing by p).
What is the Hamming-Distance between:
HOUSE
MOUSE
18
#missmatches mm
#feature pijd
Proportion of features where
observations differ
Idea: Use distance measure dij between 0 and 1 for each
corresponding pair of variable or feature in two observations.
- If kth variable is binary, nominal,
use discussed methods, e.g.:
- If kth variable is numeric:
xik: value for object i in variable k
Rk: range of variable k for all objects
- If kth variable is ordinal, use normalized ranks.
Then same like with numeric variables
Aggregate distance measures over all
variables/features/dimensions:
Gower’s dissimilarity for mixed data types
( ) 11 00
01 10 11 00
1k
ij
M Md
M M M M
( )| |ik jkk
ij
k
x xd
R
( )
1
1 pk
ij ij
k
d dp
19
> str(flower)
'data.frame': 18 obs. of 8 variables:
$ V1: Factor w/ 2 levels "0","1": 1 2 1 1 1 1 1 1 2 2 ...
$ V2: Factor w/ 2 levels "0","1": 2 1 2 1 2 2 1 1 2 2 ...
$ V3: Factor w/ 2 levels "0","1": 2 1 1 2 1 1 1 2 1 1 ...
$ V4: Factor w/ 5 levels "1","2","3",..: 4 2 3 4 5 4 4 2 3 5 ...
$ V5: Ord.factor w/ 3 levels "1"<"2"<"3": 3 1 3 2 2 3 3 2 1 2 ...
$ V6: Ord.factor w/ 18 levels "1"<"2"<"3"<..: 15 3 1 16 2 12 ...
$ V7: num 25 150 150 125 20 50 40 100 25 100 ...
$ V8: num 15 50 50 50 15 40 20 15 15 60 ...
> library(cluster)
> dist=daisy(flower, type=list(asymm=c(1, 3), symm=2, ordratio=7))
> str(dist)
Classes 'dissimilarity', 'dist' atomic [1:153] 0.901 0.618 ...
..- attr(*, "Size")= int 18
..- attr(*, "Metric")= chr "mixed"
..- attr(*, "Types")= chr [1:8] "A" "S" "A" "N" ...
Dissimilarity for mixed data types with R-function “daisy” calculating Gower’s dissimilarity
20
library(cluster)
dist = daisy(flower)
mdist = as.matrix( dist)
library(pheatmap)
pheatmap(mdist)
Visualize the distance matrix: Heatmaps are great!
21
How to visualize
multivariate observations of
mixed data types in 2D?
22
Goal of multidimensional scaling
MDS gets as input distances between
observations or data points and results
in a visualization of points in 2D
The bars between points represent the
given distances between points.
As input of MDS we only know the
distances and look for a low-dim point
configuration where points have the
same or similar distances.
23
Example for metric MDS
Distance matrix:
:
Problem: Given Euclidean distances among points,
recover the position of the points!
Example: Road distance between 21 European cities
(almost Euclidean, but not quite)
24
eurodist data:
:
res.cmd=cmdscale(eurodist)
plot(res.cmd,pch="")
text(res.cmd,
labels=rownames(res.cmd))
-2000 -1000 0 1000 2000
-10
00
01
00
0
res.cmd[,1]
res.c
md
[,2
]
Athens
Barcelona
BrusselsCalaisCherbourg
Cologne
Copenhagen
Geneva
Gibraltar
HamburgHook of Holland
LisbonLyons
Madrid MarseillesMilan
Munich
Paris
Rome
Stockholm
Vienna
Configuration can be
- shifted
- rotated
- reflected
without changing distances
MDS in R:
25
Example for metric MDS
After flipping vertical axes:
26
Example for metric MDS
Equivalence of PCA and MDS with Euclidean distance
PCA-representation in data-matrix
=
MDS-representation in Euclidean distance-matrix
MDS on Euclidean distance results in equivalent low-dimensional
representation (up to rotation, flipping, shifts) as PCA on data-
matrix (however, the data-matrix must first be derived from the
distance-matrix).
27
library(cluster)
dist = daisy(flower)
mdist = as.matrix( dist)
library(pheatmap)
pheatmap(mdist)
library(MASS)
mds = isoMDS(mdist, k=2)
d.mds = as.data.frame(mds$points)
names(d.mds) = c("c1", "c2")
library(ggplot2)
ggplot(data=d.mds, aes(x=c1, y=c2)) +
geom_point() +
geom_text(label=row.names(mdist),
hjust=1.2)
Distance matrix and 2D plot of multivariate mixed data
28
Outlier detection
29
How much is an observation differing from average?
30
31
z-transformationx
x
XX Z
2
(Z) 0
( ) 1Z
E Z
sd Z
Lets standardize and look at the z-score
The standardized Variable Z
has mean zero and variance
one.
We start from a variable X with
𝐸 𝑋 = ത𝑋 = 𝜇𝑥 and Var(X)=x2
and apply the z-transformation:
Often the z-transformation is applied to different univariate features to make
them “comparable”. A distance of -2 from the population mean always means
that this observation is two standard deviations smaller than the average.
In case of a normal-distributed X, we know that 𝑍~𝑁(0,1).
z-score
The z-score has the unit “standard deviation”
1sd 2sd 3sd 4sd 5sd 6sd 7sd 8sd 9sd 10sd0
How much is my IQ
above/below average?
Remark: Mean and SD can also be determined for a non-normal distributed variables
– but intuition is lost and we might prefer to work with quantiles.32
Remarks:
• All marginal distributions of a multi-variate normal
distribution are uni-variate normal distributions.
• All conditional distributions of a multi-variate normal
distribution are uni-variate normal distributions.
• Each iso-density-line is an ellipse or it’s higher-
dimensional generalization.
The multivariate normal distribution
310 5~ ,
0 3 25
XN
Y
Density:
~ ,NX μ Σ
11exp
2
2
t
kf
x μ Σ x μ
x
Σ
33
Mahalanobis distance is the multivariate z-score
MD=1
MD=1
MD=1
MD=2
1MDt x x μ Σ x μ
The Mahalanobis distance MD(x)
measures the distance of x to the
mean of the multivariate normal
distribution in units of standard deviations.
In case of a multi-variate normal distributed x we know that MD(x)~N(0,1).
34
MD=2
35
We need expectations or a model to identify an outlier!
35
Outlier
Outlier detection using a boxplot representation
All points beyond the whiskers are called “extreme” values.
Is there any model?
36
The model behind the “extreme” value definition in boxplots
99% of N(0,1) data are within the whiskers.
When visualizing non-normal distributed data, this model is not valid.
boxplot from 100k data points simulated from a N(0,1)
37
Outlier detection in the uni-variate case via Grubbs test
library(outliers)
x=c(45,56,54,34,32,45,67,45,67,65,154)
# Grubbs test for one outlier
#
# data: x
# G = 2.80490, U = 0.13459, p-value = 0.0001816
# alternative hypothesis: highest value 154 is an outlier
Grubbs developed this test statistic in 1950 (assuming normal distribution as in t-
test for small n) to investigate whether “some time during the experiment
something might have happened to have cause an extraneous variation on the
high side or on the low side”, and is also nowadays routinely used in regression
model checking procedures (i.e. to find outliers in Cook’s d values or standardized
residuals).
potential outlier
38
Outlier detection in the multi-variate case via 𝝌𝟐test
The Mahalanobis distance MD(x)
measures the distance of x to the
mean of the multivariate normal
distribution in units of SD.
Outlier detection via Mahalanobis distance can be performed for data for
which the multivariate normal assumption is reasonable by checking
whether the MD2 of an p-dim observation is “sticking out” of c2 distribution
with df=p.
1
p×1 p×1 p×p
22
p×1
MD ~ ( , )
MD ~
t
df p
N
c
x x μ Σ x μ 0 1
x
39
Outlier detection via c2 quantiles
• Compute for each observation p-dim x the (robust version of) the squared
Mahalanobis distances from the assumed N-distribution center: MD(x)2
• Generate a Quantile-Quantile plot to identify observations that have an
expected c2 distribution with df=p ( MD(x)2 > 97.5% quantile of cp2)
• Use in addition “adjusted quantiles” that are estimated by simulations
from the expected chi-square distribution without outliers.
• Use (robust) PCA to visualize the data in a 2D score plot.
40
Extreme quantiles of c2 distribution indicate outliers
41
PC1
PC1 PC1
PC2
PC2
library(mvoutlier)
aq.plot(dat) # to get the shown adjusted quantile plot
chisq.plot(dat) # to get an interactive qq plot to select outliers
Adjusted quantile via simulation
ECDF leaves “plausible” range
Defines adaptive
cutoff
42Slide credit: Markus Kalisch
Outlier detection via robust PCA
imagine 784 dimensions ;-)
Assumption: The manifold hypothesis holds.
43
nxp nxp nxp
nxp nxp nxp
(PCA representation)
(full reconstruction)t
Y X A
X Y A
PCA Rotation can be achieved by multiplying
X with an orthogonal rotation matrix A
Dimension reduction via PCA
PCA minimizes reconstruction error
over all available m data points:
( ) ( ) 2 2
1 1
ˆˆ|| || || ||m m
i i
i i
i i
x x X X
Partly reconstruct X with only k<p PCs:
nxp nxk nx(p-k) nxpˆ , t X Y 0 A
How good is the data representation?
The reconstruction error is given by the
squared orthogonal distance between
the data point and its projection on the
plane.
44
PCA is not robust against outliers
The first two PCs point in the directions of maximal variance.
Since the variance is not robust against outliers the
result of PCA is also not robust against outliers.
We can use a robust version of the PCA which is resistant to outliers.
PC1 with
classical PCA
PC1 with
robust PCA
45
The reconstruction of the red point has a reconstruction-error that corresponds to
the squared distance between the red and green points – PCA minimizes the sum
of squared distances.
Points with extreme reconstruction errors are identified as outliers.
1.PC
46
PCA can be used for outlier detection
We should use robust PCA to identify outliers via reconstruction errors
In robust PCA the directions of the PCs are not heavily influenced
by the positions of some outliers -> outliers have larger distances
to the hyperplane which is spanned by the first couple of PCs and
capture large parts of the variance of non-outlying points.
outliers
47
There are two major R-implementations of PCA - prcomp() and princomp()
- prcomp is numerically more stable, and therefore preferred
(see chapter 2.7 in the StDM script)
- princomp has a few more options and is therefore sometimes used
PCA in R
For robust PCA and outlier detection we can use the package rrcov
- PcaHubert performs robust PCA
48
Summary
49
• We can use different measures to quantify the dissimilarities between two
observations described by the same features, e.g.
• Euclidian and other Lp-Metrics for quantitative data
• Matching coefficient for (symmetric) categorical data
• Jaccard coefficient for (asymmetric) categorical data
• Gower dissimilarities for mixed data types (see R-package daisy)
• A distance matrix holds the pair-wise distances between several
observation units and can be visualized by a heatmap.
• Multdimensional scaling (MDS) yields a 2D plot for high-dimensional data
• MDS starts with a distance matrix
• The pair-wise distances are preserved as good as possible in the 2D plot
• PCA yields the same PC1-PC2-2D plot as MDS on the Euclidean distances
• Outlier detection in high dimensional data can be tackled by
• Quantile plots of 𝜒2 distributed squared Mahalanobis distances from assumed
N-distribution center
• Robust PCA and the distances to the PC1-PC2-2D hyperplane