advanced studies in applied statistics (wbl), ethz applied ... · res.cmd=cmdscale(eurodist)...

Advanced Studies in Applied Statistics (WBL), ETHZApplied Multivariate Statistics, Week 3

Lecturer: Beate Sick

[email protected]

1

Remark: Much of the material have been developed together with Oliver Dürr.

mailto:[email protected]

Topics of today

2

• Similarity and Distances

• Numeric data

• Categorical data

• Mixed data types

• Multi Dimensional Scaling

• 2D plots of high dimensional data starting from pair-wise distances

• Outlier detection

• Univariate outlier detection by visual checks and additional tests

• Multivariate outlier detection

• Parametric: squared Mahalanobis distances and Chi-Square test

• Non-parametric: Robust PCA for multi-variate outlier detection

The quality or state of being similar; likeness; resemblance; as, a similarity of

features.

Similarity is

hard to define,

but…

“We know it

when we see it”.

The real

meaning of

similarity is a

philosophical

question. We

will take a more

pragmatic

approach.

Webster's

Dictionary

3

What is Similarity?

Definition: Let O1 and O2 be two objects. The distance (dissimilarity)

between O1 and O2 is a real number denoted by d(O1,O2).

0.23 3 342.7

Peter Piotr

4

Defining Distance Measures (Recap)

(Dis-)similarity / Distance

Pairs of Objects:

Similarity (large ⇒ similar), vague definition

Dissimilarity (small ⇒ similar), Rules 1-3

Distance, Metric (small ⇒ similar), Rule 4 in addition

Examples of metrics (more follow with the examples)

● Euclidian and other Lp-Metrics

● Jaccard-Distance ( 1 - Jaccard Index)

● Graph Distance (shortest-path)

Rules

5

Example of a Metric

Task 1

• Draw 3 Objects on a piece of paper and measure their distances (e.g. by a ruler).

• Is this a proper distance? Are Axioms 1-4 fulfilled?

Task 2

• The 3 entities A,B,C have the dissimilarity:

d(A,B) = 1

d(B,C) = 1

d(A,C) = 3

• Is this dissimilarity a distance?

• Can you try to draw them on a piece of paper?

6

Problematic: Wordmaps

Try to do a wordmap with:

Bank

Finance

Sitting

Triangular Inequality:

Not just a mathematical gimmick!

Triangle inequality would imply:

d(„sitting“, „finance“) ≤ d(„sitting“, „bank“) + d(„bank“, „finance“)

7

We live in an Euclidean Space

If we are presented objects in the two dimensional plane, we intuitively assume

Euclidean distances between the objects.

0

1

2

3

0 1 2 3 4 5 6

p1

p2

p3 p4

8

Distance between observations oi, oj

p features describing each observation

Euclidean Distance for 2 observations

oi, oj, described by p numeric features:

Minkowski Distance as

generalization: 0

1

2

3

0 1 2 3 4 5 6

p1

p2

p3 p4

obs x1 x2

o1=p1 0 2

o2=p2 2 0

o3=p3 3 1

o4=p4 5 1 2

2

1

o , (o )p

i j ik jk

k

d o o

1

1

o , | o |p r

r i j ik jk

k

rd o o

2D example

(2 features per observation)

x1

x2

2d

9

2 2

2 2 3o , (2 3) (0 1) 2d o

Euclidean Distance and its Generalization

L1: Manhattan Distances

A

B

One block is one unit.

• How many blocks do you have to walk from A to B?

• What is the L1 distance from A to B

• r=1

• What is the Euclidean distance?

1

1

o , | o |p r

r i j ik jk

k

rd o o

Image from Wikipedia 10

Minkowski Distances

r = 1: City block (Manhattan, taxicab, L1 norm) distance.

r = 2: Euclidean distance (L2 norm)

r = ∞: “Supremum” or maximum (Lmax norm, L∞ norm) distance. This is the maximum difference between any component of the vectors

1

1

o , op

i j ik jk

k

d o o

1...po , max oi j k ik jkd o o

2

2

1

o , (o )p

i j ik jk

k

d o o

11

12 1

21 2

1 2

0 . .

0 . .

. . .

. . .

. . 0

n

n

n n

d d

d d

d d

D

o ,i j ijd o d

All diagonal elements are 0!

As discussed on the last couple of slides, there are different

possibilities to determine the pairwise distance between two

observations oi and oj.

We can collect all these pairwise distances dij in a distance matrix:

o , 0k k kkd o d

Symmetry:

o , o ,ij i j j i jid d o d o d

12

Distance matrix

How to calculate dissimilarities

with categorical variables?

13

Common situation is that objects, o1 and o2, have only binary attributes, like for

example gender (f/M), driving license (yes/no), Nobel price holder (yes/no).

We distinguish between symmetric and asymmetric binary variables.

In symmetric binary variables, both levels have roughly comparable frequencies.Example: gender

In asymmetric binary variables, both levels have very different frequencies.Example: Nobel price holder

Similarity measures for binary data

1 11 12 1 2 21 22 2( , ,...., ) and ( , ,...., )p po o o o o o o o

14

The objects o1 and o2 have only binary attributes.

Matching CoefficientSimilarity measures for “symmetric” binary vectors

1 11 12 1 2 21 22 2( , ,...., ) and ( , ,...., )p po o o o o o o o

15

Simple Matching Coefficients for symmetric binary variables (could be only a subset of the p binary variables) is defined as:

SMC = # matches / # attributes

Corresponding to the proportion of matching attributes over all attributes

SMC = (M11

+ M00) / (M

01+ M

10+ M

11+ M

00)

M01 = number of attributes where o1i is 0 and o2i is 1




The objects o1 and o2 have only binary attributes.

The Jaccard Coefficient for the asymmetric binary variable (could be only a subset of the p binary variables) is defined as :

J = # both-1-matches / # of not-both-zero attributes values

Corresponding to the proportion of matching attributes over thoseattributes which are 1 in at least one of both observations.

J = = (M11) / (M

01+ M

10+ M

11)




Matching Coefficient:Similarity measures for “asymmetric” binary vectors

1 11 12 1 2 21 22 2( , ,...., ) and ( , ,...., )p po o o o o o o o

16

M01 = 2 (the number of attributes where p is 0 and q is 1)




SMC = (M11 + M00)/(M01 + M10 + M11 + M00)

J = (M11) / (M01 + M10 + M11)

Example: How similar are two given binary vectors?

p = 1 0 0 0 0 0 0 0 0 0

q = 0 0 0 0 0 0 1 0 0 1

17

SMC = (M11 + M00)/(M01 + M10 + M11 + M00) = (0+7) / (2+1+0+7) = 0.7

= 0 / (2 + 1 + 0) = 0

More than 2 levels (Nominal Data)

Simple miss-matching coefficient (ranges between 0 and 1)

mm: number of variables where object i and j do not match

p: number of features

Character strings can be understood as nominal data.

If the strings are of equal length, this is also called Hamming-Distance (sometimes

without dividing by p).

What is the Hamming-Distance between:

HOUSE

MOUSE

18

#missmatches mm

#feature pijd

Proportion of features where

observations differ

Idea: Use distance measure dij between 0 and 1 for each

corresponding pair of variable or feature in two observations.

- If kth variable is binary, nominal,

use discussed methods, e.g.:

- If kth variable is numeric:

xik: value for object i in variable k

Rk: range of variable k for all objects

- If kth variable is ordinal, use normalized ranks.

Then same like with numeric variables

Aggregate distance measures over all

variables/features/dimensions:

Gower’s dissimilarity for mixed data types

( ) 11 00

01 10 11 00

1k

ij

M Md

M M M M

( )| |ik jkk

ij

k

x xd

R

( )

1

1 pk

ij ij

k

d dp

19

> str(flower)

'data.frame': 18 obs. of 8 variables:

$ V1: Factor w/ 2 levels "0","1": 1 2 1 1 1 1 1 1 2 2 ...

$ V2: Factor w/ 2 levels "0","1": 2 1 2 1 2 2 1 1 2 2 ...

$ V3: Factor w/ 2 levels "0","1": 2 1 1 2 1 1 1 2 1 1 ...

$ V4: Factor w/ 5 levels "1","2","3",..: 4 2 3 4 5 4 4 2 3 5 ...

$ V5: Ord.factor w/ 3 levels "1"<"2"<"3": 3 1 3 2 2 3 3 2 1 2 ...

$ V6: Ord.factor w/ 18 levels "1"<"2"<"3"<..: 15 3 1 16 2 12 ...

$ V7: num 25 150 150 125 20 50 40 100 25 100 ...

$ V8: num 15 50 50 50 15 40 20 15 15 60 ...

> library(cluster)

> dist=daisy(flower, type=list(asymm=c(1, 3), symm=2, ordratio=7))

> str(dist)

Classes 'dissimilarity', 'dist' atomic [1:153] 0.901 0.618 ...

..- attr(*, "Size")= int 18

..- attr(*, "Metric")= chr "mixed"

..- attr(*, "Types")= chr [1:8] "A" "S" "A" "N" ...

Dissimilarity for mixed data types with R-function “daisy” calculating Gower’s dissimilarity

20

library(cluster)

dist = daisy(flower)

mdist = as.matrix( dist)

library(pheatmap)

pheatmap(mdist)

Visualize the distance matrix: Heatmaps are great!

21

How to visualize

multivariate observations of

mixed data types in 2D?

22

Goal of multidimensional scaling

MDS gets as input distances between

observations or data points and results

in a visualization of points in 2D

The bars between points represent the

given distances between points.

As input of MDS we only know the

distances and look for a low-dim point

configuration where points have the

same or similar distances.

23

Example for metric MDS

Distance matrix:

:

Problem: Given Euclidean distances among points,

recover the position of the points!

Example: Road distance between 21 European cities

(almost Euclidean, but not quite)

24

eurodist data:

:

res.cmd=cmdscale(eurodist)

plot(res.cmd,pch="")

text(res.cmd,

labels=rownames(res.cmd))

-2000 -1000 0 1000 2000

-10

00

01

00

0

res.cmd[,1]

res.c

md

[,2

]

Athens

Barcelona

BrusselsCalaisCherbourg

Cologne

Copenhagen

Geneva

Gibraltar

HamburgHook of Holland

LisbonLyons

Madrid MarseillesMilan

Munich

Paris

Rome

Stockholm

Vienna

Configuration can be

- shifted

- rotated

- reflected

without changing distances

MDS in R:

25


After flipping vertical axes:

26


Equivalence of PCA and MDS with Euclidean distance

PCA-representation in data-matrix

=

MDS-representation in Euclidean distance-matrix

MDS on Euclidean distance results in equivalent low-dimensional

representation (up to rotation, flipping, shifts) as PCA on data-

matrix (however, the data-matrix must first be derived from the

distance-matrix).

27

library(cluster)

dist = daisy(flower)

mdist = as.matrix( dist)

library(pheatmap)

pheatmap(mdist)

library(MASS)

mds = isoMDS(mdist, k=2)

d.mds = as.data.frame(mds$points)

names(d.mds) = c("c1", "c2")

library(ggplot2)

ggplot(data=d.mds, aes(x=c1, y=c2)) +

geom_point() +

geom_text(label=row.names(mdist),

hjust=1.2)

Distance matrix and 2D plot of multivariate mixed data

28

Outlier detection

29

How much is an observation differing from average?

30

31

z-transformationx

x

XX Z

2

(Z) 0

( ) 1Z

E Z

sd Z

Lets standardize and look at the z-score

The standardized Variable Z

has mean zero and variance

one.

We start from a variable X with

𝐸 𝑋 = ത𝑋 = 𝜇𝑥 and Var(X)=x2

and apply the z-transformation:

Often the z-transformation is applied to different univariate features to make

them “comparable”. A distance of -2 from the population mean always means

that this observation is two standard deviations smaller than the average.

In case of a normal-distributed X, we know that 𝑍~𝑁(0,1).

z-score

The z-score has the unit “standard deviation”

1sd 2sd 3sd 4sd 5sd 6sd 7sd 8sd 9sd 10sd0

How much is my IQ

above/below average?

Remark: Mean and SD can also be determined for a non-normal distributed variables

– but intuition is lost and we might prefer to work with quantiles.32

Remarks:

• All marginal distributions of a multi-variate normal

distribution are uni-variate normal distributions.

• All conditional distributions of a multi-variate normal

distribution are uni-variate normal distributions.

• Each iso-density-line is an ellipse or it’s higher-

dimensional generalization.

The multivariate normal distribution

310 5~ ,

0 3 25

XN

Y

Density:

~ ,NX μ Σ

11exp

2

2

t

kf

x μ Σ x μ

x

Σ

33

Mahalanobis distance is the multivariate z-score

MD=1

MD=1

MD=1

MD=2

1MDt x x μ Σ x μ

The Mahalanobis distance MD(x)

measures the distance of x to the

mean of the multivariate normal

distribution in units of standard deviations.

In case of a multi-variate normal distributed x we know that MD(x)~N(0,1).

34

MD=2

35

We need expectations or a model to identify an outlier!

35

Outlier

Outlier detection using a boxplot representation

All points beyond the whiskers are called “extreme” values.

Is there any model?

36

The model behind the “extreme” value definition in boxplots

99% of N(0,1) data are within the whiskers.

When visualizing non-normal distributed data, this model is not valid.

boxplot from 100k data points simulated from a N(0,1)

37

Outlier detection in the uni-variate case via Grubbs test

library(outliers)

x=c(45,56,54,34,32,45,67,45,67,65,154)

# Grubbs test for one outlier

#

# data: x

# G = 2.80490, U = 0.13459, p-value = 0.0001816

# alternative hypothesis: highest value 154 is an outlier

Grubbs developed this test statistic in 1950 (assuming normal distribution as in t-

test for small n) to investigate whether “some time during the experiment

something might have happened to have cause an extraneous variation on the

high side or on the low side”, and is also nowadays routinely used in regression

model checking procedures (i.e. to find outliers in Cook’s d values or standardized

residuals).

potential outlier

38

Outlier detection in the multi-variate case via 𝝌𝟐test

The Mahalanobis distance MD(x)

measures the distance of x to the

mean of the multivariate normal

distribution in units of SD.

Outlier detection via Mahalanobis distance can be performed for data for

which the multivariate normal assumption is reasonable by checking

whether the MD2 of an p-dim observation is “sticking out” of c2 distribution

with df=p.

1

p×1 p×1 p×p

22

p×1

MD ~ ( , )

MD ~

t

df p

N

c

x x μ Σ x μ 0 1

x

39

Outlier detection via c2 quantiles

• Compute for each observation p-dim x the (robust version of) the squared

Mahalanobis distances from the assumed N-distribution center: MD(x)2

• Generate a Quantile-Quantile plot to identify observations that have an

expected c2 distribution with df=p ( MD(x)2 > 97.5% quantile of cp2)

• Use in addition “adjusted quantiles” that are estimated by simulations

from the expected chi-square distribution without outliers.

• Use (robust) PCA to visualize the data in a 2D score plot.

40

Extreme quantiles of c2 distribution indicate outliers

41

PC1

PC1 PC1

PC2

PC2

library(mvoutlier)

aq.plot(dat) # to get the shown adjusted quantile plot

chisq.plot(dat) # to get an interactive qq plot to select outliers

Adjusted quantile via simulation

ECDF leaves “plausible” range

Defines adaptive

cutoff

42Slide credit: Markus Kalisch

Outlier detection via robust PCA

imagine 784 dimensions ;-)

Assumption: The manifold hypothesis holds.

43

nxp nxp nxp

nxp nxp nxp

(PCA representation)

(full reconstruction)t

Y X A

X Y A

PCA Rotation can be achieved by multiplying

X with an orthogonal rotation matrix A

Dimension reduction via PCA

PCA minimizes reconstruction error

over all available m data points:

( ) ( ) 2 2

1 1

ˆˆ|| || || ||m m

i i

i i

i i

x x X X

Partly reconstruct X with only k<p PCs:

nxp nxk nx(p-k) nxpˆ , t X Y 0 A

How good is the data representation?

The reconstruction error is given by the

squared orthogonal distance between

the data point and its projection on the

plane.

44

PCA is not robust against outliers

The first two PCs point in the directions of maximal variance.

Since the variance is not robust against outliers the

result of PCA is also not robust against outliers.

We can use a robust version of the PCA which is resistant to outliers.

PC1 with

classical PCA

PC1 with

robust PCA

45

The reconstruction of the red point has a reconstruction-error that corresponds to

the squared distance between the red and green points – PCA minimizes the sum

of squared distances.

Points with extreme reconstruction errors are identified as outliers.

1.PC

46

PCA can be used for outlier detection

We should use robust PCA to identify outliers via reconstruction errors

In robust PCA the directions of the PCs are not heavily influenced

by the positions of some outliers -> outliers have larger distances

to the hyperplane which is spanned by the first couple of PCs and

capture large parts of the variance of non-outlying points.

outliers

47

There are two major R-implementations of PCA - prcomp() and princomp()

- prcomp is numerically more stable, and therefore preferred

(see chapter 2.7 in the StDM script)

- princomp has a few more options and is therefore sometimes used

PCA in R

For robust PCA and outlier detection we can use the package rrcov

- PcaHubert performs robust PCA

48

Summary

49

• We can use different measures to quantify the dissimilarities between two

observations described by the same features, e.g.

• Euclidian and other Lp-Metrics for quantitative data

• Matching coefficient for (symmetric) categorical data

• Jaccard coefficient for (asymmetric) categorical data

• Gower dissimilarities for mixed data types (see R-package daisy)

• A distance matrix holds the pair-wise distances between several

observation units and can be visualized by a heatmap.

• Multdimensional scaling (MDS) yields a 2D plot for high-dimensional data

• MDS starts with a distance matrix

• The pair-wise distances are preserved as good as possible in the 2D plot

• PCA yields the same PC1-PC2-2D plot as MDS on the Euclidean distances

• Outlier detection in high dimensional data can be tackled by

• Quantile plots of 𝜒2 distributed squared Mahalanobis distances from assumed

N-distribution center

• Robust PCA and the distances to the PC1-PC2-2D hyperplane

advanced studies in applied statistics (wbl), ethz applied ... · res.cmd=cmdscale(eurodist)...

Documents