ethem alpaydin © the mit press, 2010ethem/i2ml2e/2e_v1-0/i2ml2...simpler models are more robust on...

27
ETHEM ALPAYDIN © The MIT Press, 2010 [email protected] http://www.cmpe.boun.edu.tr/~ethem/i2ml2e Lecture Slides for

Upload: others

Post on 12-Jul-2020

27 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ETHEM ALPAYDIN © The MIT Press, 2010ethem/i2ml2e/2e_v1-0/i2ml2...Simpler models are more robust on small datasets More interpretable; simpler explanation Data visualization (structure,

ETHEM ALPAYDIN© The MIT Press, 2010

[email protected]://www.cmpe.boun.edu.tr/~ethem/i2ml2e

Lecture Slides for

Page 2: ETHEM ALPAYDIN © The MIT Press, 2010ethem/i2ml2e/2e_v1-0/i2ml2...Simpler models are more robust on small datasets More interpretable; simpler explanation Data visualization (structure,
Page 3: ETHEM ALPAYDIN © The MIT Press, 2010ethem/i2ml2e/2e_v1-0/i2ml2...Simpler models are more robust on small datasets More interpretable; simpler explanation Data visualization (structure,

Why Reduce Dimensionality? Reduces time complexity: Less computation

Reduces space complexity: Less parameters

Saves the cost of observing the feature

Simpler models are more robust on small datasets

More interpretable; simpler explanation

Data visualization (structure, groups, outliers, etc) if plotted in 2 or 3 dimensions

3Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 4: ETHEM ALPAYDIN © The MIT Press, 2010ethem/i2ml2e/2e_v1-0/i2ml2...Simpler models are more robust on small datasets More interpretable; simpler explanation Data visualization (structure,

Feature Selection vs Extraction Feature selection: Choosing k<d important features,

ignoring the remaining d – k

Subset selection algorithms

Feature extraction: Project the

original xi , i =1,...,d dimensions to

new k<d dimensions, zj , j =1,...,k

Principal components analysis (PCA), linear discriminant analysis (LDA), factor analysis (FA)

4Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 5: ETHEM ALPAYDIN © The MIT Press, 2010ethem/i2ml2e/2e_v1-0/i2ml2...Simpler models are more robust on small datasets More interpretable; simpler explanation Data visualization (structure,

Subset Selection There are 2d subsets of d features Forward search: Add the best feature at each step Set of features F initially Ø. At each iteration, find the best new feature

j = argmini E ( F xi )

Add xj to F if E ( F xj ) < E ( F )

Hill-climbing O(d2) algorithm Backward search: Start with all features and remove

one at a time, if possible. Floating search (Add k, remove l)

5Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 6: ETHEM ALPAYDIN © The MIT Press, 2010ethem/i2ml2e/2e_v1-0/i2ml2...Simpler models are more robust on small datasets More interpretable; simpler explanation Data visualization (structure,

Principal Components Analysis (PCA)

Find a low-dimensional space such that when x is projected there, information loss is minimized.

The projection of x on the direction of w is: z = wTx

Find w such that Var(z) is maximized

Var(z) = Var(wTx) = E[(wTx – wTμ)2]

= E[(wTx – wTμ)(wTx – wTμ)]

= E[wT(x – μ)(x – μ)Tw]

= wT E[(x – μ)(x –μ)T]w = wT ∑ w

where Var(x)= E[(x – μ)(x –μ)T] = ∑

6Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 7: ETHEM ALPAYDIN © The MIT Press, 2010ethem/i2ml2e/2e_v1-0/i2ml2...Simpler models are more robust on small datasets More interpretable; simpler explanation Data visualization (structure,

Maximize Var(z) subject to ||w||=1

∑w1 = αw1 that is, w1 is an eigenvector of ∑

Choose the one with the largest eigenvalue for Var(z) to be max

Second principal component: Max Var(z2), s.t., ||w2||=1 and orthogonal to w1

∑ w2 = α w2 that is, w2 is another eigenvector of ∑

and so on.

01 1222222

wwwwwww

TTT max

7

111111

wwwww

TT max

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 8: ETHEM ALPAYDIN © The MIT Press, 2010ethem/i2ml2e/2e_v1-0/i2ml2...Simpler models are more robust on small datasets More interpretable; simpler explanation Data visualization (structure,

What PCA doesz = WT(x – m)

where the columns of W are the eigenvectors of ∑, and mis sample mean

Centers the data at the origin and rotates the axes

8Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 9: ETHEM ALPAYDIN © The MIT Press, 2010ethem/i2ml2e/2e_v1-0/i2ml2...Simpler models are more robust on small datasets More interpretable; simpler explanation Data visualization (structure,

Proportion of Variance (PoV) explained

when λi are sorted in descending order

Typically, stop at PoV>0.9

Scree graph plots of PoV vs k, stop at “elbow”

How to choose k ?

dk

k

21

21

9Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 10: ETHEM ALPAYDIN © The MIT Press, 2010ethem/i2ml2e/2e_v1-0/i2ml2...Simpler models are more robust on small datasets More interpretable; simpler explanation Data visualization (structure,

10Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 11: ETHEM ALPAYDIN © The MIT Press, 2010ethem/i2ml2e/2e_v1-0/i2ml2...Simpler models are more robust on small datasets More interpretable; simpler explanation Data visualization (structure,

11Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 12: ETHEM ALPAYDIN © The MIT Press, 2010ethem/i2ml2e/2e_v1-0/i2ml2...Simpler models are more robust on small datasets More interpretable; simpler explanation Data visualization (structure,

Factor Analysis Find a small number of factors z, which when combined

generate x :

xi – µi = vi1z1 + vi2z2 + ... + vikzk + εi

where zj, j =1,...,k are the latent factors with

E[ zj ]=0, Var(zj)=1, Cov(zi ,, zj)=0, i ≠ j ,

εi are the noise sources

E* εi += ψi, Cov(εi , εj) =0, i ≠ j, Cov(εi , zj) =0 ,

and vij are the factor loadings

12Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 13: ETHEM ALPAYDIN © The MIT Press, 2010ethem/i2ml2e/2e_v1-0/i2ml2...Simpler models are more robust on small datasets More interpretable; simpler explanation Data visualization (structure,

PCA vs FA PCA From x to z z = WT(x – µ)

FA From z to x x – µ = Vz + ε

13

x z

z x

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 14: ETHEM ALPAYDIN © The MIT Press, 2010ethem/i2ml2e/2e_v1-0/i2ml2...Simpler models are more robust on small datasets More interpretable; simpler explanation Data visualization (structure,

Factor Analysis In FA, factors zj are stretched, rotated and translated to

generate x

14Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 15: ETHEM ALPAYDIN © The MIT Press, 2010ethem/i2ml2e/2e_v1-0/i2ml2...Simpler models are more robust on small datasets More interpretable; simpler explanation Data visualization (structure,

Multidimensional Scaling

srsr

srsr

srsr

srsr

E

,

,

||

|

2

2

2

2

xx

xxxgxg

xx

xxzz

X

15

Given pairwise distances between N points,

dij, i,j =1,...,N

place on a low-dim map s.t. distances are preserved.

z = g (x | θ ) Find θ that min Sammon stress

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 16: ETHEM ALPAYDIN © The MIT Press, 2010ethem/i2ml2e/2e_v1-0/i2ml2...Simpler models are more robust on small datasets More interpretable; simpler explanation Data visualization (structure,

Map of Europe by MDS

16

Map from CIA – The World Factbook: http://www.cia.gov/

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 17: ETHEM ALPAYDIN © The MIT Press, 2010ethem/i2ml2e/2e_v1-0/i2ml2...Simpler models are more robust on small datasets More interpretable; simpler explanation Data visualization (structure,

Linear Discriminant Analysis Find a low-dimensional

space such that when x is projected, classes are well-separated.

Find w that maximizes

t

ttT

t

tt

ttT

rmsr

rm

ss

mmJ

2

1

2

11

2

2

2

1

2

21

xwxw

w

17Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 18: ETHEM ALPAYDIN © The MIT Press, 2010ethem/i2ml2e/2e_v1-0/i2ml2...Simpler models are more robust on small datasets More interpretable; simpler explanation Data visualization (structure,

TBBT

TT

TTmm

2121

2121

2

21

2

21

mmmmww

wmmmmw

mwmw

SS where

18

Between-class scatter:

Within-class scatter:

21

2

1

2

1

111

111

2

1

2

1

SSSS

S

S

WWT

t

t

Ttt

Tt

t

TttT

t

t

tT

ss

r

r

rms

where

where

ww

mxmx

wwwmxmxw

xw

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 19: ETHEM ALPAYDIN © The MIT Press, 2010ethem/i2ml2e/2e_v1-0/i2ml2...Simpler models are more robust on small datasets More interpretable; simpler explanation Data visualization (structure,

Fisher’s Linear Discriminant

21

1mmw

Wc S

19

Find w that max

LDA soln:

Parametric soln:

ww

mmw

ww

www

WT

T

WT

BT

JSS

S2

21

,~| when iiCp μ

μμ 21

1

Nx

w

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 20: ETHEM ALPAYDIN © The MIT Press, 2010ethem/i2ml2e/2e_v1-0/i2ml2...Simpler models are more robust on small datasets More interpretable; simpler explanation Data visualization (structure,

K>2 Classes

Tit

it

t

tii

K

iiW r mxmx

SSS 1

20

Within-class scatter:

Between-class scatter:

Find W that max

K

ii

K

i

T

iiiBK

N11

1mmmmmm S

WSW

WSWW

WT

BT

J

The largest eigenvectors of SW-1SB

Maximum rank of K-1

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 21: ETHEM ALPAYDIN © The MIT Press, 2010ethem/i2ml2e/2e_v1-0/i2ml2...Simpler models are more robust on small datasets More interpretable; simpler explanation Data visualization (structure,

21Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 22: ETHEM ALPAYDIN © The MIT Press, 2010ethem/i2ml2e/2e_v1-0/i2ml2...Simpler models are more robust on small datasets More interpretable; simpler explanation Data visualization (structure,

Isomap Geodesic distance is the distance along the manifold that

the data lies in, as opposed to the Euclidean distance in the input space

22Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 23: ETHEM ALPAYDIN © The MIT Press, 2010ethem/i2ml2e/2e_v1-0/i2ml2...Simpler models are more robust on small datasets More interpretable; simpler explanation Data visualization (structure,

Isomap Instances r and s are connected in the graph if

||xr-xs||<e or if xs is one of the k neighbors of xr

The edge length is ||xr-xs||

For two nodes r and s not connected, the distance is equal to the shortest path between them

Once the NxN distance matrix is thus formed, use MDS to find a lower-dimensional mapping

23Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 24: ETHEM ALPAYDIN © The MIT Press, 2010ethem/i2ml2e/2e_v1-0/i2ml2...Simpler models are more robust on small datasets More interpretable; simpler explanation Data visualization (structure,

24

-150 -100 -50 0 50 100 150-150

-100

-50

0

50

100

150Optdigits after Isomap (with neighborhood graph).

0

0

7

4

6

2

55

08

71

9 5

3

0

4

7

84

7

85

9

1

2

0

6

1

8

7

0

7

6

9

1

93

94

9

2

1

99

6

43

2

8

2

7

1

4

6

2

0

4

6

37 1

0

2

2

5

2

4

81

73

0

3 377

9

13

3

4

3

4

2

889 8

4

71

6

9

4

0

1 3

6

2

Matlab source from http://web.mit.edu/cocosci/isomap/isomap.html

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 25: ETHEM ALPAYDIN © The MIT Press, 2010ethem/i2ml2e/2e_v1-0/i2ml2...Simpler models are more robust on small datasets More interpretable; simpler explanation Data visualization (structure,

Locally Linear Embedding1. Given xr find its neighbors xs

(r)

2. Find Wrs that minimize

3. Find the new coordinates zr that minimize

25

2

r s

srrs

rXE )()|( xWxW

2

r s

srrs

r zzE )()|( WWz

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 26: ETHEM ALPAYDIN © The MIT Press, 2010ethem/i2ml2e/2e_v1-0/i2ml2...Simpler models are more robust on small datasets More interpretable; simpler explanation Data visualization (structure,

26Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 27: ETHEM ALPAYDIN © The MIT Press, 2010ethem/i2ml2e/2e_v1-0/i2ml2...Simpler models are more robust on small datasets More interpretable; simpler explanation Data visualization (structure,

LLE on Optdigits

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 27

-3.5 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.51

1

1

1

1

1

00

7

4

6

2

5

5

0

8

7

1

9

5

3

0

47

84

7

8

5

91

2

0

6

18

7

0

76

91

9

3

9

4

9

2

1

99

6

432

82

7

14

6

2

0

46

3

7

1

0

22 52

48

1

7

3

0

33

77

91

334

342

88

98 4

7

1

6 94

0

1

36

2

Matlab source from http://www.cs.toronto.edu/~roweis/lle/code.html