reduces time complexity: less computation reduces space complexity: less parameters simpler models...

26
Reduces time complexity: Less computation Reduces space complexity: Less parameters Simpler models are more robust on small datasets More interpretable; simpler explanation Data visualization (beyond 2 attributes, it gets complicated) 1 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Why Reduce Dimensionality?

Upload: julius-elliott

Post on 18-Jan-2016

228 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Reduces time complexity: Less computation Reduces space complexity: Less parameters Simpler models are more robust on small datasets More interpretable;

• Reduces time complexity: Less computation• Reduces space complexity: Less parameters• Simpler models are more robust on small datasets• More interpretable; simpler explanation• Data visualization (beyond 2 attributes, it gets

complicated)

1Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Why Reduce Dimensionality?

Page 2: Reduces time complexity: Less computation Reduces space complexity: Less parameters Simpler models are more robust on small datasets More interpretable;

Feature Selection vs Extraction

Feature selection: Chose k<d important features, ignore the remaining d – k

Data snoopingGenetic algorithm

Feature extraction: Project the original d attributes onto a new k<d dimensional feature space

Principal components analysis (PCA), Linear discriminant analysis (LDA), Factor analysis (FA)Auto-association ANN

2Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 3: Reduces time complexity: Less computation Reduces space complexity: Less parameters Simpler models are more robust on small datasets More interpretable;

Principal Components Analysis (PCA)

Assume that attributes in dataset are drawn from a multivariate normal distribution. P(x)=N(m, S)

3Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

TdE ,...,:Mean 1μx

TE μxμx dx1 1xddxd

221

22221

11221

ddd

d

d

Variance is a matrixcalled “covariance”.

Diagonal elements are s2 of individual attributes.

Off diagonals describe how fluctuations in one attribute affect

fluctuations in another.

Page 4: Reduces time complexity: Less computation Reduces space complexity: Less parameters Simpler models are more robust on small datasets More interpretable;

TE μμ xx

dx1 1xddxd

221

22221

11221

ddd

d

d

Dividing off-diagonal elements by the product of variances, gives “correlation coefficients”

Correlation among attributes makes it difficult to say how any one attribute contributes to an effect.

1, ji

ijijji xx

Corr

Page 5: Reduces time complexity: Less computation Reduces space complexity: Less parameters Simpler models are more robust on small datasets More interpretable;

Consider a linear transformation of attributes z = Mx where M is a dxd matrix. The d features z will also be normally distributed (proof later).

A choice of M that results in a diagonal covariance matrix in feature-space has the following advantages: 1. Interpretation of uncorrelated features is easier2. Total variance of features is the sum of diagonal elements

Page 6: Reduces time complexity: Less computation Reduces space complexity: Less parameters Simpler models are more robust on small datasets More interpretable;

Diagonalization of the covariance matrix:

The transformation z = Mx that leads to a diagonal feature-space covariance has M = WT where the columns of W are the eigenvectors of the covariance matrix .S

The collection of eigenvalue equations Swk = lkwk

can be written as SW = WD where D = diag(l1...ld) and W is formed by column vectors [w1 ... wd].

WT= W-1 so WTSW = W-1WD = D

If we arrange the eigenvectors so that eigenvalues l1...ld are in decreasing order of magnitude, then zi = wi

Tx, i = 1…k < d are the “principle components”

Page 7: Reduces time complexity: Less computation Reduces space complexity: Less parameters Simpler models are more robust on small datasets More interpretable;

Proportion of Variance (PoV) explained by k principal components (λi sorted in descending order) is

dk

k

21

21

7Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

A plot PoV vs k shows how many eigenvalues arerequired in capture given part of total variance

How many principal components ?

Page 8: Reduces time complexity: Less computation Reduces space complexity: Less parameters Simpler models are more robust on small datasets More interpretable;

Proof that if attributes x are normally distributed with mean m and covariance S, then z=wTx is normally distributed with mean wTm and variance wTSw.

Var(z) = Var(wTx) = E[(wTx – wTμ)2] = E[(wTx – wTμ)(xTw – mTw)]= E[wT(x – μ)(x – μ)Tw]

= wT E[(x – μ)(x –μ)T]w = wT ∑ w

The objective of PCA is to maximize Var(z)=wT ∑ w Must be done subject to the constraint ||w1|| = w1

Tw1 = 1

8Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 9: Reduces time complexity: Less computation Reduces space complexity: Less parameters Simpler models are more robust on small datasets More interpretable;

Review: constrained optimization by Lagrange multipliers

find the stationary point of f(x1, x2) = 1 - x1

2 – x22

subject to the constraint g(x1, x2) = x1 + x2 = 1

Constrained optimization

Page 10: Reduces time complexity: Less computation Reduces space complexity: Less parameters Simpler models are more robust on small datasets More interpretable;

Form the Lagrangian

L(x, l) = f(x1, x2) + l(g(x1, x2) - c)

L(x, l) = 1-x12-x2

2 +l(x1+x2-1)

Page 11: Reduces time complexity: Less computation Reduces space complexity: Less parameters Simpler models are more robust on small datasets More interpretable;

-2x1 + l = 0-2x2 + l = 0x1 + x2 -1 = 0Solve for x1 and x2

Set the partial derivatives of L with respect to x1, x2, and l equal to zero

L(x, l) = 1-x12-x2

2 +l(x1+x2-1)

Page 12: Reduces time complexity: Less computation Reduces space complexity: Less parameters Simpler models are more robust on small datasets More interpretable;

In this case, not necessary to find ll sometimes called “undetermined multiplier”

Solution isx1* = x2* = ½

Page 13: Reduces time complexity: Less computation Reduces space complexity: Less parameters Simpler models are more robust on small datasets More interpretable;

Application of Lagrange multipliers in PCA

Find w1 such that w1TSw1 is maximum subject

to constraint ||w1|| = w1Tw1 = 1

Maximize L = w1TSw1 + c(w1

Tw1 – 1)

gradient of L = 2Sw1+ 2cw1 = 0

Sw1 = -cw1

w1 is an eigenvector of covariance matrix

let c = -l1

l1 is eigenvalue associate with w1

13Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 14: Reduces time complexity: Less computation Reduces space complexity: Less parameters Simpler models are more robust on small datasets More interpretable;

Prove that l1 is the variance of principal component 1

z1 = w1Tx

Sw1 = l1 w1

var(z1) = w1TSw1 = l1 w1

Tw1 = l1

To maximize var(z1), chose l1 as largest eigenvalue

14Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Page 15: Reduces time complexity: Less computation Reduces space complexity: Less parameters Simpler models are more robust on small datasets More interpretable;

More principal components:

If S has 2 distinct eigenvalues, define 2nd principal componentby max Var(z2), such that ||w2||=1 and orthogonal to w1

Introduce Lagrange multipliers a and b

01 122222 wwwwww TTTL

Set gradient of L with respect to w2 to zero2Sw2 – 2aw2 – bw1 = 0Choose b = 0 and a = l2 get Sw2 = l2w2

To maximize Var(z2) chose l2 as the second largest

eigenvalue

Page 16: Reduces time complexity: Less computation Reduces space complexity: Less parameters Simpler models are more robust on small datasets More interpretable;

For any dxd matrix M, z=MTx is a linear transformation of attributes x that defines features z

If attributes x are normally distributed with mean m and covariance S, then z is normally distributed with mean MTm and covariance MTSM. (proof slide 8)

If M = W, a matrix with columns that are the normalized eigenvectors of S, then the covariance of z is diagonal with elements equal to the eigenvalues of S (proof slide 6)

Arrange the eigenvalues in decreasing order of magnitude and find l1...lk that account for most (e.g. 90%) of the total variance, then zi = wi

Tx, are the “principle components”

16

Review

Page 17: Reduces time complexity: Less computation Reduces space complexity: Less parameters Simpler models are more robust on small datasets More interpretable;

MatLab’s [V,D] = eig(A) returns both eigenvectors (columns of V) and eigenvalues D in increasing order.

Invert the order and construct

17Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press

(V1.0)

More review

dk

k

21

21

Chose k that captures the desired amount of total variance

Page 18: Reduces time complexity: Less computation Reduces space complexity: Less parameters Simpler models are more robust on small datasets More interpretable;

Example: cancer diagnostics

• Metabonomics data• 94 samples• 35 metabolites in each sample = d• 60 control samples• 34 diseased samples

Page 19: Reduces time complexity: Less computation Reduces space complexity: Less parameters Simpler models are more robust on small datasets More interpretable;

dk

k

21

21

73.680918.74912.88561.90680.72780.54440.42380.35010.1631

proportion of variance plot

ranked eigenvalues

3 PCs capture > 95%

Page 20: Reduces time complexity: Less computation Reduces space complexity: Less parameters Simpler models are more robust on small datasets More interpretable;

1-34 cancer>35 controlSamples from cancer patients cluster

Scatter plot of PCs 1 and 2

Page 21: Reduces time complexity: Less computation Reduces space complexity: Less parameters Simpler models are more robust on small datasets More interpretable;

Assignment 5 due 10-30-15

Find the accuracy of a model that classifies all 6 types of beer bottles in glassdata.csv by multivariate linear regression. Find the eigenvalues and eigenvectors of the covariance matrix for the full beer-bottle data set.  How many eigenvalues are required to capture more than the 90% of the variance? Transform the attribute data by the eigenvectors of the 3 largest eigenvalues. What is the accuracy of a linear model that uses these features.

Plot the accuracy when you successively extent the linear model by including z1

2, z22, z3

2, z1z2, z1z3, and z2z3.

Page 22: Reduces time complexity: Less computation Reduces space complexity: Less parameters Simpler models are more robust on small datasets More interpretable;

PCA code for glass data

Page 23: Reduces time complexity: Less computation Reduces space complexity: Less parameters Simpler models are more robust on small datasets More interpretable;

eige

nval

ues

indexed by decreasing magnitude

Page 24: Reduces time complexity: Less computation Reduces space complexity: Less parameters Simpler models are more robust on small datasets More interpretable;

indexed by decreasing magnitude

PoV

Page 25: Reduces time complexity: Less computation Reduces space complexity: Less parameters Simpler models are more robust on small datasets More interpretable;

Extend MLR with PCA features

Page 26: Reduces time complexity: Less computation Reduces space complexity: Less parameters Simpler models are more robust on small datasets More interpretable;

L +x12 +x2

2 +x13 +x1x2 +x1x3 +x2x3