reduces time complexity: less computation reduces space complexity: less parameters simpler models...

• Reduces time complexity: Less computation• Reduces space complexity: Less parameters• Simpler models are more robust on small datasets• More interpretable; simpler explanation• Data visualization (beyond 2 attributes, it gets

complicated)

1Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Why Reduce Dimensionality?

Feature Selection vs Extraction

Feature selection: Chose k<d important features, ignore the remaining d – k

Data snoopingGenetic algorithm

Feature extraction: Project the original d attributes onto a new k<d dimensional feature space

Principal components analysis (PCA), Linear discriminant analysis (LDA), Factor analysis (FA)Auto-association ANN

Principal Components Analysis (PCA)

Assume that attributes in dataset are drawn from a multivariate normal distribution. P(x)=N(m, S)

TdE ,...,:Mean 1μx

TE μxμx dx1 1xddxd

Variance is a matrixcalled “covariance”.

Diagonal elements are s2 of individual attributes.

Off diagonals describe how fluctuations in one attribute affect

fluctuations in another.

TE μμ xx

dx1 1xddxd

Dividing off-diagonal elements by the product of variances, gives “correlation coefficients”

Correlation among attributes makes it difficult to say how any one attribute contributes to an effect.

ijijji xx

Consider a linear transformation of attributes z = Mx where M is a dxd matrix. The d features z will also be normally distributed (proof later).

A choice of M that results in a diagonal covariance matrix in feature-space has the following advantages: 1. Interpretation of uncorrelated features is easier2. Total variance of features is the sum of diagonal elements

Diagonalization of the covariance matrix:

The transformation z = Mx that leads to a diagonal feature-space covariance has M = WT where the columns of W are the eigenvectors of the covariance matrix .S

The collection of eigenvalue equations Swk = lkwk

can be written as SW = WD where D = diag(l1...ld) and W is formed by column vectors [w1 ... wd].

WT= W-1 so WTSW = W-1WD = D

If we arrange the eigenvectors so that eigenvalues l1...ld are in decreasing order of magnitude, then zi = wi

Tx, i = 1…k < d are the “principle components”

Proportion of Variance (PoV) explained by k principal components (λi sorted in descending order) is

A plot PoV vs k shows how many eigenvalues arerequired in capture given part of total variance

How many principal components ?

Proof that if attributes x are normally distributed with mean m and covariance S, then z=wTx is normally distributed with mean wTm and variance wTSw.

Var(z) = Var(wTx) = E[(wTx – wTμ)2] = E[(wTx – wTμ)(xTw – mTw)]= E[wT(x – μ)(x – μ)Tw]

= wT E[(x – μ)(x –μ)T]w = wT ∑ w

The objective of PCA is to maximize Var(z)=wT ∑ w Must be done subject to the constraint ||w1|| = w1

Tw1 = 1

Review: constrained optimization by Lagrange multipliers

find the stationary point of f(x1, x2) = 1 - x1

2 – x22

subject to the constraint g(x1, x2) = x1 + x2 = 1

Constrained optimization

Form the Lagrangian

L(x, l) = f(x1, x2) + l(g(x1, x2) - c)

L(x, l) = 1-x12-x2

2 +l(x1+x2-1)

-2x1 + l = 0-2x2 + l = 0x1 + x2 -1 = 0Solve for x1 and x2

Set the partial derivatives of L with respect to x1, x2, and l equal to zero

L(x, l) = 1-x12-x2

2 +l(x1+x2-1)

In this case, not necessary to find ll sometimes called “undetermined multiplier”

Solution isx1* = x2* = ½

Application of Lagrange multipliers in PCA

Find w1 such that w1TSw1 is maximum subject

to constraint ||w1|| = w1Tw1 = 1

Maximize L = w1TSw1 + c(w1

Tw1 – 1)

gradient of L = 2Sw1+ 2cw1 = 0

Sw1 = -cw1

w1 is an eigenvector of covariance matrix

let c = -l1

l1 is eigenvalue associate with w1

Prove that l1 is the variance of principal component 1

z1 = w1Tx

Sw1 = l1 w1

var(z1) = w1TSw1 = l1 w1

Tw1 = l1

To maximize var(z1), chose l1 as largest eigenvalue

More principal components:

If S has 2 distinct eigenvalues, define 2nd principal componentby max Var(z2), such that ||w2||=1 and orthogonal to w1

Introduce Lagrange multipliers a and b

01 122222 wwwwww TTTL

Set gradient of L with respect to w2 to zero2Sw2 – 2aw2 – bw1 = 0Choose b = 0 and a = l2 get Sw2 = l2w2

To maximize Var(z2) chose l2 as the second largest

eigenvalue

For any dxd matrix M, z=MTx is a linear transformation of attributes x that defines features z

If attributes x are normally distributed with mean m and covariance S, then z is normally distributed with mean MTm and covariance MTSM. (proof slide 8)

If M = W, a matrix with columns that are the normalized eigenvectors of S, then the covariance of z is diagonal with elements equal to the eigenvalues of S (proof slide 6)

Arrange the eigenvalues in decreasing order of magnitude and find l1...lk that account for most (e.g. 90%) of the total variance, then zi = wi

Tx, are the “principle components”

Review

MatLab’s [V,D] = eig(A) returns both eigenvectors (columns of V) and eigenvalues D in increasing order.

Invert the order and construct

(V1.0)

More review

Chose k that captures the desired amount of total variance

Example: cancer diagnostics

• Metabonomics data• 94 samples• 35 metabolites in each sample = d• 60 control samples• 34 diseased samples

73.680918.74912.88561.90680.72780.54440.42380.35010.1631

proportion of variance plot

ranked eigenvalues

3 PCs capture > 95%

1-34 cancer>35 controlSamples from cancer patients cluster

Scatter plot of PCs 1 and 2

Assignment 5 due 10-30-15

Find the accuracy of a model that classifies all 6 types of beer bottles in glassdata.csv by multivariate linear regression. Find the eigenvalues and eigenvectors of the covariance matrix for the full beer-bottle data set. How many eigenvalues are required to capture more than the 90% of the variance? Transform the attribute data by the eigenvectors of the 3 largest eigenvalues. What is the accuracy of a linear model that uses these features.

Plot the accuracy when you successively extent the linear model by including z1

2, z22, z3

2, z1z2, z1z3, and z2z3.

PCA code for glass data

indexed by decreasing magnitude

Extend MLR with PCA features

L +x12 +x2

2 +x13 +x1x2 +x1x3 +x2x3

reduces time complexity: less computation reduces space complexity: less parameters simpler models...

linear model

pca features

glass data

attribute data

largest eigenvalues

beerbottle data set

multivariate linear

e alpaydn

Documents

author's personal copy -...

waterblasting since 1975 · 2021. 7. 14. · no drum damage...

mecp2e1 mutation reduces the dendritic complexity of rett...

groupama sigorta reduces complexity of insurance...

functional safety - nxp semiconductors › assets ›...

by yvonne c. barkley, chris schnepf and jack cohen€¦ ·...

mega snake brochure - snake tray · shipping using less...

accelerate big data adoption with less risk and complexity -...

chronic corticosterone administration reduces dendritic...

for differentiation the activities are€¦ · resources...

valu-mats€¦ · •protects floors from damaging uric...

key recovery attacks of practical complexity on aes...

lectures on linguistic complexity - ut on linguistic...

less complexity, more productivity!

tackling complexity: how to create simple and effective...

ethem alpaydin © the mit press,...

marine pumps & accessories...amp draw reduces as flow...

arista in q1...

- bulk...

complexity in data-driven fuzzy inference systems: survey...