knowledge discovery and data mining 1 (vo ... -...

82
Knowledge Discovery and Data Mining 1 (VO) (707.003) Dimensionality Reduction Denis Helic KTI, TU Graz Dec 12, 2013 Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 1 / 82

Upload: dokhue

Post on 12-Mar-2019

224 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Knowledge Discovery and Data Mining 1 (VO ... - kti.tugraz.atkti.tugraz.at/staff/denis/courses/kddm1/dimreduce.pdf · 1 Sum over all nonblank entries in M the square of the di erence

Knowledge Discovery and Data Mining 1 (VO) (707.003)Dimensionality Reduction

Denis Helic

KTI, TU Graz

Dec 12, 2013

Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 1 / 82

Page 2: Knowledge Discovery and Data Mining 1 (VO ... - kti.tugraz.atkti.tugraz.at/staff/denis/courses/kddm1/dimreduce.pdf · 1 Sum over all nonblank entries in M the square of the di erence

Big picture: KDDM

Linear Algebra Map-Reduce

Mathematical Tools Infrastructure

Knowledge Discovery Process

Information Theory Statistical Inference

Probability Theory

PreprocessingTransformation

Data Mining

Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 2 / 82

Page 3: Knowledge Discovery and Data Mining 1 (VO ... - kti.tugraz.atkti.tugraz.at/staff/denis/courses/kddm1/dimreduce.pdf · 1 Sum over all nonblank entries in M the square of the di erence

Outline

1 Introduction

2 UV Decomposition

3 Eigenvalues and Eigenvectors

4 Principal-Component Analysis

5 PCA Example

Slides

Slides are partially based on “Mining Massive Datasets” Chapters 9 and 11

Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 3 / 82

Page 4: Knowledge Discovery and Data Mining 1 (VO ... - kti.tugraz.atkti.tugraz.at/staff/denis/courses/kddm1/dimreduce.pdf · 1 Sum over all nonblank entries in M the square of the di erence

Introduction

Representing data as matrices

There are many sources of the data that can be represented as largematrices

Vector Space Model: document are represented as term vectors

The complete document collection is represented as a largedocument-term matrix

In recommender systems, we represent users’ rating of items as autility matrix

We use also matrices to represent the social networks, etc.

Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 4 / 82

Page 5: Knowledge Discovery and Data Mining 1 (VO ... - kti.tugraz.atkti.tugraz.at/staff/denis/courses/kddm1/dimreduce.pdf · 1 Sum over all nonblank entries in M the square of the di erence

Introduction

Representing data as matrices

These matrices are huge and have hundred thousands, even millionsof rows and columns

E.g. document-term matrix of Wikipedia: 10 millions rows and100.000 columns

Utility matrix at Amazon

The number of customers times the number of articles

The curse of dimensionality

Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 5 / 82

Page 6: Knowledge Discovery and Data Mining 1 (VO ... - kti.tugraz.atkti.tugraz.at/staff/denis/courses/kddm1/dimreduce.pdf · 1 Sum over all nonblank entries in M the square of the di erence

Introduction

Representing data as matrices

In many cases we can summarize these matrices by finding narrowermatrices that are in some sense close to the original

These narrow matrices have small numbers of rows and columns

We can use them more efficiently than the original matrices

The process of finding those narrow matrices is called dimensionalityreduction

Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 6 / 82

Page 7: Knowledge Discovery and Data Mining 1 (VO ... - kti.tugraz.atkti.tugraz.at/staff/denis/courses/kddm1/dimreduce.pdf · 1 Sum over all nonblank entries in M the square of the di erence

UV Decomposition

Recommender systems

Recommender systems predict user responses to options

E.g. recommend news articles based on prediction of user interests

E.g. recommend products based on the predictions of what usermight like

Content-based systems make predictions based on the content of theitems

Collaborative filtering systems make predictions based on thesimilarity between users/items

Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 7 / 82

Page 8: Knowledge Discovery and Data Mining 1 (VO ... - kti.tugraz.atkti.tugraz.at/staff/denis/courses/kddm1/dimreduce.pdf · 1 Sum over all nonblank entries in M the square of the di erence

UV Decomposition

The Utility Matrix

In a recommender system there are two classes of entities: users anditems

Users have preferences for certain items and we have to mine forthese preferences

The data is represented by a utility matrix M ∈ Rn×m, where n is thenumber of users and m the number of items

The matrix gives a value for each user-item pair what is known aboutthe preference of that user for that item

E.g. the values can come from an ordered set (1 to 5) and represent arating that a user gave for an item

Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 8 / 82

Page 9: Knowledge Discovery and Data Mining 1 (VO ... - kti.tugraz.atkti.tugraz.at/staff/denis/courses/kddm1/dimreduce.pdf · 1 Sum over all nonblank entries in M the square of the di erence

UV Decomposition

The Utility Matrix

We assume that the matrix is sparse

This means that most entries are unknown

The majority of the user preferences for specific items is unknown

An unknown rating means that we do not have explicit information

It does not mean that the rating is low

The goal of a recommender system is to predict the blank in theutility matrix

Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 9 / 82

Page 10: Knowledge Discovery and Data Mining 1 (VO ... - kti.tugraz.atkti.tugraz.at/staff/denis/courses/kddm1/dimreduce.pdf · 1 Sum over all nonblank entries in M the square of the di erence

UV Decomposition

The Utility Matrix: example

UserMovie

HP1 HP2 HP3 Hobbit SW1 SW2 SW3

A 4 5 1

B 5 5 4

C 2 4 5

D 3 3

Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 10 / 82

Page 11: Knowledge Discovery and Data Mining 1 (VO ... - kti.tugraz.atkti.tugraz.at/staff/denis/courses/kddm1/dimreduce.pdf · 1 Sum over all nonblank entries in M the square of the di erence

UV Decomposition

Item profiles

The utility matrix itself does not offer a lot of evidence

Typically in practice, the utility matrix is a very sparse matrix

Also, we might think about the utility matrix as a final result of arating process

For example, items have some (general) characteristics that userslike/dislike and because of these characteristics they rate them in oneway or another

Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 11 / 82

Page 12: Knowledge Discovery and Data Mining 1 (VO ... - kti.tugraz.atkti.tugraz.at/staff/denis/courses/kddm1/dimreduce.pdf · 1 Sum over all nonblank entries in M the square of the di erence

UV Decomposition

Item profiles: example

Movies have certain features which describe important characteristicsof every movie

The set of actors

The director

Year

The genre

Technical characteristics

Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 12 / 82

Page 13: Knowledge Discovery and Data Mining 1 (VO ... - kti.tugraz.atkti.tugraz.at/staff/denis/courses/kddm1/dimreduce.pdf · 1 Sum over all nonblank entries in M the square of the di erence

UV Decomposition

User profiles

The same set of features might be used to represent the preferencesof the users

We might represent the preferences as the feature weights

E.g. a feature which the user prefers gets a higher weight

The final rating of a user for an item might be then the weighted sumof the features from the item profile

This can be used to predict the missing values in the utility matrix

Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 13 / 82

Page 14: Knowledge Discovery and Data Mining 1 (VO ... - kti.tugraz.atkti.tugraz.at/staff/denis/courses/kddm1/dimreduce.pdf · 1 Sum over all nonblank entries in M the square of the di erence

UV Decomposition

Representing profiles

Let us represent item and user profiles as vectors with features asdimensions

Suppose we have the following features: Julia Roberts, EdwardNorton, Martin Scorsese, Ridley Scott, Western, Drama, Thriller

Let us denote the features with a feature vector x, where eachelement (dimension) corresponds to a feature

x1 = JuliaRoberts, x2 = EdwardNorton, ..., x7 = Thriller

Now we might represent the users and items with vectors u ∈ Rd andv ∈ Rd with corresponding values for each feature

Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 14 / 82

Page 15: Knowledge Discovery and Data Mining 1 (VO ... - kti.tugraz.atkti.tugraz.at/staff/denis/courses/kddm1/dimreduce.pdf · 1 Sum over all nonblank entries in M the square of the di erence

UV Decomposition

Representing profiles: example

E.g. we might represent a movie starring Julia Roberts directed byMartin Scorsese and with a mixture elements of drama and thriller asa vector:

v0 =

10100

0.50.5

Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 15 / 82

Page 16: Knowledge Discovery and Data Mining 1 (VO ... - kti.tugraz.atkti.tugraz.at/staff/denis/courses/kddm1/dimreduce.pdf · 1 Sum over all nonblank entries in M the square of the di erence

UV Decomposition

Representing profiles: example

Another thriller movie starring Edward Norton and directed by RidleyScott:

v1 =

0101001

Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 16 / 82

Page 17: Knowledge Discovery and Data Mining 1 (VO ... - kti.tugraz.atkti.tugraz.at/staff/denis/courses/kddm1/dimreduce.pdf · 1 Sum over all nonblank entries in M the square of the di erence

UV Decomposition

Representing profiles: example

Let us represent the users as vectors u with weights for the featuresrepresenting the user preferences

E.g. a user that prefers thrillers and Edward Norton might berepresented as the following vector

u0 =

0200002

Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 17 / 82

Page 18: Knowledge Discovery and Data Mining 1 (VO ... - kti.tugraz.atkti.tugraz.at/staff/denis/courses/kddm1/dimreduce.pdf · 1 Sum over all nonblank entries in M the square of the di erence

UV Decomposition

Representing profiles: example

Note that with the weights we might express also that a user dislikesa particular feature

u1 =

−2030022

Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 18 / 82

Page 19: Knowledge Discovery and Data Mining 1 (VO ... - kti.tugraz.atkti.tugraz.at/staff/denis/courses/kddm1/dimreduce.pdf · 1 Sum over all nonblank entries in M the square of the di erence

UV Decomposition

Representing profiles: example

Now, we can calculate a rating that a user gives for a certain movieby calculating uTv

uT0 v0 =(0 2 0 0 0 0 2

)

10100

0.50.5

= 1

Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 19 / 82

Page 20: Knowledge Discovery and Data Mining 1 (VO ... - kti.tugraz.atkti.tugraz.at/staff/denis/courses/kddm1/dimreduce.pdf · 1 Sum over all nonblank entries in M the square of the di erence

UV Decomposition

Representing profiles: example

uT1 v0 =(−2 0 3 0 0 2 2

)

10100

0.50.5

= 3

Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 20 / 82

Page 21: Knowledge Discovery and Data Mining 1 (VO ... - kti.tugraz.atkti.tugraz.at/staff/denis/courses/kddm1/dimreduce.pdf · 1 Sum over all nonblank entries in M the square of the di erence

UV Decomposition

Representing profiles: example

uT0 v1 =(0 2 0 0 0 0 2

)

0101001

= 4

Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 21 / 82

Page 22: Knowledge Discovery and Data Mining 1 (VO ... - kti.tugraz.atkti.tugraz.at/staff/denis/courses/kddm1/dimreduce.pdf · 1 Sum over all nonblank entries in M the square of the di erence

UV Decomposition

Representing profiles: example

uT1 v1 =(−2 0 3 0 0 2 2

)

0101001

= 2

Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 22 / 82

Page 23: Knowledge Discovery and Data Mining 1 (VO ... - kti.tugraz.atkti.tugraz.at/staff/denis/courses/kddm1/dimreduce.pdf · 1 Sum over all nonblank entries in M the square of the di erence

UV Decomposition

Representing users: The U Matrix

We can now group all user vectors u ∈ Rd into a matrix U ∈ Rn×d

n is the number of users, and d is the number of features

uT1uT2

...uTn

Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 23 / 82

Page 24: Knowledge Discovery and Data Mining 1 (VO ... - kti.tugraz.atkti.tugraz.at/staff/denis/courses/kddm1/dimreduce.pdf · 1 Sum over all nonblank entries in M the square of the di erence

UV Decomposition

Representing items: The V Matrix

We can now group all item vectors v ∈ Rd into a matrix V ∈ Rd×m

m is the number of items, and d is the number of features

(v1 v2 . . . vm

)

Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 24 / 82

Page 25: Knowledge Discovery and Data Mining 1 (VO ... - kti.tugraz.atkti.tugraz.at/staff/denis/courses/kddm1/dimreduce.pdf · 1 Sum over all nonblank entries in M the square of the di erence

UV Decomposition

The product: UV

Now, the utility matrix M is given by:

M = UV

We will decompose M to obtain U and V

We will reduce the dimensions of M

We can also use U and V to predict missing values in M

Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 25 / 82

Page 26: Knowledge Discovery and Data Mining 1 (VO ... - kti.tugraz.atkti.tugraz.at/staff/denis/courses/kddm1/dimreduce.pdf · 1 Sum over all nonblank entries in M the square of the di erence

UV Decomposition

UV Decomposition

We start with M ∈ Rn×m and want to find U ∈ Rn×d and V ∈ Rd×m

UV closely approximates M

If we are able to find this decomposition than we have establishedthat there are d dimensions that allow us to characterize both usersand items closely

This process is called UV decomposition

Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 26 / 82

Page 27: Knowledge Discovery and Data Mining 1 (VO ... - kti.tugraz.atkti.tugraz.at/staff/denis/courses/kddm1/dimreduce.pdf · 1 Sum over all nonblank entries in M the square of the di erence

UV Decomposition

UV Decomposition: example

5 2 4 4 33 1 2 4 12 3 1 42 5 4 3 54 4 5 4

=

u11 u12

u21 u22

u31 u32

u41 u42

u51 u52

×(

v11 v12 v13 v14 v15v21 v22 v23 v24 v25

)

Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 27 / 82

Page 28: Knowledge Discovery and Data Mining 1 (VO ... - kti.tugraz.atkti.tugraz.at/staff/denis/courses/kddm1/dimreduce.pdf · 1 Sum over all nonblank entries in M the square of the di erence

UV Decomposition

Root-Mean-Square-Error

We approximate M → we need to measure the approximation error

We can pick among several measures for this error

A typical choice is the root-mean-square-error (RMSE):1 Sum over all nonblank entries in M the square of the difference

between that entry and the corresponding entry in the product UV2 Take the average of these squares by dividing by the number of terms

in the sum (i.e. the number of nonblank entries in M)3 Take the square root of the mean

Minimizing the sum of the squares is equivalent to minimizing thesquare root of the average square, thus we can omit the last two steps

Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 28 / 82

Page 29: Knowledge Discovery and Data Mining 1 (VO ... - kti.tugraz.atkti.tugraz.at/staff/denis/courses/kddm1/dimreduce.pdf · 1 Sum over all nonblank entries in M the square of the di erence

UV Decomposition

Root-Mean-Square-Error: example

Suppose we start with U and V with all ones:

1 11 11 11 11 1

×(

1 1 1 1 11 1 1 1 1

)=

2 2 2 2 22 2 2 2 22 2 2 2 22 2 2 2 22 2 2 2 2

Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 29 / 82

Page 30: Knowledge Discovery and Data Mining 1 (VO ... - kti.tugraz.atkti.tugraz.at/staff/denis/courses/kddm1/dimreduce.pdf · 1 Sum over all nonblank entries in M the square of the di erence

UV Decomposition

Root-Mean-Square-Error: example

5 2 4 4 33 1 2 4 12 3 1 42 5 4 3 54 4 5 4

2 2 2 2 22 2 2 2 22 2 2 2 22 2 2 2 22 2 2 2 2

=

3 0 2 2 11 −1 0 2 −10 1 −1 20 3 2 1 32 2 3 2

Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 30 / 82

Page 31: Knowledge Discovery and Data Mining 1 (VO ... - kti.tugraz.atkti.tugraz.at/staff/denis/courses/kddm1/dimreduce.pdf · 1 Sum over all nonblank entries in M the square of the di erence

UV Decomposition

Root-Mean-Square-Error: example

Sum of squares:1 Row 1: 182 Row 2: 73 Row 3: 64 Row 4: 235 Row 5: 21

Total sum: 75

We can already stop at this point

Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 31 / 82

Page 32: Knowledge Discovery and Data Mining 1 (VO ... - kti.tugraz.atkti.tugraz.at/staff/denis/courses/kddm1/dimreduce.pdf · 1 Sum over all nonblank entries in M the square of the di erence

UV Decomposition

Incremental computation

Finding the decomposition with the least RMSE involves starting withsome arbitrarily chosen U and V and iteratively adapting the matricesto make the RMSE smaller

We consider only adjustments to a single element of U or V

In principle we could also make more complex adjustments

In a typical example we will encounter many local minima

In that case no allowable adjustments to U or V will make the RMSEsmaller

Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 32 / 82

Page 33: Knowledge Discovery and Data Mining 1 (VO ... - kti.tugraz.atkti.tugraz.at/staff/denis/courses/kddm1/dimreduce.pdf · 1 Sum over all nonblank entries in M the square of the di erence

UV Decomposition

Incremental computation

Only one of these will be the global minimum

That is the the least possible RMSE

To increase the chances of finding the global minimum we may startthe iteration many times with different starting points

However, there is no guarantee that we will find the global minimum

Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 33 / 82

Page 34: Knowledge Discovery and Data Mining 1 (VO ... - kti.tugraz.atkti.tugraz.at/staff/denis/courses/kddm1/dimreduce.pdf · 1 Sum over all nonblank entries in M the square of the di erence

UV Decomposition

Incremental computation: example

Suppose we start with U and V with all ones and make a singleadjustment (u11):

x 11 11 11 11 1

×(

1 1 1 1 11 1 1 1 1

)=

x + 1 x + 1 x + 1 x + 1 x + 12 2 2 2 22 2 2 2 22 2 2 2 22 2 2 2 2

Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 34 / 82

Page 35: Knowledge Discovery and Data Mining 1 (VO ... - kti.tugraz.atkti.tugraz.at/staff/denis/courses/kddm1/dimreduce.pdf · 1 Sum over all nonblank entries in M the square of the di erence

UV Decomposition

Incremental computation: example

Sum of squares:

Row 1:(5−(x+1))2+(2−(x+1))2+(4−(x+1))2+(4−(x+1))2++(3−(x+1))2

This simplifies to: (4− x)2 + (1− x)2 + (3− x)2 + (3− x)2 + (2− x)2

We are looking for x that minimizes the sum:

ds

dx= 0

Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 35 / 82

Page 36: Knowledge Discovery and Data Mining 1 (VO ... - kti.tugraz.atkti.tugraz.at/staff/denis/courses/kddm1/dimreduce.pdf · 1 Sum over all nonblank entries in M the square of the di erence

UV Decomposition

Incremental computation: example

ds

dx= −2((4− x) + (1− x) + (3− x) + (3− x) + (2− x)) = 0

This gives x = 2.6

2.6 11 11 11 11 1

×(

1 1 1 1 11 1 1 1 1

)=

3.6 3.6 3.6 3.6 3.62 2 2 2 22 2 2 2 22 2 2 2 22 2 2 2 2

Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 36 / 82

Page 37: Knowledge Discovery and Data Mining 1 (VO ... - kti.tugraz.atkti.tugraz.at/staff/denis/courses/kddm1/dimreduce.pdf · 1 Sum over all nonblank entries in M the square of the di erence

UV Decomposition

Incremental computation: example

Now we would again make a single adjustment (v11) and repeat theprocess

2.6 11 11 11 11 1

×(

y 1 1 1 11 1 1 1 1

)=

2.6y + 1 3.6 3.6 3.6 3.6y + 1 2 2 2 2y + 1 2 2 2 2y + 1 2 2 2 2y + 1 2 2 2 2

Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 37 / 82

Page 38: Knowledge Discovery and Data Mining 1 (VO ... - kti.tugraz.atkti.tugraz.at/staff/denis/courses/kddm1/dimreduce.pdf · 1 Sum over all nonblank entries in M the square of the di erence

UV Decomposition

Optimizing an arbitrary element

How does the general formula look like?

We denote with P = UV the current product of matrices U and V

Suppose we want to vary urs and find the value of this element thatminimizes the RMSE

Note that urs only affects the elements in the r -th row of P:

prj =d∑

k=1

urkvkj =∑

k 6=s

urkvkj + xvsj

We sum over all nonblank values mrj

We replaced urs with x

Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 38 / 82

Page 39: Knowledge Discovery and Data Mining 1 (VO ... - kti.tugraz.atkti.tugraz.at/staff/denis/courses/kddm1/dimreduce.pdf · 1 Sum over all nonblank entries in M the square of the di erence

UV Decomposition

Optimizing an arbitrary element

If mrj is a nonblank element then the contribution of this element toRMSE is given by:

(mrj − prj)2 = (mrj −

k 6=s

urkvkj + xvsj)2

Now, we can sum over all squares of errors on nonblank entries of M

j

(mrj −∑

k 6=s

urkvkj + xvsj)2

Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 39 / 82

Page 40: Knowledge Discovery and Data Mining 1 (VO ... - kti.tugraz.atkti.tugraz.at/staff/denis/courses/kddm1/dimreduce.pdf · 1 Sum over all nonblank entries in M the square of the di erence

UV Decomposition

Optimizing an arbitrary element

We take the derivative with respect to x and set it equal to 0:

j

−2vsj(mrj −∑

k 6=s

urkvkj + xvsj) = 0

We then solve for x :

x =

∑j vsj(mrj −

∑k 6=s urkvkj)∑

j v2sj

Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 40 / 82

Page 41: Knowledge Discovery and Data Mining 1 (VO ... - kti.tugraz.atkti.tugraz.at/staff/denis/courses/kddm1/dimreduce.pdf · 1 Sum over all nonblank entries in M the square of the di erence

UV Decomposition

Optimizing an arbitrary element

Similarly, we can derive a formula for element y when we varyvrs

y =

∑i uir (mis −

∑k 6=r uikvks)

∑i u2

ir

Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 41 / 82

Page 42: Knowledge Discovery and Data Mining 1 (VO ... - kti.tugraz.atkti.tugraz.at/staff/denis/courses/kddm1/dimreduce.pdf · 1 Sum over all nonblank entries in M the square of the di erence

UV Decomposition

The complete algorithm

Preprocessing: adjusting scales by e.g. subtracting the average inrows and then columns

Initialization: many different initializations, e.g. the elements thatgive the product the averages of the elements in the utility matrix

Optimization: e.g. we always change a single element and pick anorder of change (row-by-row, etc)

Convergence: when the improvements in RMSE fall below a thresholdwe may stop

Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 42 / 82

Page 43: Knowledge Discovery and Data Mining 1 (VO ... - kti.tugraz.atkti.tugraz.at/staff/denis/courses/kddm1/dimreduce.pdf · 1 Sum over all nonblank entries in M the square of the di erence

UV Decomposition

Gradient Descent

This technique for finding the decomposition is an example ofgradient descent

We are given some data points: nonblank entries of the utility matrix

For each data point we find the direction of change that mostdecreases the RMSE

If the utility matrix is too large to visit each nonblank point severaltimes

We might randomly select a fraction of data

Stochastic gradient descent

Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 43 / 82

Page 44: Knowledge Discovery and Data Mining 1 (VO ... - kti.tugraz.atkti.tugraz.at/staff/denis/courses/kddm1/dimreduce.pdf · 1 Sum over all nonblank entries in M the square of the di erence

UV Decomposition

Overfitting

One problem that may arise

We arrive at one local minima that fits very well to the given data

But it fails to reflect the underlying process that generates the data

In other words, the RMSE is small on the given data, but it does notdo well predicting future data

This problem is called overfitting

Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 44 / 82

Page 45: Knowledge Discovery and Data Mining 1 (VO ... - kti.tugraz.atkti.tugraz.at/staff/denis/courses/kddm1/dimreduce.pdf · 1 Sum over all nonblank entries in M the square of the di erence

UV Decomposition

Avoid overfitting

Move the values only a fraction of way towards its optimized value (inthe beginning)

Stop revisiting elements of U and V well before the process hasconverged

Take several different decompositions and when predicting predict theaverage of the results of using each decomposition

Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 45 / 82

Page 46: Knowledge Discovery and Data Mining 1 (VO ... - kti.tugraz.atkti.tugraz.at/staff/denis/courses/kddm1/dimreduce.pdf · 1 Sum over all nonblank entries in M the square of the di erence

Eigenvalues and Eigenvectors

Eigenvalues and eigenvectors

Eigenvalues and eigenvectors

Given a square matrix A ∈ Rn×n, we say that λ ∈ C is an eigenvalue of Aand x ∈ Cn is the corresponding eigenvector if

Ax = λx, x 6= 0

There are two important properties of eigenvalues and eigenvectors ofsymmetric matrices

All eigenvalues are real

The eigenvectors are orthonormal

Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 46 / 82

Page 47: Knowledge Discovery and Data Mining 1 (VO ... - kti.tugraz.atkti.tugraz.at/staff/denis/courses/kddm1/dimreduce.pdf · 1 Sum over all nonblank entries in M the square of the di erence

Eigenvalues and Eigenvectors

Power method

Typically, we would calculate the leading eigenvector and leadingeigenvalue iteratively

A standard approach is the power method

We make an initial guess about the eigenvector x0

Then we iteratively calculate xt (which converges to the leadingeigenvector)

xt =Ax(t−1)

||Ax(t−1)||2

Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 47 / 82

Page 48: Knowledge Discovery and Data Mining 1 (VO ... - kti.tugraz.atkti.tugraz.at/staff/denis/courses/kddm1/dimreduce.pdf · 1 Sum over all nonblank entries in M the square of the di erence

Eigenvalues and Eigenvectors

Power method

In other words, the limiting vector is approximately equal the leadingeigenvector of the matrix

At the end of the iteration the leading (principal) eigenvalue can becalculated as:

λ1 = xTAx

Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 48 / 82

Page 49: Knowledge Discovery and Data Mining 1 (VO ... - kti.tugraz.atkti.tugraz.at/staff/denis/courses/kddm1/dimreduce.pdf · 1 Sum over all nonblank entries in M the square of the di erence

Eigenvalues and Eigenvectors

Power method

To find the second eigenpair we create a new matrix A∗ = A− λ1xxTWe then again use the power iteration to calculate the leadingeigenpair of A∗

This leading eigenpair corresponds to the second largest eigenpair ofthe original matrix A

Intuitively, we have eliminated the influence of a given eigenvector bysetting its associated eigenvalue to zero

Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 49 / 82

Page 50: Knowledge Discovery and Data Mining 1 (VO ... - kti.tugraz.atkti.tugraz.at/staff/denis/courses/kddm1/dimreduce.pdf · 1 Sum over all nonblank entries in M the square of the di erence

Eigenvalues and Eigenvectors

Power method

More formally, if A∗ = A− λ1xxT where λ1 is the leading eigenvalueof A and x is the leading eigenvector of A then

1 x is also an eigenvector of A∗ where the corresponding eigenvalue is 0.2 If v and λv are eigenpair of A other then the principal eigenpair that

they are also an eigenpair of A∗

Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 50 / 82

Page 51: Knowledge Discovery and Data Mining 1 (VO ... - kti.tugraz.atkti.tugraz.at/staff/denis/courses/kddm1/dimreduce.pdf · 1 Sum over all nonblank entries in M the square of the di erence

Eigenvalues and Eigenvectors

Power method

Proof.

We assume that A is a symmetric matrix

1 A∗x = (A− λ1xxT )x = Ax− λ1xxTx = Ax− λ1x = 0 = 0x

2 A∗v = (A∗)Tv = (A− λ1xxT )Tv = ATv − λ1xxTv = ATv = Av = λvv

Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 51 / 82

Page 52: Knowledge Discovery and Data Mining 1 (VO ... - kti.tugraz.atkti.tugraz.at/staff/denis/courses/kddm1/dimreduce.pdf · 1 Sum over all nonblank entries in M the square of the di erence

Principal-Component Analysis

PCA

Principal-component analysis or PCA is a technique for transformingpoints from a high-dimensional space by finding the directions alongwhich the points line up best

The idea is to treat the data as a matrix X and find the eigenvectorsof the matrix proportional to the covariance matrix XXT or XTX

The matrix of these eigenvectors may be thought of as a rigidrotation in a high-dimensional space

The axis corresponding to the principal eigenvector is the one withthe maximal variance

It carries most of the signal

Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 52 / 82

Page 53: Knowledge Discovery and Data Mining 1 (VO ... - kti.tugraz.atkti.tugraz.at/staff/denis/courses/kddm1/dimreduce.pdf · 1 Sum over all nonblank entries in M the square of the di erence

Principal-Component Analysis

PCA

The axis corresponding to the second eigenvector is the axis alongwhich the variance of distances from the first axis is greatest and soon

Thus, we can replace the original high-dimensional data by itsprojection onto the most important axes

These axes are the ones corresponding to the largest eigenvalues

Thus, the original data is approximated by data with fewer dimensions

The new data summarizes well the original data

Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 53 / 82

Page 54: Knowledge Discovery and Data Mining 1 (VO ... - kti.tugraz.atkti.tugraz.at/staff/denis/courses/kddm1/dimreduce.pdf · 1 Sum over all nonblank entries in M the square of the di erence

Principal-Component Analysis

Maximizing the variance

We can specify an axis by a unit vector w lying on that axis

A projection of another (centered) vector x onto the axis specified byw is given by the inner product of those two vectors:

xTw

Centered vector is a vector where the average has been subtracted

If we combine all (centered) data vectors into a matrix X then theprojection of the matrix onto the axis specified by w is given by:

Xw

Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 54 / 82

Page 55: Knowledge Discovery and Data Mining 1 (VO ... - kti.tugraz.atkti.tugraz.at/staff/denis/courses/kddm1/dimreduce.pdf · 1 Sum over all nonblank entries in M the square of the di erence

Principal-Component Analysis

Maximizing the variance

The variance of a single row from the matrix is given by:

(xTw)2

The variance of the complete projection is then given by:

σ2 =1

m

i

(xTi w)2

Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 55 / 82

Page 56: Knowledge Discovery and Data Mining 1 (VO ... - kti.tugraz.atkti.tugraz.at/staff/denis/courses/kddm1/dimreduce.pdf · 1 Sum over all nonblank entries in M the square of the di erence

Principal-Component Analysis

Maximizing the variance

In matrix form the variance is given by:

σ2 =1

m(Xw)T (Xw) =

1

mwTXTXw = wT XTX

mw = wTVw

Now, we want to choose a unit vector w that maximizes σ2

It must be a unit vector, thus the constraint wTw = 1 must besatisfied

Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 56 / 82

Page 57: Knowledge Discovery and Data Mining 1 (VO ... - kti.tugraz.atkti.tugraz.at/staff/denis/courses/kddm1/dimreduce.pdf · 1 Sum over all nonblank entries in M the square of the di erence

Principal-Component Analysis

Constrained optimization: Lagrange multipliers

Original objective function that we want to minimize: wTVw

This function is subject to constraint: constrained optimization

Typically solved by the method of Lagrange multipliers

Objective function: f (w) = wTVw

Subject to: wTw = 1

Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 57 / 82

Page 58: Knowledge Discovery and Data Mining 1 (VO ... - kti.tugraz.atkti.tugraz.at/staff/denis/courses/kddm1/dimreduce.pdf · 1 Sum over all nonblank entries in M the square of the di erence

Principal-Component Analysis

Lagrange multipliers

For each constraint we need one Lagrange multiplier, e.g. λ

Lagrange formulation of the optimization problem will be a newobjective function that is a function of s and λ

L(w, λ) = wTVw − λ(wTw − 1)

Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 58 / 82

Page 59: Knowledge Discovery and Data Mining 1 (VO ... - kti.tugraz.atkti.tugraz.at/staff/denis/courses/kddm1/dimreduce.pdf · 1 Sum over all nonblank entries in M the square of the di erence

Principal-Component Analysis

Constrained optimization

To minimize L we find w and λ that make its gradient 0

5L = 0 :

∂L

∂w= 0

∂L

∂λ= 0

Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 59 / 82

Page 60: Knowledge Discovery and Data Mining 1 (VO ... - kti.tugraz.atkti.tugraz.at/staff/denis/courses/kddm1/dimreduce.pdf · 1 Sum over all nonblank entries in M the square of the di erence

Principal-Component Analysis

Constrained optimization

∂L∂λ = 0 give back the constraint

∂L

∂w= 2Vw − 2λw = 0

Vw = λw

Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 60 / 82

Page 61: Knowledge Discovery and Data Mining 1 (VO ... - kti.tugraz.atkti.tugraz.at/staff/denis/courses/kddm1/dimreduce.pdf · 1 Sum over all nonblank entries in M the square of the di erence

Principal-Component Analysis

Constrained optimization

Thus, desired vector w is an eigenvector of the covariance matrix V

The maximizing vector will be the one associated with the largesteigenvalue λ

V is a covariance matrix, thus it will be symmetric

The eigenvectors are orthogonal and can be found by the powermethod

They are called principal components

Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 61 / 82

Page 62: Knowledge Discovery and Data Mining 1 (VO ... - kti.tugraz.atkti.tugraz.at/staff/denis/courses/kddm1/dimreduce.pdf · 1 Sum over all nonblank entries in M the square of the di erence

PCA Example

PCA example

11.2. PRINCIPAL-COMPONENT ANALYSIS 403

We can view PCA as a data-mining technique. The high-dimensional datacan be replaced by its projection onto the most important axes. These axesare the ones corresponding to the largest eigenvalues. Thus, the original datais approximated by data with many fewer dimensions, which summarizes wellthe original data.

11.2.1 An Illustrative Example

We shall start the exposition with a contrived and simple example. In thisexample, the data is two-dimensional, a number of dimensions that is too smallto make PCA really useful. Moreover, the data, shown in Fig. 11.1 has onlyfour points, and they are arranged in a simple pattern along the 45-degree lineto make our calculations easy to follow. That is, to anticipate the result, thepoints can best be viewed as lying along the axis that is at a 45-degree angle,with small deviations in the perpendicular direction.

(2,1)

(3,4)

(4,3)(1,2)

Figure 11.1: Four points in a two-dimensional space

To begin, let us represent the points by a matrix M with four rows – onefor each point – and two columns, corresponding to the x-axis and y-axis. Thismatrix is

M =

1 22 13 44 3

Compute MTM , which is

MTM =

[1 2 3 42 1 4 3

]

1 22 13 44 3

=

[30 2828 30

]

We may find the eigenvalues of the matrix above by solving the equation

(30− λ)(30− λ)− 28× 28 = 0

Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 62 / 82

Page 63: Knowledge Discovery and Data Mining 1 (VO ... - kti.tugraz.atkti.tugraz.at/staff/denis/courses/kddm1/dimreduce.pdf · 1 Sum over all nonblank entries in M the square of the di erence

PCA Example

PCA example

In this example the data is two dimensional and we want to reduce itto a single dimension

The data has only four points, and they are arranged in a simplepattern along the 45 degree line

To anticipate the result: the points are lying on this line

Small deviations in the orthogonal directions

We would expect the 45 degree line to maximize the variance

Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 63 / 82

Page 64: Knowledge Discovery and Data Mining 1 (VO ... - kti.tugraz.atkti.tugraz.at/staff/denis/courses/kddm1/dimreduce.pdf · 1 Sum over all nonblank entries in M the square of the di erence

PCA Example

PCA example

Let us represent the data in a matrix form:

X =

1 22 13 44 3

Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 64 / 82

Page 65: Knowledge Discovery and Data Mining 1 (VO ... - kti.tugraz.atkti.tugraz.at/staff/denis/courses/kddm1/dimreduce.pdf · 1 Sum over all nonblank entries in M the square of the di erence

PCA Example

PCA example

We compute XTX

XTX =

(1 2 3 42 1 4 3

)

1 22 13 44 3

=

(30 2828 30

)

Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 65 / 82

Page 66: Knowledge Discovery and Data Mining 1 (VO ... - kti.tugraz.atkti.tugraz.at/staff/denis/courses/kddm1/dimreduce.pdf · 1 Sum over all nonblank entries in M the square of the di erence

PCA Example

PCA: example

A =

(30 2828 30

)

det(λI− A) = det(

(λ− 30 28

28 λ− 30

)) = (λ− 30)(λ− 30)− 784

= λ2 − 60λ+ 900− 784 = λ2 − 60λ− 116

= (λ− 58)(λ− 2)

Thus, λ1 = 58, and λ2 = 2 are eigenvalues of A

We now solve (λi I− A)x = 0 for each eigenvalue to find thecorresponding eigenvectors

Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 66 / 82

Page 67: Knowledge Discovery and Data Mining 1 (VO ... - kti.tugraz.atkti.tugraz.at/staff/denis/courses/kddm1/dimreduce.pdf · 1 Sum over all nonblank entries in M the square of the di erence

PCA Example

PCA: example

For λ1 = 58

(28 −28−28 28

)(x1x2

)=

(00

)

28x1 − 28x2 = 0

28x1 − 28x2 = 0

Thus, x1 = x2, and we might pick x =

(11

)

And we normalize to:

(1/√

2

1/√

2

)

Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 67 / 82

Page 68: Knowledge Discovery and Data Mining 1 (VO ... - kti.tugraz.atkti.tugraz.at/staff/denis/courses/kddm1/dimreduce.pdf · 1 Sum over all nonblank entries in M the square of the di erence

PCA Example

PCA: example

For λ1 = 2

(−28 −28−28 −28

)(x1x2

)=

(00

)

−28x1 − 28x2 = 0

−28x1 − 28x2 = 0

Thus, x1 = −x2, and we might pick x =

(−11

)

And we normalize to:

(−1/√

2

1/√

2

)

Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 68 / 82

Page 69: Knowledge Discovery and Data Mining 1 (VO ... - kti.tugraz.atkti.tugraz.at/staff/denis/courses/kddm1/dimreduce.pdf · 1 Sum over all nonblank entries in M the square of the di erence

PCA Example

PCA: example

Now let us construct E, which is the (orthogonal) matrix ofeigenvectors for the matrix XTX

E =

(1/√

2 −1/√

2

1/√

2 1/√

2

)

Any orthogonal matrix represents a rotation of the axes of aEuclidean space

In the example: rotation 45 degrees counterclockwise

Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 69 / 82

Page 70: Knowledge Discovery and Data Mining 1 (VO ... - kti.tugraz.atkti.tugraz.at/staff/denis/courses/kddm1/dimreduce.pdf · 1 Sum over all nonblank entries in M the square of the di erence

PCA Example

PCA: example

Now let us construct E, which is the (orthogonal) matrix ofeigenvectors for the matrix XTX

XE =

1 22 13 44 3

(

1/√

2 −1/√

2

1/√

2 1/√

2

)=

3/√

2 1/√

2

3/√

2 −1/√

2

7/√

2 1/√

2

7/√

2 −1/√

2

Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 70 / 82

Page 71: Knowledge Discovery and Data Mining 1 (VO ... - kti.tugraz.atkti.tugraz.at/staff/denis/courses/kddm1/dimreduce.pdf · 1 Sum over all nonblank entries in M the square of the di erence

PCA Example

PCA example

11.2. PRINCIPAL-COMPONENT ANALYSIS 405

(2,1)

(3,4)

(4,3)(1,2)

(1.5,1.5)

(3.5,3.5)

Figure 11.2: Figure 11.1 with the axes rotated 45 degrees counterclockwise

For example, the first point, [1, 2], has been transformed into the point

[3/√2, 1/

√2]

If we examine Fig. 11.2, with the dashed line representing the new x-axis, we seethat the projection of the first point onto that axis places it at distance 3/

√2

from the origin. To check that fact, notice that the point of projection for boththe first and second points is [1.5, 1.5] in the original coordinate system, andthe distance from the origin to this point is

√(1.5)2 + (1.5)2 =

√9/2 = 3/

√2

Moreover, the new y-axis is, of course, perpendicular to the dashed line. Thefirst point is at distance 1/

√2 above the new x-axis in the direction of the

y-axis. That is, the distance between the points [1, 2] and [1.5, 1.5] is

√(1 − 1.5)2 + (2− 1.5)2 =

√(−1/2)2 + (1/2)2 =

√1/2 = 1/

√2

Figure 11.3 shows the four points in the rotated coordinate system.

2 2(3/ , −1/ )

2 2(3/ , 1/ ) 2 2(7/ , 1/ )

2 2(7/ , −1/ )

Figure 11.3: The points of Fig. 11.1 in the new coordinate system

The second point, [2, 1] happens by coincidence to project onto the samepoint of the new x-axis. It is 1/

√2 below that axis along the new y-axis, as is

Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 71 / 82

Page 72: Knowledge Discovery and Data Mining 1 (VO ... - kti.tugraz.atkti.tugraz.at/staff/denis/courses/kddm1/dimreduce.pdf · 1 Sum over all nonblank entries in M the square of the di erence

PCA Example

PCA: example

For example, the point [1, 2] has been transformed into the point[3/√

2, 1/√

2]

The point of projection for both the first and the second points is[1.5, 1.5] or [3/2, 3/2]

Then the distance from the origin in the new coordinate space is:

√(3/2)2 + (3/2)2 =

√9/2 = 3/

√2

Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 72 / 82

Page 73: Knowledge Discovery and Data Mining 1 (VO ... - kti.tugraz.atkti.tugraz.at/staff/denis/courses/kddm1/dimreduce.pdf · 1 Sum over all nonblank entries in M the square of the di erence

PCA Example

PCA example

11.2. PRINCIPAL-COMPONENT ANALYSIS 405

(2,1)

(3,4)

(4,3)(1,2)

(1.5,1.5)

(3.5,3.5)

Figure 11.2: Figure 11.1 with the axes rotated 45 degrees counterclockwise

For example, the first point, [1, 2], has been transformed into the point

[3/√2, 1/

√2]

If we examine Fig. 11.2, with the dashed line representing the new x-axis, we seethat the projection of the first point onto that axis places it at distance 3/

√2

from the origin. To check that fact, notice that the point of projection for boththe first and second points is [1.5, 1.5] in the original coordinate system, andthe distance from the origin to this point is

√(1.5)2 + (1.5)2 =

√9/2 = 3/

√2

Moreover, the new y-axis is, of course, perpendicular to the dashed line. Thefirst point is at distance 1/

√2 above the new x-axis in the direction of the

y-axis. That is, the distance between the points [1, 2] and [1.5, 1.5] is

√(1 − 1.5)2 + (2− 1.5)2 =

√(−1/2)2 + (1/2)2 =

√1/2 = 1/

√2

Figure 11.3 shows the four points in the rotated coordinate system.

2 2(3/ , −1/ )

2 2(3/ , 1/ ) 2 2(7/ , 1/ )

2 2(7/ , −1/ )

Figure 11.3: The points of Fig. 11.1 in the new coordinate system

The second point, [2, 1] happens by coincidence to project onto the samepoint of the new x-axis. It is 1/

√2 below that axis along the new y-axis, as is

Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 73 / 82

Page 74: Knowledge Discovery and Data Mining 1 (VO ... - kti.tugraz.atkti.tugraz.at/staff/denis/courses/kddm1/dimreduce.pdf · 1 Sum over all nonblank entries in M the square of the di erence

PCA Example

PCA: example

From the example we also see the general principle

The matrix XE keeps the transformed points

Each column represent an axis in the new space

The variance along the axes decays with each new axes, thus eachnew axis is less significant than the previous one

Since the axes are orthogonal then the values along the axes arelinearly uncorrelated

We might drop less significant axes

Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 74 / 82

Page 75: Knowledge Discovery and Data Mining 1 (VO ... - kti.tugraz.atkti.tugraz.at/staff/denis/courses/kddm1/dimreduce.pdf · 1 Sum over all nonblank entries in M the square of the di erence

PCA Example

PCA: example

Thus, we reduce dimensions

It can be seen also as a kind of data compression

We remove (reduce) the values where the information content is small

You can relate PCA with information theory

It is possible to show that if the data is Gaussian then the PCA is alsooptimal from the information theoretic point of view, i.e. the mostsignificant axes have the maximal information content

Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 75 / 82

Page 76: Knowledge Discovery and Data Mining 1 (VO ... - kti.tugraz.atkti.tugraz.at/staff/denis/courses/kddm1/dimreduce.pdf · 1 Sum over all nonblank entries in M the square of the di erence

PCA Example

PCA: example

XE =

1 22 13 44 3

(

1/√

2 −1/√

2

1/√

2 1/√

2

)=

3/√

2 1/√

2

3/√

2 −1/√

2

7/√

2 1/√

2

7/√

2 −1/√

2

Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 76 / 82

Page 77: Knowledge Discovery and Data Mining 1 (VO ... - kti.tugraz.atkti.tugraz.at/staff/denis/courses/kddm1/dimreduce.pdf · 1 Sum over all nonblank entries in M the square of the di erence

PCA Example

PCA: example

1 22 13 44 3

(

1/√

2

1/√

2

)=

3/√

2

3/√

2

7/√

2

7/√

2

Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 77 / 82

Page 78: Knowledge Discovery and Data Mining 1 (VO ... - kti.tugraz.atkti.tugraz.at/staff/denis/courses/kddm1/dimreduce.pdf · 1 Sum over all nonblank entries in M the square of the di erence

PCA Example

PCA: Algorithm

Organize data as an m × n matrix, with m entities and n features

Subtract the average for each feature to obtain centered datamatrix X

Calculate the covariance matrix 1mXTX

Calculate the eigenvalues and the eigenvectors of the covariancematrix

Select the top r eigenvectors

Project the data to the new space spanned by those r eigenvectors:XE ∈ Rm×r , where E ∈ Rn×r

Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 78 / 82

Page 79: Knowledge Discovery and Data Mining 1 (VO ... - kti.tugraz.atkti.tugraz.at/staff/denis/courses/kddm1/dimreduce.pdf · 1 Sum over all nonblank entries in M the square of the di erence

PCA Example

PCA example

20 15 10 5 0 5 10 15 2020

15

10

5

0

5

10

15

20

Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 79 / 82

Page 80: Knowledge Discovery and Data Mining 1 (VO ... - kti.tugraz.atkti.tugraz.at/staff/denis/courses/kddm1/dimreduce.pdf · 1 Sum over all nonblank entries in M the square of the di erence

PCA Example

PCA example

20 15 10 5 0 5 10 15 2020

15

10

5

0

5

10

15

20

Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 80 / 82

Page 81: Knowledge Discovery and Data Mining 1 (VO ... - kti.tugraz.atkti.tugraz.at/staff/denis/courses/kddm1/dimreduce.pdf · 1 Sum over all nonblank entries in M the square of the di erence

PCA Example

PCA example

IPython Notebook examples

http:

//kti.tugraz.at/staff/denis/courses/kddm1/pca.ipynb

Command Line

ipython notebook –pylab=inline pca.ipynb

Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 81 / 82

Page 82: Knowledge Discovery and Data Mining 1 (VO ... - kti.tugraz.atkti.tugraz.at/staff/denis/courses/kddm1/dimreduce.pdf · 1 Sum over all nonblank entries in M the square of the di erence

PCA Example

PCA: Limitations

PCA transforms the set of correlated observations into a set oflinearly uncorrelated observations

I.e. the goal of the analysis is to decorrelate the data

In other words, the goal is to remove second-order dependencies inthe data

However, if the higher-order dependencies in the data exist removingonly the second-order dependencies will not completely decorrelatethe data

First workaround: apply a nonlinear (kernel) transformation first

Second workaround: require data to be statistically independentrather than linearly independent along the dimensions

Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 82 / 82