knowledge discovery and data mining 1 (vo ... -...
TRANSCRIPT
Knowledge Discovery and Data Mining 1 (VO) (707.003)Dimensionality Reduction
Denis Helic
KTI, TU Graz
Dec 12, 2013
Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 1 / 82
Big picture: KDDM
Linear Algebra Map-Reduce
Mathematical Tools Infrastructure
Knowledge Discovery Process
Information Theory Statistical Inference
Probability Theory
PreprocessingTransformation
Data Mining
Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 2 / 82
Outline
1 Introduction
2 UV Decomposition
3 Eigenvalues and Eigenvectors
4 Principal-Component Analysis
5 PCA Example
Slides
Slides are partially based on “Mining Massive Datasets” Chapters 9 and 11
Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 3 / 82
Introduction
Representing data as matrices
There are many sources of the data that can be represented as largematrices
Vector Space Model: document are represented as term vectors
The complete document collection is represented as a largedocument-term matrix
In recommender systems, we represent users’ rating of items as autility matrix
We use also matrices to represent the social networks, etc.
Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 4 / 82
Introduction
Representing data as matrices
These matrices are huge and have hundred thousands, even millionsof rows and columns
E.g. document-term matrix of Wikipedia: 10 millions rows and100.000 columns
Utility matrix at Amazon
The number of customers times the number of articles
The curse of dimensionality
Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 5 / 82
Introduction
Representing data as matrices
In many cases we can summarize these matrices by finding narrowermatrices that are in some sense close to the original
These narrow matrices have small numbers of rows and columns
We can use them more efficiently than the original matrices
The process of finding those narrow matrices is called dimensionalityreduction
Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 6 / 82
UV Decomposition
Recommender systems
Recommender systems predict user responses to options
E.g. recommend news articles based on prediction of user interests
E.g. recommend products based on the predictions of what usermight like
Content-based systems make predictions based on the content of theitems
Collaborative filtering systems make predictions based on thesimilarity between users/items
Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 7 / 82
UV Decomposition
The Utility Matrix
In a recommender system there are two classes of entities: users anditems
Users have preferences for certain items and we have to mine forthese preferences
The data is represented by a utility matrix M ∈ Rn×m, where n is thenumber of users and m the number of items
The matrix gives a value for each user-item pair what is known aboutthe preference of that user for that item
E.g. the values can come from an ordered set (1 to 5) and represent arating that a user gave for an item
Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 8 / 82
UV Decomposition
The Utility Matrix
We assume that the matrix is sparse
This means that most entries are unknown
The majority of the user preferences for specific items is unknown
An unknown rating means that we do not have explicit information
It does not mean that the rating is low
The goal of a recommender system is to predict the blank in theutility matrix
Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 9 / 82
UV Decomposition
The Utility Matrix: example
UserMovie
HP1 HP2 HP3 Hobbit SW1 SW2 SW3
A 4 5 1
B 5 5 4
C 2 4 5
D 3 3
Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 10 / 82
UV Decomposition
Item profiles
The utility matrix itself does not offer a lot of evidence
Typically in practice, the utility matrix is a very sparse matrix
Also, we might think about the utility matrix as a final result of arating process
For example, items have some (general) characteristics that userslike/dislike and because of these characteristics they rate them in oneway or another
Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 11 / 82
UV Decomposition
Item profiles: example
Movies have certain features which describe important characteristicsof every movie
The set of actors
The director
Year
The genre
Technical characteristics
Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 12 / 82
UV Decomposition
User profiles
The same set of features might be used to represent the preferencesof the users
We might represent the preferences as the feature weights
E.g. a feature which the user prefers gets a higher weight
The final rating of a user for an item might be then the weighted sumof the features from the item profile
This can be used to predict the missing values in the utility matrix
Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 13 / 82
UV Decomposition
Representing profiles
Let us represent item and user profiles as vectors with features asdimensions
Suppose we have the following features: Julia Roberts, EdwardNorton, Martin Scorsese, Ridley Scott, Western, Drama, Thriller
Let us denote the features with a feature vector x, where eachelement (dimension) corresponds to a feature
x1 = JuliaRoberts, x2 = EdwardNorton, ..., x7 = Thriller
Now we might represent the users and items with vectors u ∈ Rd andv ∈ Rd with corresponding values for each feature
Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 14 / 82
UV Decomposition
Representing profiles: example
E.g. we might represent a movie starring Julia Roberts directed byMartin Scorsese and with a mixture elements of drama and thriller asa vector:
v0 =
10100
0.50.5
Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 15 / 82
UV Decomposition
Representing profiles: example
Another thriller movie starring Edward Norton and directed by RidleyScott:
v1 =
0101001
Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 16 / 82
UV Decomposition
Representing profiles: example
Let us represent the users as vectors u with weights for the featuresrepresenting the user preferences
E.g. a user that prefers thrillers and Edward Norton might berepresented as the following vector
u0 =
0200002
Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 17 / 82
UV Decomposition
Representing profiles: example
Note that with the weights we might express also that a user dislikesa particular feature
u1 =
−2030022
Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 18 / 82
UV Decomposition
Representing profiles: example
Now, we can calculate a rating that a user gives for a certain movieby calculating uTv
uT0 v0 =(0 2 0 0 0 0 2
)
10100
0.50.5
= 1
Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 19 / 82
UV Decomposition
Representing profiles: example
uT1 v0 =(−2 0 3 0 0 2 2
)
10100
0.50.5
= 3
Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 20 / 82
UV Decomposition
Representing profiles: example
uT0 v1 =(0 2 0 0 0 0 2
)
0101001
= 4
Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 21 / 82
UV Decomposition
Representing profiles: example
uT1 v1 =(−2 0 3 0 0 2 2
)
0101001
= 2
Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 22 / 82
UV Decomposition
Representing users: The U Matrix
We can now group all user vectors u ∈ Rd into a matrix U ∈ Rn×d
n is the number of users, and d is the number of features
uT1uT2
...uTn
Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 23 / 82
UV Decomposition
Representing items: The V Matrix
We can now group all item vectors v ∈ Rd into a matrix V ∈ Rd×m
m is the number of items, and d is the number of features
(v1 v2 . . . vm
)
Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 24 / 82
UV Decomposition
The product: UV
Now, the utility matrix M is given by:
M = UV
We will decompose M to obtain U and V
We will reduce the dimensions of M
We can also use U and V to predict missing values in M
Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 25 / 82
UV Decomposition
UV Decomposition
We start with M ∈ Rn×m and want to find U ∈ Rn×d and V ∈ Rd×m
UV closely approximates M
If we are able to find this decomposition than we have establishedthat there are d dimensions that allow us to characterize both usersand items closely
This process is called UV decomposition
Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 26 / 82
UV Decomposition
UV Decomposition: example
5 2 4 4 33 1 2 4 12 3 1 42 5 4 3 54 4 5 4
=
u11 u12
u21 u22
u31 u32
u41 u42
u51 u52
×(
v11 v12 v13 v14 v15v21 v22 v23 v24 v25
)
Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 27 / 82
UV Decomposition
Root-Mean-Square-Error
We approximate M → we need to measure the approximation error
We can pick among several measures for this error
A typical choice is the root-mean-square-error (RMSE):1 Sum over all nonblank entries in M the square of the difference
between that entry and the corresponding entry in the product UV2 Take the average of these squares by dividing by the number of terms
in the sum (i.e. the number of nonblank entries in M)3 Take the square root of the mean
Minimizing the sum of the squares is equivalent to minimizing thesquare root of the average square, thus we can omit the last two steps
Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 28 / 82
UV Decomposition
Root-Mean-Square-Error: example
Suppose we start with U and V with all ones:
1 11 11 11 11 1
×(
1 1 1 1 11 1 1 1 1
)=
2 2 2 2 22 2 2 2 22 2 2 2 22 2 2 2 22 2 2 2 2
Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 29 / 82
UV Decomposition
Root-Mean-Square-Error: example
5 2 4 4 33 1 2 4 12 3 1 42 5 4 3 54 4 5 4
−
2 2 2 2 22 2 2 2 22 2 2 2 22 2 2 2 22 2 2 2 2
=
3 0 2 2 11 −1 0 2 −10 1 −1 20 3 2 1 32 2 3 2
Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 30 / 82
UV Decomposition
Root-Mean-Square-Error: example
Sum of squares:1 Row 1: 182 Row 2: 73 Row 3: 64 Row 4: 235 Row 5: 21
Total sum: 75
We can already stop at this point
Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 31 / 82
UV Decomposition
Incremental computation
Finding the decomposition with the least RMSE involves starting withsome arbitrarily chosen U and V and iteratively adapting the matricesto make the RMSE smaller
We consider only adjustments to a single element of U or V
In principle we could also make more complex adjustments
In a typical example we will encounter many local minima
In that case no allowable adjustments to U or V will make the RMSEsmaller
Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 32 / 82
UV Decomposition
Incremental computation
Only one of these will be the global minimum
That is the the least possible RMSE
To increase the chances of finding the global minimum we may startthe iteration many times with different starting points
However, there is no guarantee that we will find the global minimum
Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 33 / 82
UV Decomposition
Incremental computation: example
Suppose we start with U and V with all ones and make a singleadjustment (u11):
x 11 11 11 11 1
×(
1 1 1 1 11 1 1 1 1
)=
x + 1 x + 1 x + 1 x + 1 x + 12 2 2 2 22 2 2 2 22 2 2 2 22 2 2 2 2
Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 34 / 82
UV Decomposition
Incremental computation: example
Sum of squares:
Row 1:(5−(x+1))2+(2−(x+1))2+(4−(x+1))2+(4−(x+1))2++(3−(x+1))2
This simplifies to: (4− x)2 + (1− x)2 + (3− x)2 + (3− x)2 + (2− x)2
We are looking for x that minimizes the sum:
ds
dx= 0
Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 35 / 82
UV Decomposition
Incremental computation: example
ds
dx= −2((4− x) + (1− x) + (3− x) + (3− x) + (2− x)) = 0
This gives x = 2.6
2.6 11 11 11 11 1
×(
1 1 1 1 11 1 1 1 1
)=
3.6 3.6 3.6 3.6 3.62 2 2 2 22 2 2 2 22 2 2 2 22 2 2 2 2
Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 36 / 82
UV Decomposition
Incremental computation: example
Now we would again make a single adjustment (v11) and repeat theprocess
2.6 11 11 11 11 1
×(
y 1 1 1 11 1 1 1 1
)=
2.6y + 1 3.6 3.6 3.6 3.6y + 1 2 2 2 2y + 1 2 2 2 2y + 1 2 2 2 2y + 1 2 2 2 2
Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 37 / 82
UV Decomposition
Optimizing an arbitrary element
How does the general formula look like?
We denote with P = UV the current product of matrices U and V
Suppose we want to vary urs and find the value of this element thatminimizes the RMSE
Note that urs only affects the elements in the r -th row of P:
prj =d∑
k=1
urkvkj =∑
k 6=s
urkvkj + xvsj
We sum over all nonblank values mrj
We replaced urs with x
Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 38 / 82
UV Decomposition
Optimizing an arbitrary element
If mrj is a nonblank element then the contribution of this element toRMSE is given by:
(mrj − prj)2 = (mrj −
∑
k 6=s
urkvkj + xvsj)2
Now, we can sum over all squares of errors on nonblank entries of M
∑
j
(mrj −∑
k 6=s
urkvkj + xvsj)2
Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 39 / 82
UV Decomposition
Optimizing an arbitrary element
We take the derivative with respect to x and set it equal to 0:
∑
j
−2vsj(mrj −∑
k 6=s
urkvkj + xvsj) = 0
We then solve for x :
x =
∑j vsj(mrj −
∑k 6=s urkvkj)∑
j v2sj
Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 40 / 82
UV Decomposition
Optimizing an arbitrary element
Similarly, we can derive a formula for element y when we varyvrs
y =
∑i uir (mis −
∑k 6=r uikvks)
∑i u2
ir
Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 41 / 82
UV Decomposition
The complete algorithm
Preprocessing: adjusting scales by e.g. subtracting the average inrows and then columns
Initialization: many different initializations, e.g. the elements thatgive the product the averages of the elements in the utility matrix
Optimization: e.g. we always change a single element and pick anorder of change (row-by-row, etc)
Convergence: when the improvements in RMSE fall below a thresholdwe may stop
Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 42 / 82
UV Decomposition
Gradient Descent
This technique for finding the decomposition is an example ofgradient descent
We are given some data points: nonblank entries of the utility matrix
For each data point we find the direction of change that mostdecreases the RMSE
If the utility matrix is too large to visit each nonblank point severaltimes
We might randomly select a fraction of data
Stochastic gradient descent
Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 43 / 82
UV Decomposition
Overfitting
One problem that may arise
We arrive at one local minima that fits very well to the given data
But it fails to reflect the underlying process that generates the data
In other words, the RMSE is small on the given data, but it does notdo well predicting future data
This problem is called overfitting
Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 44 / 82
UV Decomposition
Avoid overfitting
Move the values only a fraction of way towards its optimized value (inthe beginning)
Stop revisiting elements of U and V well before the process hasconverged
Take several different decompositions and when predicting predict theaverage of the results of using each decomposition
Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 45 / 82
Eigenvalues and Eigenvectors
Eigenvalues and eigenvectors
Eigenvalues and eigenvectors
Given a square matrix A ∈ Rn×n, we say that λ ∈ C is an eigenvalue of Aand x ∈ Cn is the corresponding eigenvector if
Ax = λx, x 6= 0
There are two important properties of eigenvalues and eigenvectors ofsymmetric matrices
All eigenvalues are real
The eigenvectors are orthonormal
Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 46 / 82
Eigenvalues and Eigenvectors
Power method
Typically, we would calculate the leading eigenvector and leadingeigenvalue iteratively
A standard approach is the power method
We make an initial guess about the eigenvector x0
Then we iteratively calculate xt (which converges to the leadingeigenvector)
xt =Ax(t−1)
||Ax(t−1)||2
Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 47 / 82
Eigenvalues and Eigenvectors
Power method
In other words, the limiting vector is approximately equal the leadingeigenvector of the matrix
At the end of the iteration the leading (principal) eigenvalue can becalculated as:
λ1 = xTAx
Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 48 / 82
Eigenvalues and Eigenvectors
Power method
To find the second eigenpair we create a new matrix A∗ = A− λ1xxTWe then again use the power iteration to calculate the leadingeigenpair of A∗
This leading eigenpair corresponds to the second largest eigenpair ofthe original matrix A
Intuitively, we have eliminated the influence of a given eigenvector bysetting its associated eigenvalue to zero
Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 49 / 82
Eigenvalues and Eigenvectors
Power method
More formally, if A∗ = A− λ1xxT where λ1 is the leading eigenvalueof A and x is the leading eigenvector of A then
1 x is also an eigenvector of A∗ where the corresponding eigenvalue is 0.2 If v and λv are eigenpair of A other then the principal eigenpair that
they are also an eigenpair of A∗
Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 50 / 82
Eigenvalues and Eigenvectors
Power method
Proof.
We assume that A is a symmetric matrix
1 A∗x = (A− λ1xxT )x = Ax− λ1xxTx = Ax− λ1x = 0 = 0x
2 A∗v = (A∗)Tv = (A− λ1xxT )Tv = ATv − λ1xxTv = ATv = Av = λvv
Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 51 / 82
Principal-Component Analysis
PCA
Principal-component analysis or PCA is a technique for transformingpoints from a high-dimensional space by finding the directions alongwhich the points line up best
The idea is to treat the data as a matrix X and find the eigenvectorsof the matrix proportional to the covariance matrix XXT or XTX
The matrix of these eigenvectors may be thought of as a rigidrotation in a high-dimensional space
The axis corresponding to the principal eigenvector is the one withthe maximal variance
It carries most of the signal
Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 52 / 82
Principal-Component Analysis
PCA
The axis corresponding to the second eigenvector is the axis alongwhich the variance of distances from the first axis is greatest and soon
Thus, we can replace the original high-dimensional data by itsprojection onto the most important axes
These axes are the ones corresponding to the largest eigenvalues
Thus, the original data is approximated by data with fewer dimensions
The new data summarizes well the original data
Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 53 / 82
Principal-Component Analysis
Maximizing the variance
We can specify an axis by a unit vector w lying on that axis
A projection of another (centered) vector x onto the axis specified byw is given by the inner product of those two vectors:
xTw
Centered vector is a vector where the average has been subtracted
If we combine all (centered) data vectors into a matrix X then theprojection of the matrix onto the axis specified by w is given by:
Xw
Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 54 / 82
Principal-Component Analysis
Maximizing the variance
The variance of a single row from the matrix is given by:
(xTw)2
The variance of the complete projection is then given by:
σ2 =1
m
∑
i
(xTi w)2
Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 55 / 82
Principal-Component Analysis
Maximizing the variance
In matrix form the variance is given by:
σ2 =1
m(Xw)T (Xw) =
1
mwTXTXw = wT XTX
mw = wTVw
Now, we want to choose a unit vector w that maximizes σ2
It must be a unit vector, thus the constraint wTw = 1 must besatisfied
Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 56 / 82
Principal-Component Analysis
Constrained optimization: Lagrange multipliers
Original objective function that we want to minimize: wTVw
This function is subject to constraint: constrained optimization
Typically solved by the method of Lagrange multipliers
Objective function: f (w) = wTVw
Subject to: wTw = 1
Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 57 / 82
Principal-Component Analysis
Lagrange multipliers
For each constraint we need one Lagrange multiplier, e.g. λ
Lagrange formulation of the optimization problem will be a newobjective function that is a function of s and λ
L(w, λ) = wTVw − λ(wTw − 1)
Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 58 / 82
Principal-Component Analysis
Constrained optimization
To minimize L we find w and λ that make its gradient 0
5L = 0 :
∂L
∂w= 0
∂L
∂λ= 0
Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 59 / 82
Principal-Component Analysis
Constrained optimization
∂L∂λ = 0 give back the constraint
∂L
∂w= 2Vw − 2λw = 0
Vw = λw
Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 60 / 82
Principal-Component Analysis
Constrained optimization
Thus, desired vector w is an eigenvector of the covariance matrix V
The maximizing vector will be the one associated with the largesteigenvalue λ
V is a covariance matrix, thus it will be symmetric
The eigenvectors are orthogonal and can be found by the powermethod
They are called principal components
Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 61 / 82
PCA Example
PCA example
11.2. PRINCIPAL-COMPONENT ANALYSIS 403
We can view PCA as a data-mining technique. The high-dimensional datacan be replaced by its projection onto the most important axes. These axesare the ones corresponding to the largest eigenvalues. Thus, the original datais approximated by data with many fewer dimensions, which summarizes wellthe original data.
11.2.1 An Illustrative Example
We shall start the exposition with a contrived and simple example. In thisexample, the data is two-dimensional, a number of dimensions that is too smallto make PCA really useful. Moreover, the data, shown in Fig. 11.1 has onlyfour points, and they are arranged in a simple pattern along the 45-degree lineto make our calculations easy to follow. That is, to anticipate the result, thepoints can best be viewed as lying along the axis that is at a 45-degree angle,with small deviations in the perpendicular direction.
(2,1)
(3,4)
(4,3)(1,2)
Figure 11.1: Four points in a two-dimensional space
To begin, let us represent the points by a matrix M with four rows – onefor each point – and two columns, corresponding to the x-axis and y-axis. Thismatrix is
M =
1 22 13 44 3
Compute MTM , which is
MTM =
[1 2 3 42 1 4 3
]
1 22 13 44 3
=
[30 2828 30
]
We may find the eigenvalues of the matrix above by solving the equation
(30− λ)(30− λ)− 28× 28 = 0
Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 62 / 82
PCA Example
PCA example
In this example the data is two dimensional and we want to reduce itto a single dimension
The data has only four points, and they are arranged in a simplepattern along the 45 degree line
To anticipate the result: the points are lying on this line
Small deviations in the orthogonal directions
We would expect the 45 degree line to maximize the variance
Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 63 / 82
PCA Example
PCA example
Let us represent the data in a matrix form:
X =
1 22 13 44 3
Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 64 / 82
PCA Example
PCA example
We compute XTX
XTX =
(1 2 3 42 1 4 3
)
1 22 13 44 3
=
(30 2828 30
)
Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 65 / 82
PCA Example
PCA: example
A =
(30 2828 30
)
det(λI− A) = det(
(λ− 30 28
28 λ− 30
)) = (λ− 30)(λ− 30)− 784
= λ2 − 60λ+ 900− 784 = λ2 − 60λ− 116
= (λ− 58)(λ− 2)
Thus, λ1 = 58, and λ2 = 2 are eigenvalues of A
We now solve (λi I− A)x = 0 for each eigenvalue to find thecorresponding eigenvectors
Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 66 / 82
PCA Example
PCA: example
For λ1 = 58
(28 −28−28 28
)(x1x2
)=
(00
)
28x1 − 28x2 = 0
28x1 − 28x2 = 0
Thus, x1 = x2, and we might pick x =
(11
)
And we normalize to:
(1/√
2
1/√
2
)
Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 67 / 82
PCA Example
PCA: example
For λ1 = 2
(−28 −28−28 −28
)(x1x2
)=
(00
)
−28x1 − 28x2 = 0
−28x1 − 28x2 = 0
Thus, x1 = −x2, and we might pick x =
(−11
)
And we normalize to:
(−1/√
2
1/√
2
)
Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 68 / 82
PCA Example
PCA: example
Now let us construct E, which is the (orthogonal) matrix ofeigenvectors for the matrix XTX
E =
(1/√
2 −1/√
2
1/√
2 1/√
2
)
Any orthogonal matrix represents a rotation of the axes of aEuclidean space
In the example: rotation 45 degrees counterclockwise
Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 69 / 82
PCA Example
PCA: example
Now let us construct E, which is the (orthogonal) matrix ofeigenvectors for the matrix XTX
XE =
1 22 13 44 3
(
1/√
2 −1/√
2
1/√
2 1/√
2
)=
3/√
2 1/√
2
3/√
2 −1/√
2
7/√
2 1/√
2
7/√
2 −1/√
2
Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 70 / 82
PCA Example
PCA example
11.2. PRINCIPAL-COMPONENT ANALYSIS 405
(2,1)
(3,4)
(4,3)(1,2)
(1.5,1.5)
(3.5,3.5)
Figure 11.2: Figure 11.1 with the axes rotated 45 degrees counterclockwise
For example, the first point, [1, 2], has been transformed into the point
[3/√2, 1/
√2]
If we examine Fig. 11.2, with the dashed line representing the new x-axis, we seethat the projection of the first point onto that axis places it at distance 3/
√2
from the origin. To check that fact, notice that the point of projection for boththe first and second points is [1.5, 1.5] in the original coordinate system, andthe distance from the origin to this point is
√(1.5)2 + (1.5)2 =
√9/2 = 3/
√2
Moreover, the new y-axis is, of course, perpendicular to the dashed line. Thefirst point is at distance 1/
√2 above the new x-axis in the direction of the
y-axis. That is, the distance between the points [1, 2] and [1.5, 1.5] is
√(1 − 1.5)2 + (2− 1.5)2 =
√(−1/2)2 + (1/2)2 =
√1/2 = 1/
√2
Figure 11.3 shows the four points in the rotated coordinate system.
2 2(3/ , −1/ )
2 2(3/ , 1/ ) 2 2(7/ , 1/ )
2 2(7/ , −1/ )
Figure 11.3: The points of Fig. 11.1 in the new coordinate system
The second point, [2, 1] happens by coincidence to project onto the samepoint of the new x-axis. It is 1/
√2 below that axis along the new y-axis, as is
Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 71 / 82
PCA Example
PCA: example
For example, the point [1, 2] has been transformed into the point[3/√
2, 1/√
2]
The point of projection for both the first and the second points is[1.5, 1.5] or [3/2, 3/2]
Then the distance from the origin in the new coordinate space is:
√(3/2)2 + (3/2)2 =
√9/2 = 3/
√2
Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 72 / 82
PCA Example
PCA example
11.2. PRINCIPAL-COMPONENT ANALYSIS 405
(2,1)
(3,4)
(4,3)(1,2)
(1.5,1.5)
(3.5,3.5)
Figure 11.2: Figure 11.1 with the axes rotated 45 degrees counterclockwise
For example, the first point, [1, 2], has been transformed into the point
[3/√2, 1/
√2]
If we examine Fig. 11.2, with the dashed line representing the new x-axis, we seethat the projection of the first point onto that axis places it at distance 3/
√2
from the origin. To check that fact, notice that the point of projection for boththe first and second points is [1.5, 1.5] in the original coordinate system, andthe distance from the origin to this point is
√(1.5)2 + (1.5)2 =
√9/2 = 3/
√2
Moreover, the new y-axis is, of course, perpendicular to the dashed line. Thefirst point is at distance 1/
√2 above the new x-axis in the direction of the
y-axis. That is, the distance between the points [1, 2] and [1.5, 1.5] is
√(1 − 1.5)2 + (2− 1.5)2 =
√(−1/2)2 + (1/2)2 =
√1/2 = 1/
√2
Figure 11.3 shows the four points in the rotated coordinate system.
2 2(3/ , −1/ )
2 2(3/ , 1/ ) 2 2(7/ , 1/ )
2 2(7/ , −1/ )
Figure 11.3: The points of Fig. 11.1 in the new coordinate system
The second point, [2, 1] happens by coincidence to project onto the samepoint of the new x-axis. It is 1/
√2 below that axis along the new y-axis, as is
Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 73 / 82
PCA Example
PCA: example
From the example we also see the general principle
The matrix XE keeps the transformed points
Each column represent an axis in the new space
The variance along the axes decays with each new axes, thus eachnew axis is less significant than the previous one
Since the axes are orthogonal then the values along the axes arelinearly uncorrelated
We might drop less significant axes
Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 74 / 82
PCA Example
PCA: example
Thus, we reduce dimensions
It can be seen also as a kind of data compression
We remove (reduce) the values where the information content is small
You can relate PCA with information theory
It is possible to show that if the data is Gaussian then the PCA is alsooptimal from the information theoretic point of view, i.e. the mostsignificant axes have the maximal information content
Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 75 / 82
PCA Example
PCA: example
XE =
1 22 13 44 3
(
1/√
2 −1/√
2
1/√
2 1/√
2
)=
3/√
2 1/√
2
3/√
2 −1/√
2
7/√
2 1/√
2
7/√
2 −1/√
2
Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 76 / 82
PCA Example
PCA: example
1 22 13 44 3
(
1/√
2
1/√
2
)=
3/√
2
3/√
2
7/√
2
7/√
2
Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 77 / 82
PCA Example
PCA: Algorithm
Organize data as an m × n matrix, with m entities and n features
Subtract the average for each feature to obtain centered datamatrix X
Calculate the covariance matrix 1mXTX
Calculate the eigenvalues and the eigenvectors of the covariancematrix
Select the top r eigenvectors
Project the data to the new space spanned by those r eigenvectors:XE ∈ Rm×r , where E ∈ Rn×r
Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 78 / 82
PCA Example
PCA example
20 15 10 5 0 5 10 15 2020
15
10
5
0
5
10
15
20
Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 79 / 82
PCA Example
PCA example
20 15 10 5 0 5 10 15 2020
15
10
5
0
5
10
15
20
Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 80 / 82
PCA Example
PCA example
IPython Notebook examples
http:
//kti.tugraz.at/staff/denis/courses/kddm1/pca.ipynb
Command Line
ipython notebook –pylab=inline pca.ipynb
Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 81 / 82
PCA Example
PCA: Limitations
PCA transforms the set of correlated observations into a set oflinearly uncorrelated observations
I.e. the goal of the analysis is to decorrelate the data
In other words, the goal is to remove second-order dependencies inthe data
However, if the higher-order dependencies in the data exist removingonly the second-order dependencies will not completely decorrelatethe data
First workaround: apply a nonlinear (kernel) transformation first
Second workaround: require data to be statistically independentrather than linearly independent along the dimensions
Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 82 / 82