knowledge discovery and data mining 1 (vo ... -...

Knowledge Discovery and Data Mining 1 (VO) (707.003)Dimensionality Reduction

Denis Helic

KTI, TU Graz

Dec 12, 2013

Denis Helic (KTI, TU Graz) KDDM1 Dec 12, 2013 1 / 82

Big picture: KDDM

Linear Algebra Map-Reduce

Mathematical Tools Infrastructure

Knowledge Discovery Process

Information Theory Statistical Inference

Probability Theory

PreprocessingTransformation

Data Mining


Outline

1 Introduction

2 UV Decomposition

3 Eigenvalues and Eigenvectors

4 Principal-Component Analysis

5 PCA Example

Slides

Slides are partially based on “Mining Massive Datasets” Chapters 9 and 11


Introduction

Representing data as matrices

There are many sources of the data that can be represented as largematrices

Vector Space Model: document are represented as term vectors

The complete document collection is represented as a largedocument-term matrix

In recommender systems, we represent users’ rating of items as autility matrix

We use also matrices to represent the social networks, etc.


Introduction


These matrices are huge and have hundred thousands, even millionsof rows and columns

E.g. document-term matrix of Wikipedia: 10 millions rows and100.000 columns

Utility matrix at Amazon

The number of customers times the number of articles

The curse of dimensionality


Introduction


In many cases we can summarize these matrices by finding narrowermatrices that are in some sense close to the original

These narrow matrices have small numbers of rows and columns

We can use them more efficiently than the original matrices

The process of finding those narrow matrices is called dimensionalityreduction


UV Decomposition

Recommender systems

Recommender systems predict user responses to options

E.g. recommend news articles based on prediction of user interests

E.g. recommend products based on the predictions of what usermight like

Content-based systems make predictions based on the content of theitems

Collaborative filtering systems make predictions based on thesimilarity between users/items


UV Decomposition

The Utility Matrix

In a recommender system there are two classes of entities: users anditems

Users have preferences for certain items and we have to mine forthese preferences

The data is represented by a utility matrix M ∈ Rn×m, where n is thenumber of users and m the number of items

The matrix gives a value for each user-item pair what is known aboutthe preference of that user for that item

E.g. the values can come from an ordered set (1 to 5) and represent arating that a user gave for an item


UV Decomposition

The Utility Matrix

We assume that the matrix is sparse

This means that most entries are unknown

The majority of the user preferences for specific items is unknown

An unknown rating means that we do not have explicit information

It does not mean that the rating is low

The goal of a recommender system is to predict the blank in theutility matrix


UV Decomposition

The Utility Matrix: example

UserMovie

HP1 HP2 HP3 Hobbit SW1 SW2 SW3

A 4 5 1

B 5 5 4

C 2 4 5

D 3 3


UV Decomposition

Item profiles

The utility matrix itself does not offer a lot of evidence

Typically in practice, the utility matrix is a very sparse matrix

Also, we might think about the utility matrix as a final result of arating process

For example, items have some (general) characteristics that userslike/dislike and because of these characteristics they rate them in oneway or another


UV Decomposition

Item profiles: example

Movies have certain features which describe important characteristicsof every movie

The set of actors

The director

Year

The genre

Technical characteristics


UV Decomposition

User profiles

The same set of features might be used to represent the preferencesof the users

We might represent the preferences as the feature weights

E.g. a feature which the user prefers gets a higher weight

The final rating of a user for an item might be then the weighted sumof the features from the item profile

This can be used to predict the missing values in the utility matrix


UV Decomposition

Representing profiles

Let us represent item and user profiles as vectors with features asdimensions

Suppose we have the following features: Julia Roberts, EdwardNorton, Martin Scorsese, Ridley Scott, Western, Drama, Thriller

Let us denote the features with a feature vector x, where eachelement (dimension) corresponds to a feature

x1 = JuliaRoberts, x2 = EdwardNorton, ..., x7 = Thriller

Now we might represent the users and items with vectors u ∈ Rd andv ∈ Rd with corresponding values for each feature


UV Decomposition

Representing profiles: example

E.g. we might represent a movie starring Julia Roberts directed byMartin Scorsese and with a mixture elements of drama and thriller asa vector:

v0 =

10100

0.50.5


UV Decomposition


Another thriller movie starring Edward Norton and directed by RidleyScott:

v1 =

0101001


UV Decomposition


Let us represent the users as vectors u with weights for the featuresrepresenting the user preferences

E.g. a user that prefers thrillers and Edward Norton might berepresented as the following vector

u0 =

0200002


UV Decomposition


Note that with the weights we might express also that a user dislikesa particular feature

u1 =

−2030022


UV Decomposition


Now, we can calculate a rating that a user gives for a certain movieby calculating uTv

uT0 v0 =(0 2 0 0 0 0 2

)

10100

0.50.5

= 1


UV Decomposition


uT1 v0 =(−2 0 3 0 0 2 2

)

10100

0.50.5

= 3


UV Decomposition


uT0 v1 =(0 2 0 0 0 0 2

)

0101001

= 4


UV Decomposition


uT1 v1 =(−2 0 3 0 0 2 2

)

0101001

= 2


UV Decomposition

Representing users: The U Matrix

We can now group all user vectors u ∈ Rd into a matrix U ∈ Rn×d

n is the number of users, and d is the number of features

uT1uT2

...uTn


UV Decomposition

Representing items: The V Matrix

We can now group all item vectors v ∈ Rd into a matrix V ∈ Rd×m

m is the number of items, and d is the number of features

(v1 v2 . . . vm

)


UV Decomposition

The product: UV

Now, the utility matrix M is given by:

M = UV

We will decompose M to obtain U and V

We will reduce the dimensions of M

We can also use U and V to predict missing values in M


UV Decomposition

UV Decomposition

We start with M ∈ Rn×m and want to find U ∈ Rn×d and V ∈ Rd×m

UV closely approximates M

If we are able to find this decomposition than we have establishedthat there are d dimensions that allow us to characterize both usersand items closely

This process is called UV decomposition


UV Decomposition

UV Decomposition: example

5 2 4 4 33 1 2 4 12 3 1 42 5 4 3 54 4 5 4

=

u11 u12

u21 u22

u31 u32

u41 u42

u51 u52

×(

v11 v12 v13 v14 v15v21 v22 v23 v24 v25

)


UV Decomposition

Root-Mean-Square-Error

We approximate M → we need to measure the approximation error

We can pick among several measures for this error

A typical choice is the root-mean-square-error (RMSE):1 Sum over all nonblank entries in M the square of the difference

between that entry and the corresponding entry in the product UV2 Take the average of these squares by dividing by the number of terms

in the sum (i.e. the number of nonblank entries in M)3 Take the square root of the mean

Minimizing the sum of the squares is equivalent to minimizing thesquare root of the average square, thus we can omit the last two steps


UV Decomposition

Root-Mean-Square-Error: example

Suppose we start with U and V with all ones:

1 11 11 11 11 1

×(

1 1 1 1 11 1 1 1 1

)=

2 2 2 2 22 2 2 2 22 2 2 2 22 2 2 2 22 2 2 2 2


UV Decomposition


5 2 4 4 33 1 2 4 12 3 1 42 5 4 3 54 4 5 4

−

2 2 2 2 22 2 2 2 22 2 2 2 22 2 2 2 22 2 2 2 2

=

3 0 2 2 11 −1 0 2 −10 1 −1 20 3 2 1 32 2 3 2


UV Decomposition


Sum of squares:1 Row 1: 182 Row 2: 73 Row 3: 64 Row 4: 235 Row 5: 21

Total sum: 75

We can already stop at this point


UV Decomposition

Incremental computation

Finding the decomposition with the least RMSE involves starting withsome arbitrarily chosen U and V and iteratively adapting the matricesto make the RMSE smaller

We consider only adjustments to a single element of U or V

In principle we could also make more complex adjustments

In a typical example we will encounter many local minima

In that case no allowable adjustments to U or V will make the RMSEsmaller


UV Decomposition

Incremental computation

Only one of these will be the global minimum

That is the the least possible RMSE

To increase the chances of finding the global minimum we may startthe iteration many times with different starting points

However, there is no guarantee that we will find the global minimum


UV Decomposition

Incremental computation: example

Suppose we start with U and V with all ones and make a singleadjustment (u11):

x 11 11 11 11 1

×(

1 1 1 1 11 1 1 1 1

)=

x + 1 x + 1 x + 1 x + 1 x + 12 2 2 2 22 2 2 2 22 2 2 2 22 2 2 2 2


UV Decomposition


Sum of squares:

Row 1:(5−(x+1))2+(2−(x+1))2+(4−(x+1))2+(4−(x+1))2++(3−(x+1))2

This simplifies to: (4− x)2 + (1− x)2 + (3− x)2 + (3− x)2 + (2− x)2

We are looking for x that minimizes the sum:

ds

dx= 0


UV Decomposition


ds

dx= −2((4− x) + (1− x) + (3− x) + (3− x) + (2− x)) = 0

This gives x = 2.6

2.6 11 11 11 11 1

×(

1 1 1 1 11 1 1 1 1

)=

3.6 3.6 3.6 3.6 3.62 2 2 2 22 2 2 2 22 2 2 2 22 2 2 2 2


UV Decomposition


Now we would again make a single adjustment (v11) and repeat theprocess

2.6 11 11 11 11 1

×(

y 1 1 1 11 1 1 1 1

)=

2.6y + 1 3.6 3.6 3.6 3.6y + 1 2 2 2 2y + 1 2 2 2 2y + 1 2 2 2 2y + 1 2 2 2 2


UV Decomposition

Optimizing an arbitrary element

How does the general formula look like?

We denote with P = UV the current product of matrices U and V

Suppose we want to vary urs and find the value of this element thatminimizes the RMSE

Note that urs only affects the elements in the r -th row of P:

prj =d∑

k=1

urkvkj =∑

k 6=s

urkvkj + xvsj

We sum over all nonblank values mrj

We replaced urs with x


UV Decomposition


If mrj is a nonblank element then the contribution of this element toRMSE is given by:

(mrj − prj)2 = (mrj −

∑

k 6=s

urkvkj + xvsj)2

Now, we can sum over all squares of errors on nonblank entries of M

∑

j

(mrj −∑

k 6=s

urkvkj + xvsj)2


UV Decomposition


We take the derivative with respect to x and set it equal to 0:

∑

j

−2vsj(mrj −∑

k 6=s

urkvkj + xvsj) = 0

We then solve for x :

x =

∑j vsj(mrj −

∑k 6=s urkvkj)∑

j v2sj


UV Decomposition


Similarly, we can derive a formula for element y when we varyvrs

y =

∑i uir (mis −

∑k 6=r uikvks)

∑i u2

ir


UV Decomposition

The complete algorithm

Preprocessing: adjusting scales by e.g. subtracting the average inrows and then columns

Initialization: many different initializations, e.g. the elements thatgive the product the averages of the elements in the utility matrix

Optimization: e.g. we always change a single element and pick anorder of change (row-by-row, etc)

Convergence: when the improvements in RMSE fall below a thresholdwe may stop


UV Decomposition

Gradient Descent

This technique for finding the decomposition is an example ofgradient descent

We are given some data points: nonblank entries of the utility matrix

For each data point we find the direction of change that mostdecreases the RMSE

If the utility matrix is too large to visit each nonblank point severaltimes

We might randomly select a fraction of data

Stochastic gradient descent


UV Decomposition

Overfitting

One problem that may arise

We arrive at one local minima that fits very well to the given data

But it fails to reflect the underlying process that generates the data

In other words, the RMSE is small on the given data, but it does notdo well predicting future data

This problem is called overfitting


UV Decomposition

Avoid overfitting

Move the values only a fraction of way towards its optimized value (inthe beginning)

Stop revisiting elements of U and V well before the process hasconverged

Take several different decompositions and when predicting predict theaverage of the results of using each decomposition


Eigenvalues and Eigenvectors

Eigenvalues and eigenvectors

Eigenvalues and eigenvectors

Given a square matrix A ∈ Rn×n, we say that λ ∈ C is an eigenvalue of Aand x ∈ Cn is the corresponding eigenvector if

Ax = λx, x 6= 0

There are two important properties of eigenvalues and eigenvectors ofsymmetric matrices

All eigenvalues are real

The eigenvectors are orthonormal



Power method

Typically, we would calculate the leading eigenvector and leadingeigenvalue iteratively

A standard approach is the power method

We make an initial guess about the eigenvector x0

Then we iteratively calculate xt (which converges to the leadingeigenvector)

xt =Ax(t−1)

||Ax(t−1)||2



Power method

In other words, the limiting vector is approximately equal the leadingeigenvector of the matrix

At the end of the iteration the leading (principal) eigenvalue can becalculated as:

λ1 = xTAx



Power method

To find the second eigenpair we create a new matrix A∗ = A− λ1xxTWe then again use the power iteration to calculate the leadingeigenpair of A∗

This leading eigenpair corresponds to the second largest eigenpair ofthe original matrix A

Intuitively, we have eliminated the influence of a given eigenvector bysetting its associated eigenvalue to zero



Power method

More formally, if A∗ = A− λ1xxT where λ1 is the leading eigenvalueof A and x is the leading eigenvector of A then

1 x is also an eigenvector of A∗ where the corresponding eigenvalue is 0.2 If v and λv are eigenpair of A other then the principal eigenpair that

they are also an eigenpair of A∗



Power method

Proof.

We assume that A is a symmetric matrix

1 A∗x = (A− λ1xxT )x = Ax− λ1xxTx = Ax− λ1x = 0 = 0x

2 A∗v = (A∗)Tv = (A− λ1xxT )Tv = ATv − λ1xxTv = ATv = Av = λvv


Principal-Component Analysis

PCA

Principal-component analysis or PCA is a technique for transformingpoints from a high-dimensional space by finding the directions alongwhich the points line up best

The idea is to treat the data as a matrix X and find the eigenvectorsof the matrix proportional to the covariance matrix XXT or XTX

The matrix of these eigenvectors may be thought of as a rigidrotation in a high-dimensional space

The axis corresponding to the principal eigenvector is the one withthe maximal variance

It carries most of the signal



PCA

The axis corresponding to the second eigenvector is the axis alongwhich the variance of distances from the first axis is greatest and soon

Thus, we can replace the original high-dimensional data by itsprojection onto the most important axes

These axes are the ones corresponding to the largest eigenvalues

Thus, the original data is approximated by data with fewer dimensions

The new data summarizes well the original data



Maximizing the variance

We can specify an axis by a unit vector w lying on that axis

A projection of another (centered) vector x onto the axis specified byw is given by the inner product of those two vectors:

xTw

Centered vector is a vector where the average has been subtracted

If we combine all (centered) data vectors into a matrix X then theprojection of the matrix onto the axis specified by w is given by:

Xw




The variance of a single row from the matrix is given by:

(xTw)2

The variance of the complete projection is then given by:

σ2 =1

m

∑

i

(xTi w)2




In matrix form the variance is given by:

σ2 =1

m(Xw)T (Xw) =

1

mwTXTXw = wT XTX

mw = wTVw

Now, we want to choose a unit vector w that maximizes σ2

It must be a unit vector, thus the constraint wTw = 1 must besatisfied



Constrained optimization: Lagrange multipliers

Original objective function that we want to minimize: wTVw

This function is subject to constraint: constrained optimization

Typically solved by the method of Lagrange multipliers

Objective function: f (w) = wTVw

Subject to: wTw = 1



Lagrange multipliers

For each constraint we need one Lagrange multiplier, e.g. λ

Lagrange formulation of the optimization problem will be a newobjective function that is a function of s and λ

L(w, λ) = wTVw − λ(wTw − 1)



Constrained optimization

To minimize L we find w and λ that make its gradient 0

5L = 0 :

∂L

∂w= 0

∂L

∂λ= 0




∂L∂λ = 0 give back the constraint

∂L

∂w= 2Vw − 2λw = 0

Vw = λw




Thus, desired vector w is an eigenvector of the covariance matrix V

The maximizing vector will be the one associated with the largesteigenvalue λ

V is a covariance matrix, thus it will be symmetric

The eigenvectors are orthogonal and can be found by the powermethod

They are called principal components


PCA Example

PCA example

11.2. PRINCIPAL-COMPONENT ANALYSIS 403

We can view PCA as a data-mining technique. The high-dimensional datacan be replaced by its projection onto the most important axes. These axesare the ones corresponding to the largest eigenvalues. Thus, the original datais approximated by data with many fewer dimensions, which summarizes wellthe original data.

11.2.1 An Illustrative Example

We shall start the exposition with a contrived and simple example. In thisexample, the data is two-dimensional, a number of dimensions that is too smallto make PCA really useful. Moreover, the data, shown in Fig. 11.1 has onlyfour points, and they are arranged in a simple pattern along the 45-degree lineto make our calculations easy to follow. That is, to anticipate the result, thepoints can best be viewed as lying along the axis that is at a 45-degree angle,with small deviations in the perpendicular direction.

(2,1)

(3,4)

(4,3)(1,2)

Figure 11.1: Four points in a two-dimensional space

To begin, let us represent the points by a matrix M with four rows – onefor each point – and two columns, corresponding to the x-axis and y-axis. Thismatrix is

M =

1 22 13 44 3

Compute MTM , which is

MTM =

[1 2 3 42 1 4 3

]

1 22 13 44 3

=

[30 2828 30

]

We may find the eigenvalues of the matrix above by solving the equation

(30− λ)(30− λ)− 28× 28 = 0


PCA Example

PCA example

In this example the data is two dimensional and we want to reduce itto a single dimension

The data has only four points, and they are arranged in a simplepattern along the 45 degree line

To anticipate the result: the points are lying on this line

Small deviations in the orthogonal directions

We would expect the 45 degree line to maximize the variance


PCA Example

PCA example

Let us represent the data in a matrix form:

X =

1 22 13 44 3


PCA Example

PCA example

We compute XTX

XTX =

(1 2 3 42 1 4 3

)

1 22 13 44 3

=

(30 2828 30

)


PCA Example

PCA: example

A =

(30 2828 30

)

det(λI− A) = det(

(λ− 30 28

28 λ− 30

)) = (λ− 30)(λ− 30)− 784

= λ2 − 60λ+ 900− 784 = λ2 − 60λ− 116

= (λ− 58)(λ− 2)

Thus, λ1 = 58, and λ2 = 2 are eigenvalues of A

We now solve (λi I− A)x = 0 for each eigenvalue to find thecorresponding eigenvectors


PCA Example

PCA: example

For λ1 = 58

(28 −28−28 28

)(x1x2

)=

(00

)

28x1 − 28x2 = 0

28x1 − 28x2 = 0

Thus, x1 = x2, and we might pick x =

(11

)

And we normalize to:

(1/√

2

1/√

2

)


PCA Example

PCA: example

For λ1 = 2

(−28 −28−28 −28

)(x1x2

)=

(00

)

−28x1 − 28x2 = 0

−28x1 − 28x2 = 0

Thus, x1 = −x2, and we might pick x =

(−11

)

And we normalize to:

(−1/√

2

1/√

2

)


PCA Example

PCA: example

Now let us construct E, which is the (orthogonal) matrix ofeigenvectors for the matrix XTX

E =

(1/√

2 −1/√

2

1/√

2 1/√

2

)

Any orthogonal matrix represents a rotation of the axes of aEuclidean space

In the example: rotation 45 degrees counterclockwise


PCA Example

PCA: example

Now let us construct E, which is the (orthogonal) matrix ofeigenvectors for the matrix XTX

XE =

1 22 13 44 3

(

1/√

2 −1/√

2

1/√

2 1/√

2

)=

3/√

2 1/√

2

3/√

2 −1/√

2

7/√

2 1/√

2

7/√

2 −1/√

2


PCA Example

PCA example


(2,1)

(3,4)

(4,3)(1,2)

(1.5,1.5)

(3.5,3.5)

Figure 11.2: Figure 11.1 with the axes rotated 45 degrees counterclockwise

For example, the first point, [1, 2], has been transformed into the point

[3/√2, 1/

√2]

If we examine Fig. 11.2, with the dashed line representing the new x-axis, we seethat the projection of the first point onto that axis places it at distance 3/

√2

from the origin. To check that fact, notice that the point of projection for boththe first and second points is [1.5, 1.5] in the original coordinate system, andthe distance from the origin to this point is

√(1.5)2 + (1.5)2 =

√9/2 = 3/

√2

Moreover, the new y-axis is, of course, perpendicular to the dashed line. Thefirst point is at distance 1/

√2 above the new x-axis in the direction of the

y-axis. That is, the distance between the points [1, 2] and [1.5, 1.5] is

√(1 − 1.5)2 + (2− 1.5)2 =

√(−1/2)2 + (1/2)2 =

√1/2 = 1/

√2

Figure 11.3 shows the four points in the rotated coordinate system.

2 2(3/ , −1/ )

2 2(3/ , 1/ ) 2 2(7/ , 1/ )

2 2(7/ , −1/ )

Figure 11.3: The points of Fig. 11.1 in the new coordinate system

The second point, [2, 1] happens by coincidence to project onto the samepoint of the new x-axis. It is 1/

√2 below that axis along the new y-axis, as is


PCA Example

PCA: example

For example, the point [1, 2] has been transformed into the point[3/√

2, 1/√

2]

The point of projection for both the first and the second points is[1.5, 1.5] or [3/2, 3/2]

Then the distance from the origin in the new coordinate space is:

√(3/2)2 + (3/2)2 =

√9/2 = 3/

√2


PCA Example

PCA example


(2,1)

(3,4)

(4,3)(1,2)

(1.5,1.5)

(3.5,3.5)

Figure 11.2: Figure 11.1 with the axes rotated 45 degrees counterclockwise

For example, the first point, [1, 2], has been transformed into the point

[3/√2, 1/

√2]

If we examine Fig. 11.2, with the dashed line representing the new x-axis, we seethat the projection of the first point onto that axis places it at distance 3/

√2

from the origin. To check that fact, notice that the point of projection for boththe first and second points is [1.5, 1.5] in the original coordinate system, andthe distance from the origin to this point is

√(1.5)2 + (1.5)2 =

√9/2 = 3/

√2

Moreover, the new y-axis is, of course, perpendicular to the dashed line. Thefirst point is at distance 1/

√2 above the new x-axis in the direction of the

y-axis. That is, the distance between the points [1, 2] and [1.5, 1.5] is

√(1 − 1.5)2 + (2− 1.5)2 =

√(−1/2)2 + (1/2)2 =

√1/2 = 1/

√2

Figure 11.3 shows the four points in the rotated coordinate system.

2 2(3/ , −1/ )

2 2(3/ , 1/ ) 2 2(7/ , 1/ )

2 2(7/ , −1/ )

Figure 11.3: The points of Fig. 11.1 in the new coordinate system

The second point, [2, 1] happens by coincidence to project onto the samepoint of the new x-axis. It is 1/

√2 below that axis along the new y-axis, as is


PCA Example

PCA: example

From the example we also see the general principle

The matrix XE keeps the transformed points

Each column represent an axis in the new space

The variance along the axes decays with each new axes, thus eachnew axis is less significant than the previous one

Since the axes are orthogonal then the values along the axes arelinearly uncorrelated

We might drop less significant axes


PCA Example

PCA: example

Thus, we reduce dimensions

It can be seen also as a kind of data compression

We remove (reduce) the values where the information content is small

You can relate PCA with information theory

It is possible to show that if the data is Gaussian then the PCA is alsooptimal from the information theoretic point of view, i.e. the mostsignificant axes have the maximal information content


PCA Example

PCA: example

XE =

1 22 13 44 3

(

1/√

2 −1/√

2

1/√

2 1/√

2

)=

3/√

2 1/√

2

3/√

2 −1/√

2

7/√

2 1/√

2

7/√

2 −1/√

2


PCA Example

PCA: example

1 22 13 44 3

(

1/√

2

1/√

2

)=

3/√

2

3/√

2

7/√

2

7/√

2


PCA Example

PCA: Algorithm

Organize data as an m × n matrix, with m entities and n features

Subtract the average for each feature to obtain centered datamatrix X

Calculate the covariance matrix 1mXTX

Calculate the eigenvalues and the eigenvectors of the covariancematrix

Select the top r eigenvectors

Project the data to the new space spanned by those r eigenvectors:XE ∈ Rm×r , where E ∈ Rn×r


PCA Example

PCA example

20 15 10 5 0 5 10 15 2020

15

10

5

0

5

10

15

20


PCA Example

PCA example

IPython Notebook examples

http:

//kti.tugraz.at/staff/denis/courses/kddm1/pca.ipynb

Command Line

ipython notebook –pylab=inline pca.ipynb


http://kti.tugraz.at/staff/denis/courses/kddm1/pca.ipynb

http://kti.tugraz.at/staff/denis/courses/kddm1/pca.ipynb

PCA Example

PCA: Limitations

PCA transforms the set of correlated observations into a set oflinearly uncorrelated observations

I.e. the goal of the analysis is to decorrelate the data

In other words, the goal is to remove second-order dependencies inthe data

However, if the higher-order dependencies in the data exist removingonly the second-order dependencies will not completely decorrelatethe data

First workaround: apply a nonlinear (kernel) transformation first

Second workaround: require data to be statistically independentrather than linearly independent along the dimensions


knowledge discovery and data mining 1 (vo ... -...

Documents