curse-of-dimensionalitycarlotta/teaching/cs782-f07/... · curse-of-dimensionality ¾effective...

1

Curse-of-Dimensionality

q 4 4 6 6 10 10 20 20 20

N 100 1000 100 1000 1000 10000 10000

d(q,N) 0.42 0.23 0.71 0.48 0.91 0.72 1.51 1.20 0.76

610 1010

~N•• RandomRandom sample of sizesample of size uniformuniform distribution in thedistribution in theq--dimensionaldimensional unit hypercubeunit hypercube

•• Diameter of a neighborhood using EuclideanDiameter of a neighborhood using Euclidean1=K)(),( /1 qNONqd −=distance: distance:

As dimensionality increases, the distance from the As dimensionality increases, the distance from the closest point increases fasterclosest point increases faster

Large Highly biased estimationsLarge Highly biased estimations⇒),( Nqd


In high dimensional spaces data become In high dimensional spaces data become extremely sparse and are far apart from extremely sparse and are far apart from each othereach other

The curse of dimensionality affects The curse of dimensionality affects anyany estimation problem with high estimation problem with high dimensionalitydimensionality

2


It is a serious problem in many It is a serious problem in many realreal--world applicationsworld applications

Microarray data:Microarray data: 3,0003,000--4,000 genes;4,000 genes;

Documents:Documents: 10,00010,000--20,000 words in 20,000 words in dictionary;dictionary;

Images, face recognitionImages, face recognition, etc., etc.

How can we deal withHow can we deal withthe curse of dimensionality?the curse of dimensionality?

3

Curse-of-DimensionalityEffective techniques applicable to high

dimensional spaces exist.

The reasons are twofold:Real data are often confined to regions of

lower dimensionality

Real data typically exhibit smoothness properties (at least locally). Local interpolation techniques can be used to make predictions

⎥⎦

⎤⎢⎣

⎡5.19122.92

2.9268.7 Covariance matrix. The covariance measures the extent to which the two variables vary together. If they are independent, the covariance vanishes

4

( )( )[ ]

( )

( ) ( )( )( )( ) ( )

( ) ( )( )( )( ) ( )∑

∑

=

=

⎥⎥⎦

⎤

⎢⎢⎣

⎡

µ−µ−µ−µ−µ−µ−

=⎥⎦

⎤⎢⎣

⎡


=⎥⎦

⎤⎢⎣

⎡µ−µ−⎟⎟

⎠

⎞⎜⎜⎝

⎛µ−µ−

=−−

×

=⎟⎟⎠

⎞⎜⎜⎝

⎛µµ

=⎟⎟⎠

⎞⎜⎜⎝

⎛=

N

iiii

iii

T

N

ii

xxxxxx

N

xxxxxxE

xxxx

E

E

Nxx

12

222211

22112

11

2222211

22112

11

221122

11

12

1

2

1

1

,

: 22

1

µxµx

xµx

matrix covariance

( ) ( )( )( )( ) ( )

( ) ( )( )[ ]

( )( )[ ] ( )⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

µ−µ−µ−

µ−µ−µ−

=⎥⎥⎦

⎤

⎢⎢⎣

⎡


∑∑

∑∑

∑

==

==

=

N

i

iN

i

ii

N

i

iiN

i

i

N

iiii

iii

xN

xxN

xxN

xN

xxxxxx

N

1

222

12211

12211

1

211

12

222211

22112

11

11

11

1

variancevariance

variancevariance

covariancecovariance

covariancecovariance

5

⎥⎦

⎤⎢⎣

⎡−

−06.15.0

5.099.0⎥⎦

⎤⎢⎣

⎡−

−15.105.105.104.1

⎥⎦

⎤⎢⎣

⎡05.101.001.093.0

⎥⎦

⎤⎢⎣

⎡04.149.049.097.0

⎥⎦

⎤⎢⎣

⎡03.193.093.094.0

Dimensionality Reduction

• Many dimensions are often interdependent (correlated);

We can:

• Reduce the dimensionality of problems;

• Transform interdependent coordinates into significant and independent ones;

6

Bayesian Probabilities

A key issue in pattern recognition is uncertainty. It is due to incomplete and/or ambiguous information, i.e. finite and noisy data.

Probability theory and decision theory provide the tools to make optimal predictions given the limited available information.

In particular, the Bayesian interpretation of probabilityallows to quantify uncertainty, and make precise revisions of uncertainty in light of new evidence.

Bayes’ Theorem

is the prior probability: it expresses the probability before we observe any data

is the posterior probability: it expresses the probability after we observed the data

The effect of the observed data is captured through the conditional probability

( ) ( ) ( )( )xXp

yYpyYxXpxXyYp=

======

||

( )yYp =

( )xXyYp == |

( )yYxXp == |

7

Curve fitting re-visited

We can adopt a Bayesian approach when estimating the parameters for polynomial curve fitting.

captures our assumptions about before observing the data. The effect of the observed data D is captured by the conditional probability Bayes’ theorem allows to evaluate the uncertainty in after we have observed the data D (in the form of posterior probability):

is the likelihood functionMaximum likelihood approach: set to the value that maximizes

w( )wp

( )w|Dp

( ) ( ) ( )( )Dp

pDpDp www || =

w

w

( )w|Dp

( )w|Dpw

Curve fitting re-visited: ML approach

Training data:

We can express our uncertainty over the value of the target variable using a probability distribution

Assumption: Given a value of x, the corresponding value of t has a Gaussian distribution with a mean equal to

Thus:

( ) ( )TNT

N ttxx LL ,,,, 11 == tx

( ) ∑=

=++++=M

j

jj

MM xwxwxwxwwxy

0

2210, Lw

( ) ( )( )1,,|,,| −Ν= ββ ww xytxtp

8



We use the training data to estimate by maximum likelihood

Assuming data are drawn independently, the likelihood function can be written as the product of the marginal distributions:

β,w

( ) ( )( )∏=

−Ν=N

nnn xytp

1

1,,|,,| ββ wwxt

{ }tx,

9


Gaussian distribution:

( ) ( )( )( )( )

( )( )

( )( ) ( )∑

∑ ∑

∏

∏

=

= =

−

=

−−

−

=

−

−+−−=

−−−=

=

Ν=⇒

−

N

nnn

N

n

N

nnn

N

n

xyt

N

nnn

NNxyt

xyt

e

xytp

nn

1

2

1 1

12

1

2,

1

1

1

2ln2

ln2

,2

2ln,2

21ln

,,|ln,,|ln

1

2

πββ

πββ

πβ

ββ

β

w

w

wwxt

w

( )( )

2

2

22

2

21,| σ

µ

πσσµ

−−

=Νx

ex

( ) 12,, −== βσµ wxy


Maximum likelihood solution for the polynomial coefficients: maximize log likelihood with respect to

It is equivalent to minimize the negative log likelihood:

Thus: The sum-of-squares error function results from maximizing the likelihood under the assumption of a Gaussian noise distribution

( ) ( )( ) ( )∑=

−+−−=N

nnn

NNxytp1

2 2ln2

ln2

,2

,,|ln πβββ wwxt

w

( )( )⎭⎬⎫

⎩⎨⎧

−= ∑=

N

nnnML xyt

1

2,21minarg ww

w

10


Maximum likelihood solution for the parameter : maximize log likelihood with respect to

( ) ( )( ) ( )∑=

−+−−=N

nnn

NNxytp1

2 2ln2

ln2

,2

,,|ln πβββ wwxt

β

( )( )∑=

−=N

nMLnn

ML

xytN 1

2,11 wβ

β


We now have the maximum likelihood solutions for the parameters:

We can now make predictions for new values of x by using the resulting probability distribution over t(predictive distribution)

MLML β,w

( ) ( )( )1,,|,,| −Ν= MLMLMLML xytxtp ββ ww

11

Let us introduce a prior distribution over the polynomial coefficientsRecall: Gaussian distribution of a D-dimensional vector x

Prior distribution:

Using Bayes’ theorem:

w

( )( )

( ) ( )⎟⎠⎞

⎜⎝⎛ −−− −

=ΝµxΣµx

ΣΣµx

1

21

2/12/1

21,,

T

eDπ

Maximum a Posteriori (MAP) approach

( ) ( )( )

⎟⎠⎞

⎜⎝⎛ −

+− ⎟

⎠⎞

⎜⎝⎛=Ν=

wwIww

T

epM

22/1

1

2,0||

α

πααα

( ) ( ) ( )αββα |,,|,,,| wwxttxw ppp ∝

MAP approach

Maximum a Posteriori solution for the parameters maximize the posterior distribution

It is equivalent to minimize the negative log posterior distribution:

Thus: maximizing the posterior distribution is equivalent to minimizing the regularized sum-of-squares error function

w

( ) ( ) ( )αββα |,,|,,,| wwxttxw ppp ∝

( )( )⎭⎬⎫

⎩⎨⎧

+−= ∑=

N

n

TnnMAP xyt

1

2

2,

2minarg wwwww

αβ

12

Decision Theory• Decision theory, when combined with probability

theory, allows to make optimal decisions in situations involving uncertainty

• Training data:

• Inference: joint probability distribution

• Decision step: make optimal decision

tx vector targetvector input ,

( )tx,p

Decision TheoryClassification example: medical diagnosis problem

• set of pixel intensities in an image

• Two classes: – absence of cancer

– presence of cancer

• Inference step: estimate

• Decision step: given predict so that a measure of error is minimized according to the given probabilities

x

01 =C12 =C

( )kCp ,x

x kC

13

Decision TheoryHow probabilities play a role in decision making?

• Decision step: given predict

Thus, we are interested in

Intuitively: we want to minimize the chance of assigning to the wrong class. Thus, choose the class that gives the higher posterior probability

( )x|kCp

x kC

( ) ( ) ( )( )x

xxp

CpCpCp kkk

|| =

x

Minimizing the misclassification rate• Goal: Minimize the number of misclassifications

We need to find a rule that assigns each input vector to one of the possible classes

Such rule divides the input space into regions so that all points in are assigned to

Boundaries between regions are called decision boundaries

kC

kRkR kC

14

Minimizing the misclassification rate• Goal: Minimize the number of misclassifications

• Assign x to the class that gives the smaller value of the integrand:– Choose if

– Choose if

( ) ( ) ( )( ) ( ) xxxx

xx

dCpdCp

CRpCRpmistakep

RR∫∫ +=

∈+∈=

21

12

1221

,,

,,

1C2C

( ) ( )21 ,, CpCp xx >( ) ( )12 ,, CpCp xx >

Minimizing the misclassification rate– Choose if

– Choose if

Thus:

– Choose if

– Choose if

1C2C

( ) ( )21 ,, CpCp xx >

( ) ( )xx || 12 CpCp >

( ) ( ) ( )xxx pCpCp kk |, =

1C2C

( ) ( )12 ,, CpCp xx >

( ) ( )xx || 21 CpCp >

15

Minimizing the misclassification rate

0ˆ xx =Optimal decision boundary:

Minimizing the misclassification rate

Thus:

General case of K classes:

( )xCp k |Choose that gives the largest kC

( ) ( ) ( )∑ ∑ ∫= =

=∈=K

k

K

k Rkkk

k

dCpCRpcorrectp1 1

,, xxx

16

Minimizing the expected loss

jCkjLkC

Some mistakes are more costly than others.Loss function (cost function): overall measure of

loss incurred in taking any of the available decisions

: loss incurred when we assign to class and the true class is

x

⎟⎟⎠

⎞⎜⎜⎝

⎛01

10000cancer

cancer normal

normal

The optimal solution is the one that minimizes the loss function

Minimizing the expected loss

[ ] ( )∑∑ ∫=k j R

kkj

j

dCpLLE xx,

The loss function depends on the true class, which is unknown.

The uncertainty of the true class is expressed through the joint probability

We minimize the expected loss:

For each x we should minimize

( )kCp ,x

( ) ( ) ( )xxx pCpLCpL kk

kjkk

kj |, ∑∑ =

17

Minimizing the expected lossFor each x we should minimize

Thus, to minimize the expected loss: Assign each x to the class j that minimizes

( ) ( ) ( )xxx pCpLCpL kk

kjkk

kj |, ∑∑ =

( )x|kk

kj CpL∑

The Reject Option

18

Inference and Decision

Inference stage: use the training data to learn a

model for

Decision stage: use the given posterior probabilities to

make optimal class assignments

( )x|kCp

Generative MethodsSolve the inference problem of estimating the class-

conditional densities for each class

Infer the prior class probabilities

Use Bayes’ theorem to find the class posterior probabilities:

where

Use decision theory to determine class membership for each new input x

( )kCp |x kC

( )kCp

( ) ( ) ( )( )x

xxp

CpCpCp kkk

|| =

( ) ( ) ( )∑=k

kk CpCpp |xx

19

Discriminative MethodsSolve directly the inference problem of estimating the

class posterior probabilities


( )x|kCp

Discriminant FunctionsFind a function which maps each input

directly onto a class label. Probabilities play no role here.


( )xf

20

Example

Linear Models for ClassificationClassification: Given an input vector x, assign it to

one of K classes where k = 1,…, K

The input space is divided in decision regions whose boundaries are called decision boundaries or decision surfaces

Linear models: decision surfaces are linear functions of the input vector x. They are defined by (D -1)-dimensional hyperplanes within the D -dimensional input space

kC

21

Linear Models for ClassificationFor regression:

For classification, we want to predict class labels, or more generally class posterior probabilities.

We transform the linear function of w using a nonlinear function f () so that

( ) 0wy T += xwx

( )0wf T +xw

Generalized Linear Models

Linear Discriminant FunctionsTwo classes:

Decision boundary:

( ) 0wy T += xwx

( )2

10CtoassignotherwiseCtoassignyif

x x x ≥

( ) 0=xy

22

Linear Discriminant FunctionsGeometrical properties:

Decision boundary:

Let be two points which lie on the decision boundary

( ) 00 =+= wy T xwx

21, xx

( ) ( )( ) 0

0,0

21

022011

=−⇒

=+==+=

xx w

xwxxwxT

TT wywy

w represents the orthogonal direction to the decision boundary

Geometrical properties (con’t)

www

TT =*

0x( )

( )

( ) ( )

( ) ( )

( )ww

x 0x

wxwxw

w

xwxww

xxw

w

xx

xxw

0

0

00

*0

0*

,

1

1

wywhen

y

directionwtheonto

ofprojectiontheis

T

TTT

T

==

=+=

−=−

−

−

Signed orthogonal distance of the origin from the decision surface

23

Linear Discriminant FunctionsMultiple classes

one-versus-the-rest: K-1 classifiers each of which solves a two-class problem of separating points of from points not in that class

kC


one-versus-one: K(K-1)/2 binary discriminant functions, one for every possible pair of classes.

24


Solution: consider a single K-class discriminantcomprising K linear functions of the form

Assign a point x to class if The decision boundary betweenclass and class is given by

( ) 0kTkk wy += xwx

kC ( ) ( ) kjxyxy jk ≠∀>

( ) ( )( ) ( ) 000 =−+−⇒

=

jkT

jk

jk

ww

xyxy

xww

kC jC

Linear Discriminant Functions

Two approaches:

Fisher’s linear discriminant

Perceptron algorithm

25

Fisher’s Linear DiscriminantOne way to view a linear classification model is in terms of dimensionality reduction.

Two class case:

Suppose we project x onto one dimension:

Set a threshold t

xwTy =

2

1

CtoassignotherwiseCtoassigntyif

x x ≥

Fisher’s Linear Discriminant

• Find an orientation along which the projected samples are well separated;

• This is exactly the goal of linear discriminant analysis (LDA);

• In other words: we are after the linear projection that best separates the data, i.e. best discriminates data of different classes.

How can we find such discriminant direction?How can we find such discriminant direction?

26

LDA

• samples of class

• samples of class

• Consider with

• Then: is the projection of along the direction of

• We want the projections where separated from the projections where

Niin C 1)},{( =x q

n ℜ∈x } ,{ 21 CCCi ∈

1N 1C

2N 2Cqℜ∈w 1=w

xwT

wx

xwT1C∈x

xwT

2C∈x

LDA• A measure of the separation between the

projected points is the difference of the sample means:

iC∑∈

=iCi

i N xxm 1 Sample mean of classSample mean of class

iCi

ii

Nm mwxw T

x

T == ∑∈

1 Sample mean for the Sample mean for the projected pointsprojected points

)( 2121 mmwT −=−mm

We wish to make the above difference as large as We wish to make the above difference as large as we can. In addition…we can. In addition…

27

LDA• To obtain good separation of the projected data we

really want the difference between the means to be large relative to some measure of the standard deviation of each class:

iC( )∑∈

−=iC

iT

i msx

xw 22 Scatter for the projected Scatter for the projected samples of classsamples of class

Total Total withinwithin--class class scatterscatter of the projected of the projected samplessamples

22

21 ss +

22

21

221 max arg

ssmm+−

w

Fisher linear discriminant Fisher linear discriminant analysisanalysis

LDA

28

LDA

( ) : matrices following the

define we of function explicit an as obtain To wwJ

( ) 22

21

221

ssmm

J+−

=w

( )( )

21 SSS

S

W

Cx

Tiii

i

+=

−−=∑∈

mxmx

WithinWithin--class scatter matrixclass scatter matrixThen:

( ) ( )

( )( ) wwwmxmxw

mwxwxw

x

x

iTT

iC

iT

Cxi

TT

Ci

Ti

S

ms

i

ii

=−−=

−=−=

∑

∑∑

∈

∈∈

222

LDA

( ) www w

wwww :Thus

ww and ww :So

WTT

TT

TT

SSS

SSss

SsSs

=+

=+=+

==

21

2122

21

2221

21

( ) ( )( )( )

( )( )TB

BT

TT

TT

S

S

mm

2121

2121

221

221

where

:Similarly

mmmm

ww

wmmmmw

mwmw

−−=

=−−

=−=−

BetweenBetween--class scatter class scatter matrixmatrix

29

LDA

ww WTSss =+ 2

221

: obtained have We

( ) ww BTSmm =− 2

21

( )wwwww

WT

BT

SS

ssmm

J =+−

= 22

21

221

wwww

wW

TB

T

SSmaxarg

LDA

( )( ) wmmmmw TBS 2121

: thatobserve We

−−=

scalarscalarAlways in the direction ofAlways in the direction of( )21 mm −

( )211 mmw −= −

WS

( )wwwww

WT

BT

SSJ =

( ) ( ) ( ) wwwwww w BWWBT SSSSwhenimizedisJ =max

30

LDA

Projection onto the line joining the class means

LDA

Solution of LDA

31

LDA

( )211 mmw −= −

WS

•• Gives the linear function with the maximum ratio of Gives the linear function with the maximum ratio of betweenbetween--class scatter to withinclass scatter to within--class scatter.class scatter.

•• The problem, e.g. classification, has been reduced The problem, e.g. classification, has been reduced to a to a qq--dimensional problem to a more manageable dimensional problem to a more manageable oneone--dimensional problem. dimensional problem.

•• Optimal for multivariate normal class conditional Optimal for multivariate normal class conditional densities.densities.

LDA•• The analysis can be extended to multiple classes.The analysis can be extended to multiple classes.

•• LDA is a LDA is a linearlinear technique for dimensionality technique for dimensionality reduction: it projects the data along directions that reduction: it projects the data along directions that can be expressed as can be expressed as linear combinationlinear combination of the of the input features. input features.

•• NonNon--linear extensions of LDA exist (e.g., generalized linear extensions of LDA exist (e.g., generalized LDA). LDA).

•• The “appropriate” transformation depends on the data The “appropriate” transformation depends on the data and on the and on the tasktask we want to perform on the data. Note we want to perform on the data. Note that LDA uses class labels.that LDA uses class labels.

32

The Perceptron Algorithm

Perceptron (Frank Rosenblatt, 1957)

• First learning algorithm for neural networks;

• Originally introduced for character classification, where each character is represented as an image;

33

Perceptron (contd.)

input

output

( )⎩⎨⎧

<≥

=0001

xx

xH if if

∑=

n

jjj xw

1Total input to output node:

Output unit performs the function: (activation function):

Perceptron: Learning Algorithm• Goal: we want to define a learning algorithm for the

weights in order to compute a mapping from the inputs to the outputs;

• Example: two class character recognition problem.

– Training set: set of images representing either the character ‘a’ or the character ‘b’ (supervised learning);

– Learning Task: Learn the weights so that when a new unlabelled image comes in, the network can predict its label.

– Settings:Class ‘a’ 1 (class C1)Class ‘b’ 0 (class C2)n input units (intensity level of a pixel)1 output unit

The perceptronneeds to learn

{ }1,0: →ℜnf

34

Perceptron: Learning AlgorithmThe algorithm proceeds as follows:

• Initial random setting of weights;

• The input is a random sequence

• For each element of class C1, if output = 1 (correct) do nothing, otherwise update weights;

• For each element of class C2, if output = 0 (correct) do nothing, otherwise update weights.

{ } ℵ∈kkx

Perceptron: Learning AlgorithmA bit more formally:

( )nxxx ,...,, 21=x ( )nwww ,...,, 21=w:θ

nnT xwxwxw +++= ...2211wx

0≥−θTwx

x1 xnx2 xn+1

w1 wnw2 -θ

=1

∑+

=

≥=1

10ˆˆ

n

iii

T xwxw

Threshold of the output unit

Output is 1 if

To eliminate the explicit dependence on :θ

Output is 1 if:

35

Perceptron: Learning Algorithm• We want to learn values of the weights so

that the perceptron correctly discriminate elements of C1 from elements of C2:

• Given x in input, if x is classified correctly, weights are unchanged, otherwise:

⎩⎨⎧

−+

=12

21'

)0()1(

Cfied as inwas classiCssent of claif an elemCfied as inwas classi Cssent of claif an elem

xw xw

w

Perceptron: Learning Algorithm

• 1st case: The correct answer is 1, which corresponds to:We have instead:

We want to get closer to the correct answer:

21 CC in classified wasand x∈0ˆˆ ≥Txw

0ˆˆ <TxwTT xwwx '<

( ) TT xxwwx +<TT xwwx '<

( ) 2xwxxxwxxxw +=+=+ TTTT

iff

ifiedion is verthe conditbecause x ,02 ≥

⎩⎨⎧

−+

=12

21'

)0()1(


xw xw

w

36

Perceptron: Learning Algorithm

• 2nd case: The correct answer is 0, which corresponds to:We have instead:

We want to get closer to the correct answer:

21 Cnassified iand was clC x∈0ˆˆ <Txw

0ˆˆ ≥TxwTT xwwx '>

( ) TT xxwwx −>TT xwwx '>( ) 2xwxxxwxxxw −=−=− TTTT

iff

ifiedion is verthe conditbecause x ,02 ≥

The previous rule allows the network to get closer to the correct answer when it performs an error.

⎩⎨⎧

−+

=12

21'

)0()1(


xw xw

w

Perceptron: Learning Algorithm• In summary:

1. A random sequence is generated such that

2. If is correctly classified, then otherwise

LL ,x,,x,x k21

21 CCi ∪∈x

kx kk ww =+1

⎩⎨⎧

∈−∈+

=+2

11 Cif

Cif

kkk

kkkk x xw

x xww

37

Perceptron: Learning AlgorithmDoes the learning algorithm converge?

Convergence theorem: Regardless of the initial choice of weights, if the two classes are linearly separable, i.e. there exist s.t.

then the learning rule will find such solution after a finite number of steps.

⎪⎩

⎪⎨⎧

∈<

∈≥

2

1

0ˆˆ

0ˆˆ

C

CT

T

x if xw

x if xw

w

Representational Power of Perceptrons• Marvin Minsky and Seymour Papert,

“Perceptron” 1969:“The perceptron can solve only problems with

linearly separable classes.”• Examples of linearly separable Boolean functions:

��

��

�� AND OR

��

��

��

38

Representational Power of Perceptrons

Perceptron that computes the AND function

��

� � ��

��

� � ��1 1

1 1-1.5 -0.5

Perceptron that computes the OR function

Representational Power of Perceptrons• Example of a non linearly separable Boolean

function:

��

��

��

EX-OR

The EX-OR function cannot be computed by a perceptron