580.691 learning theory reza shadmehr linear and quadratic decision boundaries kernel estimates of...

580.691 Learning Theory

Reza Shadmehr

Linear and quadratic decision boundariesKernel estimates of density

Missing data

Bayesian classification

• Suppose we wish to classify vector x as belonging to a class: {1,…,L}. We are given labeled data and need to form a classification function:

(1) (1) ( ) ( ) ( )

1, ,

, , , , 1, ,

ˆ ˆ 1, ,

ˆ arg max

n n i

l L

D c c c L

c c L

c P l

x x

x x x

x x

Classify x into the class l that maximizes the posterior probability.

1

L

p c l P c l p l P c lP c l

pP c p

x xx

xx

priorLikelihood

marginal

Classification when distributions have equal variance

• Suppose we wish to classify a person as male or female based on height.

1 2| 0 , and | 1 , and 1

ˆ

ˆ ˆ1 if 1| 0.5; 0 otherwise

p x c N p x c N P c q

c x

c x P c x c x

What we have:What we want:

height (cm)x height (cm)x 160 180 200

0.01

0.02

0.03

0.04

| 0p x c | 1p x c female male

Note that the two densities have equal variance

1 0.5P c Assume equal probability of being male or female:

| 0 0p x c P c | 1 1p x c P c

160 180 200

0.005

0.01

0.015

0.02

160 180 200

0.005

0.01

0.015

0.02

0.025

0.03

0.035

1

0

|i

p x p x c i P c i

height (cm)x

Classification when distributions have equal variance

| 0 0p x c P c | 1 1p x c P c

height (cm)x

160 180 200

0.005

0.01

0.015

0.02

| 0 0log

| 1 1

p x c P c

p x c P c

160 180 200

-4

-2

2

4

Estimating the decision boundary between data of equal variance

• Suppose the distributions for the data in each class is a Gaussian.

11 1

1

11 1 1

1 1 11 1 1 1

1 1 12 2 2 2

1 11 exp

22

1

1 1log 1 1 log log 2 log

2 2 21 1

log2 21 1

log 2 2 log2 2

1 1log

2 2

T

m

T

T T T

T T T

p c

P c q

mp c P c q

q

p c P c q

p c P c

p c P c

x x μ x μ

x x μ x μ

x x μ x μ μ

x x x μ x μ μ

xμ

x

1 1 1 11 2 1 1 2 2 1 2

0

1 1log log

2 2T T T T

T

q q

w

x μ x μ μ μ μ

w x

The decision boundary between any two classes is where the log of the ratio is zero. If the data in each class has a Gaussian density with equal variance, then the boundary between any two classes is a line.

Estimating the decision boundary from estimated densities

• From the data we can get an ML estimate of Gaussian parameters

1

( )

1

( )

( )1

1 class 1

( ) ( )1 1 1

1 class 1

1 1 2 2

1 2

1ˆ

1ˆ ˆ ˆ

ˆ ˆˆ

i

i

ni

n Ti i

n

n

n n

n n

x

x

μ x

x μ x μ

1 1log 0

2 2

p c P c

p c P c

x

x

2 2log 0

3 3

p c P c

p c P c

x

x

1 1log 0

3 3

p c P c

p c P c

x

x

Class 1

Class 2

Class 3

0 1

20 1

0 0 1 1

10 0 1 1 0 1

ˆ

ˆ ˆ

ˆ ˆ

T

T T

T T

y w

Jn n

n n a

w x

μ w μ ww

w w w w

w μ μ

Relationship between Bayesian classification and Fischer discriminant

If we have two classes, class -1 and class +1, then the decision boundary is at 0: 0 10 Tw w x

For the Bayesian classifier, under assumption of equal variance, the decision boundary is at:

1 1 1 11 2 1 1 2 2 1 2

0

1 11 2

1 1 1 1log 0 log log

2 22 2T T T T

T

T T T

p c P cq q

p c P c

w

xμ x μ x μ μ μ μ

x

w x

w μ μ

The Fischer decision boundary is the same as the Bayesian when the two classes have equal variance and equal prior probability.

height (cm)x

Classification when distributions have unequal variance

| 0 0p x c P c

| 1 1p x c P c

1 1 2 2| 0 , and | 1 , and 1

ˆ ˆ1 if 1| 0.5; 0 otherwise

p x c N p x c N P c q

c x P c x c x

What we have:Classification:

height (cm)x

0 |P c x 1|P c x

160 180 200

0.005

0.01

0.015

0.02

0.025

1 0.5P c Assume:

160 180 200

0.005

0.01

0.015

0.02

0.025

0.03

0.035

1

0

|i

p x p x c i P c i

160 180 200

0.2

0.4

0.6

0.8

1

140 160 180 2000

0.05

0.1

0.15

0.2

0.25

var |c x

160 180 200

-12

-10

-8

-6

-4

-2

160 180 200

0.005

0.01

0.015

0.02

0.025

| 0 0log

| 1 1

p x c P c

p x c P c

| 0 0p x c P c

| 1 1p x c P c

Quadratic discriminant: when data comes from unequal variance Gaussians

11 1 1

1

11 1 1 1 1

1 1 11 1 1 1 1 1 1 1

1 1 12 2 2 2 2 2 2 2

1 11 exp

22

1

1 1log 1 1 log log 2

2 21 1 1

log log 22 2 21 1 1

log 2 2 log log 22 2 2

l

T

m

T m

mT T T

mT T T

p c

P c q

p c P c q

q

p c P c q

x x μ x μ

x x μ x μ

x x μ x μ μ

x x x μ x μ μ

0

1 1og

2 2T Tp c P cW w

p c P c

xx x w x

x

The decision boundary between any two classes is where the log of the ratio is zero. If the data in each class has a Gaussian density with unequal variance, then the boundary between any two classes is a quadratic function of x.

green

red

blue

-20 -10 10 20

0.02

0.04

0.06

Non-parametric estimate of densities: Kernel density estimate

( )

1

2

1

1 1exp

22

n i

i

x xp x K

nh h

K u u

Suppose we have points x(i) that belong to class l. Suppose we can’t assume that these points come from a Gaussian distribution. To estimate the density, we need to form a function that assigns a weight to each point x in our space, with the integral of this function equal to 1. It seems that the more data points x(i) we find around x, the more the weight of x should be.

The kernel density estimate puts a Gaussian centered at each data point. Where there are more data points, there are more Gaussians, and the sum is the density.

-20 -10 10 20

2

4

6

8

10

-20 -10 10 20

0.02

0.04

0.06

0.08

Histogram of the sampled data belonging to class l

x

Kernel 1h1.5h

ML estimate of a Gaussian density

density estimate using a Gaussian kernel

Non-parametric estimate of densities: Kernel density estimate

( )

( )

class

1

1 1exp

22

i

i

ml

T

m

p l Khnh

K

x

x xx

u u u

green

red

blue

Classification with missing data

Suppose that we have built a Bayesian classifier and are now given a new data point to classify, but that this new data point is missing some of the “features” that we normally expect to see. In the example below, we have two features (x1 and x2), and four classes. The likelihood function is plotted.

Suppose that we are given data point (*,-1) to classify. This data point is missing a value for x1. If we assume the missing value is the average of the previously observed x1, then we would estimate it to be about 1. Assuming that the prior probabilities are equal among the four classes, we classify (1,-1) as class c2.

However, c4 is a better choice because when x2=-1, c4 is the most likely class as it has the highest likelihood.

-4 -2 0 2 4 6 8-3

-2

-1

0

1

2

3 p cx

1c

1x

2x 2c3c

4c

1E x

, ,,

,

g b

i g b bi gi g

g g

i b

g

i b

g

i i b

g

p c dp cP c

p p

p c d

p

P c p d

p

p c P c d

p

x x x

x x xxx

x x

x x

x

x x x

x

x x

x

Classification with missing data

good data bad (or missing) data

580.691 learning theory reza shadmehr linear and quadratic decision boundaries kernel estimates of...

Documents

data of equal variance

data points

sampled data

labeled data

new data point

function equal

kernel density estimate

fischer decision boundary