580.691 learning theory reza shadmehr linear and quadratic decision boundaries kernel estimates of...
TRANSCRIPT
580.691 Learning Theory
Reza Shadmehr
Linear and quadratic decision boundariesKernel estimates of density
Missing data
Bayesian classification
• Suppose we wish to classify vector x as belonging to a class: {1,…,L}. We are given labeled data and need to form a classification function:
(1) (1) ( ) ( ) ( )
1, ,
, , , , 1, ,
ˆ ˆ 1, ,
ˆ arg max
n n i
l L
D c c c L
c c L
c P l
x x
x x x
x x
Classify x into the class l that maximizes the posterior probability.
1
L
p c l P c l p l P c lP c l
pP c p
x xx
xx
priorLikelihood
marginal
Classification when distributions have equal variance
• Suppose we wish to classify a person as male or female based on height.
1 2| 0 , and | 1 , and 1
ˆ
ˆ ˆ1 if 1| 0.5; 0 otherwise
p x c N p x c N P c q
c x
c x P c x c x
What we have:What we want:
height (cm)x height (cm)x 160 180 200
0.01
0.02
0.03
0.04
| 0p x c | 1p x c female male
Note that the two densities have equal variance
1 0.5P c Assume equal probability of being male or female:
| 0 0p x c P c | 1 1p x c P c
160 180 200
0.005
0.01
0.015
0.02
160 180 200
0.005
0.01
0.015
0.02
0.025
0.03
0.035
1
0
|i
p x p x c i P c i
height (cm)x
Classification when distributions have equal variance
| 0 0p x c P c | 1 1p x c P c
height (cm)x
160 180 200
0.005
0.01
0.015
0.02
| 0 0log
| 1 1
p x c P c
p x c P c
160 180 200
-4
-2
2
4
Estimating the decision boundary between data of equal variance
• Suppose the distributions for the data in each class is a Gaussian.
11 1
1
11 1 1
1 1 11 1 1 1
1 1 12 2 2 2
1 11 exp
22
1
1 1log 1 1 log log 2 log
2 2 21 1
log2 21 1
log 2 2 log2 2
1 1log
2 2
T
m
T
T T T
T T T
p c
P c q
mp c P c q
q
p c P c q
p c P c
p c P c
x x μ x μ
x x μ x μ
x x μ x μ μ
x x x μ x μ μ
xμ
x
1 1 1 11 2 1 1 2 2 1 2
0
1 1log log
2 2T T T T
T
q q
w
x μ x μ μ μ μ
w x
The decision boundary between any two classes is where the log of the ratio is zero. If the data in each class has a Gaussian density with equal variance, then the boundary between any two classes is a line.
Estimating the decision boundary from estimated densities
• From the data we can get an ML estimate of Gaussian parameters
1
( )
1
( )
( )1
1 class 1
( ) ( )1 1 1
1 class 1
1 1 2 2
1 2
1ˆ
1ˆ ˆ ˆ
ˆ ˆˆ
i
i
ni
n Ti i
n
n
n n
n n
x
x
μ x
x μ x μ
1 1log 0
2 2
p c P c
p c P c
x
x
2 2log 0
3 3
p c P c
p c P c
x
x
1 1log 0
3 3
p c P c
p c P c
x
x
Class 1
Class 2
Class 3
0 1
20 1
0 0 1 1
10 0 1 1 0 1
ˆ
ˆ ˆ
ˆ ˆ
T
T T
T T
y w
Jn n
n n a
w x
μ w μ ww
w w w w
w μ μ
Relationship between Bayesian classification and Fischer discriminant
If we have two classes, class -1 and class +1, then the decision boundary is at 0: 0 10 Tw w x
For the Bayesian classifier, under assumption of equal variance, the decision boundary is at:
1 1 1 11 2 1 1 2 2 1 2
0
1 11 2
1 1 1 1log 0 log log
2 22 2T T T T
T
T T T
p c P cq q
p c P c
w
xμ x μ x μ μ μ μ
x
w x
w μ μ
The Fischer decision boundary is the same as the Bayesian when the two classes have equal variance and equal prior probability.
height (cm)x
Classification when distributions have unequal variance
| 0 0p x c P c
| 1 1p x c P c
1 1 2 2| 0 , and | 1 , and 1
ˆ ˆ1 if 1| 0.5; 0 otherwise
p x c N p x c N P c q
c x P c x c x
What we have:Classification:
height (cm)x
0 |P c x 1|P c x
160 180 200
0.005
0.01
0.015
0.02
0.025
1 0.5P c Assume:
160 180 200
0.005
0.01
0.015
0.02
0.025
0.03
0.035
1
0
|i
p x p x c i P c i
160 180 200
0.2
0.4
0.6
0.8
1
140 160 180 2000
0.05
0.1
0.15
0.2
0.25
var |c x
160 180 200
-12
-10
-8
-6
-4
-2
160 180 200
0.005
0.01
0.015
0.02
0.025
| 0 0log
| 1 1
p x c P c
p x c P c
| 0 0p x c P c
| 1 1p x c P c
Quadratic discriminant: when data comes from unequal variance Gaussians
11 1 1
1
11 1 1 1 1
1 1 11 1 1 1 1 1 1 1
1 1 12 2 2 2 2 2 2 2
1 11 exp
22
1
1 1log 1 1 log log 2
2 21 1 1
log log 22 2 21 1 1
log 2 2 log log 22 2 2
l
T
m
T m
mT T T
mT T T
p c
P c q
p c P c q
q
p c P c q
x x μ x μ
x x μ x μ
x x μ x μ μ
x x x μ x μ μ
0
1 1og
2 2T Tp c P cW w
p c P c
xx x w x
x
The decision boundary between any two classes is where the log of the ratio is zero. If the data in each class has a Gaussian density with unequal variance, then the boundary between any two classes is a quadratic function of x.
green
red
blue
-20 -10 10 20
0.02
0.04
0.06
Non-parametric estimate of densities: Kernel density estimate
( )
1
2
1
1 1exp
22
n i
i
x xp x K
nh h
K u u
Suppose we have points x(i) that belong to class l. Suppose we can’t assume that these points come from a Gaussian distribution. To estimate the density, we need to form a function that assigns a weight to each point x in our space, with the integral of this function equal to 1. It seems that the more data points x(i) we find around x, the more the weight of x should be.
The kernel density estimate puts a Gaussian centered at each data point. Where there are more data points, there are more Gaussians, and the sum is the density.
-20 -10 10 20
2
4
6
8
10
-20 -10 10 20
0.02
0.04
0.06
0.08
Histogram of the sampled data belonging to class l
x
Kernel 1h1.5h
ML estimate of a Gaussian density
density estimate using a Gaussian kernel
Non-parametric estimate of densities: Kernel density estimate
( )
( )
class
1
1 1exp
22
i
i
ml
T
m
p l Khnh
K
x
x xx
u u u
green
red
blue
Classification with missing data
Suppose that we have built a Bayesian classifier and are now given a new data point to classify, but that this new data point is missing some of the “features” that we normally expect to see. In the example below, we have two features (x1 and x2), and four classes. The likelihood function is plotted.
Suppose that we are given data point (*,-1) to classify. This data point is missing a value for x1. If we assume the missing value is the average of the previously observed x1, then we would estimate it to be about 1. Assuming that the prior probabilities are equal among the four classes, we classify (1,-1) as class c2.
However, c4 is a better choice because when x2=-1, c4 is the most likely class as it has the highest likelihood.
-4 -2 0 2 4 6 8-3
-2
-1
0
1
2
3 p cx
1c
1x
2x 2c3c
4c
1E x
, ,,
,
g b
i g b bi gi g
g g
i b
g
i b
g
i i b
g
p c dp cP c
p p
p c d
p
P c p d
p
p c P c d
p
x x x
x x xxx
x x
x x
x
x x x
x
x x
x
Classification with missing data
good data bad (or missing) data