giansalvo exin cirrincione unit #3 probability density estimation labelled unlabelled a specific...

Post on 19-Jan-2016

217 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Giansalvo EXIN Cirrincione

unit #3

p aram etricm eth od s

n on -p aram etricm eth od s

sem i-p aram etricm eth od s

fin ite n u m b er o f tra in in g sam p les

PROBABILITY DENSITY ESTIMATION

• labelled• unlabelled

A specific functional form for the density model is assumed. This contains a number of parameters which are then optimized by fitting the model to the training set.

The chosen form is not correct

p aram etricm eth od s

n on -p aram etricm eth od s

sem i-p aram etricm eth od s

fin ite n u m b er o f tra in in g sam p les

PROBABILITY DENSITY ESTIMATION

It does not assume a particular functional form, but allows the form of the density to be determined entirely by the data.

The number of parameters grows with the size of the TS

p aram etricm eth od s

n on -p aram etricm eth od s

sem i-p aram etricm eth od s

fin ite n u m b er o f tra in in g sam p les

PROBABILITY DENSITY ESTIMATION

It allows a very general class of functional forms in which the number of adaptive parameters can be increased in a sistematic way to build even more flexible models, but where the total number of parameters in the model can be varied independently from the size of the data set.

n on -p aram etricm eth od s

m axim u m like lih ood B ayes ian in fe ren ce s toch as tic tech n iq u esfo r on -lin e lea rn in g

p aram etricm eth od s

sem i-p aram etricm eth od s

fin ite n u m b er o f tra in in g sam p les

Parametric model: normal or Gaussian distribution

parameters

2

3dd

n on -p aram etricm eth od s

m axim u m like lih ood B ayes ian in fe ren ce s toch as tic tech n iq u esfo r on -lin e lea rn in g

p aram etricm eth od s

sem i-p aram etricm eth od s

fin ite n u m b er o f tra in in g sam p les

Parametric model: normal or Gaussian distribution

Mahalanobis distance

contour of constant probability density (smaller by a factor exp(-1/2))

n on -p aram etricm eth od s

m axim u m like lih ood B ayes ian in fe ren ce s toch as tic tech n iq u esfo r on -lin e lea rn in g

p aram etricm eth od s

sem i-p aram etricm eth od s

fin ite n u m b er o f tra in in g sam p les

Parametric model: normal or Gaussian distribution

ii uu iΣ

n on -p aram etricm eth od s

m axim u m like lih ood B ayes ian in fe ren ce s toch as tic tech n iq u esfo r on -lin e lea rn in g

p aram etricm eth od s

sem i-p aram etricm eth od s

fin ite n u m b er o f tra in in g sam p les

Parametric model: normal or Gaussian distribution

parameters 2d

The components of x are statistically independent

n on -p aram etricm eth od s

m axim u m like lih ood B ayes ian in fe ren ce s toch as tic tech n iq u esfo r on -lin e lea rn in g

p aram etricm eth od s

sem i-p aram etricm eth od s

fin ite n u m b er o f tra in in g sam p les

Parametric model: normal or Gaussian distribution

parameters 1djj

n on -p aram etricm eth od s

m axim u m like lih ood B ayes ian in fe ren ce s toch as tic tech n iq u esfo r on -lin e lea rn in g

p aram etricm eth od s

sem i-p aram etricm eth od s

fin ite n u m b er o f tra in in g sam p les

Parametric model: normal or Gaussian distribution

Some properties :• any moment can be expressed as a function of and • under general assumptions, the mean of M random variables tends to be distributed normally, in the limit as M tends to infinity (central limit theorem). Example: sum of a set of variables drawn independently from the same distribution• under any non-singular linear transformation of the coordinate system, the pdf is again normal, but with different parameters• the marginal and conditional densities are normal.

n on -p aram etricm eth od s

m axim u m like lih ood B ayes ian in fe ren ce s toch as tic tech n iq u esfo r on -lin e lea rn in g

p aram etricm eth od s

sem i-p aram etricm eth od s

fin ite n u m b er o f tra in in g sam p les

Parametric model: normal or Gaussian distribution

discriminant functiondiscriminant function

independent normal class-conditional pdf’s independent normal class-conditional pdf’s

quadratic decision boundaryquadratic decision boundary

n on -p aram etricm eth od s

m axim u m like lih ood B ayes ian in fe ren ce s toch as tic tech n iq u esfo r on -lin e lea rn in g

p aram etricm eth od s

sem i-p aram etricm eth od s

fin ite n u m b er o f tra in in g sam p les

Parametric model: normal or Gaussian distribution

independent normal class-conditional pdf’s k =

independent normal class-conditional pdf’s k =

linear decision boundarylinear decision boundary

n on -p aram etricm eth od s

m axim u m like lih ood B ayes ian in fe ren ce s toch as tic tech n iq u esfo r on -lin e lea rn in g

p aram etricm eth od s

sem i-p aram etricm eth od s

fin ite n u m b er o f tra in in g sam p les

Parametric model: normal or Gaussian distribution

P(C1) = P(C2)

n on -p aram etricm eth od s

m axim u m like lih ood B ayes ian in fe ren ce s toch as tic tech n iq u esfo r on -lin e lea rn in g

p aram etricm eth od s

sem i-p aram etricm eth od s

fin ite n u m b er o f tra in in g sam p les

Parametric model: normal or Gaussian distribution

P(C1) = P(C2) = P(C3)

n on -p aram etricm eth od s

m axim u m like lih ood B ayes ian in fe ren ce s toch as tic tech n iq u esfo r on -lin e lea rn in g

p aram etricm eth od s

sem i-p aram etricm eth od s

fin ite n u m b er o f tra in in g sam p les

Parametric model: normal or Gaussian distribution

template matchingtemplate matching

= =

n on -p aram etricm eth od s

m axim u m like lih ood B ayes ian in fe ren ce s toch as tic tech n iq u esfo r on -lin e lea rn in g

p aram etricm eth od s

sem i-p aram etricm eth od s

fin ite n u m b er o f tra in in g sam p les

ML finds the optimum values for the parameters by maximizing a likelihoodfunction derived from the training data.

drawn independently from the required distribution

n on -p aram etricm eth od s

m axim u m like lih ood B ayes ian in fe ren ce s toch as tic tech n iq u esfo r on -lin e lea rn in g

p aram etricm eth od s

sem i-p aram etricm eth od s

fin ite n u m b er o f tra in in g sam p les

TS joint probability density

Likelihood of for the given TS

ML finds the optimum values for the parameters by maximizing a likelihoodfunction derived from the training data.

n on -p aram etricm eth od s

m axim u m like lih ood B ayes ian in fe ren ce s toch as tic tech n iq u esfo r on -lin e lea rn in g

p aram etricm eth od s

sem i-p aram etricm eth od s

fin ite n u m b er o f tra in in g sam p les

error function

homeworkhomeworkGaussian pdf

sample averages

n on -p aram etricm eth od s

m axim u m like lih ood B ayes ian in fe ren ce s toch as tic tech n iq u esfo r on -lin e lea rn in g

p aram etricm eth od s

sem i-p aram etricm eth od s

fin ite n u m b er o f tra in in g sam p les

Uncertainty in the values of the parameters

n on -p aram etricm eth od s

m axim u m like lih ood B ayes ian in fe ren ce s toch as tic tech n iq u esfo r on -lin e lea rn in g

p aram etricm eth od s

sem i-p aram etricm eth od s

fin ite n u m b er o f tra in in g sam p les

n on -p aram etricm eth od s

m axim u m like lih ood B ayes ian in fe ren ce s toch as tic tech n iq u esfo r on -lin e lea rn in g

p aram etricm eth od s

sem i-p aram etricm eth od s

fin ite n u m b er o f tra in in g sam p les

weighting factor (posterior distribution)

drawn independently from the underlying distribution

n on -p aram etricm eth od s

m axim u m like lih ood B ayes ian in fe ren ce s toch as tic tech n iq u esfo r on -lin e lea rn in g

p aram etricm eth od s

sem i-p aram etricm eth od s

fin ite n u m b er o f tra in in g sam p les

n on -p aram etricm eth od s

m axim u m like lih ood B ayes ian in fe ren ce s toch as tic tech n iq u esfo r on -lin e lea rn in g

p aram etricm eth od s

sem i-p aram etricm eth od s

fin ite n u m b er o f tra in in g sam p les

n on -p aram etricm eth od s

m axim u m like lih ood B ayes ian in fe ren ce s toch as tic tech n iq u esfo r on -lin e lea rn in g

p aram etricm eth od s

sem i-p aram etricm eth od s

fin ite n u m b er o f tra in in g sam p les

A prior which gives rise to a posterior having the same functional form is said to be a conjugate prior (reproducing densities, e.g. Gaussian).

For large numbers of observations, the Bayesian representation of the density approaches the maximum likelihood solution.

Example

Assume knownFind given

normaldistribution

homeworkhomework

sample mean

Example

normaldistribution

n on -p aram etricm eth od s

m axim u m like lih ood B ayes ian in fe ren ce s toch as tic tech n iq u esfo r on -lin e lea rn in g

p aram etricm eth od s

sem i-p aram etricm eth od s

fin ite n u m b er o f tra in in g sam p les

Iterative techniques:• no storage of a complete TS• on-line learning in real-time adaptive systems• tracking of slowly varying systems

From the ML estimate of the mean of a normal distribution

The Robbins-Monro algorithm

Consider a pair of random variables g and which are correlated

regression function

Assume g has finite variance:

The Robbins-Monro algorithm

positivepositive

Successive corrections decrease in magnitude

for convergence

Corrections are sufficiently large that

the root is found

The accumulated noise has finite variance (noise doesn’t spoil

convergence )

The Robbins-Monro algorithmThe ML parameter estimate can be formulated as a sequential update method using the Robbins-Monro formula.

homework

Consider the case where the pdf is taken to be a normal distribution, with known standard deviation and unknown mean . Show that, by choosing aN = 2 / (N+1), the one-dimensional iterative version of the ML estimate of the mean is recovered by using the Robbins-Monro formula for sequential ML. Obtain the corresponding formula for the iterative estimate of 2 and repeat the same analysis.

2

ˆ

x

g 2

ˆ

f

n on -p aram etricm eth od s

p aram etricm eth od s

sem i-p aram etricm eth od s

fin ite n u m b er o f tra in in g sam p les

SUPERVISED LEARNING

histograms We can choose both the number of bins M and their starting position on the axis.The number of bins (viz. the bin width) acts as a smoothing parameter.

Curse of dimensionality ( Md bins)

Density estimation in generalDensity estimation in general

The probability that a new vector x, drawn from the unknown pdf p(x), will fall inside some region R of x-space is given by:

If we have N points drawn independently from p(x), the probability that K of them will fall within R is given by the binomial law:

The distribution is sharply peaked as N tends to infinity.

Assume p(x) is continuous and slightly varies over the region R of volume V.

Density estimation in generalDensity estimation in general

Assumption #1R relatively large so that P will be large and the binomial

distribution will be sharply peaked

Assumption #2R small justifies the assumption of p(x)

nearly constant inside the integration region.

FIXED DETERMINED FROM DATA

K-nearest-neighbours

Density estimation in generalDensity estimation in general

Assumption #1R relatively large so that P will be large and the binomial

distribution will be sharply peaked

Assumption #2R small justifies the assumption of p(x)

nearly constant inside the integration region.

DETERMINED FROM DATA

FIXED

Kernel-based methods

Kernel-based methodsdhV

We can find an expression for K by defining a kernel function H(u), also known as a Parzen window, given by:

R is a hypercube centred on x

Superposition of N cubes of side h with each cube centred on one of the data points.

interpolation function (ZOH)

Kernel-based methods

smoother estimate

Kernel-based methods

30 samples

ZOH

Gaussian

Kernel-based methodsOver different selections

of data points xn

The expectation of the estimated density is a convolution of the true pdf with the kernel function and so represents a smoothed version of the pdf.

All of the data points must be stored !

For a finite data set, there is no non-negative estimator which is unbiased for all continuous pdf’s (Rosenblatt, 1956)

K-nearest neighbours

One of the potential problems with the kernel-based approach arises from the use of a fixed width parameter (h) for all of the data points. If h is too large, there may be regions of x-space in which the estimate is oversmoothed. Reducing h may lead to problems in regions of lower density where the model density will become noisy.

The optimum choice of h may be a function of position.

Consider a small hypersphere centred at a point x and allow the radius of the sphere to grow until it contains precisely K data points. The estimate of the density is then given by K / NV.

K-nearest neighbours

The estimate is not a true probability density since its

integral over all x-space diverges.

All of the data points must be stored !

Branch-and-bound

K-nearest neighbour classification rule

The data set contains Nk points in class Ck and N

points in total.

Draw a hypersphere around x which

encompasses K points irrespective of their class.

VN

KCp

k

kk x

K

K

p

CPCpCP kkk

k x

xx

NV

Kp x

N

NCp k

k

K-nearest neighbour classification rule

K

K

p

CPCpCP kkk

k x

xx

Find a

hype

rsphe

re

arou

nd x

which

cont

ains

K poin

ts an

d the

n ass

ign

x to t

he cl

ass h

avin

g the

majo

rity i

nsid

e the

hype

rsphe

re.

K = 1 : nearest-neighbour rule

K-nearest neighbour classification rule

K

K

p

CPCpCP kkk

k x

xx

Samples

that

are c

lose

in fe

ature

spac

e lik

ely

belong t

o the s

ame

class

.

K = 1 : nearest-neighbour rule

1-NNR

K-nearest neighbour classification rule

Measure of the distance between two density functions

Kullback-Leibler distanceor

asymmetric divergence

L 0 with equality iff the two pdf’s are equal.

homework

n on -p aram etricm eth od s

p aram etricm eth od s

sem i-p aram etricm eth od s

fin ite n u m b er o f tra in in g sam p les

Techniques not restricted to specific functional forms, where the size of the model only grows with the complexity of the problem being solved, and not simply with the size of the data set.

computationally intensive

MIXTURE MODELMIXTURE MODEL

Training methods based on ML: nonlinear optimization re-estimation (EM algorithm) stochastic sequential estimation

MIXTURE DISTRIBUTIONMIXTURE DISTRIBUTION

mixing parametersmixing parameters

prior probability of the data point having been generated from component j of the mixture

To generate a data from the pdf, one of the components j is first selected at random with probability P(j) and then a data point is generated from the corresponding component density p(xj).

It can approximate any CONTINUOUS density to arbitrary accuracy provided the model has a sufficiently large number of components, and provided the parameters of the model are chosen correctly.

incomplete data(no component label)

posterior probability

spherical Gaussianspherical Gaussian

d

MAXIMUM LIKELIHOODMAXIMUM LIKELIHOOD

Adjustable parameters : P( j ) j j = 1, … , M j j = 1, … , M

Problems : singular solutions (likelihood goes to infinity) local minima

One of the Gaussian components collapses onto

one of the data points

MAXIMUM LIKELIHOODMAXIMUM LIKELIHOOD

Possible solutions : constrain the components to have equal variance minimum (underflow) threshold for the variance

Problems : singular solutions (likelihood goes to infinity) local minima

softmax or normalized exponential

Expressions for the parameters at a minimum of E

Mean of the data vectors weighted by the posterior

probabilities that the corresponding data points were generated from that

component.

Expressions for the parameters at a minimum of E

Variance of the data w.r.t. the mean of that

component, again weighted with the posterior

probabilities.

Expressions for the parameters at a minimum of E

Posterior probabilities for that component, averaged

over the data set.

Expressions for the parameters at a minimum of E

Highly non-linear coupled

equations

Expectation-maximization (EM) algorithm

The error function

decreases at each iteration until a

local minimum is found

old

old

oldnew

new

new

proof

Given a set of non-negative numbers j that sum to one :

Jensen’s inequality

QQEE oldnew

Minimizing Q leads to a decrease in the value of the Enew unless Enew is already at a local minimum.

Gaussian mixture model

Minimize :

end proof

example

EM algorithm• 1000 data points• uniform distribution• seven components j

MjP

jj

12

x

after 20 cycles after 20 cycles

Contours of constant probability density

k

c

kk CPCpp

1

xx k

c

kk CPCpp

1

xx

Why expectation-maximization ?

Hypothetical complete data set xn introduce zn , integer in the range (1,M), specifying which component of the mixture generated x.

The distribution of zn is unknown

Why expectation-maximization ?

First we guess some values for the parameters of the mixture model (the old parameter values) and then we use these, together with Bayes’ theorem, to find the probability distribution of the {zn}. We then compute the expectation of Ecomp w.r.t. this distribution. This is the E-step of the EM algorithm. The new parameter values are then found by minimizing this expected error w.r.t. the parameters. This is the maximization or M-step of the EM algorithm (min E = ML).

Why expectation-maximization ?

Pold(zn|xn) is the probability for zn, given the value of xn and the old parameter values. Thus, the expectation of Ecomp over the complete set of {zn} values is given by:

probability distribution for the {zn}

Why expectation-maximization ?

Pold(zn|xn) is the probability for zn, given the value of xn and the old parameter values. Thus, the expectation of Ecomp over the complete set of {zn} values is given by:

homeworkhomework

Why expectation-maximization ?

Pold(zn|xn) is the probability for zn, given the value of xn and the old parameter values. Thus, the expectation of Ecomp over the complete set of {zn} values is given by:

which is equal to Q ~

Stochastic estimation of parameters

It requires the storage of all previous data

points

Stochastic estimation of parameters

no singular solutions in on-line problems

top related