machine learning maximum a posteriori

164
Machine Learning for Data Mining Maximum A Posteriori (MAP) Andres Mendez-Vazquez June 23, 2015 1 / 66

Upload: andres-mendez-vazquez

Post on 06-Aug-2015

57 views

Category:

Education


0 download

TRANSCRIPT

Page 1: Machine Learning Maximum A Posteriori

Machine Learning for Data MiningMaximum A Posteriori (MAP)

Andres Mendez-Vazquez

June 23, 2015

1 / 66

Page 2: Machine Learning Maximum A Posteriori

Outline

1 IntroductionA first solutionExampleProperties of the MAP

2 Example of Application of MAP and EMExampleLinear RegressionThe Gaussian NoiseA Hierarchical-Bayes View of the Laplacian PriorSparse Regression via EMJeffrey’s Prior

2 / 66

Page 3: Machine Learning Maximum A Posteriori

Introduction

We go back to the Bayesian Rule

p (Θ|X ) = p (X|Θ) p (Θ)p (X ) (1)

We now seek that value for Θ, called Θ̂MAP

It allows to maximize the posterior p (Θ|X )

3 / 66

Page 4: Machine Learning Maximum A Posteriori

Introduction

We go back to the Bayesian Rule

p (Θ|X ) = p (X|Θ) p (Θ)p (X ) (1)

We now seek that value for Θ, called Θ̂MAP

It allows to maximize the posterior p (Θ|X )

3 / 66

Page 5: Machine Learning Maximum A Posteriori

Outline

1 IntroductionA first solutionExampleProperties of the MAP

2 Example of Application of MAP and EMExampleLinear RegressionThe Gaussian NoiseA Hierarchical-Bayes View of the Laplacian PriorSparse Regression via EMJeffrey’s Prior

4 / 66

Page 6: Machine Learning Maximum A Posteriori

Development of the solution

We look to maximize Θ̂MAP

Θ̂MAP = argmaxΘ

p (Θ)

= argmaxΘ

p (X|Θ) p (Θ)P (X )

≈ argmaxΘ

p (X|Θ) p (Θ)

= argmaxΘ

∏xi∈X

p (xi |Θ) p (Θ)

P (X ) can be removed because it has no functional relation with Θ.

5 / 66

Page 7: Machine Learning Maximum A Posteriori

Development of the solution

We look to maximize Θ̂MAP

Θ̂MAP = argmaxΘ

p (Θ)

= argmaxΘ

p (X|Θ) p (Θ)P (X )

≈ argmaxΘ

p (X|Θ) p (Θ)

= argmaxΘ

∏xi∈X

p (xi |Θ) p (Θ)

P (X ) can be removed because it has no functional relation with Θ.

5 / 66

Page 8: Machine Learning Maximum A Posteriori

Development of the solution

We look to maximize Θ̂MAP

Θ̂MAP = argmaxΘ

p (Θ)

= argmaxΘ

p (X|Θ) p (Θ)P (X )

≈ argmaxΘ

p (X|Θ) p (Θ)

= argmaxΘ

∏xi∈X

p (xi |Θ) p (Θ)

P (X ) can be removed because it has no functional relation with Θ.

5 / 66

Page 9: Machine Learning Maximum A Posteriori

Development of the solution

We look to maximize Θ̂MAP

Θ̂MAP = argmaxΘ

p (Θ)

= argmaxΘ

p (X|Θ) p (Θ)P (X )

≈ argmaxΘ

p (X|Θ) p (Θ)

= argmaxΘ

∏xi∈X

p (xi |Θ) p (Θ)

P (X ) can be removed because it has no functional relation with Θ.

5 / 66

Page 10: Machine Learning Maximum A Posteriori

Development of the solution

We look to maximize Θ̂MAP

Θ̂MAP = argmaxΘ

p (Θ)

= argmaxΘ

p (X|Θ) p (Θ)P (X )

≈ argmaxΘ

p (X|Θ) p (Θ)

= argmaxΘ

∏xi∈X

p (xi |Θ) p (Θ)

P (X ) can be removed because it has no functional relation with Θ.

5 / 66

Page 11: Machine Learning Maximum A Posteriori

We can make this easier

Use logarithms

Θ̂MAP = argmaxΘ

∑xi∈X

log p (xi |Θ) + log p (Θ)

(2)

6 / 66

Page 12: Machine Learning Maximum A Posteriori

Outline

1 IntroductionA first solutionExampleProperties of the MAP

2 Example of Application of MAP and EMExampleLinear RegressionThe Gaussian NoiseA Hierarchical-Bayes View of the Laplacian PriorSparse Regression via EMJeffrey’s Prior

7 / 66

Page 13: Machine Learning Maximum A Posteriori

What Does the MAP Estimate Get

Something NotableThe MAP estimate allows us to inject into the estimation calculation ourprior beliefs regarding the parameters values in Θ.

For exampleLet’s conduct N independent trials of the following Bernoulli experimentwith q parameter:

We will ask each individual we run into in the hallway whether theywill vote PRI or PAN in the next presidential election.

With probability p to vote PRIWhere the values of xi is either PRI or PAN.

8 / 66

Page 14: Machine Learning Maximum A Posteriori

What Does the MAP Estimate Get

Something NotableThe MAP estimate allows us to inject into the estimation calculation ourprior beliefs regarding the parameters values in Θ.

For exampleLet’s conduct N independent trials of the following Bernoulli experimentwith q parameter:

We will ask each individual we run into in the hallway whether theywill vote PRI or PAN in the next presidential election.

With probability p to vote PRIWhere the values of xi is either PRI or PAN.

8 / 66

Page 15: Machine Learning Maximum A Posteriori

What Does the MAP Estimate Get

Something NotableThe MAP estimate allows us to inject into the estimation calculation ourprior beliefs regarding the parameters values in Θ.

For exampleLet’s conduct N independent trials of the following Bernoulli experimentwith q parameter:

We will ask each individual we run into in the hallway whether theywill vote PRI or PAN in the next presidential election.

With probability p to vote PRIWhere the values of xi is either PRI or PAN.

8 / 66

Page 16: Machine Learning Maximum A Posteriori

What Does the MAP Estimate Get

Something NotableThe MAP estimate allows us to inject into the estimation calculation ourprior beliefs regarding the parameters values in Θ.

For exampleLet’s conduct N independent trials of the following Bernoulli experimentwith q parameter:

We will ask each individual we run into in the hallway whether theywill vote PRI or PAN in the next presidential election.

With probability p to vote PRIWhere the values of xi is either PRI or PAN.

8 / 66

Page 17: Machine Learning Maximum A Posteriori

First the ML estimateSamples

X ={

xi ={

PANPRI

i = 1, ...,N}

(3)

The log likelihood function

log P (X|p) =N∑

i=1log p (xi |q)

=∑

ilog p (xi = PRI |q) + ...∑

ilog p (xi = PAN |1− q)

=nPRI log q + (N − nPRI ) log (1− q)

Where nPRI are the numbers of individuals who are planning to vote PRIthis fall 9 / 66

Page 18: Machine Learning Maximum A Posteriori

First the ML estimateSamples

X ={

xi ={

PANPRI

i = 1, ...,N}

(3)

The log likelihood function

log P (X|p) =N∑

i=1log p (xi |q)

=∑

ilog p (xi = PRI |q) + ...∑

ilog p (xi = PAN |1− q)

=nPRI log q + (N − nPRI ) log (1− q)

Where nPRI are the numbers of individuals who are planning to vote PRIthis fall 9 / 66

Page 19: Machine Learning Maximum A Posteriori

First the ML estimateSamples

X ={

xi ={

PANPRI

i = 1, ...,N}

(3)

The log likelihood function

log P (X|p) =N∑

i=1log p (xi |q)

=∑

ilog p (xi = PRI |q) + ...∑

ilog p (xi = PAN |1− q)

=nPRI log q + (N − nPRI ) log (1− q)

Where nPRI are the numbers of individuals who are planning to vote PRIthis fall 9 / 66

Page 20: Machine Learning Maximum A Posteriori

First the ML estimateSamples

X ={

xi ={

PANPRI

i = 1, ...,N}

(3)

The log likelihood function

log P (X|p) =N∑

i=1log p (xi |q)

=∑

ilog p (xi = PRI |q) + ...∑

ilog p (xi = PAN |1− q)

=nPRI log q + (N − nPRI ) log (1− q)

Where nPRI are the numbers of individuals who are planning to vote PRIthis fall 9 / 66

Page 21: Machine Learning Maximum A Posteriori

First the ML estimateSamples

X ={

xi ={

PANPRI

i = 1, ...,N}

(3)

The log likelihood function

log P (X|p) =N∑

i=1log p (xi |q)

=∑

ilog p (xi = PRI |q) + ...∑

ilog p (xi = PAN |1− q)

=nPRI log q + (N − nPRI ) log (1− q)

Where nPRI are the numbers of individuals who are planning to vote PRIthis fall 9 / 66

Page 22: Machine Learning Maximum A Posteriori

We use our classic tricks

By setting

L = log P (X|q) (4)

We have that∂L∂q = 0 (5)

ThusnPRI

q − (N − nPRI )(1− q) = 0 (6)

10 / 66

Page 23: Machine Learning Maximum A Posteriori

We use our classic tricks

By setting

L = log P (X|q) (4)

We have that∂L∂q = 0 (5)

ThusnPRI

q − (N − nPRI )(1− q) = 0 (6)

10 / 66

Page 24: Machine Learning Maximum A Posteriori

We use our classic tricks

By setting

L = log P (X|q) (4)

We have that∂L∂q = 0 (5)

ThusnPRI

q − (N − nPRI )(1− q) = 0 (6)

10 / 66

Page 25: Machine Learning Maximum A Posteriori

Final Solution of ML

We get

q̂PRI = nPRIN (7)

ThusIf we say that N = 20 and if 12 are going to vote PRI, we get q̂PRI = 0.6.

11 / 66

Page 26: Machine Learning Maximum A Posteriori

Final Solution of ML

We get

q̂PRI = nPRIN (7)

ThusIf we say that N = 20 and if 12 are going to vote PRI, we get q̂PRI = 0.6.

11 / 66

Page 27: Machine Learning Maximum A Posteriori

Building the MAP estimate

Obviously we need a prior belief distributionWe have the following constraints:

The prior for q must be zero outside the [0,1] interval.Within the [0,1] interval, we are free to specify our beliefs in any waywe wish.In most cases, we would want to choose a distribution for the priorbeliefs that peaks somewhere in the [0, 1] interval.

We assume the followingThe state of Colima has traditionally voted PRI in presidential elections.However, on account of the prevailing economic conditions, the voters aremore likely to vote PAN in the election in question.

12 / 66

Page 28: Machine Learning Maximum A Posteriori

Building the MAP estimate

Obviously we need a prior belief distributionWe have the following constraints:

The prior for q must be zero outside the [0,1] interval.Within the [0,1] interval, we are free to specify our beliefs in any waywe wish.In most cases, we would want to choose a distribution for the priorbeliefs that peaks somewhere in the [0, 1] interval.

We assume the followingThe state of Colima has traditionally voted PRI in presidential elections.However, on account of the prevailing economic conditions, the voters aremore likely to vote PAN in the election in question.

12 / 66

Page 29: Machine Learning Maximum A Posteriori

Building the MAP estimate

Obviously we need a prior belief distributionWe have the following constraints:

The prior for q must be zero outside the [0,1] interval.Within the [0,1] interval, we are free to specify our beliefs in any waywe wish.In most cases, we would want to choose a distribution for the priorbeliefs that peaks somewhere in the [0, 1] interval.

We assume the followingThe state of Colima has traditionally voted PRI in presidential elections.However, on account of the prevailing economic conditions, the voters aremore likely to vote PAN in the election in question.

12 / 66

Page 30: Machine Learning Maximum A Posteriori

Building the MAP estimate

Obviously we need a prior belief distributionWe have the following constraints:

The prior for q must be zero outside the [0,1] interval.Within the [0,1] interval, we are free to specify our beliefs in any waywe wish.In most cases, we would want to choose a distribution for the priorbeliefs that peaks somewhere in the [0, 1] interval.

We assume the followingThe state of Colima has traditionally voted PRI in presidential elections.However, on account of the prevailing economic conditions, the voters aremore likely to vote PAN in the election in question.

12 / 66

Page 31: Machine Learning Maximum A Posteriori

Building the MAP estimate

Obviously we need a prior belief distributionWe have the following constraints:

The prior for q must be zero outside the [0,1] interval.Within the [0,1] interval, we are free to specify our beliefs in any waywe wish.In most cases, we would want to choose a distribution for the priorbeliefs that peaks somewhere in the [0, 1] interval.

We assume the followingThe state of Colima has traditionally voted PRI in presidential elections.However, on account of the prevailing economic conditions, the voters aremore likely to vote PAN in the election in question.

12 / 66

Page 32: Machine Learning Maximum A Posteriori

What prior distribution can we use?We could use a Beta distribution being parametrized by two values αand β

P (q) = 1B (α, β)qα−1 (1− q)β−1 . (8)

WhereWe have B (α, β) = Γ(α)Γ(β)

Γ(α+β) is the beta function where Γ is thegeneralization of the notion of factorial in the case of the real numbers.

PropertiesWhen both the α, β > 0 then the beta distribution has its mode(Maximum value) at

α− 1α+ β − 2 . (9)

13 / 66

Page 33: Machine Learning Maximum A Posteriori

What prior distribution can we use?We could use a Beta distribution being parametrized by two values αand β

P (q) = 1B (α, β)qα−1 (1− q)β−1 . (8)

WhereWe have B (α, β) = Γ(α)Γ(β)

Γ(α+β) is the beta function where Γ is thegeneralization of the notion of factorial in the case of the real numbers.

PropertiesWhen both the α, β > 0 then the beta distribution has its mode(Maximum value) at

α− 1α+ β − 2 . (9)

13 / 66

Page 34: Machine Learning Maximum A Posteriori

What prior distribution can we use?We could use a Beta distribution being parametrized by two values αand β

P (q) = 1B (α, β)qα−1 (1− q)β−1 . (8)

WhereWe have B (α, β) = Γ(α)Γ(β)

Γ(α+β) is the beta function where Γ is thegeneralization of the notion of factorial in the case of the real numbers.

PropertiesWhen both the α, β > 0 then the beta distribution has its mode(Maximum value) at

α− 1α+ β − 2 . (9)

13 / 66

Page 35: Machine Learning Maximum A Posteriori

We then do the following

We do the followingWe can choose α = β so the beta prior peaks at 0.5.

As a further expression of our beliefWe make the following choice α = β = 5.

Why? Look at the variance of the beta distributionαβ

(α+ β) (α+ β + 1) . (10)

14 / 66

Page 36: Machine Learning Maximum A Posteriori

We then do the following

We do the followingWe can choose α = β so the beta prior peaks at 0.5.

As a further expression of our beliefWe make the following choice α = β = 5.

Why? Look at the variance of the beta distributionαβ

(α+ β) (α+ β + 1) . (10)

14 / 66

Page 37: Machine Learning Maximum A Posteriori

We then do the following

We do the followingWe can choose α = β so the beta prior peaks at 0.5.

As a further expression of our beliefWe make the following choice α = β = 5.

Why? Look at the variance of the beta distributionαβ

(α+ β) (α+ β + 1) . (10)

14 / 66

Page 38: Machine Learning Maximum A Posteriori

Thus, we have the following nice properties

We have a variance with α = β = 5Var (q) ≈ 0.025

Thus, the standard deviationsd ≈ 0.16 which is a nice dispersion at the peak point!!!

15 / 66

Page 39: Machine Learning Maximum A Posteriori

Thus, we have the following nice properties

We have a variance with α = β = 5Var (q) ≈ 0.025

Thus, the standard deviationsd ≈ 0.16 which is a nice dispersion at the peak point!!!

15 / 66

Page 40: Machine Learning Maximum A Posteriori

Now, our MAP estimate for p...

We have then

p̂MAP = argmaxΘ

∑xi∈X

log p (xi |q) + log p (q)

(11)

Plugging back the ML

p̂MAP = argmaxΘ

[nPRI log q + (N − nPRI ) log (1− q) + log p (q)] (12)

Where

log P (p) = log( 1

B (α, β)qα−1 (1− q)β−1)

(13)

16 / 66

Page 41: Machine Learning Maximum A Posteriori

Now, our MAP estimate for p...

We have then

p̂MAP = argmaxΘ

∑xi∈X

log p (xi |q) + log p (q)

(11)

Plugging back the ML

p̂MAP = argmaxΘ

[nPRI log q + (N − nPRI ) log (1− q) + log p (q)] (12)

Where

log P (p) = log( 1

B (α, β)qα−1 (1− q)β−1)

(13)

16 / 66

Page 42: Machine Learning Maximum A Posteriori

Now, our MAP estimate for p...

We have then

p̂MAP = argmaxΘ

∑xi∈X

log p (xi |q) + log p (q)

(11)

Plugging back the ML

p̂MAP = argmaxΘ

[nPRI log q + (N − nPRI ) log (1− q) + log p (q)] (12)

Where

log P (p) = log( 1

B (α, β)qα−1 (1− q)β−1)

(13)

16 / 66

Page 43: Machine Learning Maximum A Posteriori

The log of P (p)

We have that

log P (q) = (α− 1) log q + (β − 1) log (1− q)− log B (α, β) (14)

Now taking the derivative with respect to p, we getnPRI

q − (N − nPRI )(1− q) − β − 1

1− q + α− 1q = 0 (15)

Thus

q̂MAP = nPRI + α− 1N + α+ β − 2 (16)

17 / 66

Page 44: Machine Learning Maximum A Posteriori

The log of P (p)

We have that

log P (q) = (α− 1) log q + (β − 1) log (1− q)− log B (α, β) (14)

Now taking the derivative with respect to p, we getnPRI

q − (N − nPRI )(1− q) − β − 1

1− q + α− 1q = 0 (15)

Thus

q̂MAP = nPRI + α− 1N + α+ β − 2 (16)

17 / 66

Page 45: Machine Learning Maximum A Posteriori

The log of P (p)

We have that

log P (q) = (α− 1) log q + (β − 1) log (1− q)− log B (α, β) (14)

Now taking the derivative with respect to p, we getnPRI

q − (N − nPRI )(1− q) − β − 1

1− q + α− 1q = 0 (15)

Thus

q̂MAP = nPRI + α− 1N + α+ β − 2 (16)

17 / 66

Page 46: Machine Learning Maximum A Posteriori

Now

With N = 20 with nPRI = 12 and α = β = 5

q̂MAP = 0.571

18 / 66

Page 47: Machine Learning Maximum A Posteriori

Outline

1 IntroductionA first solutionExampleProperties of the MAP

2 Example of Application of MAP and EMExampleLinear RegressionThe Gaussian NoiseA Hierarchical-Bayes View of the Laplacian PriorSparse Regression via EMJeffrey’s Prior

19 / 66

Page 48: Machine Learning Maximum A Posteriori

Properties

FirstMAP estimation “pulls” the estimate toward the prior.

SecondThe more focused our prior belief, the larger the pull toward the prior.

ExampleIf α = β=equal to large value

It will make the MAP estimate to move closer to the prior.

20 / 66

Page 49: Machine Learning Maximum A Posteriori

Properties

FirstMAP estimation “pulls” the estimate toward the prior.

SecondThe more focused our prior belief, the larger the pull toward the prior.

ExampleIf α = β=equal to large value

It will make the MAP estimate to move closer to the prior.

20 / 66

Page 50: Machine Learning Maximum A Posteriori

Properties

FirstMAP estimation “pulls” the estimate toward the prior.

SecondThe more focused our prior belief, the larger the pull toward the prior.

ExampleIf α = β=equal to large value

It will make the MAP estimate to move closer to the prior.

20 / 66

Page 51: Machine Learning Maximum A Posteriori

Properties

ThirdIn the expression we derived for q̂MAP , the parameters α and β play a“smoothing” role vis-a-vis the measurement nPRI .

FourthSince we referred to q as the parameter to be estimated, we can refer to αand β as the hyper-parameters in the estimation calculations.

21 / 66

Page 52: Machine Learning Maximum A Posteriori

Properties

ThirdIn the expression we derived for q̂MAP , the parameters α and β play a“smoothing” role vis-a-vis the measurement nPRI .

FourthSince we referred to q as the parameter to be estimated, we can refer to αand β as the hyper-parameters in the estimation calculations.

21 / 66

Page 53: Machine Learning Maximum A Posteriori

Beyond simple derivation

In the previous techniqueWe took an logarithm of the likelihoot × the prior to obtain a function thatcan be derived in order to obtain each of the parameters to be estimated.

What if we cannot derive?For example when we have something like |θi |.

We can try the followingEM + MAP to be able to estimate the sought parameters.

22 / 66

Page 54: Machine Learning Maximum A Posteriori

Beyond simple derivation

In the previous techniqueWe took an logarithm of the likelihoot × the prior to obtain a function thatcan be derived in order to obtain each of the parameters to be estimated.

What if we cannot derive?For example when we have something like |θi |.

We can try the followingEM + MAP to be able to estimate the sought parameters.

22 / 66

Page 55: Machine Learning Maximum A Posteriori

Beyond simple derivation

In the previous techniqueWe took an logarithm of the likelihoot × the prior to obtain a function thatcan be derived in order to obtain each of the parameters to be estimated.

What if we cannot derive?For example when we have something like |θi |.

We can try the followingEM + MAP to be able to estimate the sought parameters.

22 / 66

Page 56: Machine Learning Maximum A Posteriori

Outline

1 IntroductionA first solutionExampleProperties of the MAP

2 Example of Application of MAP and EMExampleLinear RegressionThe Gaussian NoiseA Hierarchical-Bayes View of the Laplacian PriorSparse Regression via EMJeffrey’s Prior

23 / 66

Page 57: Machine Learning Maximum A Posteriori

Example

This application comes from“Adaptive Sparseness for Supervised Learning” by Mário A.T. Figueiredo

InIEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINEINTELLIGENCE, VOL. 25, NO. 9, SEPTEMBER 2003

24 / 66

Page 58: Machine Learning Maximum A Posteriori

Example

This application comes from“Adaptive Sparseness for Supervised Learning” by Mário A.T. Figueiredo

InIEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINEINTELLIGENCE, VOL. 25, NO. 9, SEPTEMBER 2003

24 / 66

Page 59: Machine Learning Maximum A Posteriori

Outline

1 IntroductionA first solutionExampleProperties of the MAP

2 Example of Application of MAP and EMExampleLinear RegressionThe Gaussian NoiseA Hierarchical-Bayes View of the Laplacian PriorSparse Regression via EMJeffrey’s Prior

25 / 66

Page 60: Machine Learning Maximum A Posteriori

Introduction: Linear Regression with Gaussian Prior

We consider regression functions that are linear with respect to theparameter vector β

f (x, β) =k∑

i=1βih (x) = βT h (x)

Whereh (x) = [h1 (x) , ..., hk (x)]T is a vector of k fixed function of the input,often called features.

26 / 66

Page 61: Machine Learning Maximum A Posteriori

Introduction: Linear Regression with Gaussian Prior

We consider regression functions that are linear with respect to theparameter vector β

f (x, β) =k∑

i=1βih (x) = βT h (x)

Whereh (x) = [h1 (x) , ..., hk (x)]T is a vector of k fixed function of the input,often called features.

26 / 66

Page 62: Machine Learning Maximum A Posteriori

Actually, it can be...

Linear RegressionLinear regression, in which h (x) = [1, x1, ..., xd ]T i; in this case, k = d + 1.

Non-Linear RegressionHere, you have a fixed basis function whereh (x) = [φ1 (x) , φ2 (x) , ..., φ1 (x)]T with φ1 (x) = 1.

Kernel RegressionHere h (x) = [1,K (x,x1) ,K (x,x2) , ...,K (x,xn)]T where K (x,xi) issome kernel function.

27 / 66

Page 63: Machine Learning Maximum A Posteriori

Actually, it can be...

Linear RegressionLinear regression, in which h (x) = [1, x1, ..., xd ]T i; in this case, k = d + 1.

Non-Linear RegressionHere, you have a fixed basis function whereh (x) = [φ1 (x) , φ2 (x) , ..., φ1 (x)]T with φ1 (x) = 1.

Kernel RegressionHere h (x) = [1,K (x,x1) ,K (x,x2) , ...,K (x,xn)]T where K (x,xi) issome kernel function.

27 / 66

Page 64: Machine Learning Maximum A Posteriori

Actually, it can be...

Linear RegressionLinear regression, in which h (x) = [1, x1, ..., xd ]T i; in this case, k = d + 1.

Non-Linear RegressionHere, you have a fixed basis function whereh (x) = [φ1 (x) , φ2 (x) , ..., φ1 (x)]T with φ1 (x) = 1.

Kernel RegressionHere h (x) = [1,K (x,x1) ,K (x,x2) , ...,K (x,xn)]T where K (x,xi) issome kernel function.

27 / 66

Page 65: Machine Learning Maximum A Posteriori

Outline

1 IntroductionA first solutionExampleProperties of the MAP

2 Example of Application of MAP and EMExampleLinear RegressionThe Gaussian NoiseA Hierarchical-Bayes View of the Laplacian PriorSparse Regression via EMJeffrey’s Prior

28 / 66

Page 66: Machine Learning Maximum A Posteriori

Gaussian NoiseWe assume that the training set is contaminated by additive whiteGaussian Noise

yi = f (xi , β) + ωi (17)

for i = 1, ...,N where [ω1, ..., ωN ] is a set of independent zero-meanGaussian samples with variance σ2

Thus, for [y1, ..., yN ], we have the following likelihood

p (y|β) = N(Hβ, σ2I

)(18)

The posterior p (β|y) is still Gaussian and the mode is given by

β̂ =(σ2I + H T H

)−1H T y (19)

Remark: The Ridge regression.

29 / 66

Page 67: Machine Learning Maximum A Posteriori

Gaussian NoiseWe assume that the training set is contaminated by additive whiteGaussian Noise

yi = f (xi , β) + ωi (17)

for i = 1, ...,N where [ω1, ..., ωN ] is a set of independent zero-meanGaussian samples with variance σ2

Thus, for [y1, ..., yN ], we have the following likelihood

p (y|β) = N(Hβ, σ2I

)(18)

The posterior p (β|y) is still Gaussian and the mode is given by

β̂ =(σ2I + H T H

)−1H T y (19)

Remark: The Ridge regression.

29 / 66

Page 68: Machine Learning Maximum A Posteriori

Gaussian NoiseWe assume that the training set is contaminated by additive whiteGaussian Noise

yi = f (xi , β) + ωi (17)

for i = 1, ...,N where [ω1, ..., ωN ] is a set of independent zero-meanGaussian samples with variance σ2

Thus, for [y1, ..., yN ], we have the following likelihood

p (y|β) = N(Hβ, σ2I

)(18)

The posterior p (β|y) is still Gaussian and the mode is given by

β̂ =(σ2I + H T H

)−1H T y (19)

Remark: The Ridge regression.

29 / 66

Page 69: Machine Learning Maximum A Posteriori

Regression with a Laplacian Prior

In order to favor sparse estimate

p (β|α) =k∏

i=1

α

2 exp {−α |βi |} =(α

2

)kexp {−α ‖β‖1} (20)

Thus, the MAP estimate of β look like

β̂ = argminβ

{‖Hβ − y‖22 + 2σ2α ‖β‖1

}(21)

30 / 66

Page 70: Machine Learning Maximum A Posteriori

Regression with a Laplacian Prior

In order to favor sparse estimate

p (β|α) =k∏

i=1

α

2 exp {−α |βi |} =(α

2

)kexp {−α ‖β‖1} (20)

Thus, the MAP estimate of β look like

β̂ = argminβ

{‖Hβ − y‖22 + 2σ2α ‖β‖1

}(21)

30 / 66

Page 71: Machine Learning Maximum A Posteriori

Remark

This criterion is know as the LASSOThis norm l1 induces sparsity in the weight terms.

How?For example,

∥∥∥[1, 0]T∥∥∥

2=∥∥∥[1/√2, 1/

√2]T∥∥∥

2= 1.

In the other case,∥∥∥[1, 0]T

∥∥∥1

= 1 <∥∥∥[1/√2, 1/

√2]T∥∥∥

2=√

2.

31 / 66

Page 72: Machine Learning Maximum A Posteriori

Remark

This criterion is know as the LASSOThis norm l1 induces sparsity in the weight terms.

How?For example,

∥∥∥[1, 0]T∥∥∥

2=∥∥∥[1/√2, 1/

√2]T∥∥∥

2= 1.

In the other case,∥∥∥[1, 0]T

∥∥∥1

= 1 <∥∥∥[1/√2, 1/

√2]T∥∥∥

2=√

2.

31 / 66

Page 73: Machine Learning Maximum A Posteriori

Remark

This criterion is know as the LASSOThis norm l1 induces sparsity in the weight terms.

How?For example,

∥∥∥[1, 0]T∥∥∥

2=∥∥∥[1/√2, 1/

√2]T∥∥∥

2= 1.

In the other case,∥∥∥[1, 0]T

∥∥∥1

= 1 <∥∥∥[1/√2, 1/

√2]T∥∥∥

2=√

2.

31 / 66

Page 74: Machine Learning Maximum A Posteriori

An example

What if H is a orthogonal matrixIn this case H T H = I

Thus

β̂ =argminβ

{‖Hβ − y‖22 + 2σ2α ‖β‖1

}=argmin

β

{(Hβ − y)T (Hβ − y) + 2σ2α

k∑i=1|βi |

}

=argminβ

{βT H T Hβ − 2βT H T y + yT y + 2σ2α

k∑i=1|βi |

}

=argminβ

{βTβ − 2βT H T y + yT y + 2σ2α

k∑i=1|βi |

}

32 / 66

Page 75: Machine Learning Maximum A Posteriori

An example

What if H is a orthogonal matrixIn this case H T H = I

Thus

β̂ =argminβ

{‖Hβ − y‖22 + 2σ2α ‖β‖1

}=argmin

β

{(Hβ − y)T (Hβ − y) + 2σ2α

k∑i=1|βi |

}

=argminβ

{βT H T Hβ − 2βT H T y + yT y + 2σ2α

k∑i=1|βi |

}

=argminβ

{βTβ − 2βT H T y + yT y + 2σ2α

k∑i=1|βi |

}

32 / 66

Page 76: Machine Learning Maximum A Posteriori

An example

What if H is a orthogonal matrixIn this case H T H = I

Thus

β̂ =argminβ

{‖Hβ − y‖22 + 2σ2α ‖β‖1

}=argmin

β

{(Hβ − y)T (Hβ − y) + 2σ2α

k∑i=1|βi |

}

=argminβ

{βT H T Hβ − 2βT H T y + yT y + 2σ2α

k∑i=1|βi |

}

=argminβ

{βTβ − 2βT H T y + yT y + 2σ2α

k∑i=1|βi |

}

32 / 66

Page 77: Machine Learning Maximum A Posteriori

An example

What if H is a orthogonal matrixIn this case H T H = I

Thus

β̂ =argminβ

{‖Hβ − y‖22 + 2σ2α ‖β‖1

}=argmin

β

{(Hβ − y)T (Hβ − y) + 2σ2α

k∑i=1|βi |

}

=argminβ

{βT H T Hβ − 2βT H T y + yT y + 2σ2α

k∑i=1|βi |

}

=argminβ

{βTβ − 2βT H T y + yT y + 2σ2α

k∑i=1|βi |

}

32 / 66

Page 78: Machine Learning Maximum A Posteriori

An example

What if H is a orthogonal matrixIn this case H T H = I

Thus

β̂ =argminβ

{‖Hβ − y‖22 + 2σ2α ‖β‖1

}=argmin

β

{(Hβ − y)T (Hβ − y) + 2σ2α

k∑i=1|βi |

}

=argminβ

{βT H T Hβ − 2βT H T y + yT y + 2σ2α

k∑i=1|βi |

}

=argminβ

{βTβ − 2βT H T y + yT y + 2σ2α

k∑i=1|βi |

}

32 / 66

Page 79: Machine Learning Maximum A Posteriori

We can solve this last part as follow

We can group for each βi

β2i − 2βi

(H T y

)i

+ 2σ2α |βi |+ y2i (22)

If we can minimize each group we will be able to get the solution

β̂i = argminβi

{β2

i − 2βi(H T y

)i

+ 2σ2α |βi |}

(23)

We have two casesβi > 0βi < 0

33 / 66

Page 80: Machine Learning Maximum A Posteriori

We can solve this last part as follow

We can group for each βi

β2i − 2βi

(H T y

)i

+ 2σ2α |βi |+ y2i (22)

If we can minimize each group we will be able to get the solution

β̂i = argminβi

{β2

i − 2βi(H T y

)i

+ 2σ2α |βi |}

(23)

We have two casesβi > 0βi < 0

33 / 66

Page 81: Machine Learning Maximum A Posteriori

We can solve this last part as follow

We can group for each βi

β2i − 2βi

(H T y

)i

+ 2σ2α |βi |+ y2i (22)

If we can minimize each group we will be able to get the solution

β̂i = argminβi

{β2

i − 2βi(H T y

)i

+ 2σ2α |βi |}

(23)

We have two casesβi > 0βi < 0

33 / 66

Page 82: Machine Learning Maximum A Posteriori

We can solve this last part as follow

We can group for each βi

β2i − 2βi

(H T y

)i

+ 2σ2α |βi |+ y2i (22)

If we can minimize each group we will be able to get the solution

β̂i = argminβi

{β2

i − 2βi(H T y

)i

+ 2σ2α |βi |}

(23)

We have two casesβi > 0βi < 0

33 / 66

Page 83: Machine Learning Maximum A Posteriori

If βi > 0

We then derive with respect to βi

∂(β2

i − 2βi(H T y

)i

+ 2σ2αiβi)

∂βi= 2βi − 2

(H T y

)i

+ 2σ2α

We have then

β̂i =(H T y

)i− σ2α (24)

34 / 66

Page 84: Machine Learning Maximum A Posteriori

If βi > 0

We then derive with respect to βi

∂(β2

i − 2βi(H T y

)i

+ 2σ2αiβi)

∂βi= 2βi − 2

(H T y

)i

+ 2σ2α

We have then

β̂i =(H T y

)i− σ2α (24)

34 / 66

Page 85: Machine Learning Maximum A Posteriori

If βi < 0

We then derive with respect to βi

∂(β2

i − 2βi(H T y

)i− 2σ2αiβi

)∂βi

= 2βi − 2(H T y

)i− 2σ2α

We have then

β̂i =(H T y

)i

+ σ2α (25)

35 / 66

Page 86: Machine Learning Maximum A Posteriori

If βi < 0

We then derive with respect to βi

∂(β2

i − 2βi(H T y

)i− 2σ2αiβi

)∂βi

= 2βi − 2(H T y

)i− 2σ2α

We have then

β̂i =(H T y

)i

+ σ2α (25)

35 / 66

Page 87: Machine Learning Maximum A Posteriori

The value of(H Ty

)i

Ww have thatWe have that:

if βi > 0 then(H T y

)i> σ2α

if βi < 0 then(H T y

)i< −σ2α

36 / 66

Page 88: Machine Learning Maximum A Posteriori

We can put all this together

A compact Version

β̂i = sgn((

H T y)

i

) (∣∣∣(H T y)

i

∣∣∣− σ2α)

+(26)

With

(a)+ ={

a if a ≥ 00 if a < 0

and “sgn” is the sign function.

This rule is know as theThe soft threshold!!!

37 / 66

Page 89: Machine Learning Maximum A Posteriori

We can put all this together

A compact Version

β̂i = sgn((

H T y)

i

) (∣∣∣(H T y)

i

∣∣∣− σ2α)

+(26)

With

(a)+ ={

a if a ≥ 00 if a < 0

and “sgn” is the sign function.

This rule is know as theThe soft threshold!!!

37 / 66

Page 90: Machine Learning Maximum A Posteriori

We can put all this together

A compact Version

β̂i = sgn((

H T y)

i

) (∣∣∣(H T y)

i

∣∣∣− σ2α)

+(26)

With

(a)+ ={

a if a ≥ 00 if a < 0

and “sgn” is the sign function.

This rule is know as theThe soft threshold!!!

37 / 66

Page 91: Machine Learning Maximum A Posteriori

Example

We have the doted · · · line as the soft threshold

a

3

2

1

0

-1

-2

-3

-3 -2 -1 0 1 2 3

38 / 66

Page 92: Machine Learning Maximum A Posteriori

Outline

1 IntroductionA first solutionExampleProperties of the MAP

2 Example of Application of MAP and EMExampleLinear RegressionThe Gaussian NoiseA Hierarchical-Bayes View of the Laplacian PriorSparse Regression via EMJeffrey’s Prior

39 / 66

Page 93: Machine Learning Maximum A Posteriori

Now

Given that each βi has a zero-mean Gaussian prior

p (βi |τi) = N (βi |0, τi) (27)

Where τi has the following exponential hyperprior

p (τi |γ) = γ

2 exp{−γ2 τi

}for τi ≥ 0 (28)

This is a though property - so we take it by heart

p (βi |γ) =ˆ ∞

0p (βi |τi) p (τi |γ) dτi =

√γ

2 exp {−√γ |βi |} (29)

40 / 66

Page 94: Machine Learning Maximum A Posteriori

Now

Given that each βi has a zero-mean Gaussian prior

p (βi |τi) = N (βi |0, τi) (27)

Where τi has the following exponential hyperprior

p (τi |γ) = γ

2 exp{−γ2 τi

}for τi ≥ 0 (28)

This is a though property - so we take it by heart

p (βi |γ) =ˆ ∞

0p (βi |τi) p (τi |γ) dτi =

√γ

2 exp {−√γ |βi |} (29)

40 / 66

Page 95: Machine Learning Maximum A Posteriori

Now

Given that each βi has a zero-mean Gaussian prior

p (βi |τi) = N (βi |0, τi) (27)

Where τi has the following exponential hyperprior

p (τi |γ) = γ

2 exp{−γ2 τi

}for τi ≥ 0 (28)

This is a though property - so we take it by heart

p (βi |γ) =ˆ ∞

0p (βi |τi) p (τi |γ) dτi =

√γ

2 exp {−√γ |βi |} (29)

40 / 66

Page 96: Machine Learning Maximum A Posteriori

Example

The double exponential

41 / 66

Page 97: Machine Learning Maximum A Posteriori

This is equivalent to the use of the L1-norm forregularization

L1(Left) is better for sparsity promotion than L2(Right)

42 / 66

Page 98: Machine Learning Maximum A Posteriori

Outline

1 IntroductionA first solutionExampleProperties of the MAP

2 Example of Application of MAP and EMExampleLinear RegressionThe Gaussian NoiseA Hierarchical-Bayes View of the Laplacian PriorSparse Regression via EMJeffrey’s Prior

43 / 66

Page 99: Machine Learning Maximum A Posteriori

The EM trick

How do we do this?This is done by regarding τ = [τ1, ..., τk ] as the hidden/missing data

Then, if we could observe τ , complete log-posterior log p (β, σ2|y, τ)can be easily calculated

p(β, σ2|y, τ

)∝ p

(y|β, σ2

)p (β|τ) p

(σ2)

(30)

Wherep(y|β, σ2) ∼ N (β, σ2)

p (β|0, τ) ∼ N (0, τ)

44 / 66

Page 100: Machine Learning Maximum A Posteriori

The EM trick

How do we do this?This is done by regarding τ = [τ1, ..., τk ] as the hidden/missing data

Then, if we could observe τ , complete log-posterior log p (β, σ2|y, τ)can be easily calculated

p(β, σ2|y, τ

)∝ p

(y|β, σ2

)p (β|τ) p

(σ2)

(30)

Wherep(y|β, σ2) ∼ N (β, σ2)

p (β|0, τ) ∼ N (0, τ)

44 / 66

Page 101: Machine Learning Maximum A Posteriori

The EM trick

How do we do this?This is done by regarding τ = [τ1, ..., τk ] as the hidden/missing data

Then, if we could observe τ , complete log-posterior log p (β, σ2|y, τ)can be easily calculated

p(β, σ2|y, τ

)∝ p

(y|β, σ2

)p (β|τ) p

(σ2)

(30)

Wherep(y|β, σ2) ∼ N (β, σ2)

p (β|0, τ) ∼ N (0, τ)

44 / 66

Page 102: Machine Learning Maximum A Posteriori

What about p (σ2)?

We selectp(σ2) as a constant

HoweverWe can adopt a conjugate inverse Gamma prior for σ2, but for largenumber of samples the prior on the estimate of σ2is very small.

In the constant caseWe can use the MAP idea, however we have hidden parameters so weresort to the EM

45 / 66

Page 103: Machine Learning Maximum A Posteriori

What about p (σ2)?

We selectp(σ2) as a constant

HoweverWe can adopt a conjugate inverse Gamma prior for σ2, but for largenumber of samples the prior on the estimate of σ2is very small.

In the constant caseWe can use the MAP idea, however we have hidden parameters so weresort to the EM

45 / 66

Page 104: Machine Learning Maximum A Posteriori

What about p (σ2)?

We selectp(σ2) as a constant

HoweverWe can adopt a conjugate inverse Gamma prior for σ2, but for largenumber of samples the prior on the estimate of σ2is very small.

In the constant caseWe can use the MAP idea, however we have hidden parameters so weresort to the EM

45 / 66

Page 105: Machine Learning Maximum A Posteriori

E-step

Computes the expected value of the complete log-posterior

Q(β, σ2|β̂(t), σ̂2(t)

)=ˆ

log p(β, σ2|y, τ

)p(τ |β̂(t), σ̂2(t),y

)dτ (31)

46 / 66

Page 106: Machine Learning Maximum A Posteriori

M-step

Updates the parameter estimates by maximizing the Q-function(β̂(t+1), σ̂2(t+1)

)= argmax

β,σ2Q(β, σ2|β̂(t), σ̂2(t)

)(32)

47 / 66

Page 107: Machine Learning Maximum A Posteriori

Remark

FirstThe EM algorithm converges to a local maximum of the a posterioriprobability density function

p(β, σ2|y

)∝ p

(y|β, σ2

)p (β|γ) (33)

Without using the marginal prior p (β|γ) which is not GaussianInstead we use a conditional Gaussian prior p (β|γ)

48 / 66

Page 108: Machine Learning Maximum A Posteriori

Remark

FirstThe EM algorithm converges to a local maximum of the a posterioriprobability density function

p(β, σ2|y

)∝ p

(y|β, σ2

)p (β|γ) (33)

Without using the marginal prior p (β|γ) which is not GaussianInstead we use a conditional Gaussian prior p (β|γ)

48 / 66

Page 109: Machine Learning Maximum A Posteriori

The Final Model

We have

p(y|β, σ2

)= N

(Hβ, σ2I

)p(σ2)∝ ”constant”

p (β|τ) =k∏

i=1N (βi |0, τi) = N

(β|0, (Υ (τ))−1

)

p (τ |γ) =(γ

2

)k k∏i=1

exp{−γ2 τi

}

With Υ (τ) = diag(τ−1

1 , ..., τ−1k

)is the diagonal matrix with the inverse

variances of all the βi ’s.

49 / 66

Page 110: Machine Learning Maximum A Posteriori

The Final Model

We have

p(y|β, σ2

)= N

(Hβ, σ2I

)p(σ2)∝ ”constant”

p (β|τ) =k∏

i=1N (βi |0, τi) = N

(β|0, (Υ (τ))−1

)

p (τ |γ) =(γ

2

)k k∏i=1

exp{−γ2 τi

}

With Υ (τ) = diag(τ−1

1 , ..., τ−1k

)is the diagonal matrix with the inverse

variances of all the βi ’s.

49 / 66

Page 111: Machine Learning Maximum A Posteriori

The Final Model

We have

p(y|β, σ2

)= N

(Hβ, σ2I

)p(σ2)∝ ”constant”

p (β|τ) =k∏

i=1N (βi |0, τi) = N

(β|0, (Υ (τ))−1

)

p (τ |γ) =(γ

2

)k k∏i=1

exp{−γ2 τi

}

With Υ (τ) = diag(τ−1

1 , ..., τ−1k

)is the diagonal matrix with the inverse

variances of all the βi ’s.

49 / 66

Page 112: Machine Learning Maximum A Posteriori

The Final Model

We have

p(y|β, σ2

)= N

(Hβ, σ2I

)p(σ2)∝ ”constant”

p (β|τ) =k∏

i=1N (βi |0, τi) = N

(β|0, (Υ (τ))−1

)

p (τ |γ) =(γ

2

)k k∏i=1

exp{−γ2 τi

}

With Υ (τ) = diag(τ−1

1 , ..., τ−1k

)is the diagonal matrix with the inverse

variances of all the βi ’s.

49 / 66

Page 113: Machine Learning Maximum A Posteriori

The Final Model

We have

p(y|β, σ2

)= N

(Hβ, σ2I

)p(σ2)∝ ”constant”

p (β|τ) =k∏

i=1N (βi |0, τi) = N

(β|0, (Υ (τ))−1

)

p (τ |γ) =(γ

2

)k k∏i=1

exp{−γ2 τi

}

With Υ (τ) = diag(τ−1

1 , ..., τ−1k

)is the diagonal matrix with the inverse

variances of all the βi ’s.

49 / 66

Page 114: Machine Learning Maximum A Posteriori

Now, we find the Q function

First

log p(β, σ2|y, τ

)∝ log p

(y|β, σ2

)+ log p (β|τ)

∝ −n log σ2 − ‖y −Hβ‖22σ2 − βTΥ (τ)β

How can we get this?Remember

N (y|µ,Σ) = 1(2π)

k2 |Σ|

12

exp{−1

2 (y − µ)T Σ−1}

(34)

Volunteers?Please to the blackboard.

50 / 66

Page 115: Machine Learning Maximum A Posteriori

Now, we find the Q function

First

log p(β, σ2|y, τ

)∝ log p

(y|β, σ2

)+ log p (β|τ)

∝ −n log σ2 − ‖y −Hβ‖22σ2 − βTΥ (τ)β

How can we get this?Remember

N (y|µ,Σ) = 1(2π)

k2 |Σ|

12

exp{−1

2 (y − µ)T Σ−1}

(34)

Volunteers?Please to the blackboard.

50 / 66

Page 116: Machine Learning Maximum A Posteriori

Now, we find the Q function

First

log p(β, σ2|y, τ

)∝ log p

(y|β, σ2

)+ log p (β|τ)

∝ −n log σ2 − ‖y −Hβ‖22σ2 − βTΥ (τ)β

How can we get this?Remember

N (y|µ,Σ) = 1(2π)

k2 |Σ|

12

exp{−1

2 (y − µ)T Σ−1}

(34)

Volunteers?Please to the blackboard.

50 / 66

Page 117: Machine Learning Maximum A Posteriori

Thus

SecondDid you notice that the term βTΥ (τ)β is linear with respect to Υ (τ) andthe other terms do not depend on τ?

Thus, the E-step is reduced to the computation of Υ (τ)

V (t) = E(Υ (τ) |y, β̂(t), σ̂2(t)

)= diag

(E[τ−1

1 |y, β̂(t), σ̂2(t)], ...,E

[τ−1

k |y, β̂(t), σ̂2(t)])

51 / 66

Page 118: Machine Learning Maximum A Posteriori

Thus

SecondDid you notice that the term βTΥ (τ)β is linear with respect to Υ (τ) andthe other terms do not depend on τ?

Thus, the E-step is reduced to the computation of Υ (τ)

V (t) = E(Υ (τ) |y, β̂(t), σ̂2(t)

)= diag

(E[τ−1

1 |y, β̂(t), σ̂2(t)], ...,E

[τ−1

k |y, β̂(t), σ̂2(t)])

51 / 66

Page 119: Machine Learning Maximum A Posteriori

Thus

SecondDid you notice that the term βTΥ (τ)β is linear with respect to Υ (τ) andthe other terms do not depend on τ?

Thus, the E-step is reduced to the computation of Υ (τ)

V (t) = E(Υ (τ) |y, β̂(t), σ̂2(t)

)= diag

(E[τ−1

1 |y, β̂(t), σ̂2(t)], ...,E

[τ−1

k |y, β̂(t), σ̂2(t)])

51 / 66

Page 120: Machine Learning Maximum A Posteriori

Now

What do we need to calculate each of this expectations?

p(τi |y, β̂(t), σ̂2(t)

)= p

(τi |y, β̂i,(t), σ̂2i,(t)

)(35)

Then

p(τi |y, β̂i,(t), σ̂2i,(t)

)=

p(τi , β̂i,(t)|y, σ̂2i,(t)

)p(β̂i,(t)|y, σ̂2i,(t)

)∝ p

(β̂i,(t)|τi ,y, σ̂2i,(t)

)p(τi |y, σ̂2i,(t)

)= p

(β̂i,(t)|τi

)p (τi)

52 / 66

Page 121: Machine Learning Maximum A Posteriori

Now

What do we need to calculate each of this expectations?

p(τi |y, β̂(t), σ̂2(t)

)= p

(τi |y, β̂i,(t), σ̂2i,(t)

)(35)

Then

p(τi |y, β̂i,(t), σ̂2i,(t)

)=

p(τi , β̂i,(t)|y, σ̂2i,(t)

)p(β̂i,(t)|y, σ̂2i,(t)

)∝ p

(β̂i,(t)|τi ,y, σ̂2i,(t)

)p(τi |y, σ̂2i,(t)

)= p

(β̂i,(t)|τi

)p (τi)

52 / 66

Page 122: Machine Learning Maximum A Posteriori

Now

What do we need to calculate each of this expectations?

p(τi |y, β̂(t), σ̂2(t)

)= p

(τi |y, β̂i,(t), σ̂2i,(t)

)(35)

Then

p(τi |y, β̂i,(t), σ̂2i,(t)

)=

p(τi , β̂i,(t)|y, σ̂2i,(t)

)p(β̂i,(t)|y, σ̂2i,(t)

)∝ p

(β̂i,(t)|τi ,y, σ̂2i,(t)

)p(τi |y, σ̂2i,(t)

)= p

(β̂i,(t)|τi

)p (τi)

52 / 66

Page 123: Machine Learning Maximum A Posteriori

Now

What do we need to calculate each of this expectations?

p(τi |y, β̂(t), σ̂2(t)

)= p

(τi |y, β̂i,(t), σ̂2i,(t)

)(35)

Then

p(τi |y, β̂i,(t), σ̂2i,(t)

)=

p(τi , β̂i,(t)|y, σ̂2i,(t)

)p(β̂i,(t)|y, σ̂2i,(t)

)∝ p

(β̂i,(t)|τi ,y, σ̂2i,(t)

)p(τi |y, σ̂2i,(t)

)= p

(β̂i,(t)|τi

)p (τi)

52 / 66

Page 124: Machine Learning Maximum A Posteriori

Here is the interesting part

We have the following probability density

p(τi |y, β̂i,(t), σ̂2i,(t)

)=

N(βi,(t)|0, τi

)γ2 exp

{−γ

2 τi}

´∞0 N

(βi,(t)|0, τi

)γ2 exp

{−γ

2 τi}

dτi(36)

Then

E[τ−1

1 |y, β̂i,(t), σ̂2(t)]

=

´∞0

1τiN(βi,(t)|0, τi

)γ2 exp

{−γ

2 τi}

dτi´∞0 N

(βi,(t)|0, τi

)γ2 exp

{−γ

2 τi}

dτi(37)

NowI leave to you to prove that (It can come in the test)

53 / 66

Page 125: Machine Learning Maximum A Posteriori

Here is the interesting part

We have the following probability density

p(τi |y, β̂i,(t), σ̂2i,(t)

)=

N(βi,(t)|0, τi

)γ2 exp

{−γ

2 τi}

´∞0 N

(βi,(t)|0, τi

)γ2 exp

{−γ

2 τi}

dτi(36)

Then

E[τ−1

1 |y, β̂i,(t), σ̂2(t)]

=

´∞0

1τiN(βi,(t)|0, τi

)γ2 exp

{−γ

2 τi}

dτi´∞0 N

(βi,(t)|0, τi

)γ2 exp

{−γ

2 τi}

dτi(37)

NowI leave to you to prove that (It can come in the test)

53 / 66

Page 126: Machine Learning Maximum A Posteriori

Here is the interesting part

We have the following probability density

p(τi |y, β̂i,(t), σ̂2i,(t)

)=

N(βi,(t)|0, τi

)γ2 exp

{−γ

2 τi}

´∞0 N

(βi,(t)|0, τi

)γ2 exp

{−γ

2 τi}

dτi(36)

Then

E[τ−1

1 |y, β̂i,(t), σ̂2(t)]

=

´∞0

1τiN(βi,(t)|0, τi

)γ2 exp

{−γ

2 τi}

dτi´∞0 N

(βi,(t)|0, τi

)γ2 exp

{−γ

2 τi}

dτi(37)

NowI leave to you to prove that (It can come in the test)

53 / 66

Page 127: Machine Learning Maximum A Posteriori

Here is the interesting part

Thus

E[τ−1

1 |y, β̂i,(t), σ̂2(t)]

= γ∣∣∣β̂i,(t)

∣∣∣ (38)

Finally

V (t) = γdiag(∣∣∣β̂1,(t)

∣∣∣−1, ...,

∣∣∣β̂k,(t)

∣∣∣−1)

(39)

54 / 66

Page 128: Machine Learning Maximum A Posteriori

Here is the interesting part

Thus

E[τ−1

1 |y, β̂i,(t), σ̂2(t)]

= γ∣∣∣β̂i,(t)

∣∣∣ (38)

Finally

V (t) = γdiag(∣∣∣β̂1,(t)

∣∣∣−1, ...,

∣∣∣β̂k,(t)

∣∣∣−1)

(39)

54 / 66

Page 129: Machine Learning Maximum A Posteriori

The Final Q function

Something Notable

Q(β, σ2|β̂(t), σ̂2(t)

)=ˆ

log p(β, σ2|y, τ

)p(τ |β̂(t), σ̂2(t), y

)dτ

=ˆ [−n log σ2 −

‖y −Hβ‖22

σ2 − βTΥ (τ)β]

p(τ |β̂(t), σ̂2(t), y

)dτ

= −n log σ2 −‖y −Hβ‖2

2σ2 − βT

[ˆΥ (τ) p

(τ |β̂(t), σ̂2(t), y

)dτ]β

= −n log σ2 −‖y −Hβ‖2

2σ2 − βT V (t)β

55 / 66

Page 130: Machine Learning Maximum A Posteriori

The Final Q function

Something Notable

Q(β, σ2|β̂(t), σ̂2(t)

)=ˆ

log p(β, σ2|y, τ

)p(τ |β̂(t), σ̂2(t), y

)dτ

=ˆ [−n log σ2 −

‖y −Hβ‖22

σ2 − βTΥ (τ)β]

p(τ |β̂(t), σ̂2(t), y

)dτ

= −n log σ2 −‖y −Hβ‖2

2σ2 − βT

[ˆΥ (τ) p

(τ |β̂(t), σ̂2(t), y

)dτ]β

= −n log σ2 −‖y −Hβ‖2

2σ2 − βT V (t)β

55 / 66

Page 131: Machine Learning Maximum A Posteriori

The Final Q function

Something Notable

Q(β, σ2|β̂(t), σ̂2(t)

)=ˆ

log p(β, σ2|y, τ

)p(τ |β̂(t), σ̂2(t), y

)dτ

=ˆ [−n log σ2 −

‖y −Hβ‖22

σ2 − βTΥ (τ)β]

p(τ |β̂(t), σ̂2(t), y

)dτ

= −n log σ2 −‖y −Hβ‖2

2σ2 − βT

[ˆΥ (τ) p

(τ |β̂(t), σ̂2(t), y

)dτ]β

= −n log σ2 −‖y −Hβ‖2

2σ2 − βT V (t)β

55 / 66

Page 132: Machine Learning Maximum A Posteriori

The Final Q function

Something Notable

Q(β, σ2|β̂(t), σ̂2(t)

)=ˆ

log p(β, σ2|y, τ

)p(τ |β̂(t), σ̂2(t), y

)dτ

=ˆ [−n log σ2 −

‖y −Hβ‖22

σ2 − βTΥ (τ)β]

p(τ |β̂(t), σ̂2(t), y

)dτ

= −n log σ2 −‖y −Hβ‖2

2σ2 − βT

[ˆΥ (τ) p

(τ |β̂(t), σ̂2(t), y

)dτ]β

= −n log σ2 −‖y −Hβ‖2

2σ2 − βT V (t)β

55 / 66

Page 133: Machine Learning Maximum A Posteriori

Finally, the M-stepFirst

σ̂2(t+1) = argmaxσ2

{−n log σ2 − ‖y −Hβ‖22

σ2

}

= ‖y −Hβ‖22n

Second

β̂(t+1) = argmaxβ

{−‖y −Hβ‖22

σ2 − βT V (t)β

}

=(σ̂2(t+1)V (t) + H T H

)−1H T y

This also I leave to youIt can come in the test.

56 / 66

Page 134: Machine Learning Maximum A Posteriori

Finally, the M-stepFirst

σ̂2(t+1) = argmaxσ2

{−n log σ2 − ‖y −Hβ‖22

σ2

}

= ‖y −Hβ‖22n

Second

β̂(t+1) = argmaxβ

{−‖y −Hβ‖22

σ2 − βT V (t)β

}

=(σ̂2(t+1)V (t) + H T H

)−1H T y

This also I leave to youIt can come in the test.

56 / 66

Page 135: Machine Learning Maximum A Posteriori

Finally, the M-stepFirst

σ̂2(t+1) = argmaxσ2

{−n log σ2 − ‖y −Hβ‖22

σ2

}

= ‖y −Hβ‖22n

Second

β̂(t+1) = argmaxβ

{−‖y −Hβ‖22

σ2 − βT V (t)β

}

=(σ̂2(t+1)V (t) + H T H

)−1H T y

This also I leave to youIt can come in the test.

56 / 66

Page 136: Machine Learning Maximum A Posteriori

Finally, the M-stepFirst

σ̂2(t+1) = argmaxσ2

{−n log σ2 − ‖y −Hβ‖22

σ2

}

= ‖y −Hβ‖22n

Second

β̂(t+1) = argmaxβ

{−‖y −Hβ‖22

σ2 − βT V (t)β

}

=(σ̂2(t+1)V (t) + H T H

)−1H T y

This also I leave to youIt can come in the test.

56 / 66

Page 137: Machine Learning Maximum A Posteriori

Finally, the M-stepFirst

σ̂2(t+1) = argmaxσ2

{−n log σ2 − ‖y −Hβ‖22

σ2

}

= ‖y −Hβ‖22n

Second

β̂(t+1) = argmaxβ

{−‖y −Hβ‖22

σ2 − βT V (t)β

}

=(σ̂2(t+1)V (t) + H T H

)−1H T y

This also I leave to youIt can come in the test.

56 / 66

Page 138: Machine Learning Maximum A Posteriori

Outline

1 IntroductionA first solutionExampleProperties of the MAP

2 Example of Application of MAP and EMExampleLinear RegressionThe Gaussian NoiseA Hierarchical-Bayes View of the Laplacian PriorSparse Regression via EMJeffrey’s Prior

57 / 66

Page 139: Machine Learning Maximum A Posteriori

However

We need to deal in some way with the γ termIt controls the degree of spareness!!!

We can do assuming a Jeffrey’s PriorJ. Berger, Statistical Decision Theory and Bayesian Analysis. New York:Springer-Verlag, 1980.

We use instead

p (τ) ∝ 1τ

(40)

58 / 66

Page 140: Machine Learning Maximum A Posteriori

However

We need to deal in some way with the γ termIt controls the degree of spareness!!!

We can do assuming a Jeffrey’s PriorJ. Berger, Statistical Decision Theory and Bayesian Analysis. New York:Springer-Verlag, 1980.

We use instead

p (τ) ∝ 1τ

(40)

58 / 66

Page 141: Machine Learning Maximum A Posteriori

However

We need to deal in some way with the γ termIt controls the degree of spareness!!!

We can do assuming a Jeffrey’s PriorJ. Berger, Statistical Decision Theory and Bayesian Analysis. New York:Springer-Verlag, 1980.

We use instead

p (τ) ∝ 1τ

(40)

58 / 66

Page 142: Machine Learning Maximum A Posteriori

Properties of the Jeffrey’s Prior

ImportantThis prior expresses ignorance with respect to scale and is parameter free

Why scale invariantImagine, we change the scale of τ by τ ′ = Kτ where K is a constantexpressing that change

Thus, we have that

p(τ ′)

= 1τ ′

= 1kτ ∝

(41)

59 / 66

Page 143: Machine Learning Maximum A Posteriori

Properties of the Jeffrey’s Prior

ImportantThis prior expresses ignorance with respect to scale and is parameter free

Why scale invariantImagine, we change the scale of τ by τ ′ = Kτ where K is a constantexpressing that change

Thus, we have that

p(τ ′)

= 1τ ′

= 1kτ ∝

(41)

59 / 66

Page 144: Machine Learning Maximum A Posteriori

Properties of the Jeffrey’s Prior

ImportantThis prior expresses ignorance with respect to scale and is parameter free

Why scale invariantImagine, we change the scale of τ by τ ′ = Kτ where K is a constantexpressing that change

Thus, we have that

p(τ ′)

= 1τ ′

= 1kτ ∝

(41)

59 / 66

Page 145: Machine Learning Maximum A Posteriori

However

Something NotableThis prior is known as an improper prior.

In additionThis prior does not leads to a Laplacian prior on β.

NeverthelessThis prior induces sparseness and good performance for the β.

60 / 66

Page 146: Machine Learning Maximum A Posteriori

However

Something NotableThis prior is known as an improper prior.

In additionThis prior does not leads to a Laplacian prior on β.

NeverthelessThis prior induces sparseness and good performance for the β.

60 / 66

Page 147: Machine Learning Maximum A Posteriori

However

Something NotableThis prior is known as an improper prior.

In additionThis prior does not leads to a Laplacian prior on β.

NeverthelessThis prior induces sparseness and good performance for the β.

60 / 66

Page 148: Machine Learning Maximum A Posteriori

Introducing this prior into the equations

Matrix V (t) is now

V (t) = diag(∣∣∣β̂1,(t)

∣∣∣−2, ...,

∣∣∣β̂k,(t)

∣∣∣−2)

(42)

Quite interesting!!!We do not have the free γ parameter.

61 / 66

Page 149: Machine Learning Maximum A Posteriori

Introducing this prior into the equations

Matrix V (t) is now

V (t) = diag(∣∣∣β̂1,(t)

∣∣∣−2, ...,

∣∣∣β̂k,(t)

∣∣∣−2)

(42)

Quite interesting!!!We do not have the free γ parameter.

61 / 66

Page 150: Machine Learning Maximum A Posteriori

Here, we can see the new thresholdWith dotted line as the soft threshold, dashed the hard threshold andin blue solid the Jeffrey’s prior

a

3

2

1

0

-1

-2

-3

-3 -2 -1 0 1 2 3

62 / 66

Page 151: Machine Learning Maximum A Posteriori

Observations

The new rule is betweenThe soft threshold rule.The hard threshold rule.

Something NotableWith large values of

(H T y

)ithe new rule approaches the hard threshold.

Once(H Ty

)igets smaller

The estimate becomes progressively smaller approaching the behavior ofthe soft rule.

63 / 66

Page 152: Machine Learning Maximum A Posteriori

Observations

The new rule is betweenThe soft threshold rule.The hard threshold rule.

Something NotableWith large values of

(H T y

)ithe new rule approaches the hard threshold.

Once(H Ty

)igets smaller

The estimate becomes progressively smaller approaching the behavior ofthe soft rule.

63 / 66

Page 153: Machine Learning Maximum A Posteriori

Observations

The new rule is betweenThe soft threshold rule.The hard threshold rule.

Something NotableWith large values of

(H T y

)ithe new rule approaches the hard threshold.

Once(H Ty

)igets smaller

The estimate becomes progressively smaller approaching the behavior ofthe soft rule.

63 / 66

Page 154: Machine Learning Maximum A Posteriori

Finally, an implementation detail

Since several elements of β̂ will go to zero

V (t) = diag(∣∣∣β̂1,(t)

∣∣∣−2, ...,

∣∣∣β̂k,(t)

∣∣∣−2)

will have several elements going tolarge numbers

Something Notableif we define U (t) = diag

(∣∣∣β̂1,(t)

∣∣∣ , ..., ∣∣∣β̂k,(t)

∣∣∣).Then, we have that

V (t) = U−1(t) U−1

(t) (43)

64 / 66

Page 155: Machine Learning Maximum A Posteriori

Finally, an implementation detail

Since several elements of β̂ will go to zero

V (t) = diag(∣∣∣β̂1,(t)

∣∣∣−2, ...,

∣∣∣β̂k,(t)

∣∣∣−2)

will have several elements going tolarge numbers

Something Notableif we define U (t) = diag

(∣∣∣β̂1,(t)

∣∣∣ , ..., ∣∣∣β̂k,(t)

∣∣∣).Then, we have that

V (t) = U−1(t) U−1

(t) (43)

64 / 66

Page 156: Machine Learning Maximum A Posteriori

Finally, an implementation detail

Since several elements of β̂ will go to zero

V (t) = diag(∣∣∣β̂1,(t)

∣∣∣−2, ...,

∣∣∣β̂k,(t)

∣∣∣−2)

will have several elements going tolarge numbers

Something Notableif we define U (t) = diag

(∣∣∣β̂1,(t)

∣∣∣ , ..., ∣∣∣β̂k,(t)

∣∣∣).Then, we have that

V (t) = U−1(t) U−1

(t) (43)

64 / 66

Page 157: Machine Learning Maximum A Posteriori

Thus

We have that

β̂(t+1) =(σ̂2(t+1)V (t) + H T H

)−1H T y

=(σ̂2(t+1)U−1

(t) U−1(t) + H T H

)−1H T y

=(σ̂2(t+1)U−1

(t) I U−1(t) + I H T H I

)−1H T y

=(σ̂2(t+1)U−1

(t) I U−1(t) + U−1

(t) U (t)H T HU (t)U−1(t)

)−1H T y

= U (t)(σ̂2(t+1)I + U (t)H T HU (t)

)−1U (t)H T y

65 / 66

Page 158: Machine Learning Maximum A Posteriori

Thus

We have that

β̂(t+1) =(σ̂2(t+1)V (t) + H T H

)−1H T y

=(σ̂2(t+1)U−1

(t) U−1(t) + H T H

)−1H T y

=(σ̂2(t+1)U−1

(t) I U−1(t) + I H T H I

)−1H T y

=(σ̂2(t+1)U−1

(t) I U−1(t) + U−1

(t) U (t)H T HU (t)U−1(t)

)−1H T y

= U (t)(σ̂2(t+1)I + U (t)H T HU (t)

)−1U (t)H T y

65 / 66

Page 159: Machine Learning Maximum A Posteriori

Thus

We have that

β̂(t+1) =(σ̂2(t+1)V (t) + H T H

)−1H T y

=(σ̂2(t+1)U−1

(t) U−1(t) + H T H

)−1H T y

=(σ̂2(t+1)U−1

(t) I U−1(t) + I H T H I

)−1H T y

=(σ̂2(t+1)U−1

(t) I U−1(t) + U−1

(t) U (t)H T HU (t)U−1(t)

)−1H T y

= U (t)(σ̂2(t+1)I + U (t)H T HU (t)

)−1U (t)H T y

65 / 66

Page 160: Machine Learning Maximum A Posteriori

Thus

We have that

β̂(t+1) =(σ̂2(t+1)V (t) + H T H

)−1H T y

=(σ̂2(t+1)U−1

(t) U−1(t) + H T H

)−1H T y

=(σ̂2(t+1)U−1

(t) I U−1(t) + I H T H I

)−1H T y

=(σ̂2(t+1)U−1

(t) I U−1(t) + U−1

(t) U (t)H T HU (t)U−1(t)

)−1H T y

= U (t)(σ̂2(t+1)I + U (t)H T HU (t)

)−1U (t)H T y

65 / 66

Page 161: Machine Learning Maximum A Posteriori

Thus

We have that

β̂(t+1) =(σ̂2(t+1)V (t) + H T H

)−1H T y

=(σ̂2(t+1)U−1

(t) U−1(t) + H T H

)−1H T y

=(σ̂2(t+1)U−1

(t) I U−1(t) + I H T H I

)−1H T y

=(σ̂2(t+1)U−1

(t) I U−1(t) + U−1

(t) U (t)H T HU (t)U−1(t)

)−1H T y

= U (t)(σ̂2(t+1)I + U (t)H T HU (t)

)−1U (t)H T y

65 / 66

Page 162: Machine Learning Maximum A Posteriori

Advantages!!!

Quite ImportantWe avoid the inversion of the elements of β̂(t).

We can avoid getting the inverse matrixWe simply solve the corresponding linear system whose dimension is onlythe number of nonzero elements in U (t). Why?

Remember you want to maximizeQ(β, σ2|β̂(t), σ̂2(t)

)= −n log σ2 − ‖y−Hβ‖2

2σ2 − βT V (t)β

66 / 66

Page 163: Machine Learning Maximum A Posteriori

Advantages!!!

Quite ImportantWe avoid the inversion of the elements of β̂(t).

We can avoid getting the inverse matrixWe simply solve the corresponding linear system whose dimension is onlythe number of nonzero elements in U (t). Why?

Remember you want to maximizeQ(β, σ2|β̂(t), σ̂2(t)

)= −n log σ2 − ‖y−Hβ‖2

2σ2 − βT V (t)β

66 / 66

Page 164: Machine Learning Maximum A Posteriori

Advantages!!!

Quite ImportantWe avoid the inversion of the elements of β̂(t).

We can avoid getting the inverse matrixWe simply solve the corresponding linear system whose dimension is onlythe number of nonzero elements in U (t). Why?

Remember you want to maximizeQ(β, σ2|β̂(t), σ̂2(t)

)= −n log σ2 − ‖y−Hβ‖2

2σ2 − βT V (t)β

66 / 66