hierarchical bayesian methods - university of oulu 4...hierarchical bayesian methods bhaskar d rao1...

Hierarchical Bayesian Methods

Bhaskar D Rao1

University of California, San Diego

1Thanks to David Wipf, Jason Palmer, Zhilin Zhang, R. Prasad and RitwikGiri

Bhaskar D Rao University of California, San Diego

Bayesian Methods

1. MAP Estimation Framework (Type I)

2. Hierarchical Bayesian Framework (Type II)

Bayesian Methods

MAP Estimation Framework (Type I)

Problem Statement

x̂ = arg maxx

p(x |y) = arg maxx

p(y |x)p(x)

Choice of p(x) = a2e−a|x| as Laplacian and p(y |x) as Gaussian will lead

to the familiar LASSO framework.

Hierarchical Bayes (Type II): Sparse BayesianLearning(SBL)

MAP methods were interested in the mode of the posterior p(x |y) butSBL uses posterior information beyond the mode, i.e. posteriordistribution p(x |y).

Problem

For all sparse priors it is not possible to compute the normalized posteriorp(x |y), hence some approximations are needed.

Problem

Hierarchical Bayesian Framework (Type II)

Problem Statement

γ̂ = arg maxγ

p(γ|y) = arg maxγ

∫p(y |x)p(x |γ)p(γ)dx

Using this estimate of γ, one can compute the desired posterior p(x |y ; γ̂).

Problem Statement

γ̂ = arg maxγ

p(γ|y) = arg maxγ

∫p(y |x)p(x |γ)p(γ)dx

Using this estimate of γ, one can compute the desired posterior p(x |y ; γ̂).

Potential Advantages

Averaging over x leads to fewer minima in p(γ|y).

γ can tie several parameters, leading to fewer parameters.

Example: Bayesian LASSO

Laplacian prior p(x) can be represented as a Gaussian Scale Mixture inthis fashion,

p(x) =

∫p(x |γ)p(γ)dγ

∫1√2πγ

exp(− x2

2γ)× a2

2exp(−a2

2γ)dγ

2exp(−a|x |)

Potential Advantages

Averaging over x leads to fewer minima in p(γ|y).

γ can tie several parameters, leading to fewer parameters.

Example: Bayesian LASSO

Laplacian prior p(x) can be represented as a Gaussian Scale Mixture inthis fashion,

p(x) =

∫p(x |γ)p(γ)dγ

∫1√2πγ

exp(− x2

2γ)× a2

2exp(−a2

2γ)dγ

2exp(−a|x |)

Hierarchical Bayes: Sparse Bayesian Learning(SBL)

Instead of solving a MAP problem in x , in this framework one estimatesthe hyperparameters γ and then an estimate of the posterior distributionfor x , i.e. p(x |y ; γ̂). (Sparse Bayesian Learning)

Useful Representation for Sparse priors

In order for this framework to be useful, we need tractablerepresentations:

Gaussian Scaled Mixtures (GSM)

Gaussian Scale Mixtures : Model for random variable X

X = γG where, G ∼ N(g ; 0, 1)

γ is a positive random variable, which is independent of G .

p(x) =

∫p(x |γ) p(γ)dγ

∫N(x ; 0, γ) p(γ)dγ

In order for this framework to be useful, we need tractablerepresentations: Gaussian Scaled Mixtures (GSM)

X = γG where, G ∼ N(g ; 0, 1)

p(x) =

∫N(x ; 0, γ) p(γ)dγ

X = γG where, G ∼ N(g ; 0, 1)

p(x) =

∫N(x ; 0, γ) p(γ)dγ

X = γG where, G ∼ N(g ; 0, 1)

p(x) =

∫N(x ; 0, γ) p(γ)dγ

Gaussian Scale Mixtures

Most of the sparse priors over x (including those with concave g) can berepresented in this GSM form, and different scale mixing density i.e, p(γi )will lead to different sparse priors. [Palmer et al., 2006]

Example: Laplacian density

p(x ; a) =a

2exp(−a|x |)

Scale mixing density: p(γ) = a2

2 exp(− a2

2 γ), γ ≥ 0.

p(x ; a) =a

2exp(−a|x |)

2 exp(− a2

2 γ), γ ≥ 0.

p(x ; a) =a

2exp(−a|x |)

2 exp(− a2

2 γ), γ ≥ 0.

Examples of Gaussian Scale Mixtures

Student-t Distribution

p(x ; a, b) =baΓ(a + 1/2)

(2π)0.5Γ(a)

(b + x2/2)a+1/2

Scale mixing density: Inverse Gamma Distribution

p(γ) = 1Γ(a)b

a 1γa+1 e

− bγ u(γ).

Generalized Gaussian

p(x ; p) =1

2Γ(1 + 1p )

e−|x|p

Scale mixing density: Positive alpha stable density of order p/2.

Generalized logistic density

p(x ; a) =Γ(2α)

Γ(α)2

e−αx

(1 + e−x)2α

Scale mixing density: Related to Kolmogorov-Smirnov distancestatistics.

p(x ; a, b) =baΓ(a + 1/2)

(2π)0.5Γ(a)

(b + x2/2)a+1/2

p(γ) = 1Γ(a)b

a 1γa+1 e

− bγ u(γ).

p(x ; p) =1

2Γ(1 + 1p )

e−|x|p

p(x ; a) =Γ(2α)

Γ(α)2

e−αx

(1 + e−x)2α

p(x ; a, b) =baΓ(a + 1/2)

(2π)0.5Γ(a)

(b + x2/2)a+1/2

p(γ) = 1Γ(a)b

a 1γa+1 e

− bγ u(γ).

p(x ; p) =1

2Γ(1 + 1p )

e−|x|p

p(x ; a) =Γ(2α)

Γ(α)2

e−αx

(1 + e−x)2α

Sparse Bayesian Learning (Tipping)

y = Ax + v

Solving for MAP estimate of γ

γ̂ = arg maxγ

p(γ|y) = arg maxγ

p(y , γ)p(γ) = arg maxγ

p(y |γ)p(γ)

What is p(y |γ)

Given γ, x is Gaussian with mean zero and Covariance matrix Γ withΓ = diag(γ), i.e. p(x |γ) = N(x ; 0, Γ) = ΠN(xi ; 0, γi ).

Then p(y |γ) = N(y ; 0,Σy ), where Σy = σ2I + AΓAT ,

p(y |γ) =1√

(2π)N |Σy |e−

T Σ−1y y

y = Ax + v

γ̂ = arg maxγ

p(γ|y) = arg maxγ

p(y |γ)p(γ)

What is p(y |γ)

p(y |γ) =1√

(2π)N |Σy |e−

T Σ−1y y

y = Ax + v

γ̂ = arg maxγ

p(γ|y) = arg maxγ

p(y |γ)p(γ)

What is p(y |γ)

p(y |γ) =1√

(2π)N |Σy |e−

T Σ−1y y

y = Ax + v

γ̂ = arg maxγ

p(γ|y) = arg maxγ

p(y |γ)p(γ)

What is p(y |γ)

p(y |γ) =1√

(2π)N |Σy |e−

T Σ−1y y

y = Ax + v

γ̂ = arg maxγ

p(γ|y) = arg maxγ

p(y |γ)p(γ)

What is p(y |γ)

p(y |γ) =1√

(2π)N |Σy |e−

T Σ−1y y

y = Ax + v

γ̂ = arg maxγ

p(γ|y) = arg maxγ

p(y |γ)p(γ)

What is p(y |γ)

p(y |γ) =1√

(2π)N |Σy |e−

T Σ−1y y

MAP estimate of γ

γ̂ = arg minγ(log |Σy |+ yTΣ−1

y y − 2∑

i log p(γi ))

Computational Methods

Many options for solving the above optimization problem, e.g.Majorization Minimization, Expectation-Maximization (EM).

MAP estimate of γ

γ̂ = arg minγ(log |Σy |+ yTΣ−1

y y − 2∑

i log p(γi ))

Computational Methods

Many options for solving the above optimization problem, e.g.Majorization Minimization, Expectation-Maximization (EM).

Sparse Bayesian Learning

y = Ax + v

Computing Posterior

Now because of our convenient GSM choice, posterior can be easilycomputed, i.e, p(x |y ; γ̂) = N(µx ,Σx) where,

µx = E [x |y ; γ̂] = Γ̂AT (σ2I + AΓ̂AT )−1y

Σx = Cov [x |y ; γ̂] = Γ̂− Γ̂AT (σ2I + AΓ̂AT )−1AΓ̂

µx can be used as a point estimate.

Sparsity of µx is achieved through sparsity in γ.

Another parameter of interest for the EM algorithm

E (x2i |y, γ̂) = µ2

x(i) + Σx(i , i)

y = Ax + v

Computing Posterior

E (x2i |y, γ̂) = µ2

x(i) + Σx(i , i)

y = Ax + v

Computing Posterior

E (x2i |y, γ̂) = µ2

x(i) + Σx(i , i)

y = Ax + v

Computing Posterior

E (x2i |y, γ̂) = µ2

x(i) + Σx(i , i)

EM algorithm: Updating γ

Treating (y, x) as complete data and vector x as hidden variable.

log p(y , x , γ) = log p(y |x) + log p(x |γ) + log p(γ)

E step

Q(γ|γk) = Ex|y ;γk [log p(y |x) + log p(x |γ) + log p(γ)]

M step

γk+1 = argmaxγQ(γ|γk) = argmaxγEx|y ;γk [log p(x |γ) + log p(γ)]

= argminγEx|y ;γk [M∑i=1

2log γi

)− log p(γ)]

Solving this optimization problem with a non-informative prior p(γ),

γk+1i = E (x2