introduction to bayesian data analysis and markov chain ...jmorris/talks_files/mcmc2.pdf ·...

108
Introduction to Bayesian Data Analysis and Markov Chain Monte Carlo Jeffrey S. Morris University of Texas M.D. Anderson Cancer Center Department of Biostatistics [email protected] September 20, 2002 Abstract The purpose of this talk is to give a brief overview of Bayesian Inference and Markov Chain Monte Carlo methods, including the Gibbs Sampler and Metropolis Hastings algorithm.

Upload: phamnga

Post on 10-May-2019

219 views

Category:

Documents


0 download

TRANSCRIPT

Introduction to Bayesian Data Analysis andMarkov Chain Monte Carlo

Jeffrey S. MorrisUniversity of Texas M.D. Anderson Cancer Center

Department of [email protected]

September 20, 2002

AbstractThe purpose of this talk is to give a brief overview of Bayesian

Inference and Markov Chain Monte Carlo methods, including the GibbsSampler and Metropolis Hastings algorithm.

MCMC OVERVIEW 1

Outline

• Bayesian vs. Frequentist paradigm

• Bayesian Inference and MCMC

? Gibbs Sampler? Metropolis-Hastings Algorithm

• Assessing Convergence of MCMC

• Hierarchical Model Example

• MCMC: Benefits and Cautions

MCMC OVERVIEW 2

Frequentist vs. Bayesian paradigms

• Data: X Parameters: Θ

MCMC OVERVIEW 2

Frequentist vs. Bayesian paradigms

• Data: X Parameters: Θ

• To a frequentist:

? The data X are random, and the parameters Θ are fixed.

MCMC OVERVIEW 2

Frequentist vs. Bayesian paradigms

• Data: X Parameters: Θ

• To a frequentist:

? The data X are random, and the parameters Θ are fixed.? (ML) Inference is performed by finding Θ such that f(X|Θ) is

maximized.

MCMC OVERVIEW 2

Frequentist vs. Bayesian paradigms

• Data: X Parameters: Θ

• To a frequentist:

? The data X are random, and the parameters Θ are fixed.? (ML) Inference is performed by finding Θ such that f(X|Θ) is

maximized.? We cannot make probability statements about parameters, but only can

make statements about performance of estimators over repeatedsampling (e.g.confidence intervals).

MCMC OVERVIEW 2

Frequentist vs. Bayesian paradigms

• Data: X Parameters: Θ

• To a frequentist:

? The data X are random, and the parameters Θ are fixed.? (ML) Inference is performed by finding Θ such that f(X|Θ) is

maximized.? We cannot make probability statements about parameters, but only can

make statements about performance of estimators over repeatedsampling (e.g.confidence intervals).

• To a Bayesian:

? The current data X is fixed, and the unknown parameters Θ are random.

MCMC OVERVIEW 2

Frequentist vs. Bayesian paradigms

• Data: X Parameters: Θ

• To a frequentist:

? The data X are random, and the parameters Θ are fixed.? (ML) Inference is performed by finding Θ such that f(X|Θ) is

maximized.? We cannot make probability statements about parameters, but only can

make statements about performance of estimators over repeatedsampling (e.g.confidence intervals).

• To a Bayesian:

? The current data X is fixed, and the unknown parameters Θ are random.? Inference is performed via the posterior distribution f(Θ|X).

MCMC OVERVIEW 2

Frequentist vs. Bayesian paradigms

• Data: X Parameters: Θ

• To a frequentist:

? The data X are random, and the parameters Θ are fixed.? (ML) Inference is performed by finding Θ such that f(X|Θ) is

maximized.? We cannot make probability statements about parameters, but only can

make statements about performance of estimators over repeatedsampling (e.g.confidence intervals).

• To a Bayesian:

? The current data X is fixed, and the unknown parameters Θ are random.? Inference is performed via the posterior distribution f(Θ|X).? We can make probability statements about parameters, since they are

random quantities (e.g. credible intervals)

MCMC OVERVIEW 3

Bayes’ Rule

• The posterior distribution is computed by applying Bayes’ Rule:

f(Θ|X) =f(X|Θ)f(Θ)

f(X)

MCMC OVERVIEW 3

Bayes’ Rule

• The posterior distribution is computed by applying Bayes’ Rule:

f(Θ|X) =f(X|Θ)f(Θ)

f(X)

• f(X|Θ)= Likelihood

MCMC OVERVIEW 3

Bayes’ Rule

• The posterior distribution is computed by applying Bayes’ Rule:

f(Θ|X) =f(X|Θ)f(Θ)

f(X)

• f(X|Θ)= Likelihood

• f(Θ)= Prior Distribution

MCMC OVERVIEW 3

Bayes’ Rule

• The posterior distribution is computed by applying Bayes’ Rule:

f(Θ|X) =f(X|Θ)f(Θ)

f(X)

• f(X|Θ)= Likelihood

• f(Θ)= Prior Distribution

? Reflects prior knowledge about Θ

MCMC OVERVIEW 3

Bayes’ Rule

• The posterior distribution is computed by applying Bayes’ Rule:

f(Θ|X) =f(X|Θ)f(Θ)

f(X)

• f(X|Θ)= Likelihood

• f(Θ)= Prior Distribution

? Reflects prior knowledge about Θ? Sometimes controversial

MCMC OVERVIEW 3

Bayes’ Rule

• The posterior distribution is computed by applying Bayes’ Rule:

f(Θ|X) =f(X|Θ)f(Θ)

f(X)

• f(X|Θ)= Likelihood

• f(Θ)= Prior Distribution

? Reflects prior knowledge about Θ? Sometimes controversial? If little information available, just use diffuse priors (avoid improper priors)

MCMC OVERVIEW 3

Bayes’ Rule

• The posterior distribution is computed by applying Bayes’ Rule:

f(Θ|X) =f(X|Θ)f(Θ)

f(X)

• f(X|Θ)= Likelihood

• f(Θ)= Prior Distribution

? Reflects prior knowledge about Θ? Sometimes controversial? If little information available, just use diffuse priors (avoid improper priors)

• f(X)= Marginal Distribution =∫f(X|Θ)f(Θ)dΘ

MCMC OVERVIEW 3

Bayes’ Rule

• The posterior distribution is computed by applying Bayes’ Rule:

f(Θ|X) =f(X|Θ)f(Θ)

f(X)

• f(X|Θ)= Likelihood

• f(Θ)= Prior Distribution

? Reflects prior knowledge about Θ? Sometimes controversial? If little information available, just use diffuse priors (avoid improper priors)

• f(X)= Marginal Distribution =∫f(X|Θ)f(Θ)dΘ

? Difficult to compute (usually intractable integral)

MCMC OVERVIEW 3

Bayes’ Rule

• The posterior distribution is computed by applying Bayes’ Rule:

f(Θ|X) =f(X|Θ)f(Θ)

f(X)

• f(X|Θ)= Likelihood

• f(Θ)= Prior Distribution

? Reflects prior knowledge about Θ? Sometimes controversial? If little information available, just use diffuse priors (avoid improper priors)

• f(X)= Marginal Distribution =∫f(X|Θ)f(Θ)dΘ

? Difficult to compute (usually intractable integral)? Often not necessary to compute.

MCMC OVERVIEW 4

Conjugate priors

• Conjugate priors: f(Θ) and f(Θ|X) have same distributional form.

MCMC OVERVIEW 4

Conjugate priors

• Conjugate priors: f(Θ) and f(Θ|X) have same distributional form.

• Examples: Normal-Normal, Beta-Binomial, Gamma-Poisson

MCMC OVERVIEW 4

Conjugate priors

• Conjugate priors: f(Θ) and f(Θ|X) have same distributional form.

• Examples: Normal-Normal, Beta-Binomial, Gamma-Poisson

• Ex: (X|θ) ∼ Binomial(n, θ); θ ∼ Beta(α, β)

MCMC OVERVIEW 4

Conjugate priors

• Conjugate priors: f(Θ) and f(Θ|X) have same distributional form.

• Examples: Normal-Normal, Beta-Binomial, Gamma-Poisson

• Ex: (X|θ) ∼ Binomial(n, θ); θ ∼ Beta(α, β)

f(θ|X) ∝ f(X|θ)f(θ)

MCMC OVERVIEW 4

Conjugate priors

• Conjugate priors: f(Θ) and f(Θ|X) have same distributional form.

• Examples: Normal-Normal, Beta-Binomial, Gamma-Poisson

• Ex: (X|θ) ∼ Binomial(n, θ); θ ∼ Beta(α, β)

f(θ|X) ∝ f(X|θ)f(θ)

∝ θX(1− θ)n−Xθα(1− θ)β

MCMC OVERVIEW 4

Conjugate priors

• Conjugate priors: f(Θ) and f(Θ|X) have same distributional form.

• Examples: Normal-Normal, Beta-Binomial, Gamma-Poisson

• Ex: (X|θ) ∼ Binomial(n, θ); θ ∼ Beta(α, β)

f(θ|X) ∝ f(X|θ)f(θ)

∝ θX(1− θ)n−Xθα(1− θ)β

= θα+X(1− θ)β+n−X

MCMC OVERVIEW 4

Conjugate priors

• Conjugate priors: f(Θ) and f(Θ|X) have same distributional form.

• Examples: Normal-Normal, Beta-Binomial, Gamma-Poisson

• Ex: (X|θ) ∼ Binomial(n, θ); θ ∼ Beta(α, β)

f(θ|X) ∝ f(X|θ)f(θ)

∝ θX(1− θ)n−Xθα(1− θ)β

= θα+X(1− θ)β+n−X

= kernel of Beta(α+X,β + n−X)

MCMC OVERVIEW 4

Conjugate priors

• Conjugate priors: f(Θ) and f(Θ|X) have same distributional form.

• Examples: Normal-Normal, Beta-Binomial, Gamma-Poisson

• Ex: (X|θ) ∼ Binomial(n, θ); θ ∼ Beta(α, β)

f(θ|X) ∝ f(X|θ)f(θ)

∝ θX(1− θ)n−Xθα(1− θ)β

= θα+X(1− θ)β+n−X

= kernel of Beta(α+X,β + n−X)

• For single parameter problem: conjugate priors allow closed form posteriordistributions.

MCMC OVERVIEW 4

Conjugate priors

• Conjugate priors: f(Θ) and f(Θ|X) have same distributional form.

• Examples: Normal-Normal, Beta-Binomial, Gamma-Poisson

• Ex: (X|θ) ∼ Binomial(n, θ); θ ∼ Beta(α, β)

f(θ|X) ∝ f(X|θ)f(θ)

∝ θX(1− θ)n−Xθα(1− θ)β

= θα+X(1− θ)β+n−X

= kernel of Beta(α+X,β + n−X)

• For single parameter problem: conjugate priors allow closed form posteriordistributions.

• What if we don’t want to use conjugate priors?

MCMC OVERVIEW 4

Conjugate priors

• Conjugate priors: f(Θ) and f(Θ|X) have same distributional form.

• Examples: Normal-Normal, Beta-Binomial, Gamma-Poisson

• Ex: (X|θ) ∼ Binomial(n, θ); θ ∼ Beta(α, β)

f(θ|X) ∝ f(X|θ)f(θ)

∝ θX(1− θ)n−Xθα(1− θ)β

= θα+X(1− θ)β+n−X

= kernel of Beta(α+X,β + n−X)

• For single parameter problem: conjugate priors allow closed form posteriordistributions.

• What if we don’t want to use conjugate priors?

What if we have multiple parameters?

MCMC OVERVIEW 5

Non-conjugate Case

• Suppose we are interested in the posterior mean:

E(Θ|X) =∫

Θf(Θ|X)dΘ

MCMC OVERVIEW 5

Non-conjugate Case

• Suppose we are interested in the posterior mean:

E(Θ|X) =∫

Θf(Θ|X)dΘ

=∫

Θf(X|Θ)f(Θ)dΘ∫f(X|Θ)f(Θ)dΘ

MCMC OVERVIEW 5

Non-conjugate Case

• Suppose we are interested in the posterior mean:

E(Θ|X) =∫

Θf(Θ|X)dΘ

=∫

Θf(X|Θ)f(Θ)dΘ∫f(X|Θ)f(Θ)dΘ

• How do we compute this integral if it is intractable?

MCMC OVERVIEW 5

Non-conjugate Case

• Suppose we are interested in the posterior mean:

E(Θ|X) =∫

Θf(Θ|X)dΘ

=∫

Θf(X|Θ)f(Θ)dΘ∫f(X|Θ)f(Θ)dΘ

• How do we compute this integral if it is intractable?

? Numerical Integration (Quadrature)

MCMC OVERVIEW 5

Non-conjugate Case

• Suppose we are interested in the posterior mean:

E(Θ|X) =∫

Θf(Θ|X)dΘ

=∫

Θf(X|Θ)f(Θ)dΘ∫f(X|Θ)f(Θ)dΘ

• How do we compute this integral if it is intractable?

? Numerical Integration (Quadrature)May not work if there are many parameters.

MCMC OVERVIEW 5

Non-conjugate Case

• Suppose we are interested in the posterior mean:

E(Θ|X) =∫

Θf(Θ|X)dΘ

=∫

Θf(X|Θ)f(Θ)dΘ∫f(X|Θ)f(Θ)dΘ

• How do we compute this integral if it is intractable?

? Numerical Integration (Quadrature)May not work if there are many parameters.

? Monte Carlo integration

MCMC OVERVIEW 6

Markov Chain Monte Carlo: Monte Carlo Integration

• Monte Carlo integration:

Estimate integrals by randomly drawing samples from the requireddistribution.

MCMC OVERVIEW 6

Markov Chain Monte Carlo: Monte Carlo Integration

• Monte Carlo integration:

Estimate integrals by randomly drawing samples from the requireddistribution.

E(Θ|X) =∫

Θf(Θ|X)dΘ

MCMC OVERVIEW 6

Markov Chain Monte Carlo: Monte Carlo Integration

• Monte Carlo integration:

Estimate integrals by randomly drawing samples from the requireddistribution.

E(Θ|X) =∫

Θf(Θ|X)dΘ

≈ 1n

n∑t=1

Θt,

where Θt ∼ f(Θ|X)

MCMC OVERVIEW 6

Markov Chain Monte Carlo: Monte Carlo Integration

• Monte Carlo integration:

Estimate integrals by randomly drawing samples from the requireddistribution.

E(Θ|X) =∫

Θf(Θ|X)dΘ

≈ 1n

n∑t=1

Θt,

where Θt ∼ f(Θ|X)

• We still need a method for drawing samples from the posterior distribution:

MCMC OVERVIEW 6

Markov Chain Monte Carlo: Monte Carlo Integration

• Monte Carlo integration:

Estimate integrals by randomly drawing samples from the requireddistribution.

E(Θ|X) =∫

Θf(Θ|X)dΘ

≈ 1n

n∑t=1

Θt,

where Θt ∼ f(Θ|X)

• We still need a method for drawing samples from the posterior distribution:

? Rejection Sampling

MCMC OVERVIEW 6

Markov Chain Monte Carlo: Monte Carlo Integration

• Monte Carlo integration:

Estimate integrals by randomly drawing samples from the requireddistribution.

E(Θ|X) =∫

Θf(Θ|X)dΘ

≈ 1n

n∑t=1

Θt,

where Θt ∼ f(Θ|X)

• We still need a method for drawing samples from the posterior distribution:

? Rejection Sampling? Importance Sampling

MCMC OVERVIEW 6

Markov Chain Monte Carlo: Monte Carlo Integration

• Monte Carlo integration:

Estimate integrals by randomly drawing samples from the requireddistribution.

E(Θ|X) =∫

Θf(Θ|X)dΘ

≈ 1n

n∑t=1

Θt,

where Θt ∼ f(Θ|X)

• We still need a method for drawing samples from the posterior distribution:

? Rejection Sampling? Importance Sampling? Markov Chain

MCMC OVERVIEW 7

Markov Chain Monte Carlo: Markov Chains

• Markov Chain : Method to draw samples from a desired stationarydistribution.

MCMC OVERVIEW 7

Markov Chain Monte Carlo: Markov Chains

• Markov Chain : Method to draw samples from a desired stationarydistribution.

• Steps:

1. Obtain starting values Θ0

MCMC OVERVIEW 7

Markov Chain Monte Carlo: Markov Chains

• Markov Chain : Method to draw samples from a desired stationarydistribution.

• Steps:

1. Obtain starting values Θ0

2. Sample Θ1 from suitably chosen transition kernel P (Θ1|Θ0)

MCMC OVERVIEW 7

Markov Chain Monte Carlo: Markov Chains

• Markov Chain : Method to draw samples from a desired stationarydistribution.

• Steps:

1. Obtain starting values Θ0

2. Sample Θ1 from suitably chosen transition kernel P (Θ1|Θ0)3. Repeat second step n times to obtain chain {Θ0,Θ1, . . . ,Θn}.

MCMC OVERVIEW 7

Markov Chain Monte Carlo: Markov Chains

• Markov Chain : Method to draw samples from a desired stationarydistribution.

• Steps:

1. Obtain starting values Θ0

2. Sample Θ1 from suitably chosen transition kernel P (Θ1|Θ0)3. Repeat second step n times to obtain chain {Θ0,Θ1, . . . ,Θn}.

• Theorems show that, under certain regularity conditions, the chain willconverge to a particular stationary distribution after suitable burn-in period.

MCMC OVERVIEW 7

Markov Chain Monte Carlo: Markov Chains

• Markov Chain : Method to draw samples from a desired stationarydistribution.

• Steps:

1. Obtain starting values Θ0

2. Sample Θ1 from suitably chosen transition kernel P (Θ1|Θ0)3. Repeat second step n times to obtain chain {Θ0,Θ1, . . . ,Θn}.

• Theorems show that, under certain regularity conditions, the chain willconverge to a particular stationary distribution after suitable burn-in period.

• End result: A (correlated) sample from the stationary distribution.

MCMC OVERVIEW 8

Markov Chain Monte Carlo

• Given Markov Chain {Θ0,Θ1, . . . ,Θn} with stationary distribution f(Θ|X)with burn-in m, we can estimate the posterior mean using Monte Carlointegration:

MCMC OVERVIEW 8

Markov Chain Monte Carlo

• Given Markov Chain {Θ0,Θ1, . . . ,Θn} with stationary distribution f(Θ|X)with burn-in m, we can estimate the posterior mean using Monte Carlointegration:

E(Θ|X) ≈ 1n−m

n∑t=m+1

Θt.

MCMC OVERVIEW 8

Markov Chain Monte Carlo

• Given Markov Chain {Θ0,Θ1, . . . ,Θn} with stationary distribution f(Θ|X)with burn-in m, we can estimate the posterior mean using Monte Carlointegration:

E(Θ|X) ≈ 1n−m

n∑t=m+1

Θt.

• Other quantities can also be computed from Markov Chain:

MCMC OVERVIEW 8

Markov Chain Monte Carlo

• Given Markov Chain {Θ0,Θ1, . . . ,Θn} with stationary distribution f(Θ|X)with burn-in m, we can estimate the posterior mean using Monte Carlointegration:

E(Θ|X) ≈ 1n−m

n∑t=m+1

Θt.

• Other quantities can also be computed from Markov Chain:

? Standard errors? Quantiles? Density estimates

MCMC OVERVIEW 8

Markov Chain Monte Carlo

• Given Markov Chain {Θ0,Θ1, . . . ,Θn} with stationary distribution f(Θ|X)with burn-in m, we can estimate the posterior mean using Monte Carlointegration:

E(Θ|X) ≈ 1n−m

n∑t=m+1

Θt.

• Other quantities can also be computed from Markov Chain:

? Standard errors? Quantiles? Density estimates

• Samples can be used to perform any Bayesian inference of interest.

• How do we generate the Markov Chain?

MCMC OVERVIEW 9

Gibbs Sampler

• Gibbs Sampler(Geman and Geman, 1984):

Markov transition kernel consists of drawing from full conditionaldistributions.

MCMC OVERVIEW 9

Gibbs Sampler

• Gibbs Sampler(Geman and Geman, 1984):

Markov transition kernel consists of drawing from full conditionaldistributions.

• Suppose Θ = {θ1, θ2, . . . , θp}T .

MCMC OVERVIEW 9

Gibbs Sampler

• Gibbs Sampler(Geman and Geman, 1984):

Markov transition kernel consists of drawing from full conditionaldistributions.

• Suppose Θ = {θ1, θ2, . . . , θp}T .

Full conditional distribution for parameter i: f(θi|X,Θ−i)

MCMC OVERVIEW 9

Gibbs Sampler

• Gibbs Sampler(Geman and Geman, 1984):

Markov transition kernel consists of drawing from full conditionaldistributions.

• Suppose Θ = {θ1, θ2, . . . , θp}T .

Full conditional distribution for parameter i: f(θi|X,Θ−i)

Conditions on:

? The data X? The values for all other parameters Θ−i.

MCMC OVERVIEW 10

Gibbs Sampler

• Steps of Gibbs sampler:

MCMC OVERVIEW 10

Gibbs Sampler

• Steps of Gibbs sampler:

1. Choose a set of starting values Θ(0).

MCMC OVERVIEW 10

Gibbs Sampler

• Steps of Gibbs sampler:

1. Choose a set of starting values Θ(0).2. Generate (Θ(1)|Θ(0)) by sampling:

θ(1)1 from f(θ(1)

1 |X,Θ(0)−1)

θ(1)2 from f(θ(1)

2 |X,Θ(0)−2)

...θ

(1)p from f(θ(1)

p |X,Θ(0)−p)

MCMC OVERVIEW 10

Gibbs Sampler

• Steps of Gibbs sampler:

1. Choose a set of starting values Θ(0).2. Generate (Θ(1)|Θ(0)) by sampling:

θ(1)1 from f(θ(1)

1 |X,Θ(0)−1)

θ(1)2 from f(θ(1)

2 |X,Θ(0)−2)

...θ

(1)p from f(θ(1)

p |X,Θ(0)−p)

3. Repeat step two to get chain of length n: {Θ(0),Θ(1), . . .Θ(n)}.

MCMC OVERVIEW 10

Gibbs Sampler

• Steps of Gibbs sampler:

1. Choose a set of starting values Θ(0).2. Generate (Θ(1)|Θ(0)) by sampling:

θ(1)1 from f(θ(1)

1 |X,Θ(0)−1)

θ(1)2 from f(θ(1)

2 |X,Θ(0)−2)

...θ

(1)p from f(θ(1)

p |X,Θ(0)−p)

3. Repeat step two to get chain of length n: {Θ(0),Θ(1), . . .Θ(n)}.4. Assuming convergence by iteration m, compute posterior mean,

quantiles, etc. using samples m through n.

MCMC OVERVIEW 10

Gibbs Sampler

• Steps of Gibbs sampler:

1. Choose a set of starting values Θ(0).2. Generate (Θ(1)|Θ(0)) by sampling:

θ(1)1 from f(θ(1)

1 |X,Θ(0)−1)

θ(1)2 from f(θ(1)

2 |X,Θ(0)−2)

...θ

(1)p from f(θ(1)

p |X,Θ(0)−p)

3. Repeat step two to get chain of length n: {Θ(0),Θ(1), . . .Θ(n)}.4. Assuming convergence by iteration m, compute posterior mean,

quantiles, etc. using samples m through n.

• Many variations possible:

MCMC OVERVIEW 10

Gibbs Sampler

• Steps of Gibbs sampler:

1. Choose a set of starting values Θ(0).2. Generate (Θ(1)|Θ(0)) by sampling:

θ(1)1 from f(θ(1)

1 |X,Θ(0)−1)

θ(1)2 from f(θ(1)

2 |X,Θ(0)−2)

...θ

(1)p from f(θ(1)

p |X,Θ(0)−p)

3. Repeat step two to get chain of length n: {Θ(0),Θ(1), . . .Θ(n)}.4. Assuming convergence by iteration m, compute posterior mean,

quantiles, etc. using samples m through n.

• Many variations possible:

? Parameters to update each iteration, order of updating

MCMC OVERVIEW 10

Gibbs Sampler

• Steps of Gibbs sampler:

1. Choose a set of starting values Θ(0).2. Generate (Θ(1)|Θ(0)) by sampling:

θ(1)1 from f(θ(1)

1 |X,Θ(0)−1)

θ(1)2 from f(θ(1)

2 |X,Θ(0)−2)

...θ

(1)p from f(θ(1)

p |X,Θ(0)−p)

3. Repeat step two to get chain of length n: {Θ(0),Θ(1), . . .Θ(n)}.4. Assuming convergence by iteration m, compute posterior mean,

quantiles, etc. using samples m through n.

• Many variations possible:

? Parameters to update each iteration, order of updating? ’Blocking’ parameters together, working with marginalized distributions

MCMC OVERVIEW 10

Gibbs Sampler

• Steps of Gibbs sampler:

1. Choose a set of starting values Θ(0).2. Generate (Θ(1)|Θ(0)) by sampling:

θ(1)1 from f(θ(1)

1 |X,Θ(0)−1)

θ(1)2 from f(θ(1)

2 |X,Θ(0)−2)

...θ

(1)p from f(θ(1)

p |X,Θ(0)−p)

3. Repeat step two to get chain of length n: {Θ(0),Θ(1), . . .Θ(n)}.4. Assuming convergence by iteration m, compute posterior mean,

quantiles, etc. using samples m through n.

• Many variations possible:

? Parameters to update each iteration, order of updating? ’Blocking’ parameters together, working with marginalized distributions

• If conjugate priors used for all parameters, full conditionals in closed form.

MCMC OVERVIEW 10

Gibbs Sampler

• Steps of Gibbs sampler:

1. Choose a set of starting values Θ(0).2. Generate (Θ(1)|Θ(0)) by sampling:

θ(1)1 from f(θ(1)

1 |X,Θ(0)−1)

θ(1)2 from f(θ(1)

2 |X,Θ(0)−2)

...θ

(1)p from f(θ(1)

p |X,Θ(0)−p)

3. Repeat step two to get chain of length n: {Θ(0),Θ(1), . . .Θ(n)}.4. Assuming convergence by iteration m, compute posterior mean,

quantiles, etc. using samples m through n.

• Many variations possible:

? Parameters to update each iteration, order of updating? ’Blocking’ parameters together, working with marginalized distributions

• If conjugate priors used for all parameters, full conditionals in closed form.

• What if we don’t have closed form distributions for full conditionals?

MCMC OVERVIEW 11

Metropolis-Hastings Algorithm

• Metropolis-Hastings algorithm (Metropolis et al. 1953, Hastings 1970):

Method to construct a Markov Chain for θ, even if closed form expressionfor distribution is not available.

MCMC OVERVIEW 11

Metropolis-Hastings Algorithm

• Metropolis-Hastings algorithm (Metropolis et al. 1953, Hastings 1970):

Method to construct a Markov Chain for θ, even if closed form expressionfor distribution is not available.

π(θ): kernel of distribution of interest for θ, f(θ(t)i |X,Θ

(t−1)−i ).

MCMC OVERVIEW 11

Metropolis-Hastings Algorithm

• Metropolis-Hastings algorithm (Metropolis et al. 1953, Hastings 1970):

Method to construct a Markov Chain for θ, even if closed form expressionfor distribution is not available.

π(θ): kernel of distribution of interest for θ, f(θ(t)i |X,Θ

(t−1)−i ).

• Steps:

MCMC OVERVIEW 11

Metropolis-Hastings Algorithm

• Metropolis-Hastings algorithm (Metropolis et al. 1953, Hastings 1970):

Method to construct a Markov Chain for θ, even if closed form expressionfor distribution is not available.

π(θ): kernel of distribution of interest for θ, f(θ(t)i |X,Θ

(t−1)−i ).

• Steps:

1. Get θ(0)= starting value for θ.

MCMC OVERVIEW 11

Metropolis-Hastings Algorithm

• Metropolis-Hastings algorithm (Metropolis et al. 1953, Hastings 1970):

Method to construct a Markov Chain for θ, even if closed form expressionfor distribution is not available.

π(θ): kernel of distribution of interest for θ, f(θ(t)i |X,Θ

(t−1)−i ).

• Steps:

1. Get θ(0)= starting value for θ.2. Get θ∗=proposed value for θ(1), by sampling from proposal density

q(θ|X, θ(0)).

MCMC OVERVIEW 11

Metropolis-Hastings Algorithm

• Metropolis-Hastings algorithm (Metropolis et al. 1953, Hastings 1970):

Method to construct a Markov Chain for θ, even if closed form expressionfor distribution is not available.

π(θ): kernel of distribution of interest for θ, f(θ(t)i |X,Θ

(t−1)−i ).

• Steps:

1. Get θ(0)= starting value for θ.2. Get θ∗=proposed value for θ(1), by sampling from proposal density

q(θ|X, θ(0)).

3. Compute α(θ(0), θ∗)=min(

1,π(θ∗)q(θ(0)|θ∗)π(θ(0)q(θ∗|θ(0))

).

MCMC OVERVIEW 11

Metropolis-Hastings Algorithm

• Metropolis-Hastings algorithm (Metropolis et al. 1953, Hastings 1970):

Method to construct a Markov Chain for θ, even if closed form expressionfor distribution is not available.

π(θ): kernel of distribution of interest for θ, f(θ(t)i |X,Θ

(t−1)−i ).

• Steps:

1. Get θ(0)= starting value for θ.2. Get θ∗=proposed value for θ(1), by sampling from proposal density

q(θ|X, θ(0)).

3. Compute α(θ(0), θ∗)=min(

1,π(θ∗)q(θ(0)|θ∗)π(θ(0)q(θ∗|θ(0))

).

4. Generate u ∼Uniform(0,1).

MCMC OVERVIEW 11

Metropolis-Hastings Algorithm

• Metropolis-Hastings algorithm (Metropolis et al. 1953, Hastings 1970):

Method to construct a Markov Chain for θ, even if closed form expressionfor distribution is not available.

π(θ): kernel of distribution of interest for θ, f(θ(t)i |X,Θ

(t−1)−i ).

• Steps:

1. Get θ(0)= starting value for θ.2. Get θ∗=proposed value for θ(1), by sampling from proposal density

q(θ|X, θ(0)).

3. Compute α(θ(0), θ∗)=min(

1,π(θ∗)q(θ(0)|θ∗)π(θ(0)q(θ∗|θ(0))

).

4. Generate u ∼Uniform(0,1).If u < α⇒ let θ(1) = θ∗, else let θ(1) = θ(0).

MCMC OVERVIEW 11

Metropolis-Hastings Algorithm

• Metropolis-Hastings algorithm (Metropolis et al. 1953, Hastings 1970):

Method to construct a Markov Chain for θ, even if closed form expressionfor distribution is not available.

π(θ): kernel of distribution of interest for θ, f(θ(t)i |X,Θ

(t−1)−i ).

• Steps:

1. Get θ(0)= starting value for θ.2. Get θ∗=proposed value for θ(1), by sampling from proposal density

q(θ|X, θ(0)).

3. Compute α(θ(0), θ∗)=min(

1,π(θ∗)q(θ(0)|θ∗)π(θ(0)q(θ∗|θ(0))

).

4. Generate u ∼Uniform(0,1).If u < α⇒ let θ(1) = θ∗, else let θ(1) = θ(0).

• Types of proposals: Random Walk, Independence, Symmetric

MCMC OVERVIEW 12

Assessing Convergence of Markov Chains

• The Markov Chain is known to converge to the stationary distribution ofinterest, but how do I know when convergence has been achieved?

MCMC OVERVIEW 12

Assessing Convergence of Markov Chains

• The Markov Chain is known to converge to the stationary distribution ofinterest, but how do I know when convergence has been achieved?

i.e. How do I decide how long the burn-in should be?

MCMC OVERVIEW 12

Assessing Convergence of Markov Chains

• The Markov Chain is known to converge to the stationary distribution ofinterest, but how do I know when convergence has been achieved?

i.e. How do I decide how long the burn-in should be?

1. Look at time series plots for the parameters.

MCMC OVERVIEW 12

Assessing Convergence of Markov Chains

• The Markov Chain is known to converge to the stationary distribution ofinterest, but how do I know when convergence has been achieved?

i.e. How do I decide how long the burn-in should be?

1. Look at time series plots for the parameters.2. Run multiple chains with divergent starting values.

MCMC OVERVIEW 12

Assessing Convergence of Markov Chains

• The Markov Chain is known to converge to the stationary distribution ofinterest, but how do I know when convergence has been achieved?

i.e. How do I decide how long the burn-in should be?

1. Look at time series plots for the parameters.2. Run multiple chains with divergent starting values.3. Run formal diagnostics (Gelman and Rubin 1992, Geweke 1992)

MCMC OVERVIEW 12

Assessing Convergence of Markov Chains

• The Markov Chain is known to converge to the stationary distribution ofinterest, but how do I know when convergence has been achieved?

i.e. How do I decide how long the burn-in should be?

1. Look at time series plots for the parameters.2. Run multiple chains with divergent starting values.3. Run formal diagnostics (Gelman and Rubin 1992, Geweke 1992)

• Other issues:

? Length of chain

MCMC OVERVIEW 12

Assessing Convergence of Markov Chains

• The Markov Chain is known to converge to the stationary distribution ofinterest, but how do I know when convergence has been achieved?

i.e. How do I decide how long the burn-in should be?

1. Look at time series plots for the parameters.2. Run multiple chains with divergent starting values.3. Run formal diagnostics (Gelman and Rubin 1992, Geweke 1992)

• Other issues:

? Length of chain? Thinning to decrease autocorrelation

MCMC OVERVIEW 12

Assessing Convergence of Markov Chains

• The Markov Chain is known to converge to the stationary distribution ofinterest, but how do I know when convergence has been achieved?

i.e. How do I decide how long the burn-in should be?

1. Look at time series plots for the parameters.2. Run multiple chains with divergent starting values.3. Run formal diagnostics (Gelman and Rubin 1992, Geweke 1992)

• Other issues:

? Length of chain? Thinning to decrease autocorrelation

MCMC OVERVIEW 13

Example: Hierarchical Models

• Example: Growth curves for rats.

MCMC OVERVIEW 13

Example: Hierarchical Models

• Example: Growth curves for rats.• Data Yij consists of weights for 30 rats over 5 weeks.

MCMC OVERVIEW 13

Example: Hierarchical Models

• Example: Growth curves for rats.• Data Yij consists of weights for 30 rats over 5 weeks.

Rat Growth Model Data

Time (days)

Wei

ght (

g)

10 15 20 25 30 35

100

150

200

250

300

350

MCMC OVERVIEW 13

Example: Hierarchical Models

• Example: Growth curves for rats.• Data Yij consists of weights for 30 rats over 5 weeks.

Rat Growth Model Data

Time (days)

Wei

ght (

g)

10 15 20 25 30 35

100

150

200

250

300

350

• Can estimate mean growth curve by linear regression, but growth curvemodels necessary to get standard errors right.

MCMC OVERVIEW 14

Example: Hierarchical Models

• Model: Yij ∼ Normal(µij, τc)

MCMC OVERVIEW 14

Example: Hierarchical Models

• Model: Yij ∼ Normal(µij, τc)

µij = αi + βi(xj − x)

MCMC OVERVIEW 14

Example: Hierarchical Models

• Model: Yij ∼ Normal(µij, τc)

µij = αi + βi(xj − x)

αi ∼ Normal(αc, τα)

MCMC OVERVIEW 14

Example: Hierarchical Models

• Model: Yij ∼ Normal(µij, τc)

µij = αi + βi(xj − x)

αi ∼ Normal(αc, τα)

βi ∼ Normal(βc, τβ)

MCMC OVERVIEW 14

Example: Hierarchical Models

• Model: Yij ∼ Normal(µij, τc)

µij = αi + βi(xj − x)

αi ∼ Normal(αc, τα)

βi ∼ Normal(βc, τβ)

• Model could be fit using linear mixed model or Bayesian hierarchical model.

MCMC OVERVIEW 14

Example: Hierarchical Models

• Model: Yij ∼ Normal(µij, τc)

µij = αi + βi(xj − x)

αi ∼ Normal(αc, τα)

βi ∼ Normal(βc, τβ)

• Model could be fit using linear mixed model or Bayesian hierarchical model.

• Priors (conjugate and vague):αc, βc ∼ Normal(0, 10−6)

MCMC OVERVIEW 14

Example: Hierarchical Models

• Model: Yij ∼ Normal(µij, τc)

µij = αi + βi(xj − x)

αi ∼ Normal(αc, τα)

βi ∼ Normal(βc, τβ)

• Model could be fit using linear mixed model or Bayesian hierarchical model.

• Priors (conjugate and vague):αc, βc ∼ Normal(0, 10−6)

τc, τα, τβ ∼ Gamma(0.001, 0.001)

MCMC OVERVIEW 14

Example: Hierarchical Models

• Model: Yij ∼ Normal(µij, τc)

µij = αi + βi(xj − x)

αi ∼ Normal(αc, τα)

βi ∼ Normal(βc, τβ)

• Model could be fit using linear mixed model or Bayesian hierarchical model.

• Priors (conjugate and vague):αc, βc ∼ Normal(0, 10−6)

τc, τα, τβ ∼ Gamma(0.001, 0.001)

• Gibbs sampler:

Since conjugate priors were used, the full conditionals are all available inclosed form and can be derived using some algebra.

MCMC OVERVIEW 14

Example: Hierarchical Models

• Model: Yij ∼ Normal(µij, τc)

µij = αi + βi(xj − x)

αi ∼ Normal(αc, τα)

βi ∼ Normal(βc, τβ)

• Model could be fit using linear mixed model or Bayesian hierarchical model.

• Priors (conjugate and vague):αc, βc ∼ Normal(0, 10−6)

τc, τα, τβ ∼ Gamma(0.001, 0.001)

• Gibbs sampler:

Since conjugate priors were used, the full conditionals are all available inclosed form and can be derived using some algebra.

• WinBUGS: Statistical software to perform MCMC in general problems.

MCMC OVERVIEW 14

Example: Hierarchical Models

• Model: Yij ∼ Normal(µij, τc)

µij = αi + βi(xj − x)

αi ∼ Normal(αc, τα)

βi ∼ Normal(βc, τβ)

• Model could be fit using linear mixed model or Bayesian hierarchical model.

• Priors (conjugate and vague):αc, βc ∼ Normal(0, 10−6)

τc, τα, τβ ∼ Gamma(0.001, 0.001)

• Gibbs sampler:

Since conjugate priors were used, the full conditionals are all available inclosed form and can be derived using some algebra.

• WinBUGS: Statistical software to perform MCMC in general problems.

MCMC OVERVIEW 15

Conclusions

• Why use MCMC?

? Flexible computing tool with ability to fit complex models.

MCMC OVERVIEW 15

Conclusions

• Why use MCMC?

? Flexible computing tool with ability to fit complex models.? No need to make simplified modeling assumptions out of convenience.

MCMC OVERVIEW 15

Conclusions

• Why use MCMC?

? Flexible computing tool with ability to fit complex models.? No need to make simplified modeling assumptions out of convenience.? Given posterior samples, can get all benefits of Bayesian inference.

MCMC OVERVIEW 15

Conclusions

• Why use MCMC?

? Flexible computing tool with ability to fit complex models.? No need to make simplified modeling assumptions out of convenience.? Given posterior samples, can get all benefits of Bayesian inference.

• Words of Caution:

MCMC OVERVIEW 15

Conclusions

• Why use MCMC?

? Flexible computing tool with ability to fit complex models.? No need to make simplified modeling assumptions out of convenience.? Given posterior samples, can get all benefits of Bayesian inference.

• Words of Caution:

? Monitor convergence!

MCMC OVERVIEW 15

Conclusions

• Why use MCMC?

? Flexible computing tool with ability to fit complex models.? No need to make simplified modeling assumptions out of convenience.? Given posterior samples, can get all benefits of Bayesian inference.

• Words of Caution:

? Monitor convergence!∗ Unfortunately, the most complex models tend to converge very slowly.

MCMC OVERVIEW 15

Conclusions

• Why use MCMC?

? Flexible computing tool with ability to fit complex models.? No need to make simplified modeling assumptions out of convenience.? Given posterior samples, can get all benefits of Bayesian inference.

• Words of Caution:

? Monitor convergence!∗ Unfortunately, the most complex models tend to converge very slowly.∗ Can try blocking and marginalization to decrease correlation of model

parameters in MCMC.

MCMC OVERVIEW 15

Conclusions

• Why use MCMC?

? Flexible computing tool with ability to fit complex models.? No need to make simplified modeling assumptions out of convenience.? Given posterior samples, can get all benefits of Bayesian inference.

• Words of Caution:

? Monitor convergence!∗ Unfortunately, the most complex models tend to converge very slowly.∗ Can try blocking and marginalization to decrease correlation of model

parameters in MCMC.? Check if your answers make sense - compare with plots and simple

methods

MCMC OVERVIEW 15

Conclusions

• Why use MCMC?

? Flexible computing tool with ability to fit complex models.? No need to make simplified modeling assumptions out of convenience.? Given posterior samples, can get all benefits of Bayesian inference.

• Words of Caution:

? Monitor convergence!∗ Unfortunately, the most complex models tend to converge very slowly.∗ Can try blocking and marginalization to decrease correlation of model

parameters in MCMC.? Check if your answers make sense - compare with plots and simple

methods? Perform sensitivity analysis on priors.

MCMC OVERVIEW 15

Conclusions

• Why use MCMC?

? Flexible computing tool with ability to fit complex models.? No need to make simplified modeling assumptions out of convenience.? Given posterior samples, can get all benefits of Bayesian inference.

• Words of Caution:

? Monitor convergence!∗ Unfortunately, the most complex models tend to converge very slowly.∗ Can try blocking and marginalization to decrease correlation of model

parameters in MCMC.? Check if your answers make sense - compare with plots and simple

methods? Perform sensitivity analysis on priors.

• Other book: Gelman, Carlin, Stern, & Rubin (1995) Bayesian Data Analysis

MCMC OVERVIEW 16

ReferencesGelman A and Rubin DB (1992) . Inference from iterative simulation usingmultiple sequences. Statistical Science 7, 457�75472.

Geman S and Geman D (1984) . Stochastic relaxation, Gibbs distributions,and the Bayesian restoration of images. IEEE Trans. Pattn. Anal. Mach. Intel.6, 721�75741.

Geweke J (1992) . Evaluation of accuracy of sampling-based approaches tothe calculation of posterior moments. In Bayesian Statistics 4(ed. JMBernardo, J Berger, AP Dawid and AFM Smith), pp. 169�75193. OxfordUniversity Press.

Gilks WR, Richardson S, and Spiegelhalter DJ (1996) . Markov ChainMonte Carlo in Practice, Chapman and Hall.

Hastings WK (1970) . Monte Carlo sampling methods using Markov chainsand their applications. Biometrika 57, 97�75109.

Metropolis N, Rosenbluth AW, Rosenbluth MN, Teller AH and Teller E(1953). Equations of state calculations by fast computing machine. J. Chem.Phys. 21, 1087�751091.