§❹ the bayesian revolution: markov chain monte carlo (mcmc)

Download §❹ The Bayesian Revolution:  Markov Chain Monte Carlo (MCMC)

Post on 24-Feb-2016

41 views

Category:

Documents

0 download

Embed Size (px)

DESCRIPTION

§❹ The Bayesian Revolution: Markov Chain Monte Carlo (MCMC) . Robert J. Tempelman. Simulation-based inference. f(x): density g(x): function. As n → . Suppose you’re interested in the following integral/expectation: - PowerPoint PPT Presentation

TRANSCRIPT

The Bayesian Revolution: Markov Chain Monte Carlo (MCMC) methods

The Bayesian Revolution: Markov Chain Monte Carlo (MCMC)

Robert J. Tempelman1Applied Bayesian Inference, KSU, April 29, 2012 /Simulation-based inferenceSuppose youre interested in the following integral/expectation:

You can draw random samples x1,x2,,xn from f(x). Then compute

With Monte Carlo Standard Error:

As n

2f(x): densityg(x): function.Applied Bayesian Inference, KSU, April 29, 2012 /Beauty of Monte Carlo methodsYou can determine the distribution of any function of the random variable(s).Distribution summaries include:Means,Medians,Key Percentiles (2.5%, 97.5%)Standard Deviations,Etc.Generally more reliable than using Delta method especially for highly non-normal distributions.3Applied Bayesian Inference, KSU, April 29, 2012 /Using method of composition for sampling (Tanner, 1996).Involve two stages of sampling.Example:Suppose Yi|li~Poisson(li)In turn., li|a,b ~ Gamma(a,b)

Then

4

negative binomial distribution with mean a/b and variance (a/b)(1+ b -1). Applied Bayesian Inference, KSU, April 29, 2012 /Using method of composition for sampling from negative binomial:data new; seed1 = 2; alpha = 2; beta = 0.25; do j = 1 to 10000; call rangam(seed1,alpha,x); lambda = x/beta; call ranpoi(seed1,lambda,y); output; end;run;proc means mean var; var y;run;

5 Draw li|a,b ~ Gamma(a,b) .Draw Yi ~Poisson(li)

The MEANS ProcedureVariableMeanVariancey7.974939.2638E(y) = a/b = 2/0.25 = 8

Var(y) = (a/b)(1+ b -1) = 8*(1+4)=40Applied Bayesian Inference, KSU, April 29, 2012 /Another example? Student t.data new; seed1 = 29523; df=4; do j = 1 to 100000; call rangam(seed1,df/2,x); lambda = x/(df/2); t = rannor(seed1)/sqrt(lambda); output; end;run;proc means mean var p5 p95; var t;run;data new; t5 = tinv(.05,4); t95 = tinv(.95,4);run;proc print;run;6 Draw li|n ~ Gamma(n/2,n/2) .Draw ti |li~Normal(0,1/li)

Then t ~ Student tnVariableMeanVariance5th Pctl95th Pctlt-0.005242.011365-2.13762.122201Obst5t951-2.13192.13185Applied Bayesian Inference, KSU, April 29, 2012 /Expectation-Maximization (EM)Ok, I know that EM is NOT a simulation-based inference procedure.However, it is based on data augmentation.Important progenitor of Markov Chain Monte Carlo (MCMC) methodsRecall the plant genetics example

7Applied Bayesian Inference, KSU, April 29, 2012 /Data augmentationAugment data by splitting first cell into two cells with probabilities and q/4 for 5 categories:

Looks like a Beta Distribution to me!8Applied Bayesian Inference, KSU, April 29, 2012 /Data augmentation (contd)So joint distribution of complete data:

Consider the part just including the missing data

binomial9Applied Bayesian Inference, KSU, April 29, 2012 /Expectation-Maximization.Start with complete log-likelihood:

1. Expectation (E-step)

10Applied Bayesian Inference, KSU, April 29, 2012 /2. Maximization stepUse first or second derivative methods to maximize

Set to 0:

11Applied Bayesian Inference, KSU, April 29, 2012 /Recall the dataProbabilityGenotypeData (Counts)Prob(A_B_)y1=1997Prob(aaB_)

y2=906Prob(A_bb)y3=904Prob(aabb)y4=32

0 1 0: close linkage in repulsion 1: close linkage in coupling12Applied Bayesian Inference, KSU, April 29, 2012 /PROC IML code:proc iml; y1 = 1997; y2 = 906; y3 = 904; y4 = 32; theta = 0.20; /*Starting value */ do iter = 1 to 20; Ex2 = y1*(theta)/(theta+2); /* E-step */ theta = (Ex2+y4)/(Ex2+y2+y3+y4);/* M-step */ print iter theta; end;run;

itertheta10.105530320.068014730.051203140.043264650.039423460.037542970.03661780.036159890.0359338100.0358219110.0357666120.0357392130.0357256140.0357189150.0357156160.0357139170.0357131180.0357127190.0357125200.035712413Slower than Newton-Raphson/Fisher scoringbut generally more robust to poorer starting values.Applied Bayesian Inference, KSU, April 29, 2012 /How derive an asymptotic standard error using EM?From Louis (1982):

Given:14Applied Bayesian Inference, KSU, April 29, 2012 /Finish offNow

Hence:

15Applied Bayesian Inference, KSU, April 29, 2012 /Stochastic Data Augmentation (Tanner, 1996)Posterior Identity

Predictive Identity

Implies

Transition function for Markov Chain16Suggests an iterative method of composition approach for samplingApplied Bayesian Inference, KSU, April 29, 2012 /Sampling strategy from p(q|y)Start somewhere: (starting value q= q[0] )Sample x[1] from Sample q[1] from

Sample x[2] from Sample q[2] ] from

etc.Its like sampling from E-steps and M-steps

Cycle 1

Cycle 217Applied Bayesian Inference, KSU, April 29, 2012 /What are these Full Conditional Densities (FCD) ?Recall complete likelihood function

Assume prior on q is flat :

FCD:

Beta(a=(y1-x +y4 +1),b=(y2+y3+1))

Binomial(n=y1, p = 2/(q+2)) 18Applied Bayesian Inference, KSU, April 29, 2012 /IML code for Chained Data Augmentation Exampleproc iml; seed1=4; ncycle = 10000; /* total number of samples */ theta = j(ncycle,1,0);y1 = 1997; y2 = 906; y3 = 904; y4 = 32; beta = y2+y3+1; theta[1] = ranuni(seed1); /* initial draw between 0 and 1 */

do cycle = 2 to ncycle; p = 2/(2+theta[cycle-1]); xvar= ranbin(seed1,y1,p); alpha = y1+y4-xvar+1; xalpha = rangam(seed1,alpha); xbeta = rangam(seed1,beta); theta[cycle] = xalpha/(xalpha+xbeta); end; create parmdata var {theta xvar }; append;run;data parmdata; set parmdata; cycle = _n_;run;

19Starting valueApplied Bayesian Inference, KSU, April 29, 2012 /Trace Plot

proc gplot data=parmdata; plot theta*cycle;run;Burn-in?bad starting valueShould discard the first few samples to ensure that one is truly sampling from p(q|y) Starting value should have no impact.

Convergence in distribution.

How to decide on this stuff? Cowles and Carlin (1996)20Throw away the first 1000 samples as burn-inApplied Bayesian Inference, KSU, April 29, 2012 /

Histogram of samplespost burn-inproc univariate data=parmdata ; where cycle > 1000; var theta ; histogram/normal(color=red mu=0.0357 sigma=0.0060);run; Bayesian inferenceN9000Posterior Mean0.03671503Post. Std Deviation0.00607971

Quantiles for Normal DistributionPercentQuantileObserved(Bayesian)Asymptotic(Likelihood)5.00.027020.0258395.00.047280.04557

Asymptotic Likelihood inference21Applied Bayesian Inference, KSU, April 29, 2012 /Zooming in on Trace Plot

Hints of autocorrelation.

Expected with Markov Chain Monte Carlo simulation schemes.

Number of drawn samples is NOT equal number of independent draws.

22The greater the autocorrelationthe greater the problemneed more samples!Applied Bayesian Inference, KSU, April 29, 2012 /Sample autocorrelationAutocorrelation Check for White NoiseTo LagChi-SquareDFPr > ChiSqAutocorrelations63061.396 1000; identify var= theta nlag=1000 outcov=autocov ;run;23Applied Bayesian Inference, KSU, April 29, 2012 /How to estimate the effective number of independent samples (ESS)Consider posterior mean based on m samples:

Initial positive sequence estimator (Geyer, 1992; Sorensen and Gianola, 1995):

Lag-m autocovariance

24Sum of adjacent lag autocovariances

varianceApplied Bayesian Inference, KSU, April 29, 2012 /Initial positive sequence estimatorChoose t such that allSAS PROC MCMC chooses a slightly different cutoff (see documentation).

25Extensive autocorrelation across lags..leads to smaller ESSApplied Bayesian Inference, KSU, April 29, 2012 /SAS code%macro ESS1(data,variable,startcycle,maxlag);data _null_; set &data nobs=_n;; call symputx('nsample',_n);run;proc arima data=&data ; where iteration > &startcycle; identify var= &variable nlag=&maxlag outcov=autocov ;run;

proc iml; use autocov; read all var{'COV'} into cov; nsample = &nsample; nlag2 = nrow(cov)/2; Gamma = j(nlag2,1,0); cutoff = 0; t = 0; do while (cutoff = 0); t = t+1; Gamma[t] = cov[2*(t-1)+1] + cov[2*(t-1)+2]; if Gamma[t] < 0 then cutoff = 1; if t = nlag2 then do; print "Too much autocorrelation"; print "Specify a larger max lag"; stop; end; end; varm = (-Cov[1] + 2*sum(Gamma)) / nsample; ESS = Cov[1]/varm; /* effective sample size */ stdm = sqrt(varm); parameter = "&variable";/* Monte Carlo standard error */ print parameter stdm ESS;run;%mend ESS1;

26Recall: 9000 MCMC post burnin cycles.

Applied Bayesian Inference, KSU, April 29, 2012 /Executing %ESS1%ESS1(parmdata,theta,1000,1000);

27Recall: 1000 MCMC burnin cycles.parameterstdmESStheta0.00011162967.1289

i.e. information equivalent to drawing 2967 independent draws from density.Applied Bayesian Inference, KSU, April 29, 2012 /How large of an ESS should I target?Routinelyin the thousands or greater.Depends on what you want to estimate.Recommend no less than 100 for estimating typical location parameters: mean, median, etc.Several times that for typical dispersion parameters like variance.Want to provide key percentiles?i.e., 2.5th , 97.5th percentiles? Need to have ESS in the thousands!See Raftery and Lewis (1992) for further direction.28Applied Bayesian Inference, KSU, April 29, 2012 /

Worthwhile to consider this sampling strategy?Not too much difference, if any, with likelihood inference.But how about smaller samples?e.g., y1=200,y2=91,y3=90,y4=3Different story29Applied Bayesian Inference, KSU, April 29, 2012 /Gibbs sampling: origins(Geman and Geman, 1984).Gibbs sampling was first developed in statistical physics in relation to spatial inference problemProblem: true image was corrupted by a stochastic process to produce an observable image y (data) Objective: restor

Recommended

View more >