applied bayesian inference, ksu, april 29, 2012 § / §❹ the bayesian revolution: markov chain...

Applied Bayesian Inference, KSU, April 29, 2012

§ / 1

§❹ The Bayesian Revolution: Markov Chain Monte Carlo (MCMC)

Robert J. Tempelman


§ / 2

Simulation-based inference

• Suppose you’re interested in the following integral/expectation:

• You can draw random samples x1,x2,…,xn from f(x). Then compute

• With Monte Carlo Standard Error:

Ex

f dx g x x xg

1

1ˆn

ii

E g x g x E g xn

As n →

2

1

ˆ1

1

n

ii

g x E g x

nn

f(x): densityg(x): function.


§ / 3

Beauty of Monte Carlo methods

• You can determine the distribution of any function of the random variable(s).

• Distribution summaries include:– Means,– Medians,– Key Percentiles (2.5%, 97.5%)– Standard Deviations,– Etc.

• Generally more reliable than using “Delta method” especially for highly non-normal distributions.


§ /

Using method of composition for sampling (Tanner, 1996).

• Involve two stages of sampling.• Example:

– Suppose Yi|li~Poisson(li)

– In turn., li|a,b ~ Gamma(a,b)

– Then

4

Prob |!

i

yi

i iY y ey

1| , i

i ip e

1

Pr | , P | ,r |

!

1

! 1 1i

i

i

i

i

i

i

i i

yi

i i

R

y

ii

iR

ob Y y dob Y y

e dyy

y

p

e

negative binomial distribution with mean a/ b and variance (a/b)(1+ b -1).


§ /

Using method of composition for sampling from negative binomial:

data new; seed1 = 2; alpha = 2; beta = 0.25; do j = 1 to 10000; call rangam(seed1,alpha,x); lambda = x/beta; call ranpoi(seed1,lambda,y); output; end;run;proc means mean var; var y;run;

5

1. Draw li|a,b ~ Gamma(a,b) .2. Draw Yi ~Poisson(li)

The MEANS Procedure

Variable Mean Variance

y 7.9749 39.2638

E(y) = a/ = 2b /0.25 = 8

Var(y) = (a/b)(1+ b -1) = 8*(1+4)=40


§ /

Another example? Student t.data new; seed1 = 29523; df=4; do j = 1 to 100000; call rangam(seed1,df/2,x); lambda = x/(df/2); t = rannor(seed1)/sqrt(lambda); output; end;run;proc means mean var p5 p95; var t;run;data new; t5 = tinv(.05,4); t95 = tinv(.95,4);run;proc print;run; 6

1. Draw li|n ~ Gamma( /2n , /2n ) .2. Draw ti |li~Normal(0,1/li)

Then t ~ Student tnVariable Mean Variance 5th Pctl 95th Pctl

t -0.00524 2.011365 -2.1376 2.122201

Obs t5 t95

1 -2.1319 2.13185


§ / 7

Expectation-Maximization (EM)

• Ok, I know that EM is NOT a simulation-based inference procedure.– However, it is based on data augmentation.

• Important progenitor of Markov Chain Monte Carlo (MCMC) methods

– Recall the plant genetics example

4321

44

1

4

1

4

2

!!!!

!|

4321

yyyy

yyyy

nL

y

4321

44

1

4

1

4

2|

yyyy

L

y


§ / 8

Data augmentation

• Augment “data” by splitting first cell into two cells with probabilities ½ and q/4 for 5 categories:

1 2 3 41 1

|4

1

2 4 4 4y,

x y yx y y

L x

1 4 2 31

4 4

y x y y y

3241 1| yyyxyxL y,

1 1

1p

Looks like a Beta Distribution to me!


§ / 9

Data augmentation (cont’d)• So joint distribution of “complete” data:

• Consider the part just including the “missing data”

41 2 3

1 2 3 4

2! 1 1

! ! 4|

! ! ! 4 4 44y,

yx y yx yn

x yp x

yx y y

xyx

x

yxp

1

22

2| 1

y,

1

2E |

2y,x y

binomial


§ / 10

Expectation-Maximization.

• Start with complete log-likelihood:

• 1. Expectation (E-step)

1 2 34 log4

1glog lo

4| y, y x yL x con ys y

1 1 4 2 3[ ]log lo

2ˆ 2

g 1t

y y y y y

1loglog

2ˆ

ˆ324][

][

1 yyyyt

t

1 42 3

1 2 3 4

! 2log | log

! ! ! 4 44

1

!

1

4! 4y,

yx yx yyn

L xx y x y y y

1 4 2 3E log | log logE 1y,x

L x y yx y y


§ / 11

• 2. Maximization step– Use first or second derivative methods to

maximize

– Set to 0:

[ ]

1 4[ ]2 3

ˆ

ˆ 2

1

E log | y,

t

tLy

y yxy

432][

][

1

4][

][

1

]1[

2ˆ

ˆ

2ˆ

ˆ

ˆ

yyyy

yy

t

t

t

t

t

E log | y,x

L x


§ / 12

Recall the dataProbability Genotype Data (Counts)

Prob(A_B_) y1=1997

Prob(aaB_) y2=906

Prob(A_bb) y3=904

Prob(aabb) y4=32

p3

1

4

p4 4

p2

1

4

p1

2

4

0 1 → 0: close linkage in repulsion → 1: close linkage in coupling


§ / 13

PROC IML code:proc iml; y1 = 1997; y2 = 906; y3 = 904; y4 = 32; theta = 0.20; /*Starting value */ do iter = 1 to 20; Ex2 = y1*(theta)/(theta+2); /* E-step */ theta = (Ex2+y4)/(Ex2+y2+y3+y4);/* M-step */ print iter theta; end;run;

iter theta

1 0.1055303

2 0.0680147

3 0.0512031

4 0.0432646

5 0.0394234

6 0.0375429

7 0.036617

8 0.0361598

9 0.0359338

10 0.0358219

11 0.0357666

12 0.0357392

13 0.0357256

14 0.0357189

15 0.0357156

16 0.0357139

17 0.0357131

18 0.0357127

19 0.0357125

20 0.0357124

Slower than Newton-Raphson/Fisher scoring…but generally more robust to poorer starting values.


§ / 14

How derive an asymptotic standard error using EM?

• From Louis (1982): 2 ,

2

|

2

2

log | log | ,| ,

log |-

,varX θ Y

X

θ Y θ Y XX θ Y X

θθ

θ X

θ

Yp pp

pd

3241 1| yyyxyxp y,

1 4 2 3log |

1

y,p x y y yx y

ˆ 1

ˆ2 2 .03571997 34.42

ˆ ˆ .0357 2 .0var

3572|

22,yx y

22

ˆˆˆ

log |var 2

1 34.42var | , var | ,

.698

574

37

0. 1y

,y

y xp xx

Given:


§ / 15

Finish off

• Now

• Hence:

1 4 2 32

2

2 2

log |

1

y, yp yx x y y

1 1 42 3

2 2

2

ˆ2ˆ

5lo

ˆ4507

2ˆ2

.0g |

E 61

y,x

y y yy yp x

2

2

ˆ

54507.06log |

269 27519.67.41 58θ Y

θ

p

1ˆ .0060

27519.65se

2 ,

2

|

2

2| ,

log | ,- var

logl g || ,o

X θ YX

θ Yθ Y X θ Y X

θY X

θX θ

θp

pd

pp

26987.41


§ / 16

Stochastic Data Augmentation (Tanner, 1996)

• Posterior Identity

• Predictive Identity

• Implies

( | )| | ,xR

y yy pp p x dxx

R

yyy dpxpxp )|(,||

| ,

|

| ,

| ,

| , ( | )

( |

|

( | )

)

x

x

R R

R

y

yyy

y

y

y

y

yR

R

p

p dx

dx d

p

x

p x

p x

x d

K

p p

p

d

| yK

Transition function for Markov Chain

Suggests an “iterative” method of composition approach for sampling


§ / 17

Sampling strategy from p(q|y)

• Start somewhere: (starting value q= q[0] )– Sample x[1] from – Sample q[1] from

– Sample x[2] from – Sample q[2] ] from

– etc.– It’s like sampling from “E-steps” and “M-steps”

[0]| ,yp x

[1]| ,yp x x Cycle 1

[1]| ,yp x

[2]| ,yp x x Cycle 2


§ / 18

What are these Full Conditional Densities (FCD) ?

• Recall “complete” likelihood function

• Assume prior on q is “flat” :

• FCD:

1p

| , , |y yp x p x

4 2 31 1 11 11| y,

y x y y yp x Beta(a=(y1-x +y4 +1),b=(y2+y3+1))

431 2

1 2 3 4

! 2|

! ! ! ! ! 4 4

1

4 4 4

1y,

yy xx y yn

p xx y x y y y

12

|2 2

y,x y x

p x

Binomial(n=y1, p = 2/(q+2))


§ / 19

IML code for Chained Data Augmentation Example

proc iml; seed1=4; ncycle = 10000; /* total number of samples */ theta = j(ncycle,1,0);y1 = 1997; y2 = 906; y3 = 904; y4 = 32; beta = y2+y3+1; theta[1] = ranuni(seed1); /* initial draw between 0 and 1 */

do cycle = 2 to ncycle; p = 2/(2+theta[cycle-1]); xvar= ranbin(seed1,y1,p); alpha = y1+y4-xvar+1; xalpha = rangam(seed1,alpha); xbeta = rangam(seed1,beta); theta[cycle] = xalpha/(xalpha+xbeta); end; create parmdata var {theta xvar }; append;run;data parmdata; set parmdata; cycle = _n_;run;

,1

,,1 ,1

GammaBeta

Gamma Gamma

Starting value


§ / 20

Trace Plotproc gplot data=parmdata; plot theta*cycle;run;

Burn-in?

“bad” starting value

Should discard the first “few” samples to ensure that one is truly sampling from p(q|y) Starting value should have no impact.

“Convergence in distribution”.

How to decide on this stuff? Cowles and Carlin (1996)

Throw away the first 1000 samples as “burn-in”


§ / 21

Histogram of samplespost burn-in

proc univariate data=parmdata ; where cycle > 1000; var theta ; histogram/normal(color=red mu=0.0357 sigma=0.0060);run;

Bayesian inference

N 9000

Posterior Mean 0.03671503

Post. Std Deviation 0.00607971

Quantiles for Normal Distribution

Percent Quantile

Observed(Bayesian)

Asymptotic(Likelihood)

5.0 0.02702 0.02583

95.0 0.04728 0.04557

Asymptotic Likelihood inference


§ / 22

Zooming in on Trace PlotHints of autocorrelation.

Expected with Markov Chain Monte Carlo simulation schemes.

Number of drawn samples is NOT equal number of independent draws.

The greater the autocorrelation…the greater the problem…need more samples!


§ / 23

Sample autocorrelation

Autocorrelation Check for White Noise

To Lag Chi-Square

DF Pr > ChiSq

Autocorrelations

6 3061.39 6 <.0001 0.497 0.253 0.141 0.079 0.045 0.029

proc arima data=parmdata plots(only)=series(acf); where cycle > 1000; identify var= theta nlag=1000 outcov=autocov ;run;


§ / 24

How to estimate the effective number of independent samples (ESS)

• Consider posterior mean based on m samples:

• Initial positive sequence estimator (Geyer, 1992; Sorensen and Gianola, 1995):

0

ˆˆ (0) 2ˆvar

t

m mj

m

j

m

[ ]

1

1ˆ i

m

mim

[ ]

1,2,...,

1ˆvar var im i mm

[ ] [ ]

1

1 ˆ ˆˆ ( )i m j

i i tm m m

i

jm

Lag-m autocovariance

ˆ ˆ ˆ2 2 1 , 0,1,..., m m mj j j j

Sum of adjacent lag autocovariances

ˆ (0)mvariance


§ / 25

Initial positive sequence estimator

• Choose t such that all• SAS PROC MCMC chooses a slightly different

cutoff (see documentation).

ˆ 0, 0,1,..., m j j t

0

ˆˆ (0) 2ˆvar

t

m mj

m

j

m

ˆ 0

ˆvar

m

m

ESS

Extensive autocorrelation

across lags…..leads to smaller ESS


§ / 26

SAS code%macro ESS1(data,variable,startcycle,maxlag);data _null_; set &data nobs=_n;; call symputx('nsample',_n);run;proc arima data=&data ; where iteration > &startcycle; identify var= &variable nlag=&maxlag outcov=autocov ;run;

proc iml; use autocov; read all var{'COV'} into cov; nsample = &nsample; nlag2 = nrow(cov)/2; Gamma = j(nlag2,1,0); cutoff = 0; t = 0;

do while (cutoff = 0); t = t+1; Gamma[t] = cov[2*(t-1)+1] + cov[2*(t-1)+2]; if Gamma[t] < 0 then cutoff = 1; if t = nlag2 then do; print "Too much autocorrelation"; print "Specify a larger max lag"; stop; end; end; varm = (-Cov[1] + 2*sum(Gamma)) / nsample; ESS = Cov[1]/varm; /* effective sample size */ stdm = sqrt(varm); parameter = "&variable";/* Monte Carlo standard error */ print parameter stdm ESS;run;%mend ESS1;

Recall: 9000 MCMC post burnin cycles.


§ / 27

Executing %ESS1

• %ESS1(parmdata,theta,1000,1000);

Recall: 1000 MCMC burnin cycles.

parameter stdm ESS

theta 0.0001116 2967.1289

i.e. information equivalent to drawing 2967 independent draws from density.


§ / 28

How large of an ESS should I target?

• Routinely…in the thousands or greater.• Depends on what you want to estimate.

– Recommend no less than 100 for estimating “typical” location parameters: mean, median, etc.

– Several times that for “typical” dispersion parameters like variance.

• Want to provide key percentiles?– i.e., 2.5th , 97.5th percentiles? Need to have ESS in the

thousands!– See Raftery and Lewis (1992) for further direction.


§ / 29

Worthwhile to consider this sampling strategy?

• Not too much difference, if any, with likelihood inference.

• But how about smaller samples?– e.g.,

y1=200,y2=91,y3=90,y4=3– Different story


§ / 30

Gibbs sampling: origins(Geman and Geman, 1984).

• Gibbs sampling was first developed in statistical physics in relation to spatial inference problem– Problem: true image was corrupted by a stochastic

process to produce an observable image y (data) • Objective: restore or estimate the true image in the light of

the observed image y.

– Inference on based on the Markov random field joint posterior distribution, through successively drawing from updated FCD which were rather easy to specify.

– These FCD each happened to be the Gibbs distn’s.• Misnomer has been used since to describe a rather general

process.


§ / 31

Gibbs sampling

• Extension of chained data augmentation for case of several unknown parameters.

• Consider p = 3 unknown parameters:• Joint posterior density• Gibbs sampling: MCMC sampling strategy

where all FCD are recognizeable:

p 1 2 3, , | y

p 1 2 3| , , y

2 1 3| , , yp

3 1 2| , , yp

1 2 3, ,


§ / 32

Gibbs sampling: the process1) Start with some “arbitrary” starting values (but within allowable parameter space)

2) Draw from3) Draw from 4) Draw from

5) Repeat steps 2)-4) m times.

011

02 2

03 3

1

1 0 0

1 2 2 3 3| , , yp

12

1 02 1 1 3 3| , , yp

13

1 13 1 1 2 2| , , yp

Steps 2-4 constitute one cycle of Gibbs samplingm: length of Gibbs chain

One cycle = one random draw from p 1 2 3, , | y


§ / 33

General extension of Gibbs sampling

• When there are d parameters and/or blocks of parameters:

• Again specify starting values:• Sample from the FCD’s in cycle i

Sample q1(k+1)

from

Sample q2(k+1) from

…Sample qd

(k+1) from

1 2'θ θ d (0) (0) (0)

1 2θ d

( ) ( )1 2| ,...,θ ,yk k

dp 1

2 1 3| , ,...,θ ,yk k kdp

1 1 11 2 1| , ,...,θ ,yk k k

d dp

Generically, sample qi from | θ ,yi ip


§ / 34

• Throw away enough burn-in samples (k<m)

• q(k+1) , q(k+2) ,..., q(m) are a realization of a Markov chain with equilibrium distribution p(q|y)

• The m-k joint samples of q(k+1) , q(k+2) ,..., q(m) are then considered to be random drawings from the joint posterior density p(q|y).

• Individually, the m-k samples of qj(k+1) , qj

(k+2) ,..., qj(k+m) are

random samples of qj from the marginal posterior density , p(qj|y) j = 1,2,…,d. – i.e., q-j are “nuisance” variables if interest is directed on j


§ / 35

Mixed model example with known variance components, flat prior on b.

• Recall:

– where

• Write

– i.e.

11 12 2

1 1

ˆ ' ',

ˆ ' ' -1

X R X X R Zββ,u| , ,y~N

Z R X Z R Z+Gue u

' '

' '

u

X R X X R Z

Z R X Z R Z + G

X' R y

Z' R y-1

1

1

1 1

1 1

1

ˆˆ

ˆ

βθ

u

βθ

u

1 1

1 1

' '

' ' -1

X R X X R ZC

Z R X Z R Z+G

2 2 1ˆ,θ| , ,y~N θ Ce u ALREADY KNOW JOINT POSTERIOR DENSITY!


§ / 36

FCD for mixed effects model with known variance components

• Ok..really pointless to use MCMC here..but let’s demonstrate. But it be can shown FCD are:

• where

i i e u i iy N v| , , , ~~

, ~ 2 2

1,

p q

ij jj j

i

i

i

i

i

b

c

c

1

ii

i

vc

1 1

1 1

' '

' ' -1

X R X X R ZC

Z R X Z R Z+G

βθ

u

1

1

X'R yb

Z'R y

ith row

ith row

ith column

ith diagonal element


§ /

Two ways to sample b and u• 1. Block draw from

– faster MCMC mixing (less/no autocorrelation across MCMC cycles)

– But slower computing time (depending on dimension of q).• i.e. compute Cholesky of C• Some alternative strategies available (Garcia-Cortes and Sorensen,

1995)

• 2. Series of univariate draws from

– Faster computationally.– Slower MCMC mixing

• Partial solution: “thinning the MCMC chain” e.g., save every 10 cycles rather than every cycle

2 2 1ˆ,θ| , ,y~N θ Ce u

2 2| , , , ~ , ; 1,2,...θi i e u i iy N v i p q

37


§ / 38

Example: A split plot in time example(Data from Kuehl, 2000, pg.493)

• Experiment designed to explore mechanisms for early detection of phlebitis during amiodarone therapy. – Three intravenous treatments:

(A1) Amiodarone

(A2) the vehicle solution only

(A3) a saline solution.

– 5 rabbits/treatment in a completely randomized design.

– 4 repeated measures/animal (30 min. intervals)


§ / 39

SAS data stepdata ear; input trt rabbit time temp; y = temp; A = trt; B = time; trtrabbit = compress(trt||'_'||rabbit); wholeplot=trtrabbit; cards; 1 1 1 -0.3 1 1 2 -0.2 1 1 3 1.2 1 1 4 3.1 1 2 1 -0.5 1 2 2 2.2 1 2 3 3.3 1 2 4 3.7 etc.


§ / 40

The data (“spaghetti plot”)


§ / 41

Profile (Interaction) means plots


§ / 42

A split plot model assumption for repeated measures

Treatment 1

Rabbit 3Rabbit 2Rabbit 1

Time 1

Time 2

Time 3

Time 4

Time 1

Time 2

Time 3

Time 4

Time 1

Time 2

Time 3

Time 4

RABBIT IS THE EXPERIMENTAL UNIT FOR TREATMENT

RABBIT IS THE BLOCK FOR TIME


§ / 43

Suppose CS assumption was appropriate

CONDITIONAL SPECIFICATION: Model variation between experimental units (i.e. rabbits)

– This is a partially nested or split-plot design.• i.e. for treatments, rabbits is the experimental unit; for

time, rabbits is the block!

( )ijk i k i j ij ijky u e

2( ) ( )~ 0,k i uu NIID 2,0~ eijk NIIDe


§ / 44

Analytical (non-simulation) Inference based on PROC MIXED

Let’s assume “known”

Flat priors on fixed effects p(b) 1.

title 'Split Plot in Time using Mixed';title2 'Known Variance Components';proc mixed data=ear noprofile; class trt time rabbit; model temp = trt time trt*time /solution; random rabbit(trt); parms (0.1) (0.6) /hold = 1,2; ods output solutionf = solutionf;run;proc print data=solutionf; where estimate ne 0;run;

2( ) 0.10u 2 0.60e


§ / 45

(Partial) OutputObs Effect trt time Estimate StdErr DF

1 Intercept _ _ 0.2200 0.3742 12

2 trt 1 _ 2.3600 0.5292 12

3 trt 2 _ -0.2200 0.5292 12

5 time _ 1 -0.9000 0.4899 36

6 time _ 2 0.02000 0.4899 36

7 time _ 3 -0.6400 0.4899 36

9 trt*time 1 1 -1.9200 0.6928 36

10 trt*time 1 2 -1.2200 0.6928 36

11 trt*time 1 3 -0.06000 0.6928 36

13 trt*time 2 1 0.3200 0.6928 36

14 trt*time 2 2 -0.5400 0.6928 36

15 trt*time 2 3 0.5800 0.6928 36


§ / 46

MCMC inference

• First set up dummy variables.

/* Based on the zero out last level restrictions */proc transreg data=ear design order =data; model class(trt|time / zero=last); id y trtrabbit; output out=recodedsplit;run;

proc print data=recodedsplit (obs=10); var intercept &_trgind;run;

Corner parameterization implicit in SAS linear model s software


§ / 47

Partial Output (First two rabbits)Obs _NA

ME_Intercept

trt1 trt2 time1

time2

time3

Trt1time1

Trt1time2

Trt1time3

Trt2time1

Trt2time2

Trt2time3

trt time y trtrabbit

1 -0.3 1 1 0 1 0 0 1 0 0 0 0 0 1 1 -0.3 1_1

2 -0.2 1 1 0 0 1 0 0 1 0 0 0 0 1 2 -0.2 1_1

3 1.2 1 1 0 0 0 1 0 0 1 0 0 0 1 3 1.2 1_1

4 3.1 1 1 0 0 0 0 0 0 0 0 0 0 1 4 3.1 1_1

5 -0.5 1 1 0 1 0 0 1 0 0 0 0 0 1 1 -0.5 1_2

6 2.2 1 1 0 0 1 0 0 1 0 0 0 0 1 2 2.2 1_2

7 3.3 1 1 0 0 0 1 0 0 1 0 0 0 1 3 3.3 1_2

8 3.7 1 1 0 0 0 0 0 0 0 0 0 0 1 4 3.7 1_2

9 -1.1 1 1 0 1 0 0 1 0 0 0 0 0 1 1 -1.1 1_3

10 2.4 1 1 0 0 1 0 0 1 0 0 0 0 1 2 2.4 1_3

Part of X matrix (full-rank)


§ / 48

MCMC using PROC IML

proc iml;

seed = &seed; nburnin = 5000; /* number of burn in samples */ total = 200000;/* total number of Gibbs cycles beyond burnin */ thin= 10; /* saving every “thin" */

ncycle = total/skip; /* leaving a total of ncycle saved samples */

Full code available online


§ / 49

Key subroutine (univariate sampling)

start gibbs; /* univariate Gibbs sampler */ do j = 1 to dim; /* dim = p + q *//* generate from full conditionals for fixed and random effects */ solt = wry[j] - coeff[j,]*solution + coeff[j,j]*solution[j]; solt = solt/coeff[j,j]; vt = 1/coeff[j,j]; solution[j] = solt + sqrt(vt)*rannor(seed); end; finish gibbs;

2 2| , , , ~ , ; 1, 2,...θi i e u i iy N v i p q

1

ii

i

vc

1,

p q

ij jj j

i

i

i

i

i

b

c

c


§ / 50

• Output samples to SAS data set called soldataproc means mean median std data=soldata;run;

ods graphics on; %tadplot(data=soldata, var=_all_);ods graphics off;

%tadplot is a SAS automacro suited for processing MCMC samples.


§ / 51

Comparisons for fixed effectsMCMC (Some Monte Carlo error) EXACT (PROC MIXED)

Effect trt time Estimate StdErr

Intercept _ _ 0.2200 0.3742

trt 1 _ 2.3600 0.5292

trt 2 _ -0.2200 0.5292

time _ 1 -0.9000 0.4899

time _ 2 0.02000 0.4899

time _ 3 -0.6400 0.4899

trt*time 1 1 -1.9200 0.6928

trt*time 1 2 -1.2200 0.6928

trt*time 1 3 -0.06000 0.6928

trt*time 2 1 0.3200 0.6928

trt*time 2 2 -0.5400 0.6928

trt*time 2 3 0.5800 0.6928

Variable Mean Median Std Dev Nint 0.218 0.218 0.374 20000

TRT1 2.365 2.368 0.526 20000TRT2 -0.22 -0.215 0.532 20000

TIME1 -0.902 -0.903 0.495 20000TIME2 0.0225 0.0203 0.491 20000TIME3 -0.64 -0.643 0.488 20000TRT1

TIME1-1.915 -1.916 0.692 20000

TRT1TIME2

-1.224 -1.219 0.69 20000

TRT1TIME3

-0.063 -0.066 0.696 20000

TRT2TIME1

0.321 0.316 0.701 20000

TRT2TIME2

-0.543 -0.54 0.696 20000

TRT2TIME3

0.58 0.589 0.694 20000


§ / 52

%TADPLOT output on “intercept”.TracePlot

AutocorrelationPlot

Posterior Density


§ /

Marginal/Cell Means

• Effects on previous 2-3 slides not of particular interest.

• Marginal means:– Can derive using contrast vectors that are used to

compute least squares means in PROC GLM/MIXED/GLIMMIX etc.

• lsmeans trt time trt*time / e;

– mAi: marginal mean for trt i

– mBj : marginal mean for time j

– mAiBj: cell mean for trt i time j.

53


§ /

Examples of marginal/cell means

• Marginal means

• Cell mean

54

1 1 11 1

1 time timen n

A j jj jtime

trt time trt timen

1 1 11 1

1 trt trt

i

n n

B i iitrt

time trt trt timen

1 1 1 1 1 1A B trt time trt time


§ / 55

Marginal/cell (“LS”) means.Variable Mean Median Std Dev

A1 1.403 1.401 0.223A2 -0.293 -0.292 0.223A3 -0.162 -0.162 0.224B1 -0.501 -0.5 0.216B2 0.366 0.365 0.213B3 0.465 0.466 0.217B4 0.932 0.931 0.216

A1B1 -0.234 -0.231 0.373A1B2 1.382 1.382 0.371A1B3 1.88 1.878 0.374A1B4 2.583 2.583 0.372A2B1 -0.584 -0.585 0.375A2B2 -0.524 -0.526 0.373A2B3 -0.062 -0.058 0.373A2B4 -0.003 -0.005 0.377A3B1 -0.684 -0.684 0.377A3B2 0.24 0.242 0.374A3B3 -0.422 -0.423 0.376A3B4 0.218 0.218 0.374

trt time Estimate Standard Error

1 1.4 0.22362 -0.29 0.22363 -0.16 0.2236

1 -0.5 0.2162 0.3667 0.2163 0.4667 0.2164 0.9333 0.216

1 1 -0.24 0.37421 2 1.38 0.37421 3 1.88 0.37421 4 2.58 0.37422 1 -0.58 0.37422 2 -0.52 0.37422 3 -0.06 0.37422 4 -3.61E-16 0.37423 1 -0.68 0.37423 2 0.24 0.37423 3 -0.42 0.37423 4 0.22 0.3742

MCMC (Monte Carlo error) EXACT (PROC MIXED)


§ / 56

Posterior densities of ma1, mb1, ma1b1.

Dotted lines: normal density inferences based on PROC MIXED

Closed lines: MCMC


§ / 57

Generalized linear mixed models(Probit Link Model)

• Stage 1:

• Stage 2:

• Stage 3:

1' ' ' '

1

| , 1y β u x β z u x β z ui i

n y y

i i i ii

p

2 11/2 2/2 2

1 1| exp '

22u u A u

Au q

uu

p

constantβp

2up

2 22 |, | ,, | y β uu y ββ u uu upp p pp


§ /

Rethinking prior on b

• i.e.– Might not be the best idea for binary data, especially

when the data is “sparse”• Animal breeders call this the “extreme category problem”

– e.g., if all of responses in a fixed effects subclass is either 1 or 0, then ML/PM of corresponding marginal mean will approach -/+ ∞.

– PROC LOGISTIC has the FIRTH option for this very reason.• Alternative:

– Typically,

58

constantβp

20~ ( , )ββ β IN

0β 016 < s2

b < 50 is probably sufficient on the underlying latent scale (conditionally N(0,1))


§ / 59

Recall Latent Variable Concept(Albert and Chib, 1993)

• Recall uzxu ''),|1Prob( iiiY

uzx-)uzxu '''' 1(~,β| iiiiii ,N

Suppose for animal i ' ' 1x β z ui i

Then Prob( 1) 1 0.159iY

-3 -2 -1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

liability

Sta

nd

ard

No

rma

l De

nsi

ty


§ / 60

Data augmentation with ={i},

• i.e.

n

iiii

n

iipp

1

''

1

,|,| uzx-uu

otherwise0

0if1|1Pr

iiiYob

otherwise0

0if1|0Pr

iiiYob

Pr | 0I I 0 I 0 I 1i i i i i iob Y y Y Y

distribution of Y becomes degenerate or point mass in form conditional on l

I . indicator function


§ / 61

Rewrite hierarchical model

• Stage 1a)

• Stage 1b)

• Those two stages define likelihood function

1

| I 0 I 0 I 0 I 1yn

i i i ii

p Y Y

' '

1 1

| , | ,β u β u - x β z un n

i i i ii i

p p

( |( | , ) ( |)| , ) ,y β u y, β y β uup p d dp p

1

Prob( | ) ( | , )β un

ii

iiY y p d

n

i

y

ii

y

iiii

1

1'''' 1 uzxuzx


§ / 62

Joint Posterior Density

• Now

• Let’s for now assume known s2u:

2 2 2( | ) ( | , ( | () ), , , | )y β u β uβ u yu uup p ppp p

2 2( | ) ( ( | )|, , | , , ) βy yβ uu β uu up p pp p


§ / 64

FCD (cont’d)

• Fixed and random effects

2

1

22

2

'exp

2

'exp

)|(),|(,,|,

u

uu ppp

uAuZuXZuX

uuyu

1

1 2

ˆ

ˆ

X'X X'Z X'βZ'X Z'Z A Z'u u

1

2 2ˆ ' '

,' 'ˆ

e u -1

X X X Zββ,u| , ,y~N

Z X Z Z+Gu

where


§ / 65

Alternative Sampling strategies for fixed and random effects

• 1. Joint multivariate draw from– faster mixing…but computationally expensive?

• 2. Univariate draws from FCD using partitioned matrix results.– Refer to Slides # 36, 37, 49– Slower mixing.

2uβ,u| ,y,


§ / 66

Recall “binarized” RCBD

Litter Diet 1 Diet 2 Diet 3 Diet 4 Diet 5 1 79.5>75 80.9>75 79.1>75 88.6>75 95.9>75 2 70.9 81.8>75 70.9 88.6>75 85.9>75 3 76.8>75 86.4>75 90.5>75 89.1>75 83.2>75 4 75.9>75 75.5>75 62.7 91.4>75 87.7>75 5 77.3>75 77.3>75 69.5 75.0 74.5 6 66.4 73.2 86.4>75 79.5>75 72.7 7 59.1 77.7>75 72.7 85.0>75 90.9>75 8 64.1 72.3 73.6 75.9>75 60.0 9 74.5 81.4>75 64.5 75.5>75 83.6>75 10 67.3 82.3>75 65.9 70.5 63.2


§ / 67

MCMC analysis

• 5000 burn-in cycles• 500,000 additional cycles

– Saving every 10: 50,000 saved cycles• Full conditional univariate sampling on fixed

and random effects.• “Known” s2

u = 0.50.

• Remember…no s2e.


§ / 68

Fixed Effect Comparison on inferences(conditional on “known” s2

u = 0.50)

• MCMC

• PROC GLIMMIX

Solutions for Fixed Effects

Effect diet Estimate Standard Error

Intercept 0.3097 0.4772

diet 1 -0.5935 0.5960

diet 2 0.6761 0.6408

diet 3 -0.9019 0.6104

diet 4 0.6775 0.6410

diet 5 0 .

Variable Mean Median Std Dev Nintercept 0.349 0.345 0.506 50000

DIET1 -0.659 -0.654 0.64 50000DIET2 0.761 0.75 0.682 50000DIET3 -1 -0.993 0.649 50000DIET4 0.76 0.753 0.686 50000

1

2

3

4

5

β


§ / 69

Marginal Mean Comparisons

• Based on K’b1 1 0 0 0

1 0 1 0 0

' 1 0 0 1 0

1 0 0 0 1

1 0 0 0 0

K

diet Least Squares Means

diet Estimate Standard Error

1 -0.2838 0.4768

2 0.9858 0.5341

3 -0.5922 0.4939

4 0.9872 0.5343

5 0.3097 0.4772

MCMC

PROC GLIMMIX

Variable Mean Median Std Dev Nmm1 -0.31 -0.302 0.499 50000mm2 1.11 1.097 0.562 50000mm3 -0.651 -0.644 0.515 50000mm4 1.109 1.092 0.563 50000mm5 0.349 0.345 0.506 50000


§ / 70

Diet 1 Marginal Mean (m+a1)


§ / 71

Posterior Density discrepancy between MCMC and Empirical Bayes for mi?

Diet Marginal Means

Dotted lines: normal approximation based on PROC GLIMMIX

Closed lines: MCMC

Do we run the risk of overstating precision with conventional methods?


§ / 72

How about probabilities of success?

• i.e., F(K’b) or normal cdf of marginal means


Mean Standard

ErrorMean

1 -0.2838 0.4768 0.3883 0.18272 0.9858 0.5341 0.8379 0.13113 -0.5922 0.4939 0.2769 0.16534 0.9872 0.5343 0.8382 0.13095 0.3097 0.4772 0.6216 0.1815

Variable Mean Median Std Dev N

prob1 0.391 0.381 0.173 20000

prob2 0.833 0.864 0.126 20000

prob3 0.282 0.26 0.157 20000

prob4 0.833 0.863 0.126 20000

prob5 0.623 0.635 0.173 20000

MCMC

PROC GLIMMIX

DELTA METHOD


§ / 73

Comparison of Posterior Densitiesfor Diet Marginal Mean Probabilities

Dotted lines: normal approximation based on PROC GLIMMIX

Closed lines: MCMC

Largest discrepancies along the boundaries


§ / 74

Posterior density of F(m+a1) & F(m+a2)

F(m+a2) F(m+a1)


§ / 75

Posterior density of F(m+a2) - F(m+a1)

Probability (F(m+a2) - F(m+a1) < 0) = 0.0164

“Two-tailed” P-value = 2*0.0164 = 0.0328

prob21_diff Frequency Percent

prob21_diff < 0 819 1.64

prob21_diff >= 0 49181 98.36


§ / 76

How does that compare with PROC GLIMMIX?

Estimates

Label Estimate Standard Error

DF t Value Pr > |t| Mean StandardErrorMean

diet 1 lsmean

-0.2838 0.4768 10000 -0.60 0.5517 0.3883 0.1827

diet 2 lsmean

0.9858 0.5341 10000 1.85 0.0650 0.8379 0.1311

diet1 vs diet2 dif

-1.2697 0.6433 10000 -1.97 0.0484 Non-est .

Recall, we assumed “known” s2u …hence normal rather than

t-distributed test statistic.


§ / 77

What if variance components are not known?

• Specify priors on variance components: Options?– 1. Conjugate (Scaled Inverted Chi-Square) denoted as c-2 (nm, nmsm

2))

– 2. Flat (and bounded as well?)

– 3. Gelman’s (2006) prior

2

2

2 2

122 2 2 22

| , ; ,

2

m

m mm

m

m ms

m m m mm

s

p s e m u e

2 1; , mp m u e

1

2 2 2 22(0, ) ; 0 m m m mp Uniform A p A

1m 2 0ms

2m 2 0ms


§ / 78

Relationship between Scaled Inverted Chi-Square & Inverted Gamma

• Scaled Inverted Chi-square:

• Inverted Gamma

2( 1)2 2| ,p e

2

2

2

12 2 2

2

2 2| ,2

2

s

p s e

s

2E ; 11

22 2

2 22

2Var | , ; 2

2 4

v sv s

v v

2

2 2E | , ; 22

ss

22

2Var

1 2

1m 2 0ms 1

2 0

Gelman’s prior Gelman’s prior


§ / 79

Gibbs sampling and mixed effects models

• Recall the following hierarchical model:

/22 2 22

'2 exp

2e ue

y Xβ Zu y Xβ Zuy|β,u, ,

n

ep

1

/22 22

'| 2 exp

2

u A uu

q

u uu

p

2

2122 2 2 2| ,

u uu

u

s

u u u up s e

2

2122 2 2 2| ,

e ee

e

s

e e e ep s e


§ / 80

Joint Posterior Density and FCD

• FCD for b and u: same as before: normal• FCD for VC: c-2

2

2

2

2

/2

122 2

22

1/22

2

2

12 2

2

2

'e

'e

xp2

xp2

e ue

y Xβ Zβ,u, , |y

u Z

u A

u

u

y Xβ

eu uu e

eu

eq

uu

n

s

e

s

u e ee

p

2 2 2

1

| , , `e eyn

e e e ej

p ELSE n s

2 2 1 2

1

| , , `y u A un

u u u uj

p ELSE q s

e y Xβ Zu


§ / 81

Back to Split Plot in Time Example

• Empirical Bayes (EGLS based on REML)

title 'Split Plot in Time using Mixed';title2 'UnKnown Variance Components';proc mixed data=ear covtest ; class trt time rabbit; model temp = trt time trt*time /solution; random rabbit(trt); ods output solutionf = solutionf;run;proc print data=solutionf; where estimate ne 0;run;

Fully Bayes:

• 5000 burnin-cycles• 200000 subsequent

cycles• Save every 10 post

burn-in• Use Gelman’s prior

on VCCode available online


§ / 82

Variance component inferenceCovariance Parameter Estimates

Cov Parm Estimate Standard Error

Z Value Pr > Z

rabbit(trt) 0.08336 0.09910 0.84 0.2001

Residual 0.5783 0.1363 4.24 <.0001

MCMC

PROC MIXED

Variable Mean Median Std Dev N

sigmau 0.127 0.0869 0.141 20000

sigmae 0.632 0.611 0.15 20000


§ / 83

MCMC plots

Random effects variance Residual Variance


§ / 84

Estimated effects ± se (sd)PROC MIXED

Effect trt time Estimate StdErr

Intercept _ _ 0.22 0.3638

trt 1_ 2.36 0.5145

trt 2_ -0.22 0.5145

time _ 1 -0.9 0.481

time _ 2 0.02 0.481

time _ 3 -0.64 0.481

trt*time 1 1 -1.92 0.6802

trt*time 1 2 -1.22 0.6802

trt*time 1 3 -0.06 0.6802

trt*time 2 1 0.32 0.6802

trt*time 2 2 -0.54 0.6802

trt*time 2 3 0.58 0.6802

MCMCVariable Mean Median Std Dev Nintercept 0.217 0.214 0.388 20000

TRT1 2.363 2.368 0.55 20000TRT2 -0.22 -0.219 0.55 20000

TIME1 -0.898 -0.893 0.499 20000TIME2 0.0206 0.0248 0.502 20000TIME3 -0.64 -0.635 0.501 20000TRT1

TIME1-1.924 -1.931 0.708 20000

TRT1TIME2

-1.222 -1.22 0.71 20000

TRT1TIME3

-0.057 -0.057 0.715 20000

TRT2TIME1

0.318 0.315 0.711 20000

TRT2TIME2

-0.54 -0.541 0.711 20000

TRT2TIME3

0.585 0.589 0.71 20000


§ / 85

Least Squares Means

Effect trt time Estimate Standard Error

DF

trt 1 1.4000 0.2135 12trt 2 -0.2900 0.2135 12trt 3 -0.1600 0.2135 12time 1 -0.5000 0.2100 36time 2 0.3667 0.2100 36time 3 0.4667 0.2100 36time 4 0.9333 0.2100 36trt*time 1 1 -0.2400 0.3638 36trt*time 1 2 1.3800 0.3638 36trt*time 1 3 1.8800 0.3638 36trt*time 1 4 2.5800 0.3638 36trt*time 2 1 -0.5800 0.3638 36trt*time 2 2 -0.5200 0.3638 36trt*time 2 3 -0.06000 0.3638 36trt*time 2 4 4.44E-16 0.3638 36

trt*time 3 1 -0.6800 0.3638 36trt*time 3 2 0.2400 0.3638 36trt*time 3 3 -0.4200 0.3638 36trt*time 3 4 0.2200 0.3638 36

Marginal (“Least Squares”) Means

Variable Mean Median Std DevA1 1.399 1.401 0.24A2 -0.292 -0.29 0.237A3 -0.16 -0.161 0.236B1 -0.502 -0.501 0.224B2 0.364 0.363 0.222B3 0.467 0.466 0.224B4 0.934 0.936 0.222

A1B1 -0.244 -0.246 0.389A1B2 1.378 1.379 0.391A1B3 1.882 1.88 0.391A1B4 2.581 2.584 0.391A2B1 -0.586 -0.586 0.393A2B2 -0.526 -0.525 0.385A2B3 -0.058 -0.054 0.387A2B4 0.0031 0.0017 0.386A3B1 -0.676 -0.678 0.388A3B2 0.239 0.241 0.386A3B3 -0.422 -0.427 0.392A3B4 0.219 0.216 0.385

MCMCPROC MIXED

mA1

mB1

mA1B1

mA1

mB1

mA1B1


§ / 86

Posterior Densities of mA1, mB1, mA1B1

Dotted lines: t densities based on estimates/stderrs from PROC MIXED

Closed lines: MCMC


§ / 87

How about fully Bayesian inference in generalized linear mixed models?

• Probit link GLMM.– Extensions to handle unknown variance

components are exactly the same given the augmented liability variables.

• i.e. scaled-inverted chi-square conjugate to s2u.

– No “overdispersion” (s2e) to contend with for

binary data.• But stay tuned for binomial/Poisson data!


§ / 88

Analysis of “binarized” RCBD data.

title 'Posterior inference conditional on unknown VC';proc glimmix data=binarize; class litter diet; model y = diet / covb solution dist=bin link = probit; random litter; lsmeans diet / diff ilink; estimate 'diet 1 lsmean' intercept 1 diet 1 0 0 0 0 / ilink; estimate 'diet 2 lsmean' intercept 1 diet 0 1 0 0 0/ ilink; estimate 'diet1 vs diet2 dif' intercept 0 diet 1 -1 0 0 0; run;

10000 burnin cycles

200000 cycles therafter

Saving every 10

Gelman’s prior on VC.

Empirical Bayes Fully Bayes


§ / 89

Inferences on VCMethod = RSPL MCMC

Covariance Parameter Estimates

Estimate Standard Error

0.5783 0.5021



0.6488 0.6410

Method = Laplace



0.6662 0.6573

Method = Quad

Analysis Variable : sigmau Mean Median Std Dev N2.048 1.468 2.128 20000


§ / 90

Inferences on marginal means (m+ai)Method = Laplace MCMC

diet Least Squares Means


DF

1 -0.3024 0.5159 36

2 1.0929 0.5964 36

3 -0.6428 0.5335 36

4 1.0946 0.5976 36

5 0.3519 0.5294 36

Variable Mean Median Std Dev Nmm1 -0.297 -0.301 0.643 20000mm2 1.322 1.283 0.716 20000mm3 -0.697 -0.69 0.662 20000mm4 1.319 1.285 0.72 20000mm5 0.465 0.442 0.671 20000

Larger: take into account uncertainty on variance components


§ / 91

Posterior Densities of (m+ai)

Dotted lines: t36 densities based estimates and standard errors from PROC GLIMMIX (method=laplace)

Closed lines: MCMC


§ / 92

MCMC inferences on probabilities of “success”:

(based on ( +i)


§ /

MCMC inferences on marginal probabilities:(based on )

93

21i

u

Potentially big issues with empirical Bayes inference…dependent upon quality of VC inference & asymptotics!


§ / 94

Inference on Diet 1 vs. Diet 2 probabilities

Estimates

Label Mean StandardErrorMean

diet 1 lsmean

0.3812 0.1966

diet 2 lsmean

0.8628 0.1309

diet1 vs diet2 dif

Non-est .

Variable Mean Median Std Dev

N

Probdiet1

0.4 0.382 0.212 20000

Probdiet2

0.857 0.899 0.137 20000

Probdiff

0.457 0.464 0.207 20000P-value = 0.0559

prob21_diff Frequency Percent

prob21_diff < 0 180 0.90

prob21_diff >= 0 19820 99.10

Probability (F(m+a2) - F(m+a1) < 0) = 0.0090 (“one-tailed”)

PROC GLIMMIX MCMC

MCMC


§ /

Any formal comparisons between GLS/REML/EB(M/PQL) and MCMC for GLMM?

• Check Browne and Draper (2006).• Normal data (LMM)

– Generally, inferences based on GLS/REML and MCMC are sufficiently close.

– Since GLS/REML is faster, it is the method of choice for classical assumptions.

• Non-normal data (GLMM).– Quasi-likelihood based methods are particularly problematic in bias

of point estimates and interval coverage of variance components.• Side effects on fixed effects inference.

– Bayesian methods with diffuse priors are well calibrated for both properties for all parameters.

– Comparisons with Laplace not done yet.95


§ /

A pragmatic take on using MCMC vs PL for GLMM under classical assumptions?

• If datasets are too small to warrant asymptotic considerations, then the experiment is likely to be poorly powered.– Otherwise, PL might ≈ MCMC inference.

• However, differences could depend on dimensionality, deviation of data distribution from normal, and complexity of design.

• The real big advantage of MCMC ---is multi-stage hierarchical models (see later)

96


§ /

Implications of design on Fully Bayes vs. PL inference for GLMM?

• RCBD: Known for LMM, that inferences on treatment differences in RCBD are resilient to estimates of block VC.– Inference on differences in treatment effects thereby

insensitive to VC inferences in GLMM?• Whole plot treatment factor comparisons in split plot

designs?• Greater sensitivity (i.e. whole plot VC).

– Sensitivity of inference for conditional versus “population-averaged” probabilities?

97

' vs.x β i'

21

x βi

u


§ / 98

Ordinal Categorical Data

• Back to the GF83 data.– Gibbs sampling strategy laid out by Sorensen and

Gianola (1995); Albert and Chib (1993).– Simple extensions to what was considered earlier

for linear/probit mixed models

11

1 ifPr | , ,

0 otherwise

j i j

i i j job Y j


§ / 99

Joint Posterior Density• Stages

111

| , Pr | , I Iy τ τn c

i i j i j iji

p ob Y j Y j

' '1| , ,β u,τ x β z ui i i i j i jp

uAu

Au 1

22/122/

2 '2

1exp

2

1|

uuqup

constantp

2

12222 exp,|~u

uuuuuu

usp

1A

1B

2

2

3

(or something diffuse)


§ /100

Anything different for FCD compared to probit binary?

• Liabilities

• Thresholds:

– This leads to painfully slow mixing…a better strategy is based on Metropolis sampling (Cowles et al., 1996).

' '1| , I Iβ u, x β z ui i i i i j i j ip y j Y j

| , min( | 1),max( | )j jp ELSE U Y j Y j


§ /101

Fully Bayesian inference on GF83

• 5000 burn-in samples• 50000 samples post burn-in• Saving every 10.

Diagnostic plots for s2

u


§ /102

Posterior SummariesVariable Mean Median Std Dev 5th Pctl 95th Pctl

intercept -0.222 -0.198 0.669 -1.209 0.723hy 0.236 0.223 0.396 -0.399 0.894

age -0.036 -0.035 0.392 -0.69 0.598sex -0.172 -0.171 0.393 -0.818 0.48

sire1 -0.082 -0.042 0.587 -1 0.734sire2 0.116 0.0491 0.572 -0.641 0.937sire3 0.194 0.106 0.625 -0.64 1.217sire4 -0.173 -0.11 0.606 -1.118 0.595

sigmau 1.362 0.202 8.658 0.0021 4.148thresh2 0.83 0.804 0.302 0.383 1.366

probfemalecat1 0.598 0.609 0.188 0.265 0.885

probfemalecat2 0.827 0.864 0.148 0.53 0.986

probmalecat1 0.539 0.545 0.183 0.23 0.836

probmalecat2 0.79 0.821 0.154 0.491 0.974


§ /103

Posterior densities of sex-specific cumulative probabilities (first two categories)

How would interpret a “standard error” in this context?


§ /104

Posterior densities of sex-specific probabilities (each category)


§ /105

What if some FCD are not recognizeable?

• Examples: Poisson mixed models, logistic mixed models.

• Hmmm.. Need a different strategy.– Use Gibbs sampling whenever you can.– Use Metropolis-Hastings sampling for FCD that are

not recognizeable.• NEXT!

applied bayesian inference, ksu, april 29, 2012 § / §❹ the bayesian revolution: markov chain...

Documents

applied bayesian inference

draw i

draw y i

bayesian revolution

poisson i

data augmentation augment

draw t

var var y run