lecture13-mcmc

14
Markov Chain Monte Carlo in R Thanks to Lindon Eaves for first getting me interested (and for some of the slides herein) and Scott Lynch for writing an excellent introduction which I use throughout

Upload: cuongnguyenba

Post on 09-Dec-2015

212 views

Category:

Documents


0 download

DESCRIPTION

markov chain monte carl

TRANSCRIPT

Markov Chain Monte Carlo in R

Thanks to Lindon Eaves for first getting me interested (and for some of the slides herein) and Scott Lynch for writing an excellent introduction which I use throughout

Markov Chain

Markov chain = a chain of events (e.g., sampling), where the present state of the chain depends only upon the previous state

Markov Chain

Markov chain = a chain of events (e.g., sampling), where the present state of the chain depends only upon the previous state

Monte CarloMonte Carlo:

1. Administrative region in France; famous for gambling and James Bond movies

2. Algorithms that rely on repeated random sampling to compute their results

Markov Chain Monte CarloMarkov Chain Monte Carlo:

Repeatedly sample from probability distributions, each of which is determined by the previous step (the Markov Chain part) and by some random noise (the Monte Carlo part). Given a good method of determining whether the current state stays the same or moves to a new place, the chain approximates the true sampling distribution of the statistic.

MCMC is BayesianBayesians interpret probability as a measure

of a state of knowledge

Frequentists interpret probability as a measure of how often a thing occurs out of a large number of trials

MCMC is inherently Bayesian because the parameters of interest (e.g., beta coefficients in regression) are treated as random variables which we sample from in order to understand their distribution (or how unsure about where the true beta is).

What good is it?

“Model Liberation” - we can do a lot of cool things much more easily and naturally

It’s easy: coding models using Maximum Likelihood can be extraordinarily cumbersome. Doing so in MCMC is very simple. Basically, if you can simulate it, you can do MCMC with it.

Provides non-parametric sampling distributions ‘for free’ - no need to make assumptions about the shape of the sampling distribution.

Obtain subject-level parameters (genetic and environmental factor scores, imputed data, etc.) at no extra cost

Maximum Likelihood1. Make a guess at a parameter Ω

2. Find the likelihood that this is the true parameter given your data: Likelihood = P(Data|Ω) where Data is X = (x1, x2…xn)

3. This means multiplying lots of very tiny numbers, so in practice, you should take the logs of these small numbers and add them: Log Likelihood = ∑log(P(Ω|x[i])

4. Figure out some way to move the next guess in the right direction. Refigure 3 above.

5. Keep doing 3 & 4 until your results don’t change much and the second derivative is positive. This is your maximum likelihood solution

6. Obtain standard errors by differentiating the LL. The shape of the LL leading to the maximum gives information, under assumptions (such as normality) about the standard error of your estimate

Problems with ML

1. Many models require integration over values of latent variables (evaluate each likelihood and derivative of each parameter holding all others constant). This can be prohibitively expensive when numbers of dimensions is large.

2. It can be very difficult to code problems in ML compared to simulating (and using MCMC) on it

3. It is sensitive to distributional assumptions (e.g., normality), and assumed distributions of statistics may not reflect the true distribution of statistics

Type of MCMC: Metropolis algorithm1. Set things up:

a) Provide priors (Ω1) for Ω’s of interest. Priors don’t have to be good, but it can an effect on speed

b) Provide a scale, ∂, (variance) for the proposal distribution, q().c) Provide a good posterior distribution function (likelihood function)

π()

2. Draw a new Ω’ (Ωi+1) from the proposal distribution q(.| Ωi & scale=∂)

3. Find the likelihood of both Ωi+1 & Ωi: π(Ωi+1 ) & π(Ωi )

4. If π(ΩI+1 ) > π(Ωi ), accept Ωi+1 . Otherwise, accept Ωi+1 with probability = π(Ωi+1 )/π(Ωi )

5. Redo steps 2-4 for i = 1…R where R is some large number.

6. “Eventually,” a sample of your chosen statistics, Ω’, will converge to the true distribution of Ω. This may require some time (called the ‘burn in’ period).

Fact about MCMC1. Choosing a large scale, ∂, will mean that the proposal statistic Ωi+1 is

chosen too rarely. Choosing too small a scale will mean it is chosen too often. Both make burn-in times too long. The optimal % chosen is between .3 and .7.

2. Remarkably, a sample of your chosen statistics, Ω’, will converge to the true distribution of Ω regardless of the shape of the proposal distribution, q(), that you choose.

3. The statistics Ω’ will show very high autocorrelation (by the nature of the Markov Chain!). Nevertheless, the property in 2 above is maintained so long as R is large.

4. Difficult problems (high multicolinearity, many latent variables, hierarchical models) can take a long time (very large R).

The normal distribution π() function

The normal distribution:

The normal distribution for multiple regression

1. Choosing a large scale, ∂, will mean that the proposal statistic Ωi+1 is chosen too rarely. Choosing too small a scale will mean it is chosen too often. Both make burn-in times too long. The optimal % chosen is between .3 and .7.

2. Remarkably, a sample of your chosen statistics, Ω’, will converge to the true distribution of Ω regardless of the shape of the proposal distribution, q(), that you choose.

3. The statistics Ω’ will show very high autocorrelation (by the nature of the Markov Chain!). Nevertheless, the property in 2 above is maintained so long as R is large.

4. Difficult problems (high multicolinearity, many latent variables, hierarchical models) can take a long time (very large R).

Excellent MCMC introductions