sampling and markov chain monte carlo techniques
TRANSCRIPT
![Page 1: Sampling and Markov Chain Monte Carlo Techniques](https://reader034.vdocuments.site/reader034/viewer/2022042907/586f71b01a28ab10258b508b/html5/thumbnails/1.jpg)
Part 1: 2016-01-20Part 2: 2016-02-10
Tomasz Kuśmierczyk
Session 5: Sampling & MCMC
Approximate and Scalable Inference for ComplexProbabilistic Models in Recommender Systems
Part 2: Inference Techniques
![Page 2: Sampling and Markov Chain Monte Carlo Techniques](https://reader034.vdocuments.site/reader034/viewer/2022042907/586f71b01a28ab10258b508b/html5/thumbnails/2.jpg)
MCMC = Monte Carlo Markov Chains
MCMC ⊂ Sampling
![Page 3: Sampling and Markov Chain Monte Carlo Techniques](https://reader034.vdocuments.site/reader034/viewer/2022042907/586f71b01a28ab10258b508b/html5/thumbnails/3.jpg)
Literature / Credits● Szymon Jaroszewicz lectures on “Selected Advanced Topics in Machine Learning”● Daphne Koller lectures on “Probabilistic Graphical Models” (https://class.coursera.
org/pgm-003/lecture)● Patrick Lam slides http://www.people.fas.harvard.
edu/~plam/teaching/methods/convergence/convergence_print.pdf● Bishop’s book ch. 11 ● MacKay, David JC. Information theory, inference and learning algorithms. Cambridge
university press, 2003. (http://www.inference.phy.cam.ac.uk/itprnn/book.pdf)● R & JAGS online tutorials…● …
![Page 4: Sampling and Markov Chain Monte Carlo Techniques](https://reader034.vdocuments.site/reader034/viewer/2022042907/586f71b01a28ab10258b508b/html5/thumbnails/4.jpg)
Basics & motivation
![Page 5: Sampling and Markov Chain Monte Carlo Techniques](https://reader034.vdocuments.site/reader034/viewer/2022042907/586f71b01a28ab10258b508b/html5/thumbnails/5.jpg)
Motivation: Monte Carlo for integrating
http://mlg.eng.cam.ac.uk/zoubin/tut06/mcmc.pdf
Non-trivial posterior distribution (e.g., for BNs)
![Page 6: Sampling and Markov Chain Monte Carlo Techniques](https://reader034.vdocuments.site/reader034/viewer/2022042907/586f71b01a28ab10258b508b/html5/thumbnails/6.jpg)
Sampling vs Variational Inference (previous seminar)
http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf
![Page 7: Sampling and Markov Chain Monte Carlo Techniques](https://reader034.vdocuments.site/reader034/viewer/2022042907/586f71b01a28ab10258b508b/html5/thumbnails/7.jpg)
Sampling continued ...● Accuracy of sampling based estimates depends only on the variance of the
quantity being estimated● It does not depend directly on the dimensionality (having many variables is
not a problem)● In some cases we are able to break the curse of dimensionality
but
● Sampling gets much more difficult in higher dimensions● Variance often increases as the dimension grows● Accuracy of sampling based methods grows only with square root of the
number of samplesJaroszewicz
![Page 8: Sampling and Markov Chain Monte Carlo Techniques](https://reader034.vdocuments.site/reader034/viewer/2022042907/586f71b01a28ab10258b508b/html5/thumbnails/8.jpg)
Sampling techniques - basic cases● uniform -> pseudo-random numbers generator● discrete distributions -> range matching with the help of uniform (in log of
number of outcomes time)● continous -> cdf inverse● various ‘tricks’● ...
![Page 9: Sampling and Markov Chain Monte Carlo Techniques](https://reader034.vdocuments.site/reader034/viewer/2022042907/586f71b01a28ab10258b508b/html5/thumbnails/9.jpg)
Sampling techniques (e.g., for BNs posterior)● Ancestral Sampling (no evidence)● Probabilistic Logic Sampling (like AS but samples not consistent with
evidence are discarded -> low number of samples generated)● Likelihood weighting (estimations may be inaccurate + other problems)● Importance Sampling● (Adaptive) Rejection Sampling● Sampling-Importance-Resampling● Metropolis● Metropolis-Hastings● Gibbs Sampling● Hamiltionian (hybrid) Sampling● Slice sampling● and more...
![Page 10: Sampling and Markov Chain Monte Carlo Techniques](https://reader034.vdocuments.site/reader034/viewer/2022042907/586f71b01a28ab10258b508b/html5/thumbnails/10.jpg)
Monte Carlo without Markov Chains
![Page 11: Sampling and Markov Chain Monte Carlo Techniques](https://reader034.vdocuments.site/reader034/viewer/2022042907/586f71b01a28ab10258b508b/html5/thumbnails/11.jpg)
Few remarks● there is no difference between sampling from normalized and non-normalized
distributions ● non-normalized distributions are easy to evaluate for BNs● in most cases (e.g. rejection sampling) we work with non-normalized
distributions
● for simplicity p(x) is used in notation but there is no difference for complicated posterior distributions
● 1D case presented but work also in multi-dimensional case.
![Page 12: Sampling and Markov Chain Monte Carlo Techniques](https://reader034.vdocuments.site/reader034/viewer/2022042907/586f71b01a28ab10258b508b/html5/thumbnails/12.jpg)
Rejection sampling
Jaroszewicz, Bishop
c q(x)
p(x)
![Page 13: Sampling and Markov Chain Monte Carlo Techniques](https://reader034.vdocuments.site/reader034/viewer/2022042907/586f71b01a28ab10258b508b/html5/thumbnails/13.jpg)
Rejection sampling - proof
Jaroszewicz
![Page 14: Sampling and Markov Chain Monte Carlo Techniques](https://reader034.vdocuments.site/reader034/viewer/2022042907/586f71b01a28ab10258b508b/html5/thumbnails/14.jpg)
Selection of c?● c should be as small as possible to have low reduction rate● but p <= c q must hold
● Adaptive Rejection Sampling for log-concave distributions○ log-concave = logarithm of the distribution is concave
![Page 15: Sampling and Markov Chain Monte Carlo Techniques](https://reader034.vdocuments.site/reader034/viewer/2022042907/586f71b01a28ab10258b508b/html5/thumbnails/15.jpg)
Adaptive Rejection Sampling
Jaroszewicz
![Page 16: Sampling and Markov Chain Monte Carlo Techniques](https://reader034.vdocuments.site/reader034/viewer/2022042907/586f71b01a28ab10258b508b/html5/thumbnails/16.jpg)
Rejection Sampling problems● part of the samples are rejected● tight “envelope” helps a bit
but
● in many dimensions (when there are many variables) dimensionality curse must be taken into account
● see Bishop’s example (for rejection sampling): ○ p(x) ~ N(0, s1) ○ q(x) ~ N(0, 1.01*s1)○ D=1000○ -> acceptance ratio 1/20000
![Page 17: Sampling and Markov Chain Monte Carlo Techniques](https://reader034.vdocuments.site/reader034/viewer/2022042907/586f71b01a28ab10258b508b/html5/thumbnails/17.jpg)
Markov Chains
![Page 18: Sampling and Markov Chain Monte Carlo Techniques](https://reader034.vdocuments.site/reader034/viewer/2022042907/586f71b01a28ab10258b508b/html5/thumbnails/18.jpg)
What is a Markov Chain?● A triple <possibly infinite set S of possible states, initial distribution over states
P0, transition matrix P (T)>● transition matrix - a matrix with probabilities Pij (Tij) that being in some state
si at time t we will move to another state sj at time t+1● Markov property = next state depends only on one previous
Jaroszewicz
![Page 19: Sampling and Markov Chain Monte Carlo Techniques](https://reader034.vdocuments.site/reader034/viewer/2022042907/586f71b01a28ab10258b508b/html5/thumbnails/19.jpg)
Markov Chains - distribution over states
Jaroszewicz
![Page 20: Sampling and Markov Chain Monte Carlo Techniques](https://reader034.vdocuments.site/reader034/viewer/2022042907/586f71b01a28ab10258b508b/html5/thumbnails/20.jpg)
Markov Chains - stationary distribution
Jaroszewicz
![Page 21: Sampling and Markov Chain Monte Carlo Techniques](https://reader034.vdocuments.site/reader034/viewer/2022042907/586f71b01a28ab10258b508b/html5/thumbnails/21.jpg)
Stationarity example
Daphne Koller
![Page 22: Sampling and Markov Chain Monte Carlo Techniques](https://reader034.vdocuments.site/reader034/viewer/2022042907/586f71b01a28ab10258b508b/html5/thumbnails/22.jpg)
Stationarity from regularity● If there exists k such that, for every two states <si, sj> the probability of
getting from si to sj in exactly k steps is > 0 (MC is regular) →MC converges to a unique stationary distribution
● Sufficient conditions for regularity: ○ there is a path between every pair of states○ for every state, there is a self-transition
![Page 23: Sampling and Markov Chain Monte Carlo Techniques](https://reader034.vdocuments.site/reader034/viewer/2022042907/586f71b01a28ab10258b508b/html5/thumbnails/23.jpg)
Stationarity of irreducible, aperiodic MC ● Irreducible, aperiodic Markov chains always converge to a unique stationary
distribution
![Page 24: Sampling and Markov Chain Monte Carlo Techniques](https://reader034.vdocuments.site/reader034/viewer/2022042907/586f71b01a28ab10258b508b/html5/thumbnails/24.jpg)
Reducibility
Jaroszewicz
![Page 25: Sampling and Markov Chain Monte Carlo Techniques](https://reader034.vdocuments.site/reader034/viewer/2022042907/586f71b01a28ab10258b508b/html5/thumbnails/25.jpg)
Periodicity
Jaroszewicz
![Page 26: Sampling and Markov Chain Monte Carlo Techniques](https://reader034.vdocuments.site/reader034/viewer/2022042907/586f71b01a28ab10258b508b/html5/thumbnails/26.jpg)
Why I talk about Markov Chains -> MCMCthe idea is that:
● Markov Chain “jumps” over states● states determine (BN) samples (that are later used for Monte Carlo)
○ for example: state ⇔ sample
but we need:
● Markov Chain converges to a stationary distribution (to be proved every time)● a distribution of generated samples is equal to required distribution (BNs
posterior)
![Page 27: Sampling and Markov Chain Monte Carlo Techniques](https://reader034.vdocuments.site/reader034/viewer/2022042907/586f71b01a28ab10258b508b/html5/thumbnails/27.jpg)
Properties● Very general purpose ● Often easy to implement ● Good theoretical guarantees as t -> ∞
but:
● Lots of tunable parameters / design choices ● Can be quite slow to converge ● Difficult to tell whether it’s working
![Page 28: Sampling and Markov Chain Monte Carlo Techniques](https://reader034.vdocuments.site/reader034/viewer/2022042907/586f71b01a28ab10258b508b/html5/thumbnails/28.jpg)
Metropolis-Hastings derivation on the blackboard:
1. From detailed balance to stationarity2. Proposed distribution and acceptance probability3. From detailed balance to conditions on acceptance probability
![Page 29: Sampling and Markov Chain Monte Carlo Techniques](https://reader034.vdocuments.site/reader034/viewer/2022042907/586f71b01a28ab10258b508b/html5/thumbnails/29.jpg)
Part 2
Dawn of Statistical Renaissance
![Page 30: Sampling and Markov Chain Monte Carlo Techniques](https://reader034.vdocuments.site/reader034/viewer/2022042907/586f71b01a28ab10258b508b/html5/thumbnails/30.jpg)
Gibbs sampling
![Page 31: Sampling and Markov Chain Monte Carlo Techniques](https://reader034.vdocuments.site/reader034/viewer/2022042907/586f71b01a28ab10258b508b/html5/thumbnails/31.jpg)
Gibbs sampling: Algorithm
Daphne Koller
![Page 32: Sampling and Markov Chain Monte Carlo Techniques](https://reader034.vdocuments.site/reader034/viewer/2022042907/586f71b01a28ab10258b508b/html5/thumbnails/32.jpg)
Does it work? - oftenUnder certain conditions, the stationary distribution of this Markov chain is the joint distribution of the Bayesian network:
● A probability distribution P(X) is positive, if P(X = x) > 0 for all x ∈ Dom(X).● Theorem: If all conditional distributions in a Bayesian network are positive
(all probabilities are > 0) then a Gibbs sampler converges to the joint distribution of the Bayesian network.
![Page 33: Sampling and Markov Chain Monte Carlo Techniques](https://reader034.vdocuments.site/reader034/viewer/2022042907/586f71b01a28ab10258b508b/html5/thumbnails/33.jpg)
Gibbs properties● Can handle evidence even with very low probability● Works for all kinds of models, e.g. Markov networks, continuous variables● Works very well in many practical cases● overall is a very powerful and useful technique● very popular nowadays● has become another Swiss army knife for probabilistic inference
but
● Samples not statistically independent (statistics gets difficult)● Hard to give guarantees on results
Jaroszewicz
![Page 34: Sampling and Markov Chain Monte Carlo Techniques](https://reader034.vdocuments.site/reader034/viewer/2022042907/586f71b01a28ab10258b508b/html5/thumbnails/34.jpg)
Gibbs problems - more exploratory chains needed
Jaroszewicz
![Page 35: Sampling and Markov Chain Monte Carlo Techniques](https://reader034.vdocuments.site/reader034/viewer/2022042907/586f71b01a28ab10258b508b/html5/thumbnails/35.jpg)
Gibbs sampling: example
![Page 36: Sampling and Markov Chain Monte Carlo Techniques](https://reader034.vdocuments.site/reader034/viewer/2022042907/586f71b01a28ab10258b508b/html5/thumbnails/36.jpg)
Bayesian PMF using MCMC
https://www.cs.toronto.edu/~amnih/papers/bpmf.pdf
![Page 37: Sampling and Markov Chain Monte Carlo Techniques](https://reader034.vdocuments.site/reader034/viewer/2022042907/586f71b01a28ab10258b508b/html5/thumbnails/37.jpg)
Bayesian PMF using MCMC
Some useful formulas:
on the blackboard ...
![Page 38: Sampling and Markov Chain Monte Carlo Techniques](https://reader034.vdocuments.site/reader034/viewer/2022042907/586f71b01a28ab10258b508b/html5/thumbnails/38.jpg)
Diagnostics
![Page 39: Sampling and Markov Chain Monte Carlo Techniques](https://reader034.vdocuments.site/reader034/viewer/2022042907/586f71b01a28ab10258b508b/html5/thumbnails/39.jpg)
You never know with randomness...
![Page 40: Sampling and Markov Chain Monte Carlo Techniques](https://reader034.vdocuments.site/reader034/viewer/2022042907/586f71b01a28ab10258b508b/html5/thumbnails/40.jpg)
Practical problems● We only want to use samples that are sampled from a distribution close to p
(x) - when chain is already ‘mixing’
● At early iterations (before chain converged) we may be far from p(x) - we need ‘burn-in’ iterations
● Samples are correlated - we need thinning (take only every n-th sample)
![Page 41: Sampling and Markov Chain Monte Carlo Techniques](https://reader034.vdocuments.site/reader034/viewer/2022042907/586f71b01a28ab10258b508b/html5/thumbnails/41.jpg)
Diagnostics● Visual Inspection ● Geweke Diagnostic
○ tests whether the burn-in is sufficient
● Gelman and Rubin Diagnostic ○ may detect problems with disconnected sample spaces
● Raftery and Lewis Diagnostic ○ calculates the number of iterations and burn-in needed by first running
● Heidelberg and Welch Diagnostic ○ test statistic for stationarity of the distribution
http://www.people.fas.harvard.edu/~plam/teaching/methods/convergence/convergence_print.pdf
![Page 42: Sampling and Markov Chain Monte Carlo Techniques](https://reader034.vdocuments.site/reader034/viewer/2022042907/586f71b01a28ab10258b508b/html5/thumbnails/42.jpg)
Visual inspection
http://www.people.fas.harvard.edu/~plam/teaching/methods/convergence/convergence_print.pdf
Multimodal distribution, hard to get from one mode to another.The chain is not mixing.
![Page 43: Sampling and Markov Chain Monte Carlo Techniques](https://reader034.vdocuments.site/reader034/viewer/2022042907/586f71b01a28ab10258b508b/html5/thumbnails/43.jpg)
Autocorrelation (correlation between delayed samples)
http://www.people.fas.harvard.edu/~plam/teaching/methods/convergence/convergence_print.pdf
![Page 44: Sampling and Markov Chain Monte Carlo Techniques](https://reader034.vdocuments.site/reader034/viewer/2022042907/586f71b01a28ab10258b508b/html5/thumbnails/44.jpg)
Geweke Diagnostic● takes two nonoverlapping parts of the Markov chain ● compares the means of both parts, using a difference of means test ● to see if the two parts of the chain are from the same distribution (null
hypothesis). ● the test statistic is a standard Z-score with the standard errors adjusted for
autocorrelation.
![Page 45: Sampling and Markov Chain Monte Carlo Techniques](https://reader034.vdocuments.site/reader034/viewer/2022042907/586f71b01a28ab10258b508b/html5/thumbnails/45.jpg)
Gelman and Rubin Diagnostic1. Run m ≥ 2 chains of length 2n from overdispersed starting values. 2. Discard the first n draws in each chain. 3. Calculate the within-chain and between-chain variance.
http://www.people.fas.harvard.edu/~plam/teaching/methods/convergence/convergence_print.pdf
![Page 46: Sampling and Markov Chain Monte Carlo Techniques](https://reader034.vdocuments.site/reader034/viewer/2022042907/586f71b01a28ab10258b508b/html5/thumbnails/46.jpg)
Gelman and Rubin Diagnostic 24. Calculate the estimated variance of the parameter as a weighted sum of the within-chain and between-chain variance.
5. Calculate the potential scale reduction factor.
When R is high (perhaps greater than 1.1 or 1.2), then we should run our chains out longer to improve convergence to the stationary distribution.
http://www.people.fas.harvard.edu/~plam/teaching/methods/convergence/convergence_print.pdf
![Page 47: Sampling and Markov Chain Monte Carlo Techniques](https://reader034.vdocuments.site/reader034/viewer/2022042907/586f71b01a28ab10258b508b/html5/thumbnails/47.jpg)
Probabilistic programming
![Page 48: Sampling and Markov Chain Monte Carlo Techniques](https://reader034.vdocuments.site/reader034/viewer/2022042907/586f71b01a28ab10258b508b/html5/thumbnails/48.jpg)
Probabilistic programming languageprogramming language designed to:
● describe probabilistic models ● perform inference automatically even on complicated models
for example:
● PyMC● BUGS / JAGS● BayesPy
https://en.wikipedia.org/wiki/Probabilistic_programming_language
![Page 49: Sampling and Markov Chain Monte Carlo Techniques](https://reader034.vdocuments.site/reader034/viewer/2022042907/586f71b01a28ab10258b508b/html5/thumbnails/49.jpg)
What’s inside?● BUGS - Adaptive Rejection (AR) sampling● JAGS - Slice Sampler (one variable at once)
![Page 50: Sampling and Markov Chain Monte Carlo Techniques](https://reader034.vdocuments.site/reader034/viewer/2022042907/586f71b01a28ab10258b508b/html5/thumbnails/50.jpg)
JAGS PMF-like example: model filemodel{#########START###########
sv ~ dunif(0,100) su ~ dunif(0,100) s ~ dunif(0,100) tau <- 1/(s*s) tauv <- 1/(sv*sv) tauu <- 1/(su*su) ...
...
for (j in 1:M) { for (d in 1:D) { v[j,d] ~ dnorm(0, tauv) } } for (i in 1:N) { for (d in 1:D) { u[i,d] ~ dnorm(0, tauu) } } for (j in 1:M) { for (i in 1:N) { mu[i,j] <- inprod(u[i,], v[j,]) r3[i,j] <- 1/(1+exp(-mu[i,j])) r[i,j] ~ dnorm(r3[i,j], tau) } }
}#############END############
![Page 51: Sampling and Markov Chain Monte Carlo Techniques](https://reader034.vdocuments.site/reader034/viewer/2022042907/586f71b01a28ab10258b508b/html5/thumbnails/51.jpg)
JAGS PMF-like example: Parameters preparationn.chains = 1n.iter = 5000n.burnin = n.itern.thin = 1 #max(1, floor((n.iter - n.burnin)/1000))D = 10lu = 0.05lv = 0.05n.cluster=n.chainsmodel.file = "models/pmf_hypnorm3.bug"
N = dim(train)[1]M = dim(train)[2]start.s = sd(train[!is.na(train)])start.su = sqrt(start.s^2/lu)start.sv = sqrt(start.s^2/lv)
jags.data = list(N=N, M=M, D=D, r=train)jags.params = c("u", "v", "s", "su", "sv")jags.inits = list(s=start.s, su=start.su, sv=start.sv, u=matrix( rnorm(N*D,mean=0,sd=start.su), N, D), v=matrix( rnorm(M*D,mean=0,sd=start.sv), M, D))
![Page 52: Sampling and Markov Chain Monte Carlo Techniques](https://reader034.vdocuments.site/reader034/viewer/2022042907/586f71b01a28ab10258b508b/html5/thumbnails/52.jpg)
JAGS PMF-like example: running (sampling)
library(rjags)model = jags.model(model.file, jags.data, n.chains=n.chains, n.adapt=n.burnin)#update(model)samples = jags.samples(model, jags.params, n.iter=n.iter, thin=n.thin)
![Page 53: Sampling and Markov Chain Monte Carlo Techniques](https://reader034.vdocuments.site/reader034/viewer/2022042907/586f71b01a28ab10258b508b/html5/thumbnails/53.jpg)
JAGS PMF-like example: retrieving samples
per.chain = dim(samples$u)[3]
iterations = per.chain * dim(samples$u)[4]
user_sample = function(i, k) {samples$u[i, , (k-1)%%per.chain+1, ceiling(k/per.chain)]}
item_sample = function(j, k) {samples$v[j, , (k-1)%%per.chain+1, ceiling(k/per.chain)]}
![Page 54: Sampling and Markov Chain Monte Carlo Techniques](https://reader034.vdocuments.site/reader034/viewer/2022042907/586f71b01a28ab10258b508b/html5/thumbnails/54.jpg)
Why it’s good, why it’s bad?● fast prototyping● less control
![Page 55: Sampling and Markov Chain Monte Carlo Techniques](https://reader034.vdocuments.site/reader034/viewer/2022042907/586f71b01a28ab10258b508b/html5/thumbnails/55.jpg)
Results on movielens 100k
RMSE = 0.943 (~SGD)
More on https://github.com/tkusmierczyk/pmf-jags
![Page 56: Sampling and Markov Chain Monte Carlo Techniques](https://reader034.vdocuments.site/reader034/viewer/2022042907/586f71b01a28ab10258b508b/html5/thumbnails/56.jpg)
Thank you!