stat 425: introduction to bayesian analysismarina/stat425/lecture2.pdfstat 425: introduction to...

STAT 425: Introduction to Bayesian Analysis

Marina Vannucci

Rice University, USA

Fall 2017

Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall 2017 1 / 19

Lecture 2: Introduction to Bayesian Statistics

Bayes rule and example

Bayesian inference (prior, likelihood, posterior)


The tenets of Bayesian Analysis


The tenets of Bayesian Analysis

Bayesian statistics starts by using (prior) probabilities to describe yourcurrent state of knowledge: prior elicitation is a very important step ofthe analysis.

It then incorporates information through the collection of data: modelingthe data generating process is also an essential step of the analysis

By combining the prior probabilities with the data, we obtain new(posterior) probabilities to describe an updated state of knowledge

In Bayesian statistics, all uncertainty and all information areincorporated through the use of probability distributions, and allconclusions obey the laws of probability theory!


The scope of Data Analysis

We typically perform data analysis to:

Draw conclusions based on currently available data

Predict the future based on currently available data

Statistical models are useful tools for scientific discovery and prediction!

“All models are wrong but some are useful” (George Box).


Discrete Random VariablesThe set of possible values is either finite or countably infinite.Associated with a random variable Y , with possible values y1, y2, . . .,there is a probability mass function, p(yi) = P(Y = yi), such that

p(yi) ≥ 0 and∑

i

p(yi) = 1.

... and a cumulative distribution function (cdf),F(y) = P(Y ≤ y), −∞ < y <∞. The cdf is non-decreasing andsatisfies

limy→−∞

F(y) = 0 and limy→∞

F(y) = 1.

If Y is a discrete random variable, F(y) is a step function, with jumpsoccurring at the values of y for which p(y) > 0.The mean or expected value of a discrete random variable Y withprobability mass function p(y) is given by

µ = E(Y) =∑

i

yip(yi)


Continuous random variable

For continuous random variables, the set of possible values isuncountable.The probability density function (pdf), f (y), of a continuous randomvariable, Y , with support S is an integrable function such that:

(a)f (y) > 0, if y ∈ S f (y) = 0, if y /∈ S

(b) ∫S

f (y)dy = 1

(c)

P(a ≤ Y ≤ b) =∫ b

af (y)dy.


For a continuous random variable, the cumulative distribution function(cdf) is given by

P(Y ≤ a) = F(a) =∫ a

−∞f (y)dy.

The cdf of a continuous random variable, F(y), is continuous andmonotonically non-decreasing.

If a < b, we obtain

P(a ≤ Y ≤ b) =∫ b

af (y)dy = F(b)− F(a).

The mean or expected value of a continuous random variable Y with pdff (y)

µ = E(Y) =∫ ∞−∞

yf (y)dy


Joint distributionsX,Y discrete r. v. with values x1, x2, . . ., and y1, y2, . . ., respectively.Their joint probability mass function pX,Y(x, y) ispX,Y(xi, yj) = P(X = xi,Y = yj)The marginal probability mass function of one random variable isobtained from the joint frequency distribution by

pX(x) =∑

j

pX,Y(x, yj) pY(y) =∑

i

pX,Y(xi, y).

The conditional probability that X = xi, given that Y = yj is,

pX|Y(x|y) = P(X = xi|Y = yj) =P(X = xi,Y = yj)

P(Y = yj)=

pX,Y(xi, yj)

pY(yj).

This can be re-expressed as pX,Y(x, y) = pX|Y(x|y)pY(y).Summing both sides over all values of y, we get a very useful applicationof the law of total probability:

pX(x) =∑

y

pX|Y(x|y)pY(y).


Continuous random variables:

Suppose that X and Y are two continuous random variables. Their jointprobability density function, f (x, y), is the surface such that for anyregion A in the xy-plane,

P((X,Y) ∈ A) =∫ ∫

Af (x, y)dxdy

The marginal probability density function of one random variable isobtained from the joint pdf

fX(x) =∫ ∞−∞

fX,Y(x, y)dy fY(y) =∫ ∞−∞

fX,Y(x, y)dx

The conditional density functions of Y given X is defined to be

fY|X(y|x) =fXY(x, y)

fX(x), if 0 < fX(x) <∞


The joint density can be expressed in terms of the marginal and conditionaldensities as:

fXY(x, y) = fY|X(y|x)fX(x)

Integrating both sides over y allows the marginal density of X to be expressedas

fX(x) =∫ ∞−∞

fX|Y(x|y)fY(y)dy,

which is the law of total probability for the continuous case.


Bayesian Inference - formal notationMotivation is to combine inference from data with prior informationThe probability model determines the likelihood of the data as a functionof θ, i.e., x1, . . . , xn ∼ p(·|θ), then L(θ) ∝

∏i p(xi|θ) (exchangeability).

In the Bayesian point of view θ has a probability distribution, θ ∼ π(θ),that reflects our uncertainty about it.Inference on the unknown, θ, is made conditional on all relevant knowninformation (e.g., the data x = (x1, . . . , xn)). Bayes theorem allows us tocondition upon the data to calculate a posterior distribution

p(θ|x) = p(x|θ)π(θ)∫p(x|θ)π(θ)dθ

∝ L(θ)π(θ)

p(x) =∫

p(x|θ)π(θ)dθ is the normalizing constant that makes p(θ|x)integrate to 1. It is also the marginal distribution of the data.All inference about θ must be based on the posterior distribution, oftensummarized through point estimates (mean, median, mode) or intervalestimates (lower and upper α/2 percentiles)


Differences between frequentist and Bayesian statistics

Bayesian statistics: uncertainty is quantified by determining how prioropinion about parameter values changes in light of the observed data.

Classical view: uncertainty about, for example, parameter estimates isquantified by investigating how such estimates vary in repeated samplingfrom the same population.

Data sets which might have observed, but were not, are irrelevant to aBayesian in making inference. The only relevant data set is the oneobserved. Bayesians, on the other hand, need to specify their priors.

The Bayesian approach has deep historical roots but required thealgorithmic developments of the late 1980s before it became useful.

The old sterile Bayesian-frequentist debates are a thing of the past. Mostdata analysts take a pragmatic point of view and use whatever is mostuseful.


Difficulties with the Bayesian ApproachIt requires the specification of a prior distribution for all unknowns.A Bayesian analysis is subjective: two people with different priorsobserve same data and yet reach different conclusions on θ.Counters:- When there is concrete prior knowledge it should be used!- Use objective priors, for example “noninformative” or vague priors thatexpress ignorance or little knowledge.- When a large amount of data is available the prior has little influence onthe posterior, unless it is very “peaked”.- “Reality”: Scientists often disagree due to different knowledge theyhave. Bayes methods provides a way of formally incorporating thisinformation in the decision making process.Bayesian methods involve high-dimensional integrals. No longer aserious concern, after the advent of MCMC methods (time consumingbut often worth the effort, as they allow fitting complex models withoutresorting to large sample approximations).


Reverend BayesThe term BAYESIAN derives from Thomas Bayes, a British mathematicianand a Presbyterian minister (ca. 1702–1761) who lived in Tunbridge Wells(Kent).

Sole probability paper, “Essay Towards Solving a Problem in the Doctrine ofChances”, published posthumously in 1763 by Pierce, which contains theseeds of Bayes’ Theorem.Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall 2017 16 / 19

SUMMARY: Bayes theorem applied to statistical modelsBayesian inference is based on the following premise/axiom:“Uncertainties about all unknown quantities are expressed by a jointprobability distribution (prior/posterior distributions). Statisticalinference about an unknown, θ, is made conditional on all relevantknown information (e.g., data)”.Motivation: Combine inference from data with prior information.Probability model:

x|θ ∼ p(x|θ)

θ unknown - model parameters, missing data, events we did not observedirectly or exactly (latent variable)In the Bayesian point of view θ has a probability distribution

θ ∼ π(θ)

that reflects our uncertainty about it.


The data, x, is known, so we should condition on it. Bayes theoremallows us to calculate a posterior distribution

p(θ|x) = p(x|θ)π(θ)∫p(x|θ)π(θ)dθ

∝ p(x|θ)π(θ)

as the conditional dist. of unobserved variables (e.g., θ) given theobserved (e.g., data x).- π(θ) is our uncertainty about θ before seeing the data- p(θ|x) is our uncertainty about θ after seeing the data- the quantity p(x) =

∫p(x|θ)π(θ)dθ is the normalizing constant that

makes p(θ|x) integrate to 1. Also called marginal distribution of the data.All inference about θ must be based on the posterior distribution, oftensummarised through:- point estimates (mean, median, mode)- interval estimates (regions of highest posterior density or simply lowerand upper α/2 quantiles)- hypothesis testing (often through Bayes factors)


What we will learn

Posterior ∝ Likelihood × Prior

How do I quantify my prior information?- conjugate choices- diffuse choices which assign probability more or less evenly over largeregions of the parameter space

How do I assess the effect of my prior beliefs?- sensitivity analyses across alternative specifications can reveal stability(or not) to prior models.

How do I do integrals? For most problems p(x) does not have a closedform.- conjugate choices- Markov chain Monte Carlo methods


stat 425: introduction to bayesian analysismarina/stat425/lecture2.pdfstat 425: introduction to...

Documents