probability theory 2013

Probability Theory

Luc Demortier

The Rockefeller University

INFN School of Statistics, Vietri sul Mare, June 3-7, 2013

1 / 61

Goals of this Lecture

This lecture is a brief overview of the results and techniques of probability the-ory that are most relevant for statistical inference as practiced in high energyphysics today.

There will be lots of dry definitions and a few exciting theorems.

Results will be stated without proof, but with attention to their conditions ofvalidity.

I will also attempt to describe various contexts in which each of these resultsis typically applied.

2 / 61

Useful References

1 Alan F. Karr, “Probability,” Springer-Verlag New York, Inc., 1993, 282pp.A lucid, graduate-level introduction to the subject.

2 George Casella and Roger L. Berger, “Statistical Inference,” 2nd ed.,Duxbury, 2002, 660pp.Covers probability as an introduction to statistical inference, has goodexamples and clear explanations.

3 David Pollard, “A User’s Guide to Measure Theoretic Probability,”Cambridge University Press, 2002, 351pp.A modern, abstract treatment.

4 Bruno de Finetti, “Theory of Probability: A critical introductorytreatment,” translated by Antonio Machi and Adrian Smith, John Wiley &Sons Ltd., 1974, in two volumes (300pp. and 375pp.)“One of the great books of the world”, “This book is about life: about away of thinking that embraces all human activities” (D. V. Lindley in theForeword).

3 / 61

Useful References

Most HEP introductions to statistics contain material about probability theory,see for example:

5 Glen Cowan, “Statistical Data Analysis,” Clarendon Press, Oxford, 1998,197pp.

6 Frederick James, “Statistical Methods in Experimental Physics,” 2nd Ed.,World Scientific, 2006, 345pp.

And then there is always Wikipedia of course...

4 / 61

Outline

1 Probability

2 Random variables

3 Conditional probability

4 Classical limit theorems

5 / 61

What Is Probability?

For the purposes of theory, it doesn’t really matter what probability “is”, orwhether it even exists in the real world. All we need is a few definitions andaxioms:

1 The set S of all possible outcomes of a particular experiment is calledthe sample space for the experiment.

2 An event is any collection of possible outcomes of an experiment, thatis, any subset of S (including S itself).

To each event A in sample space we would like to associate a number be-tween zero and one, that will be called the probability of A, or P(A). Fortechnical reasons one cannot simply define the domain of P as “all subsets ofthe sample space S”. Care is required. . .

3 A collection of subsets of S is called a sigma algebra, denoted by B, if ithas the following properties:

• ∅ ∈ B (the empty set is an element of B).• If A ∈ B, then Ac ∈ B (B is closed under complementation).• If A1, A2, · · · ∈ B, then ∪∞i=1Ai ∈ B (B is closed under countable

unions).

6 / 61


4 Given a sample space S and an associated sigma algebra B, aprobability function is a function P with domain B that satisfies:

• P(A) ≥ 0 for all A ∈ B.

• P(S) = 1.

• If A1, A2, . . . ∈ B are pairwise disjoint,then P(∪∞i=1Ai) =

P∞i=1 P(Ai).

This is known as the axiom of countable additivity. Somestatisticians find it more plausible to work with finite additivity: IfA ∈ B and B ∈ B are disjoint, then P(A ∪B) = P(A) + P(B).Countable additivity implies finite additivity, but accepting only thelatter makes statistical theory more complicated.

These three properties are usually referred to as the axioms ofprobability, or the Kolmogorov axioms. They define probability but do notspecify how it should be interpreted or chosen.

7 / 61


There are two main philosophies for the interpretation of probability:1 Frequentism

If the same experiment is performed a number of times, differentoutcomes may occur, each with its own relative rate or frequency. This“frequency of occurrence” of an outcome can be thought of as aprobability. In the frequentist school of statistics, the only validinterpretation of probability is as the long-run frequency of an event.Thus, measurements and observations have probabilities insofar asthey are repeatable, but constants of nature do not. The Big Bang doesnot have a probability.

2 BayesianismHere probability is equated with uncertainty. Since uncertainty is alwayssomeone’s uncertainty about something, probability is a property ofsomeone’s relationship to an event, not an objective property of thatevent itself. Probability is someone’s informed degree of belief aboutsomething. Measurements, constants of nature and other parameterscan all be assigned probabilities in this paradigm.

Frequentism has a more “objective” flavor to it than Bayesianism, and is themain paradigm used in HEP. On the other hand astrophysics deals with uniquecosmic events and tends to use the Bayesian methodology.

8 / 61


From Bruno de Finetti (p. x): “My thesis, paradoxically, and a little provoca-tively, but nonetheless genuinely, is simply this:

PROBABILITY DOES NOT EXIST.

The abandonment of superstitious beliefs about the existence of Phlogiston,the Cosmic Ether, Absolute Space and Time, ... , or Fairies and Witches, wasan essential step along the road to scientific thinking. Probability, too, if re-garded as something endowed with some kind of objective existence, is noless a misleading misconception, an illusory attempt to exteriorize or materi-alize our true probabilistic beliefs.”

“In investigating the reasonableness of our own modes of thought and be-haviour under uncertainty, all we require, and all that we are reasonably enti-tled to, is consistency among these beliefs, and their reasonable relation to anykind of relevant objective data (‘relevant’ in as much as subjectively deemedto be so). This is Probability Theory. In its mathematical formulation we havethe Calculus of Probability, with all its important off-shoots and related theorieslike Statistics, Decision Theory, Games Theory, Operations Research and soon.”

9 / 61

Random Variables

1 A random variable X is a mapping from the sample space S into the realnumbers. Given a probability function on S, it is straightforward to definea probability function on the range of X. For any set A in that range:

PX(X ∈ A) = P({s ∈ S : X(s) ∈ A}).

The induced probability PX satisfies the probability axioms.

2 A random variable N is discrete if there exists a countable set C suchthat PN (C) = 1.

3 A random variable X is absolutely continuous if there exists a positivefunction fX on R, the probability density function of X, such that forevery interval (a, b],

P((a, b]) =

Z b

a

fX(t) dt.

4 With every random variable X is associated a cumulative distributionfunction:

FX(x) ≡ PX(X ≤ x)

10 / 61

Cumulative Distribution FunctionsExample: In a coin-tossing experiment, let p be the probability of heads, anddefine the random variable X to be the number of tosses required to get ahead. This yields a geometric distribution: P(X = x) = (1− p)x−1p, and

FX(x) = P(X ≤ x) =xX

i=1

P(X = i) =xX

i=1

(1− p)i−1p = 1− (1− p)x.

Example for p = 0.3:

Note the three properties of a cdf:1 limx→−∞ F (x) = 0 and limx→+∞ F (x) = 1.2 F (x) is a non-decreasing function of x.3 F (x) is right-continuous: limx↓x0 F (x) = F (x0) for every x0.

11 / 61

Quantiles

The inverse of a cumulative distribution function F is called quantile function:

F−1(α) ≡ inf{x : F (x) ≥ α}, for 0 < α < 1.

In words, the quantile of order α of a continuous cdf F , or its α-quantile, is thevalue of x to the left of which lies a fraction α of the total probability under F .Instead of an α-quantile, one sometimes speaks of a 100α-percentile.

Some related concepts include:

• The median of a distribution, which is its 50th percentile.

• The lower quartile (25th percentile), and the upper quartile (75thpercentile).

• A measure of dispersion that is sometimes used is the interquartilerange, the distance between the lower and upper quartiles.

• A random variable with distribution function F can be constructed byapplying F−1 to a random variable uniformly distributed on [0, 1], aprocess known as the quantile transformation.

12 / 61

Probability Densities and Probability Mass Functions

We have already seen that for a continuous random variable one can writeprobabilities as integrals of a probability density function (pdf):

P((a, b]) =

Z b

a

fX(t) dt.

The cdf is a special case of this equation:

FX(x) = P(X ≤ x) =

Z x

−∞fX(t) dt.

The equivalent equation for a discrete random variable involves a probabilitymass function (pmf) and a summation instead of an integration:

FN (n) = P(N ≤ n) =nX

i=0

fN (i),

where fN (i) ≡ P(N = i).

13 / 61

Random Vectors

Random d-vectors generalize random variables: They are mappings fromsample space into Rd. Distributional concepts describing random vectors in-clude the following:

1 The distribution of X = (X1, . . . , Xd) is the probabilityPX(B) = P(X ∈ B) on Rd.

2 The joint distribution function of X1, . . . , Xd is the functionFX : Rd → [0, 1] given by:

FX(x1, . . . , xd) = P(X1 ≤ x1, . . . , Xd ≤ xd).

3 Let X be a random d-vector. Then for each i the distribution function ofcomponent i can be recovered as follows:

FXi(t) = limtj→∞,j 6=i

FX(t1, . . . , ti−1, t, ti+1, . . . , td).

4 A random vector X is discrete if there is a countable subset C of Rd

such that P(X ∈ C) = 1, and absolutely continuous if there is a functionfX : Rd → R+, the joint density of X1, . . . , Xd, such that:

P(X1 ≤ t1, . . . , Xd ≤ td) =

Z t1

−∞. . .

Z td

−∞fX(y1, . . . , yd) dy1 . . . dyd.

14 / 61

Random Vectors

5 If X = (X1, . . . , Xd) is absolutely continuous, then for each i, Xi isabsolutely continuous, and

fXi(x) =

Z +∞

−∞. . .

Z +∞

−∞fX(y1, . . . , yi−1, x, yi+1, . . . , yd)

× dy1 . . . dyi−1dyi+1 . . . dyd.

Integrating out the variables of the joint density, other than that for Xi,yields the marginal density function of Xi. The procedure can begeneralized to any subvector of X.

15 / 61

Expectation, Covariance, Correlation, and Independence

Let X and Y be two continuous random variables with joint pdf fXY (x, y) andmarginals fX(x) and fY (y), respectively. We define:

1 The expectation of X: µX = E(X) ≡R

x fX(x) dx

2 The variance of X:σ2

X = Var(X) ≡ E[(X − E(X))2] =R

(x− µX)2 fX(x) dx.3 The covariance of X and Y :

Cov(X, Y ) ≡ E[(X−µX) (Y −µY )] =R

(x−µX) (y−µY ) fXY (x, y) dx dy.

4 The correlation of X and Y : ρXY ≡ Cov(X,Y )σX σY

. This is a number between−1 and +1.

In addition, we say that X and Y are independent random variables, if forevery x and y,

fXY (x, y) = fX(x) fY (y).

Note that if X and Y are independent random variables, then Cov(X, Y ) =0 and ρXY = 0. However the converse is not true. It is possible to finduncorrelated, dependent random variables. This is because covariance andcorrelation only measure a particular kind of linear relationship.

The above definitions can be adapted to discrete random variables by replac-ing the integrals with sums.

16 / 61

Transformation TheoryIf X is a random variable, then any function of X, say g(X), is also a ran-dom variable. Often g(X) itself is of interest and we write Y = g(X). Theprobability distribution of Y is then given by

P(Y ∈ A) = P(g(X) ∈ A) = P(x ∈ X : g(x) ∈ A).

This is all that is needed in practice. For example, if X and Y are discreterandom variables, the formula can be applied directly to the probability massfunctions:

fY (y) =X

x∈g−1(y)

fX(x),

keeping in mind that g−1(y) is a set that may contain more than one point.

For the case that X and Y are continuous, an example will illustrate the issues.Suppose g(X) = X2. We can write:

FY (y) = P(Y ≤ y) = P(X2 ≤ y) = P(−√y ≤ X ≤p

(y))

= P(X ≤ √y)− P(X ≤ −√y) = FX(

√y)− FX(−√y).

The pdf of Y can now be obtained from the cdf by differentiation:

fY (y) =d

dyFY (y) =

1

2√

yfX(

√y) +

1

2√

yfX(−√y).

17 / 61

Transformation Theory

This example can be generalized as follows. Suppose the sample space of Xcan be partitioned into k sets, such that g(x) is a one-to-one transformationfrom each set onto the sample space of Y . Then:

fY (y) =kX

i=1

fX

“g−1

i (y)” ˛ d

dyg−1

i (y)

˛,

where gi is the restriction of g to the ith set.

18 / 61

Characterization of Probability DistributionsThere are several ways to characterize the probability distribution of a randomvariable X.

1 Functional characterizations:• The probability density function (pdf) for a continuous random

variable, or the probability mass function (pmf) for a discreterandom variable.

• The cumulative distribution function (cdf).• The characteristic function, φX(t) = E(eitX) (a Fourier transform).

2 Measures of location:• The mean: the expectation value of X, E(X).• The median: the point at which the cdf reaches 50%.• The mode: the location of the maximum of the pdf or pmf, when

unique.3 Measures of dispersion:

• The variance: the expectation of (X − µ)2, where µ is the mean.• The standard deviation, or the square root of the variance.

4 Measures of shape:• The skewness, defined by µ3/µ

3/22 , where µn ≡ E[(X − E(X))n] is

the nth central moment.• The excess kurtosis, µ4/µ2

2 − 3

19 / 61

Examples of Probability Distributions

In HEP we encounter the following distributions (among others!):

1 Binomial;

2 Multinomial;

3 Poisson;

4 Uniform;

5 Exponential;

6 Gaussian;

7 Log-Normal;

8 Gamma;

9 Chi-squared;

10 Cauchy (Breit-Wigner);

11 Student’s t;

12 Fisher-Snedecor.

20 / 61

The Binomial Distribution

Parameters: Number of trials n,Success probability p

Support: k ∈ {0, . . . , n}

pmf:`

nk

´pk(1− p)n−k

cdf: I1−p(n− k, 1 + k)

Mean: np

Median: bnpc or dnpe

Mode: b(n + 1)pc or b(n + 1)pc − 1

Variance: np(1− p)

Skewness: 1−2p√np(1−p)

Ex. Kurtosis: 1−6p(1−p)np(1−p)

(http://en.wikipedia.org/wiki/Binomial_distribution)

21 / 61

http://en.wikipedia.org/wiki/Binomial_distribution

The Binomial Distribution

The binomial distribution comes up mostly when calculating the efficiency ofan event selection. With the total number of events fixed, one counts thenumber of events that passed the selection and draws inferences about the“probability of success”.

Another example is a study of forward-backward asymmetry. One collects Nevents from a given process and is looking for an asymmetry between thenumbers F and B of events in the forward and backward hemispheres, re-spectively. The distribution of F is binomial:

N

F

!pF (1− p)B

with mean pN and standard deviationp

Np(1− p) ≈p

F (1− p). The forward-backward asymmetry is usually defined as

R ≡ F −B

F + B=

2F

N− 1

and has varianceVar(R) =

4p(1− p)

N≈ 4FB

N 3.

22 / 61

The Multinomial Distribution

Parameters: Number of trials n,Event probabilities p1, . . . , pk,

Pki=1 pi = 1

Support: xi ∈ {0, . . . , n},Pk

i=1 xi = n

pmf: n!x1!...xk!

px11 . . . p

xkk

Mean: E(Xi) = npi

Variance: Var(Xi) = npi(1− pi)

Covariance: Cov(Xi, Xj) = −npipj (i 6= j)

This is the distribution of the contents of the bins of a histogram, when the totalnumber of events n is fixed.

(http://en.wikipedia.org/wiki/Multinomial_distribution)

23 / 61

http://en.wikipedia.org/wiki/Multinomial_distribution

The Poisson Distribution

Parameters: λ > 0

Support: k ∈ {0, 1, 2, 3, . . .}

pmf: λk

k!e−λ

cdf: Γ(bk+1c,λ)bkc!

Mean: λ

Median: ≈ bλ + 13− 0.02

λc

Mode: bλc, dλe − 1

Variance: λ

Skewness: λ−1/2

Ex. Kurtosis: λ−1

(http://en.wikipedia.org/wiki/Poisson_distribution)

24 / 61

http://en.wikipedia.org/wiki/Poisson_distribution

The Poisson DistributionThe Poisson distribution is ubiquitous in HEP. It can be derived from the so-called Poisson postulates:

For each t ≥ 0, let Nt be an integer-valued random variable with the followingproperties (think of Nt as denoting the no. of arrivals from time 0 to time t):

1 Start with no arrivals: N0 = 0.2 Arrivals in disjoint time periods are independent:

s < t ⇒ Ns and Nt −Ns are independent.3 Number of arrivals depends only on period length: Ns and Nt+s −Nt

are identically distributed.4 Arrival probability is proportional to period length, if length is small:

limt→0P(Nt=1)

t= λ.

5 No simultaneous arrivals: limt→0P(Nt>1)

t= 0.

Then, for any integer n:

P(Nt = n) = e−λt (λt)n

n!,

that is, Nt ∼ Poisson(λt).

Example: The number of particles emitted in a fixed time interval t from aradioactive source.

25 / 61

The Uniform Distribution

Parameters: a, b ∈ R

Support: x ∈ [a, b]

pdf: 1b−a

for x ∈ [a, b], 0 otherwise

cdf: 0 for x < a, x−ab−a

for a ≤ x < b,

1 for x ≥ b

Mean: 12(a + b)

Median: 12(a + b)

Mode: Any value in [a, b]

Variance: 112

(b− a)2

Skewness: 0

Ex. Kurtosis: − 65

(http://en.wikipedia.org/wiki/Uniform_distribution_(continuous))

26 / 61

http://en.wikipedia.org/wiki/Uniform_distribution_(continuous)

The Exponential Distribution

Parameters: λ > 0

Support: x ≥ 0

pdf: λe−λx

cdf: 1− e−λx

Mean: 1λ

Median: ln 2λ

Mode: 0

Variance: 1λ2

Skewness: 2

Ex. Kurtosis: 6

(http://en.wikipedia.org/wiki/Exponential_distribution)

27 / 61

http://en.wikipedia.org/wiki/Exponential_distribution

The Exponential Distribution

In a Poisson process with a mean of λ events per unit time:

P(N) = e−λt (λt)N

N !,

the probability of no events in time t is the exponential distribution e−λt.

For the time interval Z between two successive Poisson events, one has:

P(Z > t) = e−λt.

28 / 61

The Gaussian (or Normal) Distribution

Parameters: Mean µ, Variance σ2 > 0

Support: x ∈ R

pdf: 1√2π σ

expn− (x−µ)2

2σ2

ocdf: 1

2

h1 + erf

“x−µ√

2 σ

”iMean: µ

Median: µ

Mode: µ

Variance: σ2

Skewness: 0

Ex. Kurtosis: 0

(http://en.wikipedia.org/wiki/Normal_distribution)

29 / 61

http://en.wikipedia.org/wiki/Normal_distribution

The Gaussian (or Normal) Distribution

This is the most important distribution in all of statistics, in large part due tothe limit theorems, see later.

The probability content of some commonly used intervals:

P(−1.00 ≤ X − µ

σ≤ 1.00) = 0.68 P(−1.64 ≤ X − µ

σ≤ 1.64) = 0.90

P(−1.96 ≤ X − µ

σ≤ 1.96) = 0.95 P(−2.58 ≤ X − µ

σ≤ 2.58) = 0.99

P(−3.29 ≤ X − µ

σ≤ 3.29) = 0.999

The bivariate Normal distribution has pdf

f(x, y) =1

2πσXσY

p1− ρ2

exp

− 1

2(1− ρ2)

»(x− µX)2

σ2X

− 2ρ(X − µX)(Y − µY )

σXσY+

(y − µY )2

σ2Y

–ff.

30 / 61

The Log-Normal Distribution

Parameters: Log-scale µ, Shape σ2 > 0

Support: x > 0

pdf: 1

x√

2π σexpn− (ln x−µ)2

2σ2

ocdf: 1

2

h1 + erf

“ln x−µ√

2 σ

”iMean: eµ+σ2/2

Median: eµ

Mode: eµ−σ2

Variance: (eσ2

− 1) e2µ+σ2

Skewness: (eσ2

+ 2)p

eσ2 − 1

Ex. Kurtosis: e4σ2

+ 2e3σ2

+ 3e2σ2

− 6

The logarithm of a Lognormal random variable follows a Normal distribution.

(http://en.wikipedia.org/wiki/Log-normal_distribution)31 / 61

http://en.wikipedia.org/wiki/Log-normal_distribution

The Gamma Distribution

Parameters: Shape α > 0, Rate β > 0

or Shape k ≡ α, Scale θ ≡ 1β

Support: x > 0

pdf: βα

Γ(α)xα−1 e−βx

cdf: γ(α,βx)Γ(α)

Mean: αβ

Median: No simple closed form

Mode: α−1β

for α > 1

Variance: αβ2

Skewness: 2√α

Ex. Kurtosis: 6α

(http://en.wikipedia.org/wiki/Gamma_distribution)

32 / 61

http://en.wikipedia.org/wiki/Gamma_distribution

The Gamma Distribution

The exponential and chi-squared distributions are special cases of the Gammadistribution.

There is a special relationship between the Gamma and Poisson distributions.If X is a Gamma(α, β) random variable where α is integer, then

P(X ≤ x) = P(Y ≥ α),

where Y is a Poisson(xβ) random variable.

33 / 61

The Chi-Squared Distribution

Parameters: Degrees of freedom k ∈ N

Support: x ≥ 0

pdf: 1

2k2 Γ( k

2 )x

k2−1 e−

x2

cdf: 1

Γ( k2 )

γ`

k2, x

2

´Mean: k

Median: ≈ k`1− 2

9k

´3Mode: max(k − 2, 0)

Variance: 2k

Skewness:p

8/k

Ex. Kurtosis: 12/k

(http://en.wikipedia.org/wiki/Chi-squared_distribution)

34 / 61

http://en.wikipedia.org/wiki/Chi-squared_distribution

The Chi-Squared Distribution

If X1, . . . , Xn are independent, standard Normal random variables (N(0, 1)),then the sum of their squares,

X2(n) ≡

nXi=1

X2i ,

follows a chi-squared distribution for n degrees of freedom.

At large n, the quantities

Zn =X2

(n) − n√

2n,

Z′n =

q2X2

(n) −√

2n− 1,

are approximately standard Normal.

35 / 61

The Cauchy (or Lorentz, or Breit-Wigner) Distribution

Parameters: Location x0, Scale γ > 0

Support: x ∈ R

pdf: 1

πγ

»1+

“x−x0

γ

”2–

cdf: 1π

arctan“

x−x0γ

”+ 1

2

Mean: undefined

Median: x0

Mode: x0

Variance: undefined

Skewness: undefined

Ex. Kurtosis: undefined

(http://en.wikipedia.org/wiki/Cauchy_distribution)

36 / 61

http://en.wikipedia.org/wiki/Cauchy_distribution

The Cauchy (or Lorentz, or Breit-Wigner) Distribution

Although the Cauchy distribution is pretty pathological, since none of its mo-ments exist, it “has a way of turning up when you least expect it” (Casella &Berger).

For example, the ratio of two standard normal random variables has a Cauchydistribution.

37 / 61

Student’s t-Distribution

Parameters: Degrees of freedom ν > 0

Support: x ∈ R

pdf:Γ( ν+1

2 )√

νπ Γ( ν2 )

“1 + x2

ν

”− ν+12

cdf: 1− 12I ν

x2+ν

`ν2, 1

2

´, for x > 0

Mean: 0 for ν > 1

Median: 0

Mode: 0

Variance: νν−2

for ν > 2, ∞ for 1 < ν ≤ 2

Skewness: 0 for ν > 3

Ex. Kurtosis: 6ν−4

for ν > 4, ∞ for 2 < ν ≤ 4

(http://en.wikipedia.org/wiki/Student's_t-distribution)

38 / 61

http://en.wikipedia.org/wiki/Student's_t-distribution

Student’s t-Distribution

Let X1, X2, . . . , Xn be Normal random variables with mean µ and standarddeviation σ. Then the quantities

X =1

n

nXi=1

Xi,

S2 =1

n− 1

nXi=1

(Xi − X)2

are independently distributed, (X − µ)/(σ/√

n) as standard normal, and(n− 1)S2/σ2 as chi-squared for (n− 1) degrees of freedom.To test whether X is statistically consistent with µ, when σ is unknown, thecorrect test statistic to use is the ratio

t =(X − µ)/(σ/

√n)

S/σ=

√n (X − µ)

S,

which has a Student’s t-distribution with n− 1 degrees of freedom.

Note that for ν = 1 the t-distribution becomes the Cauchy distribution.

39 / 61

The Fisher-Snedecor (or F-) Distribution

Parameters: Degrees of freedom d1, d2 > 0

Support: x ≥ 0

pdf:

r(d1x)d1 d

d22

(d1x+d2)d1+d2

1

x B“

d12

,d22

”cdf: I d1x

d1x+d2

`d12

, d22

´Mean: d2

d2−2for d2 > 2

Mode: d1−2d1

d2d2+2

for d1 > 2

Variance: 2d22(d1+d2−2)

d1(d2−2)2(d2−4)for d2 > 4

Skewness: (2d1+d2−2)√

8(d2−4)

(d2−6)√

d1(d1+d2−2)for d2 > 6

(http://en.wikipedia.org/wiki/F-distribution)

40 / 61

http://en.wikipedia.org/wiki/F-distribution

The Fisher-Snedecor (or F-) Distribution

Let X1, . . . , Xn be a random sample from a N(µX , σ2X) distribution, and Y1, . . . , Ym

a random sample from a N(µY , σ2Y ) distribution. We may be interested in com-

paring the variabilities of the X and Y populations. A quantity of interest in thiscase is the ratio σ2

X/σ2Y , information about which is contained in the ratio of

sample variances S2X/S2

Y , where

S2X =

1

n− 1

nXi=1

(Xi − X)2,

S2Y =

1

m− 1

mXi=1

(Yi − Y )2.

The F -distribution with n − 1 and m − 1 degrees of freedom provides thedistribution of the ratio

S2X/S2

Y

σ2X/σ2

Y

=S2

X/σ2X

S2Y /σ2

Y

,

which is a ratio of scaled chi-squared variates that are independent.

41 / 61

Relationships Among Distributions

(From Casella & Berger)

42 / 61

Conditional Probability

It is sometimes desirable to revise probabilities to account for the knowledgethat an event has occurred. This can be done using the concept of conditionalprobability:

Let A and B be events. Provided that P(A) > 0, the conditional probability ofB given A is

P(B | A) =P(B ∩A)

P(A).

If P(A) = 0, one sets P(B | A) = P(B) by convention.

What happens in the conditional probability calculation is that B becomes thesample space: P(B | B) = 1. The original sample space S has been updatedto B.

It follows from the definition that P(B ∩A) = P(B | A) P(A), and by symmetry,P(A ∩ B) = P(A | B) P(B). Equating the right-hand sides of both equationsyields:

P(A | B) = P(B | A)P(A)

P(B),

which is known as Bayes’ rule.

43 / 61

Conditional Probability

A more general version of Bayes’ rule is obtained by considering a partitionA1, A2, . . . of the sample space S, that is, the Ai are pairwise disjoint subsetsof S, and ∪∞i=1Ai = S.

Let B be any set. By the Law of Total Probability we have

P(B) =∞Xi=1

P(B | Ai) P(Ai).

Substituting this result in the basic version of Bayes’ rule shows that, for eachi = 1, 2, . . .:

P(Ai | B) =P(B | Ai) P(Ai)P∞

j=1 P(B | Aj) P(Aj).

44 / 61

Conditional Probability for Random VariablesSo far we defined conditional probability for subsets of sample space (“events”).The extension to random vectors is straightforward.

The simplest case is that of discrete random vectors. Let (X, Y ) be a discretebivariate random vector with joint probability mass function (pmf) fX,Y (x, y)and marginal pmfs fX(x) and fY (y). For any y such that fY (y) > 0, theconditional pmf of X given that Y = y is the function of x defined by

fX|Y (x | y) = P(X = x | Y = y) =fX,Y (x, y)

fY (y).

This is a properly normalized pmf that is everywhere positive.

For an absolutely continuous random (n+1)-vector (X, Y1, . . . , Yn) with densityf , the conditional density of X given Y1, . . . , Yn is the function of x defined by

fX|Y1,...,Yn(x | y1, . . . , yn) =f(x, y1, . . . , yn)R +∞

−∞ f(z, y1, . . . , yn) dz.

The conditional cumulative distribution function of X given Y1, . . . , Yn is ob-tained by integrating the conditional density:

FX|Y1,...,Yn(t | y1, . . . , yn) =

Z t

−∞fX|Y1,...,Yn(x | y1, . . . , yn) dx.

45 / 61

Bayes’ Theorem and the Predictive Distribution

The Bayesian paradigm makes extensive use of conditional probabilities, start-ing with Bayes’ theorem, which is a reformulation of Bayes’ rule for randomvariables. Suppose we have some data X with a known distribution p(X | θ)depending on some unknown parameter θ. If we have prior information aboutθ in the form of a prior distribution π(θ), Bayes’ theorem tells us how to updatethat information to take into account the new data. The result is a posteriordistribution for θ:

p(θ | X) =p(X | θ) π(θ)R

p(X | θ′) π(θ′) dθ′.

The result of a Bayesian analysis is this posterior distribution. Usually one triesto summarize it by providing a mean, or quantiles, or intervals with specificprobability content.The posterior distribution also provides the basis for predicting the values offuture observations Xnew, via the predictive distribution:

p(Xnew | X) =

Zp(Xnew | θ) p(θ | X) dθ.

46 / 61

The Borel-Kolmogorov Paradox

This paradox illustrates some of the pitfalls attached to conditional probabilitycalculations. Imagine a uniform distribution of points over the surface of theearth. Intuitively it is obvious that the subset of points along the equator willalso be uniformly distributed. That is, their conditional distribution, given alatitude of zero, is uniform. However the equator is only the equator becauseof our choice of coordinate system. Any other great circle could serve asequator. By invariance one would expect the points to be just as uniformlydistributed along meridians, for example.

Unfortunately this cannot be the case. A quarter of a meridian’s length liesnorth of latitude 45◦N. Integrating over all meridians would lead to the expec-tation that the earth’s cap above 45◦N covers a quarter of the total surface,which is clearly incorrect.

To resolve the paradox, one must realize that conditioning on a set of proba-bility zero is inadmissible. The proper procedure is first to condition on a smallbut finite set around the value of interest, and then take a limit. However thelimit is not uniquely defined. In the case of the equator, it is a band of parallelswhose width goes to zero, whereas in the case of the meridian it is a lune ofmeridians whose opening angle goes to zero. . .

47 / 61


For a mathematical explanation, we introduce a latitude φ and a longitude λ,with −π/2 ≤ φ ≤ +π/2 and −π < λ ≤ +π. The probability density function ofa uniform distribution on the unit sphere is given by:

f(φ, λ) =1

4πcos φ.

The marginal densities are:

f(φ) =

Z +π

−π

f(φ, λ) dλ =1

2cos φ,

f(λ) =

Z +π/2

−π/2

f(φ, λ) dφ =1

2π.

For the conditional densities at zero longitude and zero latitude this gives:

f(φ | λ = 0) =f(φ, λ = 0)

f(λ = 0)=

1

2cos φ,

f(λ | φ = 0) =f(φ = 0, λ)

f(φ = 0)=

1

2π,

showing that the conditional distributions along the Greenwich meridian andalong the equator are indeed different.

48 / 61


Let’s now change coordinates: Keep the latitude φ but replace the longitude:

λ → µ =λ

g(φ),

where g(φ) is a strictly positive function. In terms of φ and µ, the density ofpoints on the unit sphere is given by:

f(φ, µ) =1

4πcos φ

˛∂(φ, λ)

∂(φ, µ)

˛=

1

4πcos φ g(φ)

For the conditional density of latitudes given µ = 0 we find:

f(φ | µ = 0) ∝ cos φ g(φ).

Observe that µ = 0 is entirely equivalent to λ = 0; both conditions correspondto the same set of points on the unit sphere. And yet the conditional distribu-tions differ by the factor g(φ). This is because of the implicit limiting procedureused in conditioning. To condition on λ = 0 we consider the set {λ : |λ| ≤ ε}and let ε go to zero. To condition on µ = 0 the set is {λ : |λ| ≤ ε g(φ)}.

49 / 61

Classical Limit Theorems

The limit theorems describe the behavior of certain sample quantities as thesample size approaches infinity.

The notion of an infinite sample size is somewhat fanciful, but it provides anoften useful approximation for the finite sample case due to the simplificationit introduces in various formulae.

First we have to clarify what we mean by random variables “approaching infin-ity”. . .

50 / 61

Modes of Stochastic Convergence

Consider a sequence of random variables X1, X2, . . ., not necessarily inde-pendent or identically distributed. There are many different ways this se-quence can converge to a random variable X. Here we only consider three:

1 Convergence in distribution:Xn

d−→ X if limn→∞ FXn(x) = FX(x) at all points x where FX(x) iscontinuous.

2 Convergence in probability (or weak convergence):Xn

P−→ X if, for every ε > 0, limn→∞ P(|Xn −X| < ε) = 1.“For any ε > 0 and δ > 0 there is an integer N such that the individualdeviation probabilities P(|Xn −X| > ε) are less than δ for all n > N .”

3 Almost sure convergence (or strong convergence):Xn

a.s.−−→ X if, for every ε > 0, P(limn→∞|Xn −X| < ε) = 1.“For any ε > 0 and δ > 0 there is an integer N such that the probabilityfor any deviation |Xn −X|, n > N , to be greater than ε, is less than δ.”

Almost sure convergence implies convergence in probability, and convergencein probability implies convergence in distribution.

51 / 61

Classical Limit Theorems: The Statements

Let X1, X2, . . . be i.i.d.; set Xn ≡ 1n

Pni=1 Xi, µ = E(X1), and σ2 = Var(X1).

The weak law of large numbers:If µ < ∞, then Xn

P−→ µ.

The strong law of large numbers:If µ < ∞, then Xn

a.s.−−→ µ.

The central limit theorem:

If σ2 < ∞, thenXn − µ

σ/√

n

d−→ N(0, 1).

The law of the iterated logarithm:

If σ2 < ∞, then lim supn→∞

Xn − µ

σ/√

n

1√2 ln ln n

= 1.

52 / 61

Short Digression on “Lim Sup”

The limit superior and limit inferior of a sequence of Xn are defined by

lim supn→∞

Xn ≡ limn→∞

„supm≥n

Xm

«= inf

n≥0supm≥n

Xm

lim infn→∞

Xn ≡ limn→∞

„inf

m≥nXm

«= sup

n≥0inf

m≥nXm

The limit superior of Xi is the smallestreal number b such that, for any ε > 0,there exists an n such that Xm < b + εfor all m > n. In other words, anynumber larger than the limit superior isan eventual upper bound for thesequence. Only a finite number ofelements of the sequence are greaterthan b + ε.

A sequence converges when its limit inferior and limit superior are equal.

53 / 61

Classical Limit Theorems: Applications

1 ConsistencyThe property described by the Weak Law of Large Numbers, asequence of random variables converging to a constant, is known asconsistency. This is an important property for the study of pointestimators in statistics.

2 Frequentist StatisticsAs an application of the Weak Law of Large Numbers, let Xi be aBernoulli random variable, with Xi = 1 representing “success”, andXi = 0 “failure”. Then Xn = 1

n

Pni=1 Xi is the frequency of successes in

the first n trials. If p is the probability of success, the Weak Law of LargeNumbers states that Xn

P−→ p.

This result is often used by frequentist statisticians to justify theiridentification of probability with frequency. One must be very careful withthis however. Logically one cannot assume something as a definitionand then prove it as a theorem. Furthermore, there is a contradictionbetween a definition that would assume as certain something that thetheorem only states to be very probable.

54 / 61


3 Monte Carlo Integration:Let f be a function on [a, b], with

R b

a|f(x)| dx < ∞, and let U1, U2, . . . be

independent random variables that are uniformly distributed between aand b. Then:

1

n

nXi=1

f(Ui)a.s.−−→

Z b

a

f(x) dx.

This follows directly from the Strong Law of Large Numbers. This MonteCarlo technique of integration is one of the simplest, and it extends tomulti-dimensional integrals.

55 / 61


4 Maximum Likelihood Estimation:Suppose that we have observed data X1, . . . , Xn from a distributionf(x | θ), where θ is an unknown parameter. The likelihood function is:

Ln(θ) =nY

i=1

f(Xi | θ).

We can estimate θ by its value at the maximum of the likelihood:

θn = arg max{θ}

Ln(θ).

The estimator θn depends on the sample size n. If the likelihoodfunction satisfies some standard regularity conditions, one can provetwo limit theorems for θn, as n →∞:

• Consistency: θnP−→ θ;

• Asymptotic normality:√

n[θn − θ]d−→ N(0, σ2(θ)), where

σ2(θ) = 1/I(θ) and I(θ) is the Fisher information:

I(θ) ≡ E

"„d

dθln f(X1 | θ)

«2#

56 / 61


5 Empirical Distribution Functions:Let X1, . . . , Xn be i.i.d. with entirely unknown distribution function F .We can estimate F by the empirical distribution function:

Fn(t) =1

n

nXi=1

1(Xi ≤ t).

This estimator is justified by the Strong Law of Large Numbers:

Fn(t) =1

n

nXi=1

1(Xi ≤ t)a.s.−−→ E[1(X1 ≤ t)] = F (t).

The convergence is actually uniform in t, a result known as theGlivenko-Cantelli theorem:

sup{t}|Fn(t)− F (t)| a.s.−−→ 0.

Accompanying this result is a central limit theorem known as theKolmogorov-Smirnov theorem:

√n sup{t}|Fn(t)− F (t)| d−→ Z, where P(Z ≤ t) = 1− 2

∞Xj=1

(−1)j+1e−2j2t2

.

57 / 61


6 The Delta Method:Suppose we make an observation X that we can use to estimate aparameter θ. However we are not interested in θ itself, but in somefunction g of θ. We could then consider using g(X) as an estimator ofg(θ). What are the properties of this estimator? What is its variance, itssampling distribution?A first-order Taylor expansion of g(X) around X = θ yields:

g(X) = g(θ) + g′(θ)(X − θ) + Remainder

Assuming that the mean of X is θ, this leads to E(g(X)) ≈ g(θ)(neglecting the remainder in the Taylor expansion), and therefore:

Varg(X) ≈ E([g(X)− g(θ)]2) ≈ E([g′(θ)(X − θ)]2) = [g′(θ)]2VarX

So far all we have done is rederive the rule of error propagation. Whereit gets interesting is that this rule is associated with a generalization ofthe Central Limit Theorem.

58 / 61


6 The Delta Method (continued):• Basic Delta Method:

Let Y1, Y2, . . . be random with√

n(Yn − θ)d−→ N(0, σ2). For given g

and θ, suppose that g′(θ) exists and is not zero. Then√

n[g(Yn)− g(θ)]d−→ N(0, σ2[g′(θ)]2).

• Second-Order Delta Method:Let Y1, Y2, . . . be random with

√n(Yn − θ)

d−→ N(0, σ2). For given gand θ, suppose that g′(θ) = 0 and g′′(θ) exists and is not zero.Then

n[g(Yn)− g(θ)]d−→ 1

2σ2g′′(θ)χ2

1.

• Multivariate Delta Method:Let ~X1, ~X2, . . . be a sequence of random p-vectors such thatE(Xik) = µi and Cov(Xik, Xjk) = σij . For a given function g withcontinuous first partial derivatives and a specific value of ~µ,suppose that τ 2 ≡

PPσij

∂g∂µi

∂g∂µj

> 0. Then:

√n[g(X1, . . . , Xp)− g(µ1, . . . , µp)]

d−→ N(0, τ 2).

59 / 61


7 Sampling to a Foregone ConclusionSuppose we are looking for a new physics signal in a specific channel.We have a sample of n events in that channel, and for each event in thesample we measure property X. We plan to claim discovery if theobserved significance Zn exceeds a pre-set threshold c:

Zn ≡|Xn − µ|σ/√

n≥ c,

where µ is the expected value of X in the absence of signal and σ is themeasurement resolution. The discovery threshold c is typically set to 3or 5 in HEP.

Suppose that after an initial data collection run we observe someindication of a signal, but not enough to claim discovery. We thenproceed to take more data and regularly check our discovery criterion.As the sample size increases, is there any guarantee that the probabilityfor making the correct decision regarding the presence of signal goes to1?

60 / 61


7 Sampling to a Foregone Conclusion (continued)The answer is NO! At least not if we keep c constant with n. This is aconsequence of the Law of the Iterated Logarithm (LIL), according towhich the event Zn ≥ (1 + ε)

√2 ln ln n happens infinitely many times if

ε ≤ 0. Therefore, regardless of the choice of c, Zn will eventually exceedc for some n, even if there is no signal.

Furthermore, the LIL tells us exactly how to vary the discovery thresholdwith sample size in order to avoid “sampling to a foregone conclusion”.

“A blind use of [significances] allows the statistician to cheat, by claimingat a suitable point in a sequential experiment that he has a train tocatch. This must have been known to Khintchine when he proved in1924 that, in sequential binomial sampling, a "sigmage" of nearly√

2 ln ln n is reached infinitely often, with probability 1. [. . . ] But note thatthe iterated logarithm increases with fabulous slowness, so that thisparticular objection to the use of [significances] is theoretical rather thanpractical. To be reasonably sure of getting 3σ one would need to gosampling for billions of years, by which time there might not be anytrains to catch.” (I.J. Good, J. R. Statist. Soc. B 27 (1965) p.197.)

61 / 61

probability theory 2013

Documents