stochastic processes for physicists - imperial · stochastic processes for physicists understanding...

Stochastic Processes for PhysicistsUnderstanding Noisy Systems

Chapter 1: A review of probability theory

Paul KirkDivision of Molecular Biosciences, Imperial College London

19/03/2013

1.1 Random variables and mutually exclusive events

Random variables• Suppose we do not know the precise value of a variable, but may

have an idea of the relative likelihood that it will have one of anumber of possible values.

• Let us call the unknown quantity X .

• This quantity is referred to as a random variable.

Probability

• 6-sided die. Let X be the value we get when we roll the die.

• Describe the likelihood that X will have each of the values 1, . . . , 6by a number between 0 and 1: the probability.

• If Prob(X = 3) = 1, then we will always get a 3.

• If Prob(X = 3) = 2/3, then we expect to get a 3 about two-thirdsof the time.

Paul Kirk 1 of 22


Mutually exclusive events

• The various values of X are an example of mutually exclusiveevents.

• X can take precisely one of the values between 1 and 6.

• “Mutually exclusive probabilities sum”◦ Prob(X = 3 or X = 4) = Prob(X = 3) + Prob(X = 4).2 A review of probability theory

Figure 1.1. An illustation of summing the probabilities of mutually exclusiveevents, both for discrete and continuous random variables.

we want to know the probability for X to be in the range from 3 to 4, we sum allthe probabilities for the values from 3 to 4. This is illustrated in Figure 1.1. SinceX always takes a value between 1 and 6, the probability for it to take a value in thisrange must be unity. Thus, the sum of the probabilities for all the mutually exclusivepossible values must always be unity. If the die is fair, then all the possible valuesare equally likely, and each is therefore equal to 1/6.

Note: in mathematics texts it is customary to denote the unknown quantityusing a capital letter, say X, and a variable that specifies one of the possiblevalues that X may have as the equivalent lower-case letter, x. We will use thisconvention in this chapter, but in the following chapters we will use a lower-caseletter for both the unknown quantity and the values it can take, since it causes noconfusion.

In the above example, X is a discrete random variable, since it takes the discreteset of values 1, . . . , 6. If instead the value of X can be any real number, then we saythat X is a continuous random variable. Once again we assign a number to each ofthese values to describe their relative likelihoods. This number is now a function ofx (where x ranges over the values that X can take), called the probability density,and is usually denoted by PX(x) (or just P (x)). The probability for X to be in therange from x = a to x = b is now the area under P (x) from x = a to x = b. Thatis

Prob(a < X < b) =∫ b

a

P (x)dx. (1.1)

Paul Kirk 2 of 22


“Note: in mathematics texts it is customary to denote the unknownquantity using a capital letter, say X , and a variable that specifies oneof the possible values that X may have as the equivalent lower-caseletter, x . We will use this convention in this chapter, but in thefollowing chapters we will use a lower-case letter for both the unknownquantity and the values it can take, since it causes no confusion.”

So, rather than writing Prob(X = 3) or Prob(X = x), we will (in laterchapters) tend to write Prob(3) or Prob(x).

Warning: may cause confusion.

Paul Kirk 3 of 22


Continuous random variables• For continuous random variables, the probability for X to be within

a range is found by integrating the probability density function.

• Prob(a < X < b) =∫ ba P(x)dx

2 A review of probability theory

Figure 1.1. An illustation of summing the probabilities of mutually exclusiveevents, both for discrete and continuous random variables.

we want to know the probability for X to be in the range from 3 to 4, we sum allthe probabilities for the values from 3 to 4. This is illustrated in Figure 1.1. SinceX always takes a value between 1 and 6, the probability for it to take a value in thisrange must be unity. Thus, the sum of the probabilities for all the mutually exclusivepossible values must always be unity. If the die is fair, then all the possible valuesare equally likely, and each is therefore equal to 1/6.

Note: in mathematics texts it is customary to denote the unknown quantityusing a capital letter, say X, and a variable that specifies one of the possiblevalues that X may have as the equivalent lower-case letter, x. We will use thisconvention in this chapter, but in the following chapters we will use a lower-caseletter for both the unknown quantity and the values it can take, since it causes noconfusion.

In the above example, X is a discrete random variable, since it takes the discreteset of values 1, . . . , 6. If instead the value of X can be any real number, then we saythat X is a continuous random variable. Once again we assign a number to each ofthese values to describe their relative likelihoods. This number is now a function ofx (where x ranges over the values that X can take), called the probability density,and is usually denoted by PX(x) (or just P (x)). The probability for X to be in therange from x = a to x = b is now the area under P (x) from x = a to x = b. Thatis

Prob(a < X < b) =∫ b

a

P (x)dx. (1.1)

Paul Kirk 4 of 22


Expectation

• The expectation of an arbitrary function, f (X ), with respect to theprobability density function P(X ) is

〈f (X )〉P(X ) =

∫ ∞−∞

P(x)f (x)dx .

• The mean or expected value of X is 〈X 〉.

Paul Kirk 5 of 22


Variance• The variance of X is the expectation of the squared difference from

the mean

V [X ] =

∫ ∞−∞

P(x)(x − 〈X 〉)2dx

=

∫ ∞−∞

P(x)(x2 + 〈X 〉2 − 2x〈X 〉)dx

=

∫ ∞−∞

P(x)x2dx +

∫ ∞−∞

P(x)〈X 〉2dx −∫ ∞−∞

P(x)2x〈X 〉dx

= 〈X 2〉+

(〈X 〉2

∫ ∞−∞

P(x)dx

)−(

2〈X 〉∫ ∞−∞

P(x)xdx

)= 〈X 2〉+ 〈X 〉2 − 2〈X 〉2

= 〈X 2〉 − 〈X 〉2

Paul Kirk 6 of 22

1.2 Independence

Independence

• “Independent probabilities multiply”

• For independent variables, PX ,Y (x , y) = PX (x)PY (y)

• For independent variables, 〈XY 〉 = 〈X 〉〈Y 〉

Paul Kirk 7 of 22

1.3 Dependent random variables

Dependence

• If X and Y are dependent, then PX ,Y (x , y) does not factor as theproduct of PX (x) and PY (y)

• If we know PX ,Y (x , y) and want to know PX (x), then it is obtainedby “integrating out” (or “marginalising”) the other variable:

PX (x) =

∫ ∞−∞

PX ,Y (x , y)dy

Paul Kirk 8 of 22

1.3 Dependent random variables

Conditional probability densities

• The probability density for X given that we know that Y = y iswritten P(X = x |Y = y) or P(x |y) and is referred to as theconditional probability density for X given Y .

PX |Y (X = x |Y = y) =PX ,Y (X = x ,Y = y)

PY (Y = y).

Explanation: “To see how to calculate this conditional probability, wenote first that P(x , y) with y = a gives the relative probability fordifferent values of x given that Y = a. To obtain the conditionalprobability density for X given that Y = a, all we have to do is divideP(x , a) by its integral over all values of x .”

Paul Kirk 9 of 22

1.4 Correlations and correlation coefficients

• The covariance of X and Y is:

cov(X ,Y ) = 〈(X − 〈X 〉)(Y − 〈Y 〉)〉 = 〈XY 〉 − 〈X 〉〈Y 〉.

Idea:

1. How can we define what it means for a value x to be “bigger thanusual”? Well, we can see if x > 〈X 〉 i.e. if x − 〈X 〉 > 0.

2. Similarly, we can say that a value x is “smaller than usual” if x < 〈X 〉i.e. if x − 〈X 〉 < 0.

3. If x tends to be “bigger than usual” when y is “bigger than usual”,then 〈(X − 〈X 〉)(Y − 〈Y 〉)〉 will be > 0

• The correlation is just a normalised version of the covariance,which takes values in the range −1 to 1:

CXY =〈XY 〉 − 〈X 〉〈Y 〉√

V [X ]V [Y ]

Paul Kirk 10 of 22

1.5 Adding random variables together

When we have two continuous random variables, X and Y , withprobability densities PX and PY , it is often useful to be able tocalculate the probability density of the random variable whose value isthe sum of them: Z = X + Y . It turns out that the probabilitydensity for Z is given by

PZ (z) =∫∞−∞ PX (s − z)PY (s)ds = PX ∗ PY

PZ (z) =

∫ ∞−∞

PX ,Y (z − s, s)ds =

∫ ∞−∞

PX ,Y (s, z − s)ds

If X and Y are independent, this becomes:

PZ (z) =

∫ ∞−∞

PX (z − s)PY (s)ds = PX ∗ PY

Paul Kirk 11 of 22


If X1 and X2 are random variables and X = X1 + X2, then

〈X 〉 = 〈X1〉+ 〈X2〉,

and if X1 and X2 are independent, then

V [X ] = V [X1] + V [X2].

Mysterious (?) assertion

“Averaging the results of a number of independent measurementsproduces a more accurate result. This is because the variances of thedifferent measurements add together.” — does this make sense?

Paul Kirk 12 of 22


Explanation

Assume all measurements have expectation, µ, and variance, σ2.By the independence assumption, the variance of the average is:

V

[N∑

n=1

Xn

N

]=

N∑n=1

V

[Xn

N

].

Moreover,

V

[Xn

N

]= E

[X 2n

N2

]−(E

[Xn

N

])2

=E[X 2n

]− (E [Xn])2

N2=

V [Xn]

N2.

So,

V

[N∑

n=1

Xn

N

]=

N∑n=1

V [Xn]

N2=

1

N2

N∑n=1

σ2 =σ2

N.

Paul Kirk 13 of 22

1.6 Transformations of a random variable

Key assertion: If Y = g(X ), then:

〈f (Y )〉 =

∫ x=b

x=aPX (x)f (g(x))dx =

∫ y=g(b)

y=g(a)PY (y)f (y)dy .

Given this assumption, everything else falls out automatically:

〈f (Y )〉 =

∫ x=b

x=aPX (x)f (g(x))dx =

∫ y=g(b)

y=g(a)PX (g−1(y))f (y)

dx

dydy

=

∫ y=g(b)

y=g(a)

PX (g−1(y))

g ′(g−1(y))f (y)dy .

General result (for invertible g):

PY (y) =PX (g−1(y))

|g ′(g−1(y))|.

Paul Kirk 14 of 22

1.7 The distribution function

The probability distribution function, which we will call D(x), of arandom variable X is defined as the probability that X is less than orequal to x . Thus

D(x) = Prob(X ≤ x) =

∫ x

−∞P(z)dz

In addition, the fundamental theorem of calculus tells us that

P(x) =d

dxD(x).

Paul Kirk 15 of 22

1.8 The characteristic function

“The characteristic function is defined as the Fourier transform of theprobability density.”

χ(s) =

∫ ∞−∞

P(x) exp (isX )dx .

The inverse transform gives:

P(x) =1

2π

∫ ∞−∞

χ(s) exp (−isx)ds.

The Fourier transform of the convolution of two functions, P(x) andQ(x), is the product of their Fourier transforms, χP(s) and χQ(s).For discrete random variables, the characteristic function is a sum.In general (for both discrete and continuous r.v.’s), we have:

χP(s) = 〈exp (isx)〉P(X ).

Paul Kirk 16 of 22

1.9 Moments and cumulants

Moment generating function (departure from the book)

The moment generating function is defined as:

M(t) = 〈exp(tX )〉,

where X is a random variable, and the expectation is with respect tosome density P(X ), so that

M(t) =

∫ ∞−∞

exp(tx)P(x)dx

=

∫ ∞−∞

(1 + tx +

1

2!t2x2 + ...

)P(x)dx

= 1 + tm1 +1

2!t2m2 + . . .+

1

r !trmr + . . .

where mr = 〈X r 〉 is the r -th (raw) moment.

Paul Kirk 17 of 22


Moment generating function (continued)

M(t) = 1 + tm1 +1

2!t2m2 + . . .+

1

r !trmr + . . .

It follows from the above expansion that:

M(0) = 1

M ′(0) = m1

M ′′(0) = m2

...

M(r)(0) = mr

Paul Kirk 18 of 22


Cumulant generating function (continued departure from book)

The log of the moment generating function is called the cumulantgenerating function,

R(t) = ln(M(t)).

By the chain rule of differentiation, we can write down the derivativesof R(t) in terms of the derivatives of M(t), e.g.

R ′(t) =M ′(t)

M(t)

R ′′(t) =M(t)M ′′(t)− (M ′(t))2

(M(t))2

Note that R(0) = 1, R ′(0) = M ′(0) = m1 = µ,R ′′(0) = M ′′(0)− (M ′(0))2 = m2 −m2

1 = σ2, . . .These are the cumulants.

Paul Kirk 19 of 22


The moments can be calculated from the derivatives of thecharacteristic function, evaluated at s = 0. We can see this byexpanding the characteristic function as a Taylor series:

χ(s) =∞∑n=0

χ(n)(0)sn

n!

where χ(n)(s) is the n-th derivative of χ(s). But we also have:

χ(s) = 〈e isX 〉 =

⟨ ∞∑n=0

(isX )n

n!

⟩=∞∑n=0

in〈X n〉sn

n!

Equating the 2 expressions, we get: 〈X n〉 = χ(n)(0)in

Paul Kirk 20 of 22


Cumulants

The n-th order cumulant of X , is the n-th derivative of the log of thecharacteristic function.

For independent random variables, X and Y , if Z = X + Y then then-th cumulant of Z is the sum of the n-th cumulants of X and Y .

The Gaussian distribution is also the only absolutely continuousdistribution all of whose cumulants beyond the first two (i.e. otherthan the mean and variance) are zero.

Paul Kirk 21 of 22

1.10 The multivariate Gaussian

Let x = [x!, . . . , xN ]>, then the general form of the Gaussian pdf is:

P(x) =1√

(2π)Ndet(Σ)exp

(−1

2(x− µ)>Σ−1(x− µ)

),

where µ is the mean vector and Σ is the covariance matrix.

All higher moments of a Gaussian can be written in terms of themeans and covariances. Defining ∆X ≡ X − 〈X 〉, for a 1-dimensionalGaussian we have:

〈∆X 2n〉 =(2n − 1)!(V [X ])n

2n−1(n − 1)!

〈∆X 2n−1〉 = 0

Paul Kirk 22 of 22

stochastic processes for physicists - imperial · stochastic processes for physicists understanding...

Documents