probability notes

144
 Section 1 Probability spaces. (, A  , P) is a probability space if  (, A  )  is a measurable space and P is a measure on A   such that P() = 1. Let us recall some denitions from measure theory. A pair  ( , A  )  is a measurable space  if A   is a  σ -algebra of subsets of . A collection A of subsets of is called an algebra if: (i)   A, (ii)  C ,  B  A = C  B, C  B  A, (iii)  B  A = \ B  A. A collection A   of subsets of is called a  σ -algebra  if it is an algebra and (iv)  C i A   for all  i 1 = ⇒∪ i1 C i A  . Elements of the  σ -algebra A   are often called  events. P is a  probability measure  on the  σ -algebra A   if (1)  P() = 1, (2)  P(  A) 0 for all events  A A  , (3)  P is countably additive : given any disjoint  A i A   for i 1, i.e. A i  A  j  =  / 0 for all  i  ,  j, P [ i=1  A i = i=1 P(  A i ). It is sometimes convenient to use an equivalent formulation of property (3): (3 0 )  P is nitely additive and  continuous, i.e. for any decreasing sequence of events  B n A  ,  B n  B n+1 ,  B = \ n1  B n  = P(  B) =  lim nP(  B n ). Lemma 1  Properties (3) and (3 0 ) are equivalent. Proof.  First, let us show that (3) implies (3 0 ). If we denote C n  = B n \ B n+1  then B n  is the disjoint union k n C k   B and, by (3), P(  B n ) = P(  B) + k n P( C k ). Since the last sum is the tail of convergent series, lim nP(  B n ) = P(  B).  Next, let us show that (3 0 ) implies  ( 3). If, given disjoint sets  (  A n ), we dene B n  = in+1  A i , then [ i1  A i  = A 1 [  A 2 [ ··· [  A n [  B n , 1

Upload: tomas-kojar

Post on 05-Nov-2015

19 views

Category:

Documents


0 download

DESCRIPTION

probability notes

TRANSCRIPT

  • Section 1

    Probability spaces.

    (W,A ,P) is a probability space if (W,A ) is a measurable space and P is a measure on A such that P(W) = 1. Let usrecall some definitions from measure theory. A pair (W,A ) is a measurable space if A is a s -algebra of subsets ofW. A collection A of subsets of W is called an algebra if:

    (i) W 2 A,(ii) C,B 2 A=)C\B,C[B 2 A,(iii) B 2 A=)W\B 2 A.

    A collection A of subsets of W is called a s -algebra if it is an algebra and

    (iv) Ci 2A for all i 1=)[i1Ci 2A .Elements of the s -algebra A are often called events. P is a probability measure on the s -algebra A if

    (1) P(W) = 1,

    (2) P(A) 0 for all events A 2A ,(3) P is countably additive: given any disjoint Ai 2A for i 1, i.e. Ai\Aj = /0 for all i , j,

    P [i=1

    Ai=

    i=1P(Ai).

    It is sometimes convenient to use an equivalent formulation of property (3):

    (30) P is finitely additive and continuous, i.e. for any decreasing sequence of events Bn 2A , Bn Bn+1,B=

    \n1

    Bn =) P(B) = limn!P(Bn).

    Lemma 1 Properties (3) and (30) are equivalent.

    Proof. First, let us show that (3) implies (30). If we denote Cn = Bn \Bn+1 then Bn is the disjoint union[knCk[B

    and, by (3),P(Bn) = P(B)+

    knP(Ck).

    Since the last sum is the tail of convergent series, limn!P(Bn) = P(B). Next, let us show that (30) implies (3). If,given disjoint sets (An), we define Bn = [in+1Ai, then[

    i1Ai = A1

    [A2

    [ [An[Bn,1

  • and, by finite additivity,

    P[i1

    Ai=

    n

    i=1P(Ai)+P(Bn).

    Clearly, Bn Bn+1 and, since (An) are disjoint, \n1Bn = /0. By (30), limn!P(Bn) = 0 and (3) follows. utLet us give several examples of probability spaces. One most basic example of a probability space is ([0,1],B([0,1]),l ),whereB([0,1]) is the Borel s -algebra on [0,1] and l is the Lebesgue measure. Let us quickly recall how this measureis constructed. More generally, let us consider the construction of the Lebesgue-Stieltjes measure on R correspondingto a non-decreasing right-continuous function F(x). One considers an algebra of finite unions of disjoint intervals

    A=n[in

    (ai,bi]n 1, all (ai,bi] are disjointo

    and defines the measure F on the sets in this algebra by (we slightly abuse the notations here)

    F[in

    (ai,bi]=

    n

    i=1

    F(bi)F(ai)

    .

    It is not difficult to show that F is countably additive on the algebra A, i.e.,

    F [i=1

    Ai=

    i=1

    F(Ai)

    whenever all Ai and [i1Ai are finite unions of disjoint intervals. The proof is exactly the same as in the case ofF(x) = x corresponding to the Lebesgue measure case. Once countable additivity is proved on the algebra, it remainsto appeal to the following key result. Recall that, given an algebra A, the s -algebra A = s(A) generated by A is thesmallest s -algebra that contains A.

    Theorem 1 (Caratheodorys extension theorem) If A is an algebra of sets and : A! R is a non-negative countablyadditive function on A, then can be extended to a measure on the s -algebra s(A). If is s -finite, then this extensionis unique.

    Therefore, F above can be uniquely extended to the measure on the s -algebra s(A) generated by the algebra of finiteunions of disjoint intervals. This is the s -algebraB(R) of Borel sets on R. Clearly, (R,B(R),F) will be a probabilityspace if

    (1) F(x) = F((,x]) is non-decreasing and right-continuous,(2) limx!F(x) = 0 and limx!F(x) = 1.

    The reason we required F to be right-continuous corresponds to our choice that intervals (a,b] in the algebra are closedon the right, so the two conventions agree and the measure F is continuous, as it should be, e.g.

    F(a,b]

    = F

    \n1

    (a,b+n1]= lim

    n!F(a,b+n1]

    .

    In Probability Theory, functions satisfying properties (1) and (2) above are called cumulative distribution functions, orc.d.f. for short, and we will give an alternative construction of the probability space (R,B(R),F) in the next section.

    Other basic ways to define a probability is through a probability function or density function. If the measurable spaceis such that all singletons are measurable, we can simply assign some weights pi = P(wi) to a sequence of distinctpoints wi 2W, such that i1 pi = 1, and let

    P(A) = PA\{wi}i1

    .

    2

  • The function i! pi = P({wi}) is called a probability function. Now, suppose that we already have a s -finite measureQ on (W,A ), and consider any measurable function f :W! R+ such thatZ

    Wf (w)dQ(w) = 1.

    Then we can define a probability measure P on (W,A ) by

    P(A) =ZAf (w)dQ(w).

    The function f is called the density function of P with respect to Q and, in a typical setting when W= Rk and Q is theLebesgue measure l , f is simply called the density function of P.

    Examples. (1) The probability measure on R corresponding to the probability function

    pi = P({i}) = li

    i!el

    for integer i 0 is called the Poisson distribution with the parameter l > 0. (Notation: given a set A, we will denoteby I(x 2 A) or IA(x) the indicator that x belongs to A.) (2) A probability measure on R corresponding to the densityfunction f (x) = lelxI(x 0) is called the exponential distribution with the parameter l > 0. (3) A probabilitymeasure on R corresponding to the density function

    f (x) =1p2p

    ex2/2

    is called the standard normal, or standard Gaussian, distribution on R. utRecall that a measure P is called absolutely continuous with respect to another measure Q, P Q, if for all A 2A ,

    Q(A) = 0=) P(A) = 0,in which case the existence of the density is guaranteed by the following classical result from measure theory.

    Theorem 2 (Radon-Nikodym) On a measurable space (W,A ) let be a s -finite measure and n be a finite measureabsolutely continuous with respect to , n . Then there exists the Radon-Nikodym derivative h 2L 1(W,A ,)such that

    n(A) =ZAh(w)d(w)

    for all A 2A . Such h is unique modulo -a.e. equivalence.Of course, the Radon-Nikodym theorem also applies to finite signed measures n , which can be decomposed inton = n+n for some finite measures n+,n the so-called Hahn-Jordan decomposition. Let us recall a proof of theRadon-Nikodym theorem for convenience.

    Proof. Clearly, we can assume that is a finite measure. Consider the Hilbert space H =L 2(W,A , +n) and thelinear functional T : H! R given by T ( f ) = R f dn . SinceZ f dn Z | f |d(+n)Ck fkH ,T is a continuous linear functional and, by the Riesz-Frechet theorem,

    Rf dn =

    Rf gd( + n) for some g 2 H. This

    impliesRf d =

    Rf (1g)d(+n). Now g(w) 0 for (+n)-almost all w, which can be seen by taking f (w) =

    I(g(w)< 0), and similarly g(w) 1 for (+n)-almost all w. Therefore, we can take 0 g 1. Let E = {w : g(w) =1}. Then

    (E) =ZI(w 2 E)d(w) =

    ZI(w 2 E)(1g(w))d(+n)(w) = 0,

    3

  • and since n , n(E) = 0. Since R f gd = R f (1 g)dn , we can restrict both integral to Ec, and replacing f byf/(1g) and denoting h := g/(1g) (defined on Ec) we get REc f dn = REc f hd (more carefully, one can truncatef/(1g) first and then use the monotone convergence theorem). Therefore, n(A\Ec) = RA\Echd and this finishesthe proof if we set h= 0 on E. To prove uniqueness, consider two such h and h0 and let A= {w : h(w)> h0(w)}. Then0=

    RA(hh0)d and, therefore, (A) = 0. ut

    Let us now write down some important properties of s -algebras and probability measures.

    Lemma 2 (Approximation property) If A is an algebra of sets then for any B 2 s(A) there exists a sequence Bn 2 Asuch that limn!P(B4Bn) = 0.Proof. Here B4Bn denotes the symmetric difference (B[Bn)\ (B\Bn). Let

    D =B 2 s(A) lim

    n!P(B4Bn) = 0 for some Bn 2 A .

    We will prove that D is a s -algebra and, since AD , this will imply that s(A)D . One can easily check that

    d(B,C) := P(B4C) =ZW|IB(w) IC(w)|dP(w)

    is a semi-metric, which satisfies

    (a) d(B\C,D\E) d(B,D)+d(C,E),(b) |P(B)P(C)| d(B,C),(c) d(Bc,Cc) = d(B,C).

    Now, consider D1, . . . ,DN 2D . If a sequenceCin 2 A for n 1 approximates Di, i.e.limn!P(Cin4Di) = 0,

    then, by the properties (a) (c), CNn =S

    iNCin approximates DN =S

    iN Di, which means that DN 2 D . Let D =Si1Di. Since P(D\DN)! 0 as N! , it is clear that D 2D , so D is a s -algebra. ut

    Dynkins theorem. We will now describe a tool, the so-called Dynkins theorem, or pl theorem, which is often quiteuseful in checking various properties of probabilities.

    p-systems: A collection of setsP is called a p-system if it is closed under taking intersections, i.e.

    1. if A,B 2P then A\B 2P .l -systems: A collection of setsL is called a l -system if

    1. W 2L ,2. if A 2L then Ac 2L ,3. if An 2L are disjoint for n 1 then [n1An 2L .

    Given any collection of sets C , by analogy with the s -algebra s(C ) generated by C , we will denote by L (C ) thesmallest l -system that contains C . If is easy to see that the intersection of all l -systems that contain C is again al -system that contains C , so this intersection is preciselyL (C ).

    Theorem 3 (Dynkins theorem) IfP is a p-system,L is a l -system andP L , then s(P)L .

    4

  • We will give typical examples of application of this result below.

    Proof. First of all, it should be obvious that the collection of sets which is both a p-system and a l -system is as -algebra. Therefore, if we can show thatL (P) is a p-system then it is a s -algebra and

    P s(P)L (P)L ,which proves the result. Let us prove thatL (P) is a p-system. For a fixed set AW, let us define

    GA =BW B\A 2L (P) .

    Step 1. Let us show that if A 2L (P) then GA is a l -system. Obviously, W 2 GA. If B 2 GA then B\A 2L (P) and,since Ac 2L (P) is disjoint from B\A,

    Bc\A= (B\A)[Acc 2L (P).This means that Bc 2 GA. Finally, if Bn 2 GA are disjoint then Bn\A 2L (P) are disjoint and

    ([n1Bn)\A= [n1(Bn\A) 2L (P),so [n1Bn 2 GA. We showed that GA is a l -system.Step 2. Next, let us show that if A 2P then L (P) GA. Since P L (P), by Step 1, GA is a l -system. Also,sinceP is a p-system, closed under taking intersections,P GA. This implies thatL (P) GA. In other words, weshowed that if A 2P and B 2L (P) then A\B 2L (P).Step 3. Finally, let us show that if B 2L (P) then L (P) GB. By step 2, GB containsP and, by Step 1, GB is al -system. Therefore,L (P) GB. We showed that if B 2L (P) and A 2L (P) then A\B 2L (P), soL (P) isa p-system. utExample. Suppose that W is a topological space with the Borel s -algebra B generated by open sets. Given twoprobability measures P1 and P2 on (W,B), the collection of sets

    L =B 2B P1(B) = P2(B)

    is trivially a l -system, by the properties of probability measures. On the other hand, the collectionP of all open setsis a p-system and, therefore, if we know that P1(B) = P2(B) for all open sets then, by Dynkins theorem, this holds forall Borel sets B 2B. Similarly, one can see that a probability on the Borel s -algebra on the real line is determined byprobabilities of the sets (, t] for all t 2 R. utRegularity of measures. Let us now consider the case of W= S where (S,d) is a metric space, and letA be the Borels -algebra generated by open (or closed) sets. A probability measure P on this space is called closed regular if

    P(A) = supP(F)

    F A,F - closed (1.0.1)for all A 2A . Similarly, probability measure P is called regular if

    P(A) = supP(K)

    K A,K - compact (1.0.2)for all A 2 A . It is a standard result in measure theory that every finite measure on (Rk,B(Rk)) is regular. In thesetting of complete separable metric spaces, this is known as Ulams theorem, which we will prove below.

    Theorem 4 Every probability measure P on a metric space (S,d) is closed regular.

    Proof. Let us consider a collection of sets

    L =A 2A both A and Ac satisfy (1.0.1) . (1.0.3)

    5

  • First of all, let us show that each closed set F 2L ; we only need to show that an open setU = Fc satisfies (1.0.1). Letus consider sets

    Fn =s 2 S d(s,F) 1/n .

    It is obvious that all Fn are closed, and Fn Fn+1. One can also easily check that, since F is closed, [n1Fn =U and,by the continuity of measure,

    P(U) = limn!P(Fn) = supn1

    P(Fn).

    This proves thatU = Fc satisfies (1.0.1) and F 2L . Next, one can easily check thatL is a l -system, which we willleave as an exercise below. Since the collection of all closed sets is a p-system, and closed sets generate the Borels -algebra A , by Dynkins theorem, all measurable sets are inL . This proves that P is closed regular. ut

    Theorem 5 (Ulam) If (S,d) is a complete separable metric space then every probability measure P is regular.

    Proof. First, let us show that there exists a compact set K S such that P(S\K) e. Consider a sequence {s1,s2, . . .}that is dense in S. For any m 1, S = Si=1 B(si, 1m ), where B(si, 1m ) is the closed ball of radius 1/m centered at si. Bythe continuity of measure, for large enough n(m),

    PS\

    n(m)[i=1

    Bsi,

    1m

    e

    2m.

    If we take

    K =\m1

    n(m)[i=1

    Bsi,

    1m

    then

    P(S\K) m1

    e2m

    = e.

    Obviously, by construction, K is closed and totally bounded. Since S is complete, K is compact. By the previoustheorem, given A 2A , we can find a closed subset F A such that P(A\F) e . Therefore, P(A\ (F \K)) 2e , andsince F \K is compact, this finishes the proof. utExercise. Let F = {F N : N \F is finite} be the collection of all sets in N with finite complements. F is a filter,which means that (a) /0

  • Section 2

    Random variables.

    Let (,A ,P) be a probability space and (S ,B) be a measurable space where B is a -algebra of subsets of S .Recall that a function X :S is called measurable if for all B B,

    X1(B) = X() B A .

    In Probability Theory, such functions are called random variables, especially, when (S ,B) = (R,B(R)). Dependingon the target spaceS , X may be called a random vector, sequence, or, more generally, a random element inS . Recallthat measurability can be checked on the sets that generate the -algebraB and, in particular, the following holds.

    Lemma 3 X : R is a random variable if and only if, for all t R,{X t} := X() (, t] A .

    Proof. Only if direction requires proof. We will prove that

    D =D R X1(D) A

    is a -algebra. Since sets (, t] D , this will imply thatB(R)D . The fact that D is a -algebra follows simplybecause taking pre-image preserves set operations. For example, if we consider a sequence Di D for i 1 then

    X1i1

    Di=i1

    X1(Di) A ,

    because X1(Di) A and A is a -algebra. Therefore, i1Di D . Other properties can be checked similarly, soD is a -algebra. Given a random element X on (,A ,P)with values in (S ,B), let us denote the image measure onB by PX = PX1,which means that for B B,

    PX (B) = P(X B) = P(X1(B)) = PX1(B).(S ,B,PX ) is called the sample space of a random element X and PX is called the law of X , or the distribution of X .Clearly, on this space a random variable :S S defined by the identity (s) = s has the same law as X .WhenS = R, the function F(t) = P(X t) is called the cumulative distribution function (c.d.f.) of X . Clearly, this functionsatisfies the following properties that already appeared in the previous section:

    (1) F(x) = F((,x]) is non-decreasing and right-continuous,(2) limxF(x) = 0 and limxF(1) = 1.

    On the other hand, any such function is a c.d.f. of some random variable, for example, the random variables X(x) = xon the space (R,B(R),F) constructed in the previous section, since

    Px : X(x) t= F(, t]= F(t).

    7

  • 0 1

    Figure 2.1: A random variable defined by the quantile transformation.

    Another construction can be given on the probability space ([0,1],B([0,1]), ) with the Lebesgue measure , usingthe so-called quantile transformation. Given a c.d.f. F , let us define a random variable X : [0,1] R by the quantiletransformation (see Figure 2.1):

    X(x) = infs R F(s) x.

    What is the c.d.f. of X? Notice that, since F is right-continuous,

    X(x) t inf{s |F(s) x} t limst

    F(s) x F(t) x.

    This implies that F is the c.d.f. of X , since

    (x : X(x) t) = (x : F(t) x) = F(t).This means that, to define the probability space (R,B(R),F), we can start with ([0,1],B([0,1]), ) and let F = X1be the image of the Lebesgue measure by the quantile transformation, or the law of X onR.A related inverse propertyis left as an exercise below.

    Given random element X : (,A ) (S ,B), the -algebra(X) =

    X1(B)

    B Bis called a -algebra generated by X. It is obvious that this collection of sets is, indeed, as -algebra.

    Example. Consider a random variable X on ([0,1],B([0,1]), ) defined by

    X(x) =

    0, 0 x 1/2,1, 1/2< x 1.

    Then, the -algebra generated by X consists of the sets

    (X) =/0,0,12

    ,12,1, [0,1]

    ,

    and P(X = 0) = P(X = 1) = 1/2. Lemma 4 Consider a probability space (,A ,P), a measurable space (S ,B) and random elements X : Sand Y : R. Then the following are equivalent:

    1. Y = g(X) for some (Borel) measurable function g :S R,2. Y : R is (X)-measurable.

    It should be obvious from the proof that R can be replaced by any separable metric space.

    Proof. The fact that 1 implies 2 is obvious, since for any Borel set B R the set B = g1(B) B and, therefore,Y = g(X) B= X g1(B) = B= X1(B) (X).

    Let us show that 2 implies 1. For all integer n and k, consider sets

    An,k = : Y ()

    k2n,k+12n

    = Y1

    k2n,k+12n

    .

    8

  • By 2, An,k (X) = {X1(B) |B B} and, therefore, An,k = X1(Bn,k) for some Bn,k B. Let us consider a function

    gn(x) = kZ

    k2n

    I(x Bn,k).

    By construction, |Y gn(X)| 2n, since

    Y () k2n,k+12n

    X() Bn,k gn(X()) = k2n .

    It is easy to see that gn(x) gn+1(x) and, therefore, the limit g(x) = limn gn(x) exists and is measurable as a limitof measurable functions. Clearly, Y = g(X). Independence. Consider a probability space (,A ,P). Then, -algebras Ai A , i n, are independent if

    P(A1 An) =inP(Ai)

    for all Ai Ai. Similarly, -algebras Ai A for i n are pairwise independent ifP(AiAj) = P(Ai)P(Aj)

    for all Ai Ai,Aj A j, i j. Random variables Xi : (,A ) (S ,B) for i n are independent if the -algebras(Xi) are independent, which is just another convenient way to state that

    P(X1 B1, . . . ,Xn Bn) = P(X1 B1) . . .P(Xn Bn)for any events B1, . . . ,Bn B. Pairwise independence is defined similarly.Example. Consider a regular tetrahedron die, Figure 2.2, with red, green and blue sides and a red-green-blue base. If

    b

    br

    r

    g

    g

    Figure 2.2: Pairwise independent, but not independent, random variables.

    we roll this die then the colors provide an example of pairwise independent random variables that are not independent,since

    P(r) = P(b) = P(g) =12and P(rb) = P(rg) = P(bg) =

    14,

    whileP(rbg) =

    14 P(r)P(b)P(g) =

    12

    3.

    First of all, independence can be checked on generating algebras.

    Lemma 5 If algebras Ai, i n are independent then -algebras (Ai) are independent.Proof. Obvious by the Approximation Lemma 2. A more flexible criterion follows from Dynkins theorem.

    9

  • Lemma 6 If collections of sets Ci for i n are -systems (closed under finite intersections) then their independenceimplies the independence of the -algebras (Ci) they generate.

    Proof. Le us consider the collection C of setsC A such that

    P(CC2 Cn) = P(C)n

    i=2P(Ci)

    for all Ci Ci for 2 i n. It is obvious that C is a -system, and it contains C1 by assumption. Since C1 is a-system, by Dynkins theorem, C contains (C1). This means that we can replace C1 by (C1) in the statement ofthe theorem and, similarly, we can continue to replace each Ci by (Ci).

    Lemma 7 Consider random variables Xi : R on a probability space (,A ,P). (a) Random variables (Xi) areindependent if and only if, for all ti R,

    P(X1 t1, . . . ,Xn tn) =n

    i=1P(Xi ti). (2.0.1)

    (b) If the laws of Xi have densities fi on R then these random variables are independent if and only if a joint density fon Rn of the vector (Xi) exists and

    f (x1, ...,xn) =n

    i=1

    fi(xi).

    Proof. (a) This is obvious by Lemma 6, because the collection of sets (, t] for t R is a -system that generatesthe Borel -algebra on R.

    (b) Let us start with the if part. If we denote X = (X1, . . . ,Xn) then, for any Ai B(R),

    P ni=1

    {Xi Ai}

    = P(X A1 An) =A1An

    n

    i=1

    fi(xi)dx1 dxn

    =n

    i=1

    Ai

    fi(xi)dxi =n

    i=1P(Xi Ai).

    Next, we prove the only if part. First of all, by independence,

    P(X A1 An) =n

    i=1P(Xi Ai) =

    A1An

    n

    i=1

    fi(xi)dx1 dxn.

    We would like to show that this implies that

    P(X A) =A

    n

    i=1

    fi(xi)dx1 dxn.

    for all A in the Borel -algebra on Rn, which would means that the joint density exists and is equal to the product ofindividual densities. One can prove the above equality for all AB(Rn) by appealing to the Monotone Class Theoremfrom measure theory, or the Caratheodory Extension Theorem 1, since the above equality, obviously, can be extendedfrom the semi-algebra of measurable rectangles A1 An to the algebra of disjoint unions of measurable rectangles,which generates the Borel -algebra. However, we can also appeal to the Dynkins theorem, since the familyL of setsA that satisfy the above equality is a -system by properties of measures and integrals, and it contains the -systemP of measurable rectangles A1 An that generates the Borel -algebra,B(Rn) = (P). More generally, a collection of -algebras At A indexed by t T for some set T is called independent if any finitesubset of these -algebras are independent. Let T = T1 Tn be a partition of T into disjoint sets. In this case, thefollowing holds.

    10

  • Lemma 8 (Grouping lemma) The -algebras

    Bi =tTiAt =

    tTiAt

    generated by the subsets of -algebras (At)tTi are independent.

    Proof. For each i n, consider a collection of sets

    Ci =tF

    At for all finite F Ti and At At.

    It is obvious thatBi = (Ci) sinceAt Ci for all t Ti, each Ci is a -system, and C1, . . . ,Cn are independent by thedefinition of independence of the -algebras At for t T . Using Lemma 6 finishes the proof. (Of course, one shouldrecognize from measure theory that Ci is a semi-algebra that generatesBi.) If we would like to construct finitely many independent random variables (Xi)in with arbitrary distributions (Pi)inon R, we can simply consider the space = Rn with the product measure

    P1 . . .Pnand define a random variable Xi by Xi(x1, . . . ,xn) = xi. The main result in the next section will imply that one canconstruct an infinite sequence of independent random variables with arbitrary distributions on the same probabilityspace, and here we will give a sketch of another construction on the space ([0,1],B([0,1]), ). We will write P= toemphasize that we think of the Lebesgue measure as a probability.

    Step 1. If we write the dyadic decomposition of x [0,1],x=

    n12nn(x),

    then it is easy to see that (n)n1 are independent random variables with the distribution P(n = 0) = P(n = 1) = 1/2,since for any n 1 and any ai {0,1},

    P(x : 1(x) = a1, . . . ,n(x) = an) = 2n,

    since fixing the first n coefficients in the dyadic expansion places x into an interval of length 2n.

    Step 2. Let us consider injections km : N N for m 1 such that their ranges km(N) are all disjoint and let us defineXm = Xm(x) =

    n12nkm(n)(x).

    It is an easy exercise to check that each Xm is well defined and has the uniform distribution on [0,1], which can be seenby looking at the dyadic intervals first. Moreover, by the Grouping Lemma above, the random variables (Xm)m1 areall independent since they are defined in terms groups of independent random variables.

    Step 3. Given a sequence of probability distributions (Pm)m1 on R, let (Fm)m1 be the sequence of the correspondingc.d.f.s and let (Qm)m1 be their quantile transforms. We have seen above that each Ym = Qm(Xm) has the distributionPm on R, and they are obviously independent of each other. Therefore, we constructed a sequence of independentrandom variables Ym on the space ([0,1],B([0,1]), ) with arbitrary distributions Pm. Expectation. If X : R is a random variable on (,A ,P) then the expectation of X is defined as

    EX =X()dP().

    In other words, expectation is just another term for the integral with respect to a probability measure and, as a result,expectation has all the usual properties of the integrals in measure theory: convergence theorems, change of variablesformula, Fubinis theorem, etc. Let us write down some special cases of the change of variables formula.

    11

  • Lemma 9 (1) If F is the c.d.f. (and the law) of X on R then, for any measurable function g : R R,

    Eg(X) =Rg(x)dF(x).

    (2) If the distribution of X is discrete, i.e. P(X {xi}i1) = 1, thenEg(X) =

    i1g(xi)P(X = xi).

    (3) If the distribution of X : Rn on Rn has the density function f (x) then, for any measurable function g : Rn R,

    Eg(X) =Rng(x) f (x)dx.

    Proof. All these properties follow by making the change of variables x= X(),

    Eg(X) =g(X())dP() =

    g(x)dPX1(x) =

    g(x)dPX (x),

    where PX = PX1 is the law of X on R or Rn. Another simple fact is the following.

    Lemma 10 If X ,Y : R are independent and E|X |,E|Y |< then EXY = EXEY.

    Proof. Independence implies that the distribution of (X ,Y ) on R2 is the product measure PQ, where P and Q are thedistributions of X and Y on R, and, therefore,

    EXY =R2xyd(PQ)(x,y) =

    RxdP(x)

    RydQ(y) = EXEY,

    by the change of variables and Fubini theorems. Exercise. If a random variable X has continuous c.d.f. F(t), show that F(X) is uniform on [0,1], i.e. the law of F(X)is the Lebesgue measure on [0,1].

    Exercise. If F is a continuous distribution function, show thatF(x)dF(x) = 1/2.

    Exercise. ch( ) is a moment generating function of a random variable X with distribution P(X =1) = 1/2, since

    EeX =e + e

    2= ch( ).

    Does there exist a (bounded) random variable X such that EeX = chm( ) for 0 < m < 1? (Hint: compute severalderivatives at zero.)

    Exercise. Consider a measurable function f : X Y R and a product probability measure PQ on X Y. For0< p q, prove that

    fLp(P)Lq(Q) fLq(Q)Lp(P),i.e.

    | f (x,y)|pdP(x)q/p

    dQ(y)1/q | f (x,y)|qdQ(y)p/qdP(x)1/p.

    Assume that both sides are well-defined, for example, that f is bounded.

    Exercise. If the event A (P) is independent of the -systemP then P(A) = 0 or 1.Exercise. Suppose X is a random variable and g : R R is measurable. Prove that if X and g(X) are independent thenP(g(X) = c) = 1 for some constant c.

    12

  • Exercise. Suppose that (em)1mn are i.i.d. exponential random variables with the parameter > 0, and let

    e(1) e(2) . . . e(n)be the order statistics (the random variables arranged in the increasing order). Prove that the spacings

    e(1),e(2) e(1), . . . ,e(n) e(n1)are independent exponential random variables, and e(k+1) e(k) has the parameter (n k) .Exercise. Suppose that (en)n1 are i.i.d. exponential random variables with the parameter = 1. Let Sn = e1+ . . .+enand Rn = Sn+1/Sn for n 1. Prove that (Rn)n1 are independent and Rn has density nx(n+1)I(x 1).Hint: Let R0 = e1and compute the joint density of (R0,R1, . . . ,Rn) first.

    Exercise. Let N be a Poisson random variable with the mean , i.e. P(N = j) = je/ j! for integer j 0. Then,consider N i.i.d. random variables, independent of N, taking values 1, . . . ,k with probabilities p1, . . . , pk. Let Nj bethe number of these random variables taking value j, so that N1+ . . .+Nk = N. Prove that N1, . . . ,Nk are independentPoisson random variables with means p1, . . . , pk.

    Additional exercise. Suppose that a measurable subset P [0,1] and the interval I = [a,b] [0,1] are such that (P) = (I), where is the Lebesgue measure on [0,1]. Show that there exists a measure-preserving transformationT : [0,1] [0,1], i.e. T1 = , such that T (I) P and T is one-to-one (injective) outside a set of measure zero.(Additional exercises never need to be turned in.)

    13

  • Section 3

    Kolmogorovs consistency theorem.

    In this section we will describe a typical way to construct an infinite family of random variables (Xt)tT on the sameprobability space, namely, when we are given all their finite dimensional marginals. This means that for any finitesubset N T , we are given

    PN(B) = P(Xt)tN B

    for all B in the Borel -algebraBN =B(RN). Clearly, these laws must satisfy a natural consistency condition,

    PN(B) = PM(BRM\N), (3.0.1)for any finite subsets N M and any Borel set B BN . (Of course, to be careful, we should define these probabilitiesfor ordered subsets and also make sure they are consistent under rearrangements, but the notations for unordered setsis clear and should not cause any confusion.)

    Our goal is to construct a probability space (,A ,P) and random variables Xt : R that have (PN) as theirfinite dimensional distributions. We take

    = RT = : T R

    to be the space of all real-valued functions on T , and let Xt be the coordinate projection

    Xt = Xt() = (t).

    For the coordinate projections to be measurable, the following collection of events,

    A=BRT\N B BN,

    must be contained in the -algebraA . It is easy to see that A is, in fact, an algebra of sets, and it is called the cylindricalalgebra on RT . We will then take A = (A) to be the smallest -algebra on which all coordinate projections aremeasurable. This is the so-called cylindrical -algebra on RT . A set BRT\N is called a cylinder. As we alreadyagreed, the probability P on the sets in algebra A is given by

    PBRT\N= PN(B).

    Given two finite subsets N M T and B BN , the same set can be represented as two different cylinders,BRT\N = BRM\NRT\M.

    However, by the consistency condition, the definition of Pwill not depend on the choice of the representation. To finishthe construction, we need to show that P can be extended from algebra A to a probability measure on the -algebraA . By the Caratheodory Extension Theorem 1, we only need to show that the following holds.

    Theorem 6 P is countably additive on the cylindrical algebra A.

    14

  • Proof. Equivalently, it is enough to show that P satisfies continuity of measure property on A, namely, given a sequenceBn A,

    Bn Bn+1,n1

    Bn = /0= limnP(Bn) = 0.

    We will prove that if there exists > 0 such that P(Bn)> for all n then

    n1Bn /0. We can represent the cylindersBn as

    Bn =CnRT\Nnfor some finite subset Nn T and Cn BNn . Since Bn Bn+1, we can assume that Nn Nn+1. First of all, by theregularity of probability on a finite dimensional space, there exists a compact set Kn Cn such that

    PNnCn \Kn

    2n+1

    .

    It is easy to see that in

    CiRT\Ni \in

    KiRT\Ni in

    (Ci \Ki)RT\Ni

    and, therefore,

    Pin

    CiRT\Ni \in

    KiRT\Ni P

    in

    (Ci \Ki)RT\Ni

    inP(Ci \Ki)RT\Ni

    in

    2i+1

    2.

    Since we assumed thatP(Bn) = P

    in

    CiRT\Ni> ,

    this implies thatPin

    KiRT\Ni

    2> 0.

    Let us rewrite this intersection asin

    KiRT\Ni =in

    (KiRNn\Ni)RT\Nn = KnRT\Nn ,

    whereKn =

    in

    (KiRNn\Ni)

    is a compact in RNn , since Kn is a compact in RNn . We proved that

    PNn(Kn) = P(KnRT\Nn) = P

    in

    KiRT\Ni> 0

    and, therefore, there exists a point n = (n(t))tNn Kn. By construction, we also have the following inclusionproperty. For m> n,

    m Km KnRNm\Nnand, therefore, (m(t))tNn Kn. Any sequence on a compact has a converging subsequence. Let (n1k)k1 be such that

    (n1k (t))tN1 ((t))tN1 K1

    as k . Then we can take a subsequence (n2k)k1 of the sequence (n1k)k1 such that

    (n2k (t))tN2 ((t))tN2 K2.

    15

  • Notice that the values of (t)must agree on N1. Iteratively, we can find a subsequence (nmk )k1 of (nm1k )k1 such that

    (nmk (t))tNm ((t))tNm Km.This proves the existence of a point

    n1

    KnRT\Nn n1

    Bn,

    so this last set is not empty. This finishes the proof. This results gives us another way to construct a sequence (Xn)n1 of independent random variables with given distri-butions (Pn)n1 as the coordinate projections on the infinite product space RN with the cylindrical (product) -algebraA =BN and infinite product measure P=n1Pn.

    Exercise. Does the set C([0,1],R) of continuous functions on [0,1] belong to the cylindrical -algebra A on R[0,1]?Hint: An exercise in Section 1 might be helpful.

    16

  • Section 4

    Inequalities for sums of independentrandom variables.

    When we toss an unbiased coin many times, we expect the number of heads or tails to be close to a half a phe-nomenon called the law of large numbers. More generally, if (Xn)n1 is a sequence of independent identically dis-tributed (i.i.d.) random variables, we expect that their average

    Xn =Snn

    =1n

    n

    i=1

    Xi

    is, in some sense, close to the expectation = EX1, assuming that it exists. In the next section, we will prove a generalqualitative result of this type, but we begin with more quantitative statements in special cases. Let us begin with thecase of i.i.d. Rademacher random variables (en)n1 such that

    P(en = 1) = P(en =1) = 12 ,

    which is, basically, equivalent to tossing a coin with heads and tails replaced by 1.We will need the following.

    Lemma 11 (Chebyshevs inequality) If a random variable X 0 then, for t > 0,

    P(X t) EXt.

    Proof. This follows from a simple sequence of inequalities,

    EX = EXI(X < t)+EXI(X t) EXI(X t) tEI(X t) = tP(X t),where we used that X 0. utThis inequality is often used in the exponential form,

    P(X t) el tEelX for l 0,which is obvious if we rewrite {X t}= {elX el t}. We will use it to prove the following.

    Theorem 7 (Hoeffdings inequality) For any a1, . . . ,an 2 R and t 0,

    P ni=1

    eiai t exp

    t

    2

    2ni=1 a2i

    .

    17

  • Proof.We begin by writing for l 0,

    P ni=1

    eiai t el tEexp

    l

    n

    i=1

    eiai= el t

    n

    i=1

    Eexp(leiai),

    where in the last step we used Lemma 10. One can easily check the inequality

    ex+ ex

    2 ex2/2,

    for example, using the Taylor expansions, and, therefore,

    Eexp(leiai) =12elai +

    12elai el 2a2i /2

    and

    P ni=1

    eiai t exp

    l t+ l

    2

    2

    n

    i=1

    a2i.

    Optimizing over l 0 finishes the proof. utWe can also apply the same inequality to (ei) and, combining both cases,

    P n

    i=1eiai t 2exp t2

    2ni=1 a2i

    .

    This implies, for example, that

    P1

    n

    n

    i=1

    ei t 2expnt2

    2

    .

    This show that, no matter how small t > 0 is, the probability that the average deviates from the expectation Ee1 = 0 bymore than t decreases exponentially fast with n. Let us now consider a more general case. For p,q2 [0,1], we considerthe function

    D(p,q) = p logpq+(1 p) log 1 p

    1q ,

    which is called the Kullback-Leibler divergence. To see that D(p,q) 0, with equality only if p = q, just use thatlogx x1 with equality only if x= 1.

    Theorem 8 (Hoeffding-Chernoff inequality) Suppose that 0 Xi 1 and = EX. Then

    P1n

    n

    i=1

    Xi + t enD(+t,)

    for any t 0 such that + t 1.

    Proof. Notice that the probability is zero when + t > 1, since the average can not exceed 1, so + t 1 is not reallya constraint here. Using the convexity of the exponential function, we can write for x 2 [0,1] that

    elx = el (x1+(1x)0) xel +(1 x)el 0 = 1 x+ xel ,which implies that

    EelX 1EX+EXel = 1+el .Using this, we get the following bound for any l 0,

    P ni=1

    Xi n(+ t)

    eln(+t)Eel Xi = eln(+t)(EelX )n

    eln(+t)1+el n.18

  • The derivative of the right hand side in l is equal to zero at:

    n(+ t)eln(+t)(1+el )n+n(1+el )n1el eln(+t) = 0,

    (+ t)(1+el )+el = 0,

    el =(1)(+ t)(1 t) 1,

    so the critical point l 0, as required. Substituting this back into the bound,

    P ni=1

    Xi n(+ t)

    (1 t)(1)(+ t)

    +t1+ (1)(+ t)

    1 t!n

    =

    + t

    +t 11 t

    1t!n= exp

    n(+ t) log + t

    +(1 t) log 1 t

    1

    ,

    which finishes the proof. utWe can also apply this bound to Zi = 1Xi, with the mean Z = 1 , to get

    P1n

    n

    i=1

    Xi t= P1n

    n

    i=1

    Zi Z + t enD(z+t,Z) = enD(1+t,1).

    These inequalities show that, no matter how small t > 0 is, the probability that the average Xn deviates from theexpectation by more than t in either direction decreases exponentially fast with n. Of course, the same conclusionapplies to any bounded random variables, |Xi| M, by shifting and rescaling the interval [M,M] into the interval[0,1].

    Even though the Hoeffding-Chernoff inequality applies to all bounded random variables, in real-world applica-tions in engineering, computer science, etc., one would like to improve the control of the probability by incorporatingother measures of closeness of the random variable X to the mean , for example, the variance

    s2 = Var(X) = E(X)2 = EX2 (EX)2.The following inequality is classical. Let us denote

    f(x) = (1+ x) log(1+ x) x.We will center the random variables (Xn) and instead work with Zn = Xn .

    Theorem 9 (Bennetts inequality) Let us consider i.i.d. (Zn)n1 such that EZ = 0, EZ2 = s2 and |Z|

  • Using Taylor series expansion and the fact that EZ = 0, EZ2 = s2 and |Z|M, we can write

    EelZ = E

    k=0

    (lZ)k

    k!=

    k=0

    l kEZk

    k!

    = 1+

    k=2

    l k

    k!EZ2Zk2 1+

    k=2

    l k

    k!Mk2s2

    = 1+s2

    M2

    k=2

    l kMk

    k!= 1+

    s2

    M2elM1lM

    exp

    s2M2elM1lM

    where in the last inequality we used 1+ x ex. Therefore,

    P ni=1

    Zi nt expn

    l t+ s

    2

    M2elM1lM

    .

    Now, we optimize this bound over l 0. We find the critical point:

    t+ s2

    M2MelMM

    = 0,

    elM =tMs2

    +1,

    l = 1M

    log1+

    tMs2

    .

    Since this l 0, plugging it into the above bound,

    P ni=1

    Zi nt

    expn tM

    log1+

    tMs2

    +

    s2

    M2

    tMs2

    +11 log1+

    tMs2

    = expn

    s2

    M2

    tMs2

    log1+

    tMs2

    tM

    s2log1+

    tMs2

    = expn

    s2

    M2

    tMs2

    1+

    tMs2

    log1+

    tMs2

    = exp

    ns

    2

    M2ftMs2

    ,

    finishes the proof. utTo simplify the bound in Bennetts inequality, one can notice that (we leave it as an exercise)

    f(x) x2

    2(1+ x/3),

    which implies that

    P1n

    n

    i=1

    Zi t exp

    nt

    2

    2(s2+ tM/3)

    .

    Combining with the same inequality for (Zi), and recalling that Zi = Xi , we obtain another classical inequality,Bernsteins inequality:

    P1

    n

    n

    i=1

    Xi t 2exp nt2

    2(s2+ tM/3)

    .

    For small t, the denominator is of order 2s2, and we get a better control of the probability when the variance issmall.

    20

  • Azumas inequality. So far we have considered sum of independent random variables. As a digression, we will nowgive one example of a concentration inequality for general functions f = f (X1, . . . ,Xn) of n independent randomvariables. We do not assume that these random variables are identically distributed, but we will assume the followingstability condition on the function f : f (x1, . . . ,xi, . . . ,xn) f (x1, . . . ,x0i, . . . ,xn) aifor some constants a1, . . . ,an. This means that modifying ith coordinate of f can change its value by not more than ai.Let us begin with the following observation.

    Lemma 12 If a random variable X satisfies |X | 1 and EX = 0 then, for any l 0,

    EelX el 2/2.

    Proof. Let us write lX as a convex combination

    lX = 1+X2

    l + 1X2

    (l ),

    where (1+X)/2,(1X)/2 2 [0,1] and their sum is equal to one. By the convexity of exponential function,

    elX 1+X2

    el +1X2

    el ,

    and taking expectations we get EelX ch(l ) el 2/2. utUsing this, we will now prove the following analogue of Hoeffdings inequality.

    Theorem 10 (Azumas inequality) Under the above stability condition, for any t 0,

    Pf E f t exp t2

    2ni=1 a2i

    .

    Proof. For i = 1, . . . ,n, let Ei denote the expectation in Xi+1, . . . ,Xn with the random variables X1, . . . ,Xi fixed. Onecan think of (X1, . . . ,Xn) as defined on a product space with the product measure, and Ei denotes the integration overthe last n i coordinates. Let us denote Yi = Ei f Ei1 f and note that En f = f and E0 f = E f . Then we can writef E f = ni=1Yi (this is called martingale-difference representation) and as before, for l 0,

    Pf E f t

    = P ni=1

    Yi t el tEelY1+...+lYn .

    Notice that Ei1Yi = Ei1 f Ei1 f = 0. Also, the stability condition implies that |Yi| ai. Since Y1, . . . ,Yn1 do notdepend on Xn (only Yn does), if we average in Xn first, we can write

    EelY1+...+lYn = EelY1+...+lYn1En1elYn

    .

    If we apply the previous lemma to Yn/an viewed as a function of Xn, we get

    En1elYn = En1elan(Yn/an) el 2a2n/2

    and, therefore, EelY1+...+lYn ela2nEelY1+...+lYn1 . Proceeding by induction on n, we get EelY1+...+lYn el ni=1 a2i and

    P ni=1

    Yi t exp

    l t+ l

    2

    2

    n

    i=1

    a2i.

    Optimizing over l 0 finishes the proof. ut

    21

  • Notice that in the above proof, we did not use the fact that Xis are random variables. They could be random vectorsor arbitrary random elements taking values in some measurable spaces. We only used the assumption that they areindependent. Keeping this in mind, let us give one example of application of Azumas inequality.

    Example. Consider an Erdos-Renyi random graph G(n, p) on n vertices, where each edge is present with probability pindependently of other edges. Let f = c(G(n, p)) be the chromatic number of this graph, which is the smallest numberof colors needed to color the vertices so that no two adjacent vertices share the same color. Let us denote the verticesby v1, . . . ,vn and let Xi denote the randomness in the set of possible edges between the vertex vi and v1, . . . ,vi1. Inother words, Xi = (Xi1, . . . ,X

    ii1) where Xik is 1 if the edge between vk and vi is present and 0 otherwise. Notice that

    vectors X1, . . . ,Xn are independent and the chromatic number is clearly a function f = f (X1, . . . ,Xn). To apply Azumasinequality, we need to determine the stability constants a1, . . . ,an. Observe that changing the set of edges connected toone vertex vi can only affect the chromatic number by at most 1 because, in the worst case, we can assigns a new colorto this vertex. This means that ai = 1 and Azumas inequality implies that

    Pc(G(n, p))Ec(G(n, p)) t 2e t22n

    (we simply apply the inequality to f and f here). For example, if we take t = p2n logn, we get the bound 2/n, whichmeans that with high probability the chromatic number will be within

    p2n logn from its expected value Ec(G(n, p)).

    When p is fixed, it is known (but non-trivial) that this expected value is close to cpn/ logn with cp = log(1 p)/2,so the deviation

    p2n logn is of much smaller order.

    Exercise. (Hoeffding-Chernoffs inequality) Prove that for 0< 1/2 and 0 t < ,

    D(1+ t,1) t2

    2(1) .

    Hint: compare second derivatives.

    Exercise. Let X1, . . . ,Xn be independent flips of a fair coin, i.e. P(Xi = 0) = P(Xi = 1) = 1/2. If Xn is their averageshow that for t 0

    P|Xn1/2|> t 2e2nt2 .

    Hint: use previous problem twice.

    Exercise. Suppose that the random variables X1, . . . ,Xn,X 01, . . . ,X 0n are independent and, for all i n, Xi and X 0i havethe same distribution. Prove that

    P ni=1

    (XiX 0i )2t

    n

    i=1

    (XiX 0i )21/2 et .

    Hint: think about a way to introduce Rademacher random variables ei into the problem and then use Hoeffdingsinequality.

    Exercise. (Bernsteins inequality) Let X1, . . . ,Xn be i.i.d. random variables such that |Xi | M, EX = andvar(X) = s2. If Xn is their average, make a change of variables in the Bernstein inequality to show that for t > 0,

    PXn +

    r2s2tn

    +2Mt3n

    et .

    22

  • Section 5

    Laws of Large Numbers.

    In this section, we will study two types of convergence of the average to the mean, in probability and almost surely.Consider a sequence of random variables (Yn)n1 on some probability space (W,A ,P). We say that Yn converges inprobability to a random variable Y (and write Yn

    p! Y ) if, for all e > 0,limn!P(|YnY | e) = 0.

    We say that Yn converges to Y almost surely, or with probability 1, if

    Pw : lim

    n!Yn(w= Y (w)) = 1.

    Let us begin with an easier case.

    Theorem 10 (Weak law of large numbers) Consider a sequence of random variables (Xn)n1 that are centered, EXn =0, have uniformly bounded second moments, EX2n K < , and are uncorrelated, EXiXj = 0 for i , j. Then

    Xn =1nin

    Xi! 0

    in probability.

    Proof. By Chebyshevs inequality, also using that EXiXj = 0,

    P|Xn0| e = PX2n e2 EX2ne2

    =1

    n2e2E(X1+ +Xn)2 = 1n2e2

    n

    i=1

    EX2i nKn2e2

    =Kne2

    ! 0,

    as n! , which finishes the proof. utOf course, if (Xn)n1 are independent then they are automatically uncorrelated, since

    EXiXj = EXiEXj = 0.

    Before we move on to the almost sure convergence, let us give one more application of the above argument to theproblem of approximation of continuous functions. Consider an i.i.d. sequence (Xn)n1 with the distribution P on Rthat depends on some parameter q 2Q R, and suppose that

    EXi = q ,Var(Xi) = s2(q).

    Then the following holds.

    22

  • Theorem 11 If u : R! R is continuous and bounded and s2(q) is bounded on compacts thenEu(Xn)! u(q)

    uniformly on compacts.

    Proof. For any e > 0, we can write

    |Eu(Xn)u(q)| E|u(Xn)u(q)|= E|u(Xn)u(q)|

    I(|Xnq | e)+ I(|Xnq |> e)

    max

    |xq |e|u(x)u(q)|+2kukP(|Xnq |> e).

    The last probability can be bounded as in the previous theorem,

    P(|Xnq |> e) s2(q)ne2

    ,

    and, therefore,

    |Eu(Xn)u(q)| max|xq |e |u(x)u(q)|+2kuks2(q)

    ne2.

    The statement of the theorem should now be obvious. utExample. Let (Xi) be i.i.d. with the Bernoulli distribution B(q) with probability of success q 2 [0,1],

    P(Xi = 1) = q , P(Xi = 0) = 1q ,and let u : [0,1]! R be continuous. Then, by the above theorem, the following linear combination of the Bernsteinpolynomials,

    Bn(q) := Eu(Xn) =n

    k=0

    u kn

    P ni=1

    Xi = k=

    n

    k=0

    u kn

    nk

    q k(1q)nk

    approximate u(q) uniformly on [0,1]. This gives an explicit example of polynomials in the Weierstrass theorem thatapproximate a continuous function on [0,1]. utExample. Suppose (Xi) have the Poisson distribution P(q) with the parameter q > 0,

    Pq (Xi = k) =q k

    k!eq for integer k 0.

    Then, it is well known (and easy to check) that EXi = q ,s2(q) = q , and the sum X1+ . . .+Xn has Poisson distributionP(nq). Therefore, if u is bounded and continuous on [0,+) then

    Eu(Xn) =

    k=0

    u kn

    P ni=1

    Xi = k=

    k=0

    u kn

    (nq)kk!

    enq ! u(q)

    uniformly on compact sets. utBefore we turn to the almost sure convergence results, let us note that convergence in probability is weaker than a.s.convergence. For example, consider a probability space which is a circle of circumference 1 with the uniform measureon it. Consider a sequence of r.v. on this probability space defined by

    Xk(x) = Ix 2h1+

    12+ + 1

    k,1+ + 1

    k+1

    mod 1

    .

    Then, Xk ! 0 in probability, since for 0< e < 1,

    P(|Xk0| e) = 1k+1 ! 0.

    However, Xk does not have an almost sure limit, because the series k1 1/k diverges and, as a result, each point x onthe sphere will fall into the above intervals infinitely many times, i.e. it will satisfy Xk(x) = 1 for infinitely many k. Letus begin with the following lemma.

    23

  • Lemma 12 Consider a sequence (pi)i1 such that pi 2 [0,1). Then

    i1

    (1 pi) = 0()i1

    pi =+.

    Proof. (=. Using that 1 p ep we get

    in

    (1 pi) exp

    inpi! 0 as n! .

    =). We can assume that pi 1/2 for i m for large enough m, because, otherwise, the series obviously diverges.Since 1 p e2p for p 1/2 we have

    min

    (1 pi) exp2

    minpi

    and the result follows. utThe following result plays a key role in Probability Theory. Consider a sequence (An)n1 of events An 2 A on theprobability space (W,A ,P).We will denote by

    An i.o. :=\n1

    [mn

    Am

    the event that An occur infinitely often, which consists of all w 2W that belong to infinitely many events in the sequence(An)n1. Then the following holds.

    Lemma 13 (Borel-Cantelli lemma)

    (1) If n1P(An)< then P(An i.o.) = 0.

    (2) If An are independent and n1P(An) = + then P(An i.o.) = 1.

    Proof. (1) If Bn =S

    mn Am then Bn Bn+1 and, by the continuity of measure,

    P(An i.o.) = P\n1

    Bn= lim

    n!P(Bn).

    On the other hand,P(Bn) = P

    [mn

    Am

    mnP(Am),

    which goes to zero as n!+, because the series m1P(Am)< .(2) We can write

    PW\Bn

    = P

    W\ [

    mnAm= P

    \mn

    (W\Am)

    {by independence} = mn

    P(W\Am) = mn

    (1P(Am)) = 0,

    by Lemma 12, since mnP(Am) = +. Therefore, P(Bn) = 1 and P(An i.o.) = P(T

    n1Bn) = 1. utLet us show how this implies the strong law of large number for bounded random variables. Recall that in the casewhen

    = EX1,s2 = Var(X1) and |X1|M,the average satisfies Bernsteins inequality proved in the last section,

    P|Xn| t 2exp nt22(s2+ tM/3)

    .

    24

  • If we take

    t = tn =8s2 logn

    n

    1/2then, for n large enough such that tnM/3 s2, we have

    P|Xn| tn 2exp nt2n4s2

    =

    2n2.

    Since the series n1 2n2 converges, the Borel-Cantelli lemma implies that

    P|Xn| tn i.o.= 0.

    This means that for P-almost all w 2 W, the difference |Xn(w) | will become smaller than tn for large enoughn n0(w). If we recall the definition of almost sure convergence, this means that Xn converges to almost surely the so-called strong law of large number. Next, we will show that this holds even for unbounded random variablesunder a minimal assumption that the expectation = EX1 exists.

    Strong law of large numbers. The following simple observation will be useful: if a random variable X 0 then

    EX =Z 0P(X x)dx.

    Indeed, if F is the law of X on R then

    EX =Z 0xdF(x) =

    Z 0

    Z x01dsdF(x) =

    Z 0

    Z s1dF(x)ds=

    Z 0P(X s)ds.

    For X 0 such that EX < this implies

    i1

    P(X i)Z 0P(X s)ds= EX < . (5.0.1)

    As before, let (Xn)n1 be i.i.d. random variables on the same probability space.

    Theorem 12 (Strong law of large numbers) If E|X1|< then

    Xn =1n

    n

    i=1

    Xi! = EX1 almost surely.

    Proof. The proof will proceed in several steps.

    Step 1. First, without loss of generality we can assume that Xi 0. Indeed, for signed random variables we candecompose Xi = X+i Xi , where

    X+i = Xi I(Xi 0) and Xi = |Xi| I(Xi < 0)and the general result would follow, since

    1n

    n

    i=1

    Xi =1n

    n

    i=1

    X+i 1n

    n

    i=1

    Xi ! EX+1 EX1 = EX1.

    Thus, from now on we assume that Xi 0.Step 2. (Truncation) Next, we can replace Xi by Yi = Xi I(Xi i) using the Borel-Cantelli lemma. Since

    i1

    P(Xi , Yi) =i1

    P(Xi > i) EX1 < ,

    25

  • the Borel-Cantelli lemma implies that P({Xi , Yi} i.o.) = 0. This means that for some (random) i0 and for i i0 wehave Xi = Yi and, therefore,

    limn!

    1n

    n

    i=1

    Xi = limn!1n

    n

    i=1

    Yi.

    It remains to show that if Tn = ni=1Yi then Tn/n! EX almost surely.Step 3. (Limits over subsequences) We will first prove almost sure convergence along the subsequences of the typen(k) = bakc for a > 1. For any e > 0,

    k1

    P|Tn(k)ETn(k)| en(k)

    k1

    1e2n(k)2

    Var(Tn(k)) = k1

    1e2n(k)2 in(k)

    Var(Yi)

    k1

    1e2n(k)2 in(k)

    EY 2i =1e2 i1

    EY 2i k:n(k)i

    1n(k)2

    .

    To bound the last sum, let us note thatak

    2 n(k) = bakc ak

    and if k0 =min{k : ak i} then

    n(k)i

    1n(k)2

    aki

    4a2k

    =4

    a2k0(1 1a2 ) K

    i2,

    where K = K(a) depends on a only. Therefore, we showed that

    k1

    P|Tn(k)ETn(k)| en(k) K

    i1EY 2i

    1i2

    = Ki1

    1i2

    Z i0x2 dF(x),

    where F is the law of X . We can bound the last sum as follows

    i1

    1i2

    Z i0x2 dF(x) =

    i1

    1i2 mm

    1i2

    Z m+1m

    x2 dF(x)

    m0

    2m+1

    Z m+1m

    x2 dF(x) 2 m0

    Z m+1m

    xdF(x) = 2EX < .

    We proved thatk1

    P|Tn(k)ETn(k)| en(k)<

    and the Borel-Cantelli lemma implies that

    P|Tn(k)ETn(k)| en(k) i.o.= 0.

    If we take a sequence em = m1, this implies that

    P9m 1, |Tn(k)ETn(k)| m1n(k) i.o.= 0,

    and this proves thatTn(k)n(k)

    ETn(k)n(k)

    ! 0

    with probability one. On the other hand, it is obvious that

    ETn(k)n(k)

    =1

    n(k) in(k)EX1I(X1 i)! EX

    26

  • as k! , so we proved thatTn(k)n(k)

    ! EX almost surely.

    Step 4. Finally, for j such that

    n(k) j < n(k+1) = n(k)n(k+1)n(k)

    n(k)a2

    we can write1a2

    Tn(k)n(k)

    Tjj a2 Tn(k+1)

    n(k+1)and, therefore, with probability one,

    1a2

    EX liminfj!

    Tjj limsup

    j!Tjj a2EX .

    Taking a = 1+m1 and letting m! proves that lim j!Tj/ j = EX almost surely. utExercise. Suppose that random variables (Xn)n1 are i.i.d. such that E|X1|p 0. Show that maxin |Xi|/n1/pgoes to zero in probability.

    Exercise. (Weak LLN for U-statistics) If (Xn)n1 are i.i.d. such that EX1 = and s2 = Var(X1)< , show thatn2

    1

    1i< jnXiXj ! 2

    in probability as n! .Exercise. If u : [0,1]k ! R is continuous then show that

    0 j1,..., jkn

    u j1n, . . . ,

    jkn

    ik

    nji

    x jii (1 xi)n ji ! u(x1, . . . ,xk)

    as n! , uniformly on [0,1]k.Exercise. If E|X |< and limn!P(An) = 0, show that limn!EXIAn = 0. (Hint: use the Borel-Cantelli lemma oversome subsequence.)

    Exercise. Suppose that An is a sequence of events such that

    limn!P(An) = 0 and n1

    P(An \An+1)< .

    Prove that P(An i.o.) = 0.

    Exercise. Let {Xn}n1 be i.i.d. and Sn = X1+ . . .+Xn. If Sn/n! 0 a.s., show that E|X1| < . (Hint: use the idea in(5.0.1) and the Borel-Cantelli lemma.)

    Exercise. Suppose that (Xn)n1 are independent random variables. Show that P(supn1Xn < ) = 1 if and only ifn1P(Xn >M)< for some M > 0.

    Exercise. If (Xn)n1 are i.i.d., not constant with probability one, then P(Xn converges) = 0.

    Exercise. Let {Xn}n1 be independent and exponentially distributed, i.e. with c.d.f. F(x) = 1 ex for x 0. Showthat

    Plimsupn!

    Xnlogn

    = 1= 1.

    Exercise. Suppose that (Xn)n1 are i.i.d. with EX+1 < and EX1 =. Show that Sn/n! almost surely.

    27

  • Exercise. Suppose that (Xn)n1 are i.i.d. such that E|X1| < and EX1 = 0. If (cn) is a bounded sequence of realsnumbers, prove that

    1n

    n

    i=1

    ciXi! 0 almost surely.

    Hint: either (a) group the close values of c or (b) examine the proof of the strong law of large numbers.

    Additional exercise. Suppose (Xn)n1 are i.i.d. standard normal. Prove that

    Plimsupn!

    |Xn|p2logn

    = 1= 1.

    28

  • Section 6

    0 1 Laws. Random series.

    Consider a sequence (Xi)i1 of random variables on the same probability space (,F ,P) and let (Xi)i1

    be the

    -algebra generated by this sequence. An event A (Xi)i1 is called a tail event if A (Xi)in for all n 1.In other words,

    A T = n1

    (Xi)in

    ,

    where T is the so-called tail -algebra. For example, if Ai (Xi) then

    Ai i.o.=n1

    in

    Ai

    is a tail event. It turns out that when (Xi)i1 are independent then all tail events have probability 0 or 1.

    Theorem 14 (Kolmogorovs 01 law) If (Xi)i1 are independent then P(A) = 0 or 1 for all A T .

    Proof. For a finite subset F = {i1, . . . , in} N, let us denote by XF = (Xi1 , . . . ,Xin). The -algebra (Xi)i1

    is

    generated by the algebra of sets

    A ={XF B} finite F N,B B(R|F |).

    This is an algebra, because any set operations on finite number of such sets can again be expressed in terms of finitelymany random variables Xi. By the Approximation Lemma, we can approximate any set A

    (Xi)i1

    by sets in A .

    Therefore, for any > 0, there exists a set A A such that P(AA) and, therefore,|P(A)P(A)| , |P(A)P(AA)| .

    By definition, A (X1, ...,Xn) for large enough n and, since A is a tail event, A((Xi)in+1). The Grouping Lemmaimplies that A and A are independent and P(AA) = P(A)P(A). Finally, we get

    P(A) P(AA) = P(A)P(A) P(A)P(A),and letting 0 proves that P(A) = P(A)2. Example. The event A =

    the series i1Xi converges

    is a tail event, so it has probability 0 or 1 when Xis are

    independent. Example. Consider the series i1Xizi on the complex plane, for z C. Its radius of convergence is

    r = liminfi |Xi|

    1i .

    For any x 0, the event {r x} is, obviously, a tail event. This implies that r = const with probability 1.

    30

  • Next we will prove a stronger result under a more restrictive assumption that the random variables (Xi)i1 are not onlyindependent, but also identically distributed. A set B RN is called symmetric if, for all n 1,

    (x1,x2, . . . ,xn,xn+1, . . .) B = (xn,x2, ...,xn1,x1,xn+1, ...) B.In other words, the set B is symmetric under the permutations of finitely many coordinates. We will call an eventA ((Xi)i1) symmetric if it is of the form {(Xi)i1 B} for some symmetric set BRN in the cylindrical -algebraB on RN. For example, any event in the tail -algebra T is symmetric.

    Theorem 15 (Savage-Hewitt 01 law) If (Xi)i1 are i.i.d. and A ((Xi)i1) is symmetric then P(A) = 0 or 1.Proof. By the Approximation Lemma, for any > 0, there exists n 1 and An (X1, ...,Xn) such that P(AnA) .Of course, we can write this set as

    An =(X1, . . . ,Xn) Bn

    (X1, ...,Xn)for some Borel set Bn on Rn. Let us denote

    An =(Xn+1, . . . ,X2n) Bn

    (Xn+1, ...,X2n).This set is independent of An, so P(AnAn) = P(An)P(An). Given x= (x1,x2, . . .) RN, let us define an operator

    x= (xn+1, . . . ,x2n,x1, . . . ,xn,x2n+1, . . .)

    that switches the first n coordinates with the second n coordinates. Denote X = (Xi)i1 and recall that A = {X B}for some symmetric set B RN that, by definition, satisfies B= x x B= B. Now we can write

    P(AnA) = P(Xn+1, . . . ,X2n) Bn

    X B= P

    (Xn+1, . . . ,X2n) Bn

    X B= P

    (Xn+1, . . . ,X2n) Bn

    X B{using that Xis are i.i.d.} = P

    (X1, . . . ,Xn) Bn

    X B= P(AnA) .

    This implies that P(AnAn)A

    2 and we can conclude thatP(A) P(An), P(A) P(AnAn) = P(An)P(An) P(A)2.

    Letting 0 again implies that P(A) = P(A)2. Example. Let Sn = X1+ . . .+Xn and let

    r = limsupn

    Snanbn

    .

    The event {r x} is symmetric, since changing the order of a finite set of coordinates does not affect Sn for largeenough n. As a result, P(r x) = 0 or 1, which implies that r = const with probability 1. Random series. We already saw above that, by Kolmogorovs 01 law, the series i1Xi for independent (Xi)i1converges with probability 0 or 1. This means that either Sn = X1+ . . .+Xn converges to its limit S with probabilityone, or with probability one it does not converge. We know that almost sure convergence is stronger that convergencein probability, so in the case when with probability one Sn does not converge, is it still possible that it converges tosome random variable in probability? The answer is no, because we will now prove that for random series convergencein probability implies almost sure convergence. We will need the following.

    Theorem 16 (Kolmogorovs inequality) Suppose that (Xi)i1 are independent and Sn=X1+ . . .+Xn. If for all j n,P(|SnS j| a) p< 1, (6.0.1)

    then, for x> a,

    Pmax1 jn

    |S j| x 1

    1 pP(|Sn|> xa).

    31

  • Proof. First of all, let us notice that this inequality is obvious without the maximum, because (6.0.1) is equivalent to1 p P(|SnS j|< a) and we can write

    (1 p)P|S j| x P|SnS j|< aP|S j| x= P

    |SnS j|< a, |S j| x P(|Sn|> xa).The equality in the middle holds because the events {|S j| x} and {|Sn S j| < a} are independent, since the firstdepends only on X1, ...,Xj and the second only on Xj+1, ...,Xn. The last inequality holds, because if |SnS j|< a and|S j| x then, by triangle inequality, |Sn|> xa.

    To deal with the maximum, instead of looking at an arbitrary partial sum S j, we will look at the first partial sumthat crosses the level x. We define this first time by

    =minj n : |S j| x

    and let = n+1 if all |S j|< x. Notice that the event { = j} also depends only on X1, ...,Xj, so we can again write

    (1 p)P = j P|SnS j|< aP = j= P

    |SnS j|< a, = j P(|Sn|> xa, = j).The last inequality is true, because when = j we have |S j| x and|SnS j|< a, = j |Sn|> xa, = j.It remains to add up over j n to get

    (1 p)P( n) P(|Sn|> xa, n) P(|Sn|> xa)and notice that { n}= {max jn |S j| x}. We will need one more simple lemma.

    Lemma 15 A sequence Yn Y almost surely if and only if Mn =maxin |YiY | 0 in probability.Proof. The only if direction is obvious, so we only need to prove the if part. Since the sequenceMn is decreasing,it converges to some limit Mn M 0 everywhere. Since for all > 0,

    P(M ) P(Mn ) 0 as n ,this means that P(M = 0) = 1 and Mn 0 almost surely. Of course, this implies that Yn Y almost surely. We are now ready to prove the result mentioned above.

    Theorem 17 If the series i1Xi converges in probability then it converges almost surely.

    Proof. Suppose that the partial sums Sn converge to some random variable S in probability, i.e., for any > 0, for largeenough n n0() we have P(|SnS| ) . If k j n n0() then

    P(|SkS j| 2) P(|SkS| )+P(|S jS| ) 2.Next, we apply Kolmogorovs inequality with x= 4 , a= 2 and p= 2 to the partial sums Xn+1+ . . .+Xj to get

    Pmaxn jk

    |S jSn| 4 1

    12 P(|SkSn| 2)2

    12 3,

    for small . The events {maxn jk |S jSn| 4} are increasing as k and, by the continuity of measure,

    Pmaxn j |S jSn| 4

    3.

    32

  • Finally, since P(|SnS| ) we getPmaxn j |S jS| 5

    4.

    This means that the maximum maxn j |S jS| 0 in probability and, by previous lemma, Sn S almost surely. Let us give one easy-to-check criterion for convergence of random series. Again, we will need one auxiliary result.

    Lemma 16 Random sequence (Yn)n1 converges in probability to some limit Y if and only if it is Cauchy in probability,which means that

    limn,mP(|YnYm| ) = 0

    for all > 0.

    Proof. Again, the only if direction is obvious and we only need to prove the if part. Given = l2, we can findm(l) large enough such that, for n,m m(l),

    P|YnYm| 1l2

    1

    l2. (6.0.2)

    Without loss of generality, we can assume that m(l+1) m(l) so that

    P|Ym(l+1)Ym(l)| 1l2

    1

    l2.

    Then,

    l1

    P|Ym(l+1)Ym(l)| 1l2

    l1

    1l2

    <

    and, by the Borel-Cantelli lemma,

    P|Ym(l+1)Ym(l)| 1l2 i.o.

    = 0.

    As a result, for large enough (random) l and for k > l,

    |Ym(k)Ym(l)|il

    1i2 1/2. One famous application of Theorem 18 is the following form of the strong law of large numbers.

    33

  • Theorem 19 (Kolmogorovs strong law of large numbers) Let (Xi)i1 be independent random variables such thatEXi = 0 and EX2i < . Suppose that bi bi+1 and bi . If i1EX2i /b2i < then limn b1n Sn = 0 almost surely.This follows immediately from Theorem 18 and the following well-known calculus lemma.

    Lemma 17 (Kroneckers lemma) Suppose that a sequence (bi)i1 is such that all bi > 0 and bi . Given anothersequence (xi)i1, if the series i1 xi/bi converges then limn b1n ni=1 xi = 0.

    Proof. Because the series converges, rn := in+1 xi/bi 0 as n . Notice that we can write xn = bn(rn1 rn)and, therefore,

    n

    i=1

    xi =n

    i=1

    bi(ri1 ri) =n1i=1

    (bi+1bi)ri+b1r0bnrn.

    Since ri 0, given > 0, we can find n0 such that for i n0 we have |ri| and |n1i=n0+1(bi+1 bi)ri| bn.Therefore, b1n n

    i=1xi b1n n0

    i=1(bi+1bi)ri

    + +b1n b1|r0|+ |rn|.Letting n and then 0 finishes the proof. A couple of examples of application of Kolmogorovs strong law of large numbers will be given in the exercises below.

    Exercise. Let {Sn : n 0} be a simple random walk which starts at zero, S0 = 0, and at each step moves to the rightwith probability p and to the left with probability 1 p. Show that the event {Sn = 0 i.o.} has probability 0 or 1. (Hint:Hewitt-Savage 0 1 law.)

    Exercise. In the setting of the previous problem, show: (a) if p 1/2 then P(Sn = 0 i.o.) = 0; (b) if p = 1/2 thenP(Sn = 0 i.o.) = 1. Hint: use the fact that the events

    liminfn Sn 1/2

    ,limsupn

    Sn 1/2

    are symmetric.

    Exercise. Suppose that (Xn)n1 are i.i.d. with EX1 = 0 and EX21 = 1. Prove that for > 0,

    1n1/2(logn)1/2+

    n

    i=1

    Xi 0

    almost surely. Hint: use Kolmogorovs strong law of large numbers.

    Exercise. Let (Xn) be i.i.d. random variables with continuous distribution F . We say that Xn is a record value if Xn > Xifor i< n. Let In be the indicator of the event that Xn is a record value.

    (a) Show that the random variables (In)n1 are independent and P(In = 1) = 1/n. Hint: if Rn {1, . . . ,n} is the rankof Xn among the first n random variables (Xi)in, prove that (Rn) are independent.

    (b) If Sn = I1+ . . .+ In is the number of records up to time n, prove that Sn/ logn 1 almost surely. Hint: useKolmogorovs strong law of large numbers.

    Exercise. Suppose that two sequences of random variables Xn : 1 R and Yn : 2 R for n 1 on two differentprobability spaces have the same distributions in the sense that all their finite dimensional distributions are the same,L ((Xi)in) =L ((Yi)in) for all n 1. If Xn converges almost surely to some random variable X on 1 as n ,prove that Yn also converges almost surely on 2.

    34

  • Section 7

    Stopping times, Walds identity, Markovproperty. Another proof of the SLLN.

    In this section, we will have our first encounter with two concepts, stopping times andMarkov property, in the setting ofthe sums of independent random variables. Later, stopping times will play an important role in the study of martingales,and Markov property will appear again in the setting of the Brownian motion.

    Consider a sequence (Xi)i1 of independent random variables and an integer valued random variable {1,2, . . .}.We say that is independent of the future if { n} is independent of ((Xi)in+1). Suppose that is independent ofthe future and E|Xi|< for all i 1. We can formally write

    ES = k1

    ES I( = k) = k1

    ESkI( = k)

    = k1nk

    EXnI( = k)()=

    n1kn

    EXnI( = k) = n1

    EXnI( n).

    In (*) we can interchange the order of summation if, for example, the double sequence is absolutely summable, bythe Fubini-Tonelli theorem. Since is independent of the future, the event { n}= { n1}c is independent of(Xn) and we get

    ES = n1

    EXnP( n). (7.0.1)

    This implies the following.

    Theorem 18 (Walds identity.) If (Xi)i1 are i.i.d., E|X1|< and E < , then ES = EX1E.Proof. By (7.0.1) we have,

    ES = n1

    EXnP( n) = EX1n1

    P( n) = EX1E.

    The reason we can interchange the order of summation in (*) is because under our assumptions the double sequenceis absolutely summable since

    n1kn

    E|Xn|I( = k) = n1

    E|Xn|I( n) = E|X1|E < ,

    so we can apply the Fubini-Tonelli theorem. We say that is a stopping time if { n} (X1, . . . ,Xn) for all n. Clearly, a stopping time is independent of thefuture. One example of a stopping time is =min{k 1 : Sk 1}, since

    { n}= kn

    {Sk 1} (X1, . . . ,Xn).

    34

  • Given a stopping time , we would like to describe all events that depend on and the sequence X1, . . . ,X up tothis stopping time. A formal definition of the -algebra generated by the sequence up to a stopping time is thefollowing:

    =AA{ n} (X1, . . . ,Xn) for all n 1.

    When is a stopping time, one can easily check that this is a -algebra, and the meaning of the definition is that, ifwe know that n then the corresponding part of the event A is expressed only in terms of X1, . . . ,Xn.

    Theorem 19 (Markov property) Suppose that (Xi)i1 are i.i.d. and is a stopping time. Then the sequence T =(X+1,X+2, . . .) is independent of the -algebra and

    Td= (X1,X2, . . .),

    where d= means the equality in distribution.

    In words, this means that the sequence T = (X+1,X+2, . . .) after the stopping time is an independent copy of theentire sequence, also independent of everything that happens before the stopping time.

    Proof. Consider an event A and B B, the cylindrical -algebra on RN. We can write,PA{T B}

    =

    n1PA{ = n}{T B}

    =

    n1PA{ = n}{Tn B}

    ,

    where Tn = (Xn+1,Xn+2, . . .). By the definition of the -algebra ,

    A{ = n}= A{ n}\A{ n1} (X1, . . . ,Xn).On the other hand, {Tn B} (Xn+1, . . .) and, therefore, is independent of A{ = n}. Using this and the fact that(Xi)i1 are i.i.d.,

    PA{T B}

    =

    n1PA{ = n}PTn B

    = n1

    PA{ = n}PT1 B= PAPT1 B,

    and this finishes the proof. Let us give one interesting application of the Markov property and Walds identity that will yield another proof of theStrong Law of Large Numbers.

    Theorem 20 Suppose that (Xi)i1 are i.i.d. such that EX1 > 0. If Z = infn1 Sn then P(Z >) = 1.

    This means that partial sums can not drift down to if the mean EX1 > 0. Of course, this is obvious by the stronglaw of large number, but we want to prove this independently, since this will give another proof of the SLLN.

    Proof. Let us define (see Fig. 7.1 below),

    1 =mink 1 : Sk 1

    , Z1 = min

    k1Sk, S

    (2)k = S1+kS1 ,

    2 =mink 1 : S(2)k 1

    , Z2 = min

    k2S(2)k , S

    (3)k = S

    (2)2+kS

    (2)2 ,

    and recursively,n =min

    k 1 : S(n)k 1

    , Zn = min

    knS(n)k , S

    (n+1)k = S

    (n)n+kS

    (n)n .

    We mentioned above that 1 is a stopping time. It is easy to check that Z1 is 1 -measurable and, by the Markovproperty, it is independent of the sequence T1 = (X1+1,X1+2, . . .), which has the same distribution as the original

    35

  • 01

    1

    !1

    !2z2

    0z1

    Figure 7.1: A sequence of stopping times.

    sequence. Since 2 and Z2 are defined exactly the same way as 1 and Z1, only in terms of this new sequence T1 , Z2 isan independent copy of Z1. Now, it should be obvious that (Zn)n1 are i.i.d. random variables. Clearly,

    Z = infk1

    Sk = infZ1,S1 +Z2,S1+2 +Z3, ...

    ,

    and, since, by construction, S1++k1 k1,Z 0 we define Xi = Xi+ so that EX1 = > 0. By the above result, infn1(Sn+ n) > withprobability one. This means that for all n 1, Sn+n M > for some random variableM. Dividing both sidesby n and letting n we get

    liminfn

    Snn

    with probability one. We can then let 0 over some sequence. Similarly, we prove that

    limsupn

    Snn 0

    with probability one, which finishes the proof.

    36

  • Exercise. Let (Xi)i1 be i.i.d. and EX1 > 0. Given a> 0, show that E a}. (Hint: truncateXis and and use Walds identity).

    Exercise. Let S0 = 0,Sn =ni=1Xi be a random walk with i.i.d. (Xi), P(Xi =+1,1) = p,1 p for p> 1/2. Considerinteger b 1 and let =min{n 1 : Sn = b}. Show that for 0< s 1,

    Es =1 (14pqs2)1/2

    2qs

    band compute E.

    Exercise. Suppose that we play a game with i.i.d. outcomes (Xn)n1 such that E|X1| 0, so after n rounds ourtotal profit (or loss) is

    Yn = max1mn

    Xn cn.

    In this problem we will find the best strategy to play the game, in some sense.

    1. Given a R, let p = P(X1 > a) > 0 and consider the stopping time T = inf{n : Xn > a}. Compute EYT . (Hint:sum over sets {T = n}.)

    2. Consider such that E(X1)+ = c. For a= show that EYT = .3. Show that Yn +nm=1((Xm)+ c) (for any , actually).4. Use Walds identity to conclude that for any stopping time such that E < we have EY .

    This means that stopping at time T results in the best expected profit .

    37

  • Section 8

    Convergence of Laws.

    In this section we begin the discussion of weak convergence of distributions on metric spaces. Let (S,d) be a metricspace with the metric d. Consider a measurable space (S,B) with the Borel -algebraB generated by open sets andlet (Pn)n1 and P be some probability distributions onB.We define

    Cb(S) =f : S R continuous and bounded.

    We say that Pn P weakly iflimn

    Sf dPn =

    Sf dP for all f Cb(S). (8.0.1)

    This is often denoted by Pnd P or Pn = P. Of course, the idea here is that the measure P on the Borel -algebra is

    determined uniquely by the integrals of f Cb(S), so we compare closeness of the measures by closeness of all theseintegrals. To see this, suppose that

    f dP =

    f dQ for all f Cb(S). Consider any open set U in S and let F =Uc.

    Using that d(x,F) = 0 if and only if x F (because F is closed), it is easy to see thatfm(x) =min

    1,md(x,F)

    I(x U) as m .Since fm Cb(S), by monotone convergence theorem we get that P(U) =Q(U) and, by Dynkins theorem, P=Q. Wewill also say that random variables Xn X in distribution if their laws converge weakly. Notice that in this definitionthe random variables need not be defined on the same probability space, as long as they take values in the same metricspace S. We will come back to the study of convergence on general metric spaces later in the course, and in this sectionwe will prove only one general result, the Selection Theorem. Other results will be proved only on R or Rn to prepareus for the most famous example of convergence of laws - the Central Limit Theorem (CLT). First of all, let us noticethat on the real line the convergence of probability measures can be expressed in terms of their c.d.f.s, as follows.

    Theorem 24 If S = R then Pn P weakly if and only if Fn(t) = Pn(, t] F(t) = P(, t] for any point of

    continuity t of the c.d.f. F.

    Proof. = Suppose that (8.0.1) holds. Let us approximate the indicator I(x t) by continuous functions so thatI(x t ) 1(x) I(x t) 2(x) I(x t+ ),

    as in Fig. 8.1 below. Obviously, 1,2 Cb(R). Then, using (8.0.1) for 1 and 2,

    ! !!!!

    ""

    "#

    $!%

    Figure 8.1: Approximating an indicator.

    39

  • F(t )1 dF = limn

    1 dFn limnFn(t) limn

    2 dFn =

    2 dF F(t+ ).

    Therefore, for any > 0,F(t ) lim

    nFn(t) F(t+ ).More carefully, we should write liminf and limsup but, since t is a point of continuity of F, letting 0 proves thatthe limit limnFn(t) exists and is equal to F(t).

    = Let PC(F) be the set of points of continuity of F. Since F is monotone, the set PC(F) is dense in R.Take M large enough such that both M,M PC(F) and P((M,M]c) . Clearly, for large enough n 1 wehave Pn((M,M]c) 2. For any k > 1, consider a sequence of points M = xk1 xk2 . . . xkk =M such that allxi PC(F) and maxi |xki+1 xki | 0 as k . Given a function f Cb(R), consider an approximating function

    fk(x) = 1

  • We say that a sequence of distributions (Pn)n1 on a metric space (S,d) is uniformly tight if, for any > 0, there existsa compact K S such that Pn(K) 1 for all n. Checking this property can be difficult and, of course, one needsto understand how compacts look like for a particular metric space. In the case of the CLT on Rn, this will be a trivialtask. The following fact is quite fundamental.

    Theorem 25 (Selection Theorem) If (Pn)n1 is a uniformly tight sequence of laws on the metric space (S,d) then thereexists a subsequence (n(k)) such that Pn(k) converges weakly to some probability law P.

    Let us recall the following well-known result.

    Lemma 19 (Cantors diagonalization) Let A be a countable set and fn :AR,n 1. Then there exists a subsequence(n(k)) such that fn(k)(a) converges for all a A, possibly to .

    Proof. Let A = {a1,a2, . . .}. Take (n1(k)) such that fn1(k)(a1) converges. Take (n2(k)) (n1(k)) such that fn2(k)(a2)converges. Recursively, take (nl(k)) (nl1(k)) such that fnl(k)(al) converges. Now consider the sequence (nk(k)).Clearly, fnk(k)(al) converges for any l because for k l,nk(k) {nl(k)} by construction. Proof of Theorem 25.We will prove the Selection Theorem for arbitrary metric spaces, since this result will be usefulto us later when we study the convergence of laws on general metric spaces. However, when S= R one can see this ina much more intuitive way, as follows.

    (The case S = R). Let A be a dense set of points in R. Given a sequence of probability measures Pn on R and theirc.d.f.s Fn, by Cantors diagonalization, there exists a subsequence (n(k)) such that Fn(k)(a) F(a) for all a A. Forx R\A, we can extend F by

    F(x) = infF(a)

    x< a,a A.Obviously, F(x) is non-decreasing but not necessarily right-continuous. However, at the points of discontinuity, wecan redefine it to be right continuous. Then F(x) will be a cumulative distribution function, since the fact that Pn areuniformly tight ensures that F(x) 0 or 1 if x or +. In order to prove weak convergence of Pn(k) to themeasure P with the c.d.f. F , let x be a point of continuity of F(x) and let a,b A such that a< x< b. We have,

    F(a) = limk

    Fn(k)(a) liminfk Fn(k)(x) limsupk Fn(k)(x) limkFn(k)(b) = F(b).

    Since x is a point of continuity and A is dense,

    limAax

    F(a) = F(x), limAbx

    F(b) = F(x),

    and this proves that Fn(k)(x) F(x) for all such x. By Theorem 24, this means that the laws Pn converge to P. Let usnow prove the general case on general metric spaces.

    (The general case). If K is a compact then, obviously, Cb(K) =C(K). Later in these lectures, when we deal in moredetail with convergence on general metric spaces, we will prove the following fact, which is well-known and is aconsequence of the Stone-Weierstrass theorem.

    C(K) is separable w.r.t. norm || f || = supxK | f (x|.

    Since Pn are uniformly tight, for any r 1 we can find a compact Kr such that Pn(Kr) > 1 1/r for all n 1. LetCr C(Kr) be a countable dense subset ofC(Kr). By Cantors diagonalization, there exists a subsequence (n(k)) suchthat Pn(k)( f ) converges for all f Cr for all r 1. Since Cr is dense in C(Kr), this implies that Pn(k)( f ) converges forall f C(Kr) for all r 1. Next, for any f Cb(S), f dPn(k)Kr f dPn(k)

    Kcr | f |dPn(k) || f ||Pn(k)(Kcr ) || f ||r .This implies that the limit

    I( f ) := limk

    f dPn(k) (8.0.2)

    41

  • exists. The question is why this limit is an integral I( f ) =f dP for some probability measure P? Basically, this is a

    consequence of the Riesz representation theorem for positive linear functionals on locally compact Hausdorff spaces.One can define the functional Ir( f ) on each compact Kr, find a measure on it using the Riesz representation theorem,check that these measures agree on the intersections, and extend to a probability measure on their union. However,instead, we will use a more general version of the Riesz representation theorem the Stone-Daniell theorem frommeasure theory which states the following.

    Given a set S, a family of function = { f : S R} is called a vector lattice iff ,g = c f +g ,c R and f g, f g .

    Vector lattice is called a Stone vector lattice if f 1 for any f . For example, any vector lattice that containsconstants is automatically a Stone vector lattice.

    A functional I : R is called a pre-integral if1. I(c f +g) = cI( f )+ I(g),

    2. f 0, I( f ) 0,3. fn 0, || fn|| < = I( fn) 0.

    See R.M. Dudley Reals Analysis and Probability for a proof of the following:

    (Stone-Daniell theorem) If is a Stone vector lattice and I is a pre-integral on then I( f ) =f d for some unique

    measure on the minimal -algebra on which all functions in are measurable.

    We will use this theorem with =Cb(S) and I defined in (8.0.2). The first two properties are obvious. To prove thethird one, let us consider a sequence such that

    fn 0, 0 fn(x) f1(x) || f1||.On any compact Kr, fn 0 uniformly, i.e.

    || fn||,Kr n,r n 0.Since

    fn dPn(k) =Kr

    fn dPn(k) +Kcr

    fn dPn(k) n,r+ 1r f1||,we get

    I( fn) = limk

    fndPn(k) n,r+ 1r || f1||.

    Letting n and r , we get that I( fn) 0. By the Stone-Daniell theorem,

    I( f ) =

    f dP

    for some unique measure P on (Cb(S)). The choice of f = 1 shows that I( f ) = 1 = P(S), which means that P is aprobability measure. Finally, let us show that (Cb(S)) is the Borel -algebra B generated by open sets. Since anyf Cb(S) is measurable onB we get (Cb(S))B.On the other hand, let F S be any closed set and take a functionf (x) =min(1,d(x,F)).We have, | f (x) f (y)| d(x,y) so f Cb(S) and

    f1({0}) (Cb(S)).However, since F is closed, f1({0}) = {x : d(x,F) = 0}= F and this proves thatB (Cb(S)). Conversely, the following holds (we will prove this result later for any complete separable metric space).

    Theorem 26 If Pn converges weakly to P on Rk then (Pn)n1 is uniformly tight.

    42

  • Proof. For any > 0, there exists large enough M > 0, such that P(|x|>M)< . Consider a function

    (s) =

    0, sM,1, s 2M,(sM)/M, M s 2M,

    and let (x) := (|x|) for x Rk. Since Pn P weakly,

    limsupn

    Pn|x|> 2M limsup

    n

    (x)dPn(x) =

    (x)dP(x) P|x|>M .

    For n large enough, n n0, we get Pn(|x| > 2M) 2. For n < n0 choose Mn so that Pn(|x| > Mn) 2. TakeM =max{M1, . . . ,Mn01,2M}. As a result, Pn(|x|>M) 2 for all n 1. Finally, let us relate convergence in distribution to other forms of convergence. Consider random variables X and Xnon some probability space (,A ,P) with values in a metric space (S,d). Let P and Pn be their corresponding lawson Borel setsB in S. Convergence of Xn to X in probability and almost surely is defined exactly the same way as forS= R by replacing |XnX | with d(Xn,X).

    Lemma 20 Xn X in probability if and only if for any sequence (n(k)) there exists a subsequence (n(k(r))) suchthat Xn(k(r)) X almost surely.

    Proof. =. Suppose Xn does not converge to X in probability. Then, for small > 0, there exists a subsequence(n(k))k1 such that P(d(X ,Xn(k)) ) . This contradicts the existence of a subsequence Xn(k(r)) that converges toX almost surely.

    =. Given a subsequence (n(k)) let us choose (k(r)) so that

    Pd(Xn(k(r)),X) 1r

    1

    r2.

    By the Borel-Cantelli lemma, these events can occur i.o. with probability 0, which means that with probability one,for large enough r,

    d(Xn(k(r)),X) 1r ,i.e. Xn(k(r)) X almost surely.

    Lemma 21 Xn X in probability then Xn X in distribution.

    Proof. By Lemma 20, for any subsequence (n(k)) there exists a subsequence (n(k(r))) such that Xn(k(r)) X almostsurely. Given f Cb(R), by the dominated convergence theorem,

    E f (Xn(k(r))) E f (X),i.e. Xn(k(r)) X weakly. By Lemma 18, Xn X in distribution. Exercise. Let Xn be random variables on the same probability space with values in a metric space S. If for some points S, Xn s in distribution, show that Xn s in probability.Exercise. For the following sequences of laws Pn on R having densities fn, which are uniformly tight? (a) fn = I(0x n)/n, (b) fn = nenxI(x 0), (c) fn = ex/n/nI(x 0).Exercise. Suppose that XnX in distribution on R andYn cR in probability. Show that XnYn cX in distribution,assuming that Xn,Yn are defined on the same probability space.

    Exercise. Suppose that random variables (Xn) are independent and Xn X in probability. Show that X is almost surelyconstant.

    43

  • Section 9

    Characteristic functions.

    In the next section, we will prove one of the most classical results in Probability Theory the central limit theorem and one of the main tools will be the so-called characteristic functions. Let X = (X1, . . . ,Xk) be a random vector on Rkwith the distribution P and let t = (t1, . . . , tk) Rk. The characteristic function of X is defined by

    f (t) = Eei(t,X) =ei(t,x) dP(x).

    In this section, we will collect various important and useful facts about the characteristic functions. The standardnormal distribution N(0,1) on R with the density

    p(x) =12

    ex2/2

    will play a central role, so let us start by computing its characteristic function. First of all, notice that this is indeed adensity, since 1

    2

    Rex

    2/2 dx2

    =12

    R2e(x

    2+y2)/2 dxdy=12

    20

    0

    er2/2r drd = 1.

    If X has the standard normal distribution N(0,1) then, obviously, EX = 0 and

    Var(X) = EX2 =12

    Rx2ex

    2/2 dx=12

    Rxd(ex2/2) = 1,

    by integration by parts. To motivate the computation of the characteristic function, let us first notice that for R,

    EeX =12

    ex

    x22 dx= e

    22

    12

    e(x )2

    2 dx= e22

    12

    ex22 dx= e

    22 .

    For complex = it, we begin similarly by completing the square,

    EeitX = et22

    12

    e(xit)2

    2 dx= et22

    it+R

    (z)dz,

    where we denoted(z) = 1

    2e

    z22 for z C.

    Since is analytic, by Cauchys theorem, integral over a closed path is equal to 0. Let us take a closed path it+ xfor x from M to +M, M+ iy for y from t to 0, x from M to M and, finally, M+ iy for y from 0 to t. For largeM, the function (z) is very small on the intervals M+ iy, so letting M we get

    it+R(z)dz=

    R(z)dz= 1,

    44

  • and we proved that

    f (t) = EeitX = et22 . (9.0.1)

    It is also easy to check that Y = X+ has mean = EY, variance 2 = Var(Y ) and the density

    12

    exp (x)

    2

    22.

    This is the so-called normal distribution N(,2) and its characteristic function is given by

    EeitY = Eeit(+X) = eitt22/2. (9.0.2)

    Our next very important observation is that integrability of X is related to smoothness of its characteristic function.

    Lemma 22 If X is a real-valued random variable such that E|X |r < for integer r 1 then f (t) Cr(R) andf ( j)(t) = E(iX) jeitX

    for j r.Proof. If r = 0, we use the fact that |eitX | 1 to conclude that

    f (t) = EeitX EeisX = f (s) if t s,by the dominated convergence theorem. This means that f C(R), i.e. characteristic functions are always continuous.If r = 1, E|X |< , we can use eitX eisXt s

    |X |and, therefore, by the dominated convergence theorem,

    f (t) = limst E

    eitX eisXt s = EiXe

    itX .

    Also, by dominated convergence theorem, EiXeitX C(R), which means that f C1(R). We proceed by induction.Suppose that we proved that

    f ( j)(t) = E(iX) jeitX

    and that r = j+1, E|X | j+1 < . Then, we can use that (iX) jeitX (iX) jeisXt s |X | j+1,

    so that, by the dominated convergence theorem, f ( j+1)(t) = E(iX) j+1eitX C(R). Next, we want to show that the characteristic function uniquely determines the distribution. This is usually provedusing convolutions. Let X and Y be two independent random vectors on Rk with the distributions P and Q. We denoteby PQ the convolution of P and Q, which is the distributionL (X+Y ) of the sum X+Y.We have,