conditional expectation and central limit theorem

8/10/2019 Conditional expectation and Central limit theorem

1/22

Extra material for MI 2014/15Ernst Hansen, Niels Richard Hansen, University of Copenhagen

Contents

1 Conditional expectations 1

1.1 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2 The Central Limit Theorem 14

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.2 The Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3 Applying the Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . 15

2.4 Lindebergs proof of the CLT . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.5 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

1 Conditional expectations

If X is a real-valued random variable defined on the probability space (,F, P), and ifD F is a sub--algebra we can ask the following question: What is the best D-measurableapproximation of the random variable X? If X is already D-measurable, then X itselfmay be the most obvious choice, but ifX is not D-measurable, can we define a best D-measurable approximation? It may not be clear why this is an interesting question, butthink about the following example. WithX1, . . . , X nbeingn real-valued random variableswe let

S=X1+ . . . + Xn

denote their sum, and we let D= (S) denote the-algebra generated byS. Then any D-measurable random variable is, in fact, a function, g(S), ofS. If we observe the sumSbutnot the individual random variables X1, . . . , X n, we might be interested in computing the

best prediction of eachXi in terms of the observed sum S. This is equivalent to computingthe best D-measurable approximation ofXi. One could suggest that the average S/n isthe best prediction given the sum, but how can this be justified? This chapter provides theframework for making it mathematically precise what we mean by best approximation,and some computational rules are also given. Problem 1.8 deals with the prediction ofXigiven the sum Sunder specific distributional assumptions on X1, . . . , X n.

1


2/22

IfXhas second moment then X L2(,F, P), and sinceL2(,D, P) L2(,F, P), wemight suggest that the best approximation ofXin L2(,D, P) is the orthogonal projectionofXonto L2(,D, P). This is indeed a sensible idea, but there are a few annoying technicalcomplications. First,L2(,F, P) is not a Hilbert space since the norm is only a pseudonorm, and we have to collect the random variables into equivalence classes to get a Hilbertspace. This is a nuisance, and is really a digression. Projections can be defined, as we willsee in the proof below, inL2(,F, P) without reference to equivalence classes. Second, itis convenient to extend the projections to work ifXhas just first moment.

Theorem 1(Orthogonal projections in L2). Assume thatD F is a sub--algebra. Thereexists a map

pr: L2(,F, P) L2(,D, P),which satisfies

DX dP =

D

pr(X) dP for every D D (1)and

|pr(X) pr(X)| dP |X X| dP. (2)

Proof:Let

= inf YL2(,D,P)

(X Y)2 dP,

and choose a sequence Yn L2(,D, P) such that

(X Yn)2 dP . The parallelogramidentity gives

X 1

2(Yn+ Ym)

2dP+

12

(Yn Ym)2

dP =1

2

(X Yn)2 dP+

(X Yn)2 dP

.

Since 12(Yn+ Ym) L2(,D, P) we have

X 12

(Yn+ Ym)

2dP,

and this implies that

+1

4

(Yn Ym)2 dP 1

2

(X Yn)2 dP+

(X Ym)2 dP

with the right hand side converging to for n, m . This shows that Y1, Y2, . . . is aCauchy sequence inL2(,D, P). By the Riesz-Fischer theorem,L2(,D, P) is completeand the Cauchy sequence has a limit, that is Yn Y in L2(,D, P) forn . We definethe map pr by pr(X) =Y.

For D D definec= sgn

D

X pr(X) dP

2


3/22

for >0, where sgn is the sign function defined by

sgn(x) =

1 if x >0

0 if x= 0

1 if x 0 was arbitrary, the left hand side must be 0, and this proves (1).

To prove the final inequality define

D+ = (pr(X) pr(X) 0)

andD likewise, such that

|pr(X) p(X)| = 1D+(pr(X) pr(X)) 1D(pr(X) pr(X)).

Observing thatD D we get, using (1), that

|pr(X) pr(X)| dP =

D+

pr(X) dPD+

pr(X) dPD

pr(X) dP+

D

pr(X) dP

=

D+

X dPD+

X dPD

X dP+

D

X dP

= D+ X X dP D X X dPD+

|X X| dP+D

|X X| dP

=

|X X| dP.

3


4/22

The previous theorem establish just enough knowledge about the orthogonal projection

map pr for random variables with second moment to be able to extend the map to randomvariables with first moment. Observe that (2) means that pr is continuous in the L1-norm,and that this is really the essential property we use below for extending the map fromL2toL1.Theorem 2. Let X be a real-valued random variable, defined on a background space(,F, P). Let D F be a sub--algebra. If E|X| Y) is in D. Hence we see that(Y >Y)

Y Y dP= 0 .

But 1(Y >Y)(Y Y) is non-negative and strictly positive on (Y > Y). As the integralis zero, we conclude that P(Y > Y) = 0. Similarly, P(Y < Y) = 0. So it follows thatP(Y=Y) = 0.To prove existence, observe thatL2(,F, P) is anL1-dense subset ofL1(,F, P), whichis a direct consequence of Corollary 12.11 in Schillings book. So ifE|X|


5/22

forn . We claim that this D-measurable limit Ysatisfies (3). Indeed, for any D Dwe have that

D

X dPD

Y dP

D

X dPD

Xn dP

+D

Xn dPD

pr(Xn) dP

+

D

pr(Xn) dPD

Y dP

|X Xn| dP+

|pr(Xn) Y| dP

since the middle integral is zero according to (1). As Xn X and pr(Xn) Y inL1,this upper limit can be made arbitrarily small, and hence (3) holds.

The extension established above of the map pr fromL2(,F

, P) toL1(,F

, P) could bedenoted pr as well. We prefer the notation commonly used in probability theory wherepr(X) is written E(X| D), and is called the conditional expectation ofXgiven D. Thecontent of Theorem 2 is reformulated as the following definition:

Definition 1. Let X be a real-valued random variable, defined on a background space(,F, P), and assume thatE|X|


6/22

It follows from Lemma 4 below that V(X| D)0 almost everywhere. The usefulness ofconditional variances is intimately linked to the relation

V X=E

V(X| D)

+ V

E(X| D)

, (6)

which follows from Corollary 8 below. The derivation is left to the reader as Problem 1.2.There are many cases where hard-to-find unconditional variances can be computed usingthis conditional approach with a clever choice ofD.

In the typical application, the conditioning-algebra is not at all arbitrary it is generatedby a relevant random variable Y (which can be real-valued, vector-valued or have valuesin an arbitrary abstract measurable space). In this case, we will typically write E(X| Y),repressing the -algebra in the notation. Thus in the initial example we would writeE(Xi

|Sn) for the conditional expectation ofXi w.r.t. the sumS. It is a general fact (but

not proved in this note) that E(X| Y) can be written as g(Y) for a suitable real-valuedfunctiong , defined on the space where Yhas values. This means that E(X| Y) should bethought of as a function ofY something that can be computed once Y is known. Withthis in mind,E(X| Y) can be understood as the natural predictorofXafter observationof Y. If Xhas second moment, this interpretation be most clearly justified by seeingE(X| Y) as the orthogonal projection onL2(, (Y), P).It may seem overly complicated to develop a machinery for conditioning on an abstract-algebra if all this machinery is ever used for is to condition on a random variable. Butthere are considerable notational advantages in the -algebra formulation, in particularassociated to the fact that several random variables can generate the same -algebra.In the -algebra formulation we can switch between the conditioning variables withoutaffecting the conditional expectation, and this turns out to be extremely convenient.

IfXandX are two real-valued random variables satisfying that X=X a.e., it obviouslyholds that

DX dP=

D

X dP for everyD D ,if the two variables are integrable. Hence the integral condition defining E(X | D) isthe same as the integral condition defining E(X | D), and from the uniqueness of theconditional expectation it follows that

E(X| D) =E(X | D) a.e.The almost everywhere closing of this formula could be neglected without any problems,

but is is customary in probability theory to formulate all results on conditional expec-tations as holding true almost everywhere, as an admission to the fact that conditionalexpectations are only determined up to modifications on nullsets.

Example 1. If X itself is D-measurable (as well as integrable), it clearly satisfies theintegral condition defining E(X| D). Hence we have that

E(X| D) =X a.e.

6


7/22

There is a handful of cases where the conditional expectation can be computed effortlessly,

Example 1 being the most obvious. In the exercises there are a few other such cases. Butin general some amount of work will be needed, relying on a number of results to beestablished now.

Lemma 3. LetX1 andX2 be real-valued random variables, defined on a common back-ground space (,F, P). Assume that E|X1|


8/22

Theorem 5 (Tower property). Let X be a real-valued random variable, defined on a

background space(,F, P). Assume thatE|X| < . LetD E F be two nested sub--algebras. Then it holds that

E

E(X| E) | D

= E(X| D) a.e. (8)

Proof:Let E(X| E) and E(X| D) be specific versions of the conditional expectations.By definition E(X| E) is an integrable real-valued random variable, so the left hand sideof (8) makes sense.

The claim in (8) is that E(X| D) is a D-measurable random variable (this is obviouslysatisfied) which has the same integrals over D-sets as E(X| E). To check this integrationproperty, take an eventD

D. Since D and E are nested, we see thatD

E automatically.

Hence D

E(X| E) dP=D

X dP =

D

E(X| D) dP ,

exactly as desired.

The tower property is surprisingly useful for computing conditional expectations via a two-step procedure. It is often possible to find an intermediate-algebra E such thatE(X| E)is easy to compute. And with a bit of luck the result is itselfD-measurable, trivializing theouter conditional expectation in (8). Even when E(X| E) is not D-measurable, calculationof the outer conditional expectation is frequently an easier task than a direct attempt atcomputingE(X| D).

Theorem 6. Let X, X1, X2, . . . be real-valued random variables, defined on a commonbackground space(,F, P), and letD F be a sub--algebra. Suppose thatXn 0 almosteverywhere for everyn, and thatXn X almost everywhere. IfE|X| < it holds that

E(Xn| D) E(X| D) a.e.

Proof:Since 0 Xn Xalmost everywhere, the assumption that Xhas first momentimplies that each Xn also has first moment, and so the various conditional expectationsexists. Choose specific versions.

We have previously noted that the conditional expectation has a monotonicity property,which implies that E(X1

| D)

E(X2

|D)

. . . almost everywhere. Hence there is a

random variable Y(potentially with values in [0, ] ) such that

E(Xn| D) Y a.e.

By standard arguments we can assume Y to be D-measurable. The challenge is to provethat Yhas finite integral (enabling us to disregard the possibility ofY having the value

8


9/22

) and that Y satisfies the integral condition characterizing E(X| D). Both propertiescan be obtained from the same computation. Let D D be arbitrary. Then we have that

1DXn 1DX and 1DE(Xn| E) 1DY a.e.Relying on the monotone convergence theorem we obtain that

DY dP =

D

limn

E(Xn| D) dP =

limn

1DE(Xn| D) dP

= limn

1DE(Xn| D) dP = lim

n

1DXn dP

=

D

X dP

Using D = we obtain the desired integrability of Y. And obviously Y satisfies theintegral condition characterizing E(X| D). Corollary 7. LetX andYbe real-valued random variables, defined on a common back-ground space(,F, P), and letD F be a sub--algebra. Suppose thatX isD-measurable.IfE|Y| < andE|XY| < then it holds that

E(XY| D) =X E(Y| D) a.e.

Proof:The result is easily obtained ifXis an indicator, say X= 1D0 for a suitable eventD0 D. Clearly,X E(Y| D) is D-measurable and integrable. And for any D D it holdsthat

DX E(Y| D) dP=

D

1D0E(Y| D) dP =DD0

E(Y| D) dP

=

DD0

Y dP=

D

1D0Y dP=

D

XY dP ,

using thatD D0 D. So the variable X E(Y| D) satisfies the integral condition charac-terizingE(XY| D).IfXis a simple function, say X=

ni=1 ci1Di , the result follows from the above calculations

by linearity,

E(XY| D) =E ni=1

ci1DiY| D=ni=1

ci E(1DiY| D)

=ni=1

ci1DiE(Y| D) =X E(Y| D) .

Before we attack the general case, consider one more special example: Assume that X 0and that Y 0. By applying Corollary 5.17 of [MT] we can find simple non-negative

9


10/22

D-measurable variablesSn such thatSn X. Obviously, this implies that SnY XY sofrom Theorem 6 we have that

E(SnY| D) E(XY| D) a.e.But as Sn is simple, we also have that

E(SnY| D) =Sn E(Y| D) X E(Y| D) a.e.As the two limits must be equal almost everywhere, we have established the result in thecase where both variables are non-negative.

What remains is to reduce the general situation to the non-negative case just treated.This can be done by a bit of bookkeeping. Let X andYbe arbitrary, only subject to the

conditions of the theorem. We can decompose each of them into positive and negative parts,X=X+X andY =Y+Y. ClearlyX+ andX are non-negative and D-measurable,and Y+ and Y are nonnegative and integrable. Furthermore, the crossproducts X+Y+,X+Y, XY+ and XY are all integrable, for instance

E|X+Y+| E|XY| < .Hence

E(XY| D) =E

(X+ X)(Y+ Y) | D

=E(X+Y+ | D) E(X+Y | D) E(XY+ | D) + E(XY | D)=X+ E(Y+ | D) X+ E(Y | D) X E(Y+ | D) + X E(Y | D)= (X+ X) E(Y+ Y | D)=X E(Y| D) a.e

and the result is established.

To illustrate the use of the computational rules for the conditional expectation, as estab-lished above, we show a different representation of the conditional variance akin to thewell known formula for the unconditional variance.

Corollary 8. For X a real-valued random variable with second moment and D F asub--algebra it holds that

V(X| D) =E

X2 | D

E(X| D)2

.

Proof:Observe that

(X E(X| D))2 =X2 2XE(X| D) +

E(X| D)2

.

10


11/22

By Corollary 7

EXE(X| D) | D= E(X| D)2,and by linearity of the conditional expectation, Lemma 7, we get that

V(X| D) =E

X2 | D

2E

XE(X| D) | D

+ E

E(X| D)2 | D

=E

X2 | D

2E(X| D)2 + E(X| D)2=E

X2 | D

E(X| D)2

a.e.

1.1 Problems

1.1. LetXbe a real-valued random variable defined on a background space (,F, P) andlet D Fbe a sub--algebra. Show that ifX=c a.e. then it holds that E(X| D) =c a.e.1.2. Show the variance formula (6) when Xhas second moment.

1.3. Let (,F, P) be a probability space. Let D be the -algebra generated by a finitepartition H= (H1, . . . , H n) of (where every H-atom is F-measurable).

(a) Show that any D-measurable real-valued random variable Ycan written in the form

Y =ki=1

ci1Hi (9)

for suitable constants c1, . . . cn R.(b) Let Xbe a real-valued random variable with E|X| 0. Can anything be said about ci ifP(Hi) = 0?

(c) Can the above construction be carried over to the case where H is a countable partition?

(d) Let Y be a Z-valued random variable on (,F, P). Find E(X| Y) by computing thevalue on each atom (Y =k).

1.4. LetXbe a positive real valued random variable and letXdenote the integer partofX. That is, ifX=N+ with N N0 and [0, 1) thenX =N.

11


12/22

(a) Assume that the distribution ofXhas densityfw.r.t. the Lebesgue measure on (0, ).Define for n N0

p(n) =

n+1n

f(x) dx and (n) =

n+1n

xf(x) dx.

Show that

E(X| X) = (X)p(X) a.e.

(b) Assume that Xis uniformly distributed on (0, 10). Show that

E(X| X) = X +12

a.e.

(c) Assume that Xis exponentially distributed on (0, ). Show that

E(X| X) = X + e 2e 1 a.e.

Compare with the result for the uniform distribution and explain the difference.

1.5. Let X be a real-valued random variable on a probability space ( ,F, P). Assumethat E|X| x) where B (, x] is a Borel set.(a) Show that Dx is a -algebra.

(b) Assume that the distribution of X has density f w.r.t. the Lebesgue measure, andassume that f(x)> 0 for all x >0. Show that for x >0

E(X| Dx) =X1(Xx)+ (x)1(X>x) a.e.

where

(x) = 1

P(X > x)

x

yf(y) dy

(c) Show that ifX is Weibull distributed with shape parameter c >0 then

(x) =x + exc

x eyc

dy.

1.6. LetXandY be random variables defined on a common background space (,F, P).Suppose thatX is real-valued and that E|X| < , butYcan have values in an arbitrarymeasurable space. Show that ifX and Y are independent, then

E(X| Y) =E X a.e.

12


13/22

1.7. Let Xbe a real-valued, integrable random variable defined on a background space

(,F, P). Let Y and Zbe two random variables on the same background space, withvalues in arbitrary measurable spaces. Assume that (X, Y) is independent ofZ. Show that

E

X| (Y, Z)

= E(X| Y) a.e.Hint: the events of the form (Y B, Z C) is a generator, stable under intersection, forthe conditioning -algebra on the left.

1.8. LetX1, . . . , X n be independent and identically distributed real-valued random vari-ables. Assume thatE|Xi| < . LetS=

ni=1 Xi.

(a) Show by symmetry that

(SB) X1 dP =

(SB) X2 dP

for any set B B.(b) Show that E(X1| S) =E(X2| S) almost everywhere.(c) Conclude thatE(X1| S) = 1nS.1.9. LetXandYbe two real-valued random variables defined on a common backgroundspace (,F, P). Assume that E|X|


14/22

2 The Central Limit Theorem

2.1 Introduction

The mathematical theorem that has become known as the Central Limit Theorem, orjust CLT for short, is unquestionably the most important of all results in probabilitytheory. It says that the distribution of the sum, or average, of independent and identicallydistributed real valued random variables with second moment is well approximated bya normal distribution. We will refer to it as Laplaces CLT, since Laplace was the firstto prove general results in this form. It is not really fair to refer to CLT as one theorem,because there are many versions that differ by the assumptions made, and the form in whichthe conclusions are given. The assumption that the variables are identically distributed can

be relaxed without much more work. The independence assumption can also be relaxed,but this is considerably more difficult.

The proof presented in this note is one of the most direct based on little more than thevarious introductory concepts and results in probability theory, most notably the conceptof independence, and then some analytic approximation techniques like the second orderTaylor expansion with remainder.

2.2 The Central Limit Theorem

Recall that the standard normal distribution on (R,B) has distribution function

(x) = 12

x

ey2

2 dy.

Theorem 9 (Laplaces CLT). If X1, X2, . . . are independent identically distributed realvalued random variables with mean and variance2 then

P

1

n

ni=1

Xi

x

(x) (10)

forn .

Observe that with

Xn = 1

n

ni=1

Xi,

thenE Xn = , and by independence

V Xn = 1

n2

ni=1

V Xi=n2

n2 =

2

n.

14


15/22

Moreover,

1n

ni=1

Xi

= n

(Xn ),

thus the Central Limit Theorem, as formulated in Theorem 9, is also a statement aboutthe asymptotic distribution of the standardized average, and could have been formulatedas

P

Xn + x

n

(x) (11)

forn . We usually say thatXn is asymptically normally distributed with mean andvariance

2

n, and write

Xnas

N,2

n .2.3 Applying the Central Limit Theorem

The practical importance of the CLT is due to the fact that it gives computable approxi-mations to probabilities of interest concerning Xn. It follows from (11) that

P

Xn x n

= P

Xn + x

n

P

Xn < x

n

(x) (x).

A bound on|Xn |that holds with high probability can be computed by equating

(x) (x) = 1 2(x)equal to the desired probability and then solve for x. A popular choice is 12(x) = 0.95,or equivalently,

x= 1(0.025) = 1.96.If we want the bound to hold with probability 0.99 instead, the solution is

x= 1(0.005) = 2.58.

Taking x = 1.96 the conclusion from the CLT is that the absolute deviation ofXn fromis bounded by

1.96

nwith approximate probability 0.95. The CLT implies that the approximation of the prob-ability by 0.95 will become increasingly accurate as n . Compare this with the boundfrom Chebychevs inequality, which can be restated as

P

Xn x n

1 1

x2.

15


16/22

Equating the lower probability bound equal to 0.95 givesx =

20 = 4.47. The conclusion

using Chebychevs inequality is, in principle, a little stronger it assures that the absolutedeviation is bounded by

4.47

n

with probability greater than or equal to 0.95. This is a guarantied bound on the probabil-ity. The probability can be larger, but not smaller. The CLT provides an approximation,which can be smaller or larger than 0.95. Since (4.47) = 3.91 106 the CLT suggeststhat the probability statement made by Chebychevs inequality is actually very conserva-tive.

An obstacle to the practical use of the bounds provided by the CLT is that the standarddeviation enters. It is rarely known. Just as Xn is an estimate of the (unknown) from

X1, . . . , X n, we can also compute an estimate of . The usual estimate is

n=

1n 1

ni=1

(Xi Xn)2.

Division byn 1 (and not n) assures that E2n = 2. A computable bound on|Xn |isthen obtained by plugging in n, that is, the estimated bound becomes

1.96n

n.

It is possible to show thatn

P 1forn , and with a little more work it follows that also with the plug-in estimate ofit holds that

P

Xn xnn

1 2(x) (12)

forn . See Problem 2.1 for an argument based on a finite fourth moment assumption.The common way to report the bound is in terms of the interval

Xn 1.96n

n, Xn+ 1.96

nn

or equivalently

Xn 1.96nn

.

This is a random interval, and the probability statement about this interval is that it willcoverwith approximate probability 0.95. In the statistical terminology, it is a confidenceinterval for , which has approximate 95% coverage.

16


17/22

2.4 Lindebergs proof of the CLT

In everything that follows, X1, X2, . . . are independent and identically distributed realvalued random variables with mean and variance 2. We introduce the variables

Yi=Xi

n , i= 1, . . . , n ,

which are then also independent and identically distributed but with mean 0 and variancen1. We let

Sn =ni=1

Yi= 1

n

ni=1

Xi

,

such thatESn = 0 and V Sn = 1.

The main idea in the proof is to replace the variables Yi by normally distributed variableswith the same mean and variance. To this end, let Z1, Z2, . . . be independent identicallydistributed with Zi N(0, n1). Let

Wn =ni=1

Zi N(0, 1).

The variables Z1, Z2, . . .are, moreover, assumed independent ofY1, Y2, . . ..

The proof of Theorem 9 is broken into two parts. First we give an intermediate result thatstates that

Eh(Sn) Eh(Wn) 0for allC2-functionsh : R R with bounded and uniformly continuous second derivative.Theorem 9 would follow directly from this result with h(y) = 1(,x](y) if this functionwere regular enough. Clearly, it is not, but it can easily be approximated by C2-functionswith a piecewise linear second derivative in such a way that we can transfer the convergenceto indicator functions, and in this way get the desired convergence of distribution functionsas stated in Theorem 9.

Lemma 10. Ifh isC2 withh uniformly continuous and bounded then

Eh(Sn) Eh(Wn) 0 (13)

forn .Proof: By assumption,|h(x)| M


18/22

fulfills that() 0 for 0, and by the bound on h it holds that|()| 2M.Introduce the variables

Si=

ij=1

Yj+

nj=i+1

Zj and Ui=

i1j=1

Yj+

nj=i+1

Zj,

such thatSi= Ui+ Yi and Si1= Ui+ Zi.

ThenS0= Wn and

h(Sn) h(Wn) =ni=1

h(Si) h(Si1).

The idea in the proof is to bound the individual terms in the sum above. By a secondorder Taylor expansion ofhwith remainder it holds for all u and y that

h(u + y) =h(u) + h(u)y+1

2h(u + vy)y2

for some v [0, 1]. This gives that

h(Si) h(Si1) = h(Ui+ Yi) h(Ui+ Zi)= h(Ui)(Yi Zi) +1

2Y2i h

(Ui+ vYi) 12

Z2ih(Ui+ v

Zi)

for v, v

[0, 1]. Since h

is bounded by an affine function, h

(Ui) has first moment, andsince Ui and Yi Zi are independent, we get that h(Ui) and Yi Zi are independent.Hence

Eh(Ui)(Yi Zi) =Eh(Ui)E(Yi Zi) = 0.From this it follows that

2|Eh(Si) Eh(Si1)| = |EY2i h(Ui+ vYi) EZ2ih(Ui+ vZi)| EY2i|h(Ui+ vYi) h(Ui)|

+ E Z2i |h(Ui+ vZi) h(Ui)|+|EY2i h(Ui) EZ2ih(Ui)|

EY2

i (|Yi|) + EZ2

i(|Zi|)+|EY2i h(Ui) EZ2ih(Ui)|.

Sinceh is bounded, h(Ui) has first moment, and h(Ui) is independent ofY

2i as well as

Z2i. HenceEY2i h

(Ui) =EY2i Eh

(Ui) =E Z2iEh

(Ui) =EZ2ih

(Ui),

18


19/22

and the last term above is 0. That is, we have

2|Eh(Si) Eh(Si1)| EY2i (|Yi|) =EY2

1(|Y1|)

+ EZ2i(|Zi|) =EZ2

1(|Z1|)

.

This finally gives the bound

2|Eh(Sn) Eh(Wn)| ni=1

2|Eh(Si) Eh(Si1)| nEY21(|Y1|) + nEZ21(|Z1|),

and the proof is completed by showing that the two terms on the right hand side convergeto 0 for n . They are dealt with in the same way, so lets consider the first. We havethat

nY21(|Y1|) =(X1 )2

2 |X1 |

n 0

forn , since|X1 |

n 0

for n . Moreover, this convergence is dominated by 2M(X1)22 , which is integrable.It follows by dominated convergence that

nEY21(|Y1|) 0forn , and this completes the proof. Note how the proof above uses independence in two crucial places to get expectation 0of products. We could not control the error bounds if these terms did not completelydisappear, and for this purpose the independence assumption plays an important role.

Note also that the distribution ofWndoes not depend uponn. ThusEh(Wn) is a constant.The conclusion of Lemma 10 could thus just as well have been phrased as

Eh(Sn) Eh(W) (14)forW N(0, 1).To prove Theorem 9 from Lemma 10 we need to construct approximations of indicatorfunctions by a C2-function with bounded and uniformly continuous second derivative. Tothis end we start with the continuous piecewise linear function

g(y) =

0 ify 0y if 0< y /4/2 + y if/4< y 3/4 y if 3/4< y 0 if < y

19


20/22

Then we let G(y) = y g(z)dz. We note that G(y)0 and G(y) < 0 if and only if

y (0, ). We introduce b() = |G(y)|dy andh(y) =

y

G(z)

b() dz+ 1.

Then h(y) = 1 for y 0, h decreases from 1 to 0 on the interval (0, ), and h(y) = 0fory . Moreover, it is constructed so that it is C2 with the second derivative,g, beingbounded and uniformly continuous. The function h is a regular approximation to theindicator function 1(,0]. Using this function we can now give a proof of Theorem 9.

Proof of Theorem 9: We letFn denote the distribution function for the distribution ofSn. That is,

Fn(x) =P(Sn x) =E1(,x](Sn).Fixingx R we have that

h(y x + ) 1(,x](y) h(y x).

PluggingSn in for y, taking expectations and limits it follows from Lemma 10 that

Eh(W x + ) liminfn

Fn(x) limsupn

Fn(x) Eh(W x)

whereW N(0, 1). Now, as

1(,x](y)

h(y

x + ) and h(y

x)

1(,x+](y)

it follows that

(x ) liminfn

Fn(x) limsupn

Fn(x) (x + ).

Letting 0, and using that is continuous in x, we conclude thatFn(x) converges andthat

limn

Fn(x) = (x).

We conclude this note with a few comments on how the CLT can be proved by other

methods, and how it fits into a more general framework of convergence in distribution.We generally say that Sn converges in distribution to W N(0, 1) if (14) holds for allbounded continuous functions h, and we write

1n

ni=1

Xi

D W

20


21/22

or

1n

ni=1

Xi

D N(0, 1).

The proof of Theorem 9 shows that convergence in distribution to the normally distributedrandom variable W implies pointwise convergence of the corresponding distribution func-tions. We should note that the argument relies in an essential way on the fact that thelimit distribution function is continuous everywhere.

We havent actually established convergence in distribution completely. Lemma 10 is al-most a proof of convergence in distribution ofSn towards Wexcept that h is assumed alittle more regular than just continuous. In the general theory on convergence of distribu-tion it is usually shown explicitly that proving a result like Lemma 10 is enough to ensurethat the convergence holds for all bounded continuous h, but we will not pursue this here.The conclusion, that we have established, on the convergence of the distribution functionsis what matters for practical applications.

An alternative to the substitution argument given above is based on a more analyticmethod. It can first of all be shown that the convergence of the expectations (14) forall continuous h follows if we can just check the convergence for some very special C

functions h, namely h(x) = cos(x) and h(x) = sin(x) for R. The characteristicfunctions ofSn is

n() =Ecos(Sn) + iEsin(Sn) =EeiSn.

The CLT is then proved by showing that

n() e2/2

=E eiW

for all R, and this is enough to establish convergence in distribution of Sn towardW N(0, 1). This approach binds the theory of convergence in distribution together withcharacteristic functions (or Fourier transforms) of probability measures.

A third, and very different, approach was first given by Stanford statistician Charles Steinin his seminal 1972 paper A bound for the error in the normal approximation to thedistribution of a sum of dependent variables. Steins method is centered around Steinsdifferential equation, which is the first order linear inhomogeneous differential equation

f(s) sf(s) =h(s) Eh(W)

forW N(0, 1) and h bounded and continuous. From this equation we get that|Eh(Sn) Eh(W)| = |Ef(Sn) ESnf(Sn)|,

and a detailed analysis of the regularity of the solution f, combined with clever argumentsinvolving E f(Sn) ESnf(Sn), give Laplaces CLT and much more. As the title of Steinsoriginal paper indicates, this approach is useful also when handling sums of dependentrandom variables.

21


22/22

2.5 Problems

2.1. Assume that X1, X2, . . . have finite fourth moment. Use the Law of Large Numbersto show that

2 P 2

for n . Introduce the interval I() = 1 , 1 + for (0, 1), and concludethat for all (0, 1)

P

I()

0

forn . Argue that

PXn xn

n= PXn xn

n ,

I()+PXn xn

n ,

I() ,and show that the first term tends to 0 for n , while the second term implies that

1 2(x1 ) lim infn

P

Xn xnn

lim supn

P

Xn xnn

1 2(x1 + ).

Conclude that (12) holds.

22

conditional expectation and central limit theorem

Documents