ee603 class notes 12/05/13 john stensby chapter 11

65
EE603 Class Notes 12/05/13 John Stensby Updates at http://www.ece.uah.edu/courses/ee385/ 11-1 Chapter 11: Sequences of Finite-Second-Moment Random Variables The theory of sequences of finite-second-moment random variables is the topic of this chapter. We study their application to system theory, where they serve as the system’s input and output. It is natural to ask if a given random sequence has a limit, and in what sense is the limit is approached. Convergence of random variable sequences is discussed in this chapter. This chapter deals with discrete phenomenon and mathematics. Random sequences occur in applications where analog signals are sampled. They have applications in the fields of signal and image processing, digital control and digital communications. They have many applications outside of electrical engineering (for example, in the world of games, stocks, money and finance). Sequence of Random Variables – A Basic Definition Let (S, F, P) be a probability space (see Chapter 1 of these notes). A random variable X() maps S into the extended real line R + . (See Chapter 2 for the definition of a random variable.) A mapping from a sample space into a set of discrete-time sample functions is called a random, or stochastic, sequence X(n;), also known as a discrete-time random process. Often, we suppress the argument and write X(n). For each fixed in some sample space S, the function of n denoted by X(n; ) is an “ordinary” deterministic sequence of numbers known as a sample function. Alternatively, also true is the fact that X(n;) is a sequence of random variables that is indexed on n. That is, for a fixed index n 0 , X(n 0 ;) is a random variable. Example 11-1: X(n;) X()f(n), where X() is a random variable, and f(n) is a deterministic sequence of real numbers, is a simple random sequence. Example 11-2: X(n;) A()sin(n/10 + ()), where A() and () are random variables, is a random sequence. These two elementary examples have the feature that their future values are predictable from their present and past values.

Upload: others

Post on 27-Feb-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

EE603 Class Notes 12/05/13 John Stensby

Updates at http://www.ece.uah.edu/courses/ee385/ 11-1

Chapter 11: Sequences of Finite-Second-Moment Random Variables

The theory of sequences of finite-second-moment random variables is the topic of this

chapter. We study their application to system theory, where they serve as the system’s input and

output. It is natural to ask if a given random sequence has a limit, and in what sense is the limit

is approached. Convergence of random variable sequences is discussed in this chapter. This

chapter deals with discrete phenomenon and mathematics.

Random sequences occur in applications where analog signals are sampled. They have

applications in the fields of signal and image processing, digital control and digital

communications. They have many applications outside of electrical engineering (for example, in

the world of games, stocks, money and finance).

Sequence of Random Variables – A Basic Definition

Let (S, F, P) be a probability space (see Chapter 1 of these notes). A random variable

X() maps S into the extended real line R+. (See Chapter 2 for the definition of a random

variable.)

A mapping from a sample space into a set of discrete-time sample functions is called a

random, or stochastic, sequence X(n;), also known as a discrete-time random process. Often,

we suppress the argument and write X(n). For each fixed in some sample space S, the

function of n denoted by X(n;) is an “ordinary” deterministic sequence of numbers known as a

sample function. Alternatively, also true is the fact that X(n;) is a sequence of random variables

that is indexed on n. That is, for a fixed index n0, X(n0;) is a random variable.

Example 11-1: X(n;) X()f(n), where X() is a random variable, and f(n) is a deterministic

sequence of real numbers, is a simple random sequence.

Example 11-2: X(n;) A()sin(n/10 + ()), where A() and () are random variables, is a

random sequence.

These two elementary examples have the feature that their future values are predictable from

their present and past values.

EE603 Class Notes 12/05/13 John Stensby

Updates at http://www.ece.uah.edu/courses/ee385/ 11-2

Repeated Bernoulli Trials - the Quintessential Example of an Infinite Random Sequence

Consider the tossing of a fair coin. Here, we have the sample space S = [H, T], the set of

events (i.e., -algebra) F = {[H], [T], [H,T], Ø}, and the probability measure P that is usually

associated with the tossing of a fair coin (i.e., P[H] = P[T] = 1/2, etc.). (S, F, P) is the

probability space for the coin tossing experiment. We define a random variable X: S R as

(H) 1

(T) 0 .

X

X (11-1)

We know that {X < 1/2} = [T], {X > 1/2} = [H], etc. In what follows, probability space (S, F, P)

and random variable X will be used “to build” the Bernoulli trials random sequence.

This random sequence X(n) is defined easily. On the nth toss, assign X(n) = 1

(alternatively, X(n) = 0) if a heads (alternatively, tails) is obtained. We call X(n) the Bernoulli

trials random sequence. This simple sequence must be described by using the methodology

outlined above, a task that introduces some complexity. We need a probability space (S , F,

P) so that X(n;) can be defined as a mapping from S into a set of binary functions. The

space (S , F, P), the development of which is outlined below, is a product space.

Our product space (S , F, P) is developed by using ideas from Chapter 1 of these notes

(also see Chapter 8 of Stark and Woods, Probability and Random Processes, 4rd ed.). Instead of

considering individual heads and tails as elementary outcomes of separate experiments, our

product space has elementary outcomes that are infinite head/tail sequences. We define the

Bernoulli trials random sequence X(n;) as a mapping from sample space S into a set of binary

functions.

Our product space will be built as an infinite Cartesian product of (S, F, P) with itself

(recall that (S, F, P) describes the coin tossing experiment). But first, by (Sn, Fn, Pn), we denote

the “nth repetition” of (S, F, P); that is, Sn = {[Hn], [Tn]} and Fn = {[Hn], [Tn], [Hn, Tn], Ø}, where

Hn and Tn denotes “heads on the nth toss” and “tails on the nth toss”, respectively. Pn is the usual

EE603 Class Notes 12/05/13 John Stensby

Updates at http://www.ece.uah.edu/courses/ee385/ 11-3

probably measure that is used for the tossing of a fair coin (i.e., Pn[Hn] = Pn[Tn] = 1/2, etc.).

Now, denoted as (S, F, P), our infinite-dimensional product space is determined from the (Sk,

Fk, Pk), k 1, as outlined in what follows.

The sample space S is the infinite Cartesian product

k 1 2 3 nk 1

S S S S S S . (11-2)

Elements of S consist of infinite sequences of heads and tales. Element S has the form

1, , , , (11-3)

where k Sk is either heads or tails, 1 k < ( is a sequence of heads/tails outcomes, not the

outcome of a specific trial).

F denotes the set of events (i.e., the -algebra) for the product space. F includes all sets

of the form

k 1 2 nk 1

A A A A

, (11-4)

where Ak Fk, 1 k < (set (11-4) is called a generalized rectangle). Also, all countable

intersections and unions of such sets are included in F. For example, consider the event [the

first two tosses produce different outcomes] F . This event in F is represented as

1 2 3 4 1 2 3 4{H } {T } {T } {H } S S S S , (11-5)

the union of two generalized rectangles each of the form (11-4). Also, the intersection of events

EE603 Class Notes 12/05/13 John Stensby

Updates at http://www.ece.uah.edu/courses/ee385/ 11-4

[{H1}{S2}{S3} … ] [{S1}{T2}{S3} … ] must mean the event [{H1}{ T2}{S3} …

]. As it turns out, F is the -algebra generated by the collection (i.e., set) of all generalized

rectangles of the form (11-4) (see Chapter 1 for details on how a -algebra can be generated by a

collection of sets).

To finish our product space, we must define P , a probability measure on the product

space. To accomplish this, we use the fact that the successive trials are independent, and

probabilities can be multiplied (without this assumption, it would not be possible to define P

without knowing the interdependence of each trial on the other trials). We start with events of

the form given by (11-4), and we define

Cartesian Product Algebraic Product

n n n 1 1 2 2 3 3 n nn 1 n 1

A (A ) (A ) (A ) (A ) (A )

P P P P P

P (11-6)

(note the different interpretation/usage of the symbol). We realize that every event in F can

be represented as countable unions and/or intersections of events of the form (11-4). And, we

use the Axioms of Probability (specifically, the Countable Additivity property - possessed by all

valid probability measures) to extend definition (11-6) to all of F. For example, consider the

event [the first two tosses are different] F given by (11-5). The probability of this event is

1 2 3 4 1 2 3 4

1 2 3 4 1 2 3 4

1 2 3 4 1 2 3 4

{H } {T } {T } {H }

{H } {T } {T } {H }

[H ] [T ] [ ] [ ] [T ] [H ] [ ] [ ]

1/ 4 1/ 4

S S S S

S S S S

P P P S P S P P P S P S

P

P

P P

the first two tosses are different

1/ 2 ,

(11-7)

EE603 Class Notes 12/05/13 John Stensby

Updates at http://www.ece.uah.edu/courses/ee385/ 11-5

where we have used the fact that the event [the first two tosses are different] can be represented

as the union of two events of the form (11-4). This finishes the definitions of P and our product

space (S, F, P). Note that we have developed the same product space that is discussed in

Chapter 8 of Stark and Woods, 4rd edition (also in the 3rd edition, Ch. 6). Finally, using our

infinite-dimensional product space (S, F, P), we are in a position to define the Bernoulli trials

random sequence. Denote an elementary outcome in S as . That is, = (1, 2, ... ) S ,

where each k Sk, k 1, is either a head or tail (so that is an infinite indexed sequence of

heads and tails). We define the Bernoulli trials random sequence as

n n

n n

(n; ) 1, H

0, T

X (11-8)

a mapping from S into the set of binary functions (remember that k is the kth component

of ).

The Limit of Nested Event Sequences

When dealing with an infinite sequence of random variables, we need to be able to define

the notion of a limit of an event sequence. In general, the limit of an event sequence is

somewhat complicated and abstract. Before considering the general case, we first consider the

important simple special case of nested sequences.

A nested decreasing sequence of events is a simple concept. The event sequence Ak, k

1, is nested and decreasing if for each integer n 1 we have

A1 A2 … An . (11-9)

A convenient feature of such a sequence is that

EE603 Class Notes 12/05/13 John Stensby

Updates at http://www.ece.uah.edu/courses/ee385/ 11-6

N

N kk 1

A A . (11-10)

Like a bounded and monotone sequence of real numbers, all of which have a real-number

limit, a nested decreasing sequence of events has a well-defined limit event. As N in

(11-10), we obtain A a countable intersection of events. And, a countable intersection of

events is an event (recall that the set of events, a -algebra, is closed under countable unions and

intersections). So, the limit of (11-10) is well defined. Often, we write AN A , where A is the

limit.

In some applications, an event can be expressed as the limit of a nested decreasing

sequence of events, a sometimes-valuable representation. For example, let X(n), n 1, be a

sequence of random variables and consider

NN

A (n) 5,n 1 (1) 5 (2) 5 (3) 5 (N) 5

limit

X X X X X

A, (11-11)

where

N

Nn=1

{ (n) 5} XA . (11-12)

Note that A1 A2 … AN so that AN is a nested decreasing sequence that has the limit A

{X(n) < 5, n 1}.

Similar results and statements can be made for a nested increasing sequence of events.

The sequence BN is a nested increasing event sequence if B1 B2 … BN for all N.

Furthermore, we can write

EE603 Class Notes 12/05/13 John Stensby

Updates at http://www.ece.uah.edu/courses/ee385/ 11-7

N

N nn 1

B B . (11-13)

A nested increasing event sequence always has a limit

nnlimit

B B (11-14)

since B can be written as a countable union of events. Often, we write BN B .

Nested sequences of events are special cases of general event sequences. In Appendix

11b, we define the limit, when it exists, of an arbitrary sequence of events (unlike the case of

nested sequences, the limit of an arbitrary event sequence may not exist!).

Concerning infinite intersections and unions, some standard notation needs to be

reviewed. For Bn, n 1, an arbitrary sequence of events, we utilize the standard notation

N

n nNn=1 n=1

limit

B B (11-15)

N

n nNn=1 n=1

limit

B B . (11-16)

Of course, the limits (11-15) and (11-16) may, or may not, exist when the Bn are non-nested.

Computing P[A], where A is the Limit of a Nested Event Sequence

We need to be able to compute probabilities like P[X(n) < 5, n 1]. This probability can,

we will argue, be computed as the limit of P[AN], where AN is represented by (11-12). That is,

we need to show the second equality in (the first equality is a definition)

N N

N Nn=1 n=1 n=1

{ (n) < 5} limit { (n) < 5} limit { (n) < 5}

P X P X P X . (11-17)

EE603 Class Notes 12/05/13 John Stensby

Updates at http://www.ece.uah.edu/courses/ee385/ 11-8

To any specified accuracy, this limit can be approximated by using sufficiently large N.

The second equality in Equation (11-17) follows from the continuity of the probability

measure P, a fact that we will argue in what follows. The events

N

N

n=1

{ (n) < 5} XA (11-18)

form an indexed set of nested, decreasing events. The limit of the nested sequence is

N

NN N n=1 n=1

limit limit { (n) < 5} { (n) < 5}

X X A A . (11-19)

As will be shown in a section that follows, for the nested sequence of decreasing events, we have

NNN Nlimit limit

P P PA A A . (11-20)

That is, we can interchange P and the limit operations. A similar statement will be made for a

nested sequence of increasing events, BN B.

Nested sequences are just special cases. In Appendix 11-B, we define what is meant by

the limit of an event sequence where the events are not generally nested. Also, we argue that

(11-20) is true for arbitrary convergent sequences of events.

Continuity of a Probability Measure

On a general probability space (S, F, P), the probability measure P has a continuity

property. This is satisfying from an intuitive sense; it allows us to use P as a metric, or “gauge”,

to “measure” the “size” of an event. Also, the continuity of P is used when we approximate the

probability of an event that is represented as the limit of an infinite sequence of nested events.

There is an analog here to the theory of continuous functions. Let f(x) be any function

EE603 Class Notes 12/05/13 John Stensby

Updates at http://www.ece.uah.edu/courses/ee385/ 11-9

with domain that includes x0. Then f(x) is continuous at x0 if and only if

limit f ( ) f (limit ) f ( )

n n 0n n

x x x (11-21)

for all sequences {xn} that converge to x0. In words, Equation (11-21) states that one can

interchange limit and function computation. In the sense described by Theorem 11-1 (and the

more inclusive results given in Appendix 11B), this basic idea carries over to probability

measures.

Theorem 11-1: Consider an increasing sequence of events as shown by Figure 11-1. That is, the

events are such that Bn Bn+1 for all n 1. Define the infinite union of these events as

N

N n nN N n=1 n=1

limit limit

B B B B , (11-22)

a well-defined event (since a -algebra is closed under countable unions). Then, to any degree

of accuracy that is required, P[B] can be approximated by P[Bn] for sufficiently large n. That

is, we have

n nn nlimit [ ] limit [ ]

P P PB B B . (11-23)

B1B2 B4B3

Figure 11-1: An increasing sequence of events.

EE603 Class Notes 12/05/13 John Stensby

Updates at http://www.ece.uah.edu/courses/ee385/ 11-10

In words, (11-23) says that we can move the limit operation from “outside” to “inside” the

probability measure (interchange the limit and probability operations).

Proof: We define the sequence of events

1 1

2 2

n n n 1

A B

A B B

A B B

(11-24)

where the over-bar denotes set complement. The Bn are nested, and the disjoint An, 1 n N,

“union up” to BN so we can write

N N

N n nn 1 n 1

1 N , including ( N, )

B B A i.e. . (11-25)

As a result of this, we have

N N N

N n n nn 1n 1 n 1

[ ]

P P P P B B A A (11-26)

for all finite N. Now take the limit of (11-26) to obtain

N

N n nN N n 1 n 1

limit limit [ ] [ ]

P P PB A A . (11-27)

Now, the most crucial step in the proof answers the question: does the sum on the right-hand

EE603 Class Notes 12/05/13 John Stensby

Updates at http://www.ece.uah.edu/courses/ee385/ 11-11

side of (11-27) converge? If yes, what does it converge to? Since the An are disjoint, we can use

the Countable Additivity Property of P (see Chapter 1) to write

N n nN n 1 n 1

limit [ ] 1

=

P P P B A A . (11-28)

In (11-27), the middle Nth partial sum is an increasing sequence of real numbers that is bounded

above by unity, as can be seen by (11-28). Hence, the limits in (11-27) and (11-28) converge.

To find out what they converge to, simply use

n nn 1 n 1

= = A B B , (11-29)

in (11-28) to obtain the desired result

n nn nlimit limit

P P PB B B . (11-30)

Corollary 11-1: A version of Theorem 11-1 holds for a decreasing nested sequence of events.

That is, suppose Bn Bn+1 for n 1. Then we can write

n nn nlimit limit ]

P P P[B B B , (11-31)

where

N

N n nn 1 n 1

,

B B B B . (11-32)

EE603 Class Notes 12/05/13 John Stensby

Updates at http://www.ece.uah.edu/courses/ee385/ 11-12

Proof: Similar to the proof given for Theorem 11-1.

Appendix 11B extends Theorem 11-1 to more general, non-nested sequences of events.

In the appendix, we define the limit, if it exists, of an event sequence, not necessarily nested. If

event A is the limit of an infinite event sequence An, we show that P[A] is the limit of P[An] as

index n approaches infinity. So, the probability measure P is continuous!! The analogy, drawn

in the paragraph preceding Theorem 11-1, to continuous function is valid!

Example 11-3: Theorem 11-1 and its corollary are used to approximate the probability of an

event that is represented as the limit of an infinite sequence of events. For example, for each n

0, let Bn = {X[k] < 2 for 0 k n}. This is a decreasing and nested sequence of events: Bn+1

Bn, n 0. Suppose we wanted to calculate P[B], where B = {X[k] < 2 for 0 k}. We know

that

nn=0

B B . (11-33)

We use the corollary to approximate (as closely as desired) P[B] as the probability of a finite

intersection. That is, based on our accuracy requirements, we select N and approximate

N

n nn=0 n=0

[ ] [ ] [ ] X(0) 2,X(1) 2, ,X(N) 2

P P P P B B B . (11-34)

Example 11-4: Back in Chapter 2 of these class notes, we were told that probability distribution

functions are right continuous. We were told that

nF( ) limit F( +1/n)

x x (11-35)

for any distribution function F(x) and all x. However, Equation (11-35) follows directly from

EE603 Class Notes 12/05/13 John Stensby

Updates at http://www.ece.uah.edu/courses/ee385/ 11-13

Theorem 11.1 since

n n

n

limit F( +1/n) = limit X x 1/ n

limit{X x 1/ n}

{X x}

F(x).

P

P

P

x

(11-36)

Statistical Specification of a Random Sequence

In this chapter, we assume that all random variables are real-valued. This assumption

greatly simplifies the notation, definitions and theory. From a conceptual standpoint, little is lost

by assuming that everything is real valued (however, complex-valued random sequences are

important - and often used - in many applications where band-pass signals are represented by

their complex-valued, low-pass equivalents).

A random sequence is statistically specified by its distribution functions, all orders are

required in general. That is, for each positive integer n, and for all positive integer sequences k1

k2 … kn, we need knowledge of the nth-order distribution function

1 2 n

1 2 n

k k k 1 2 n

1 k 2 k n k

F( , , , ; k ,k ,..., k )

X[k ] ,X[k ] , ,X[k ] . P

x x x

x x x (11-37)

Note that a complete statistical specification requires an infinite set of distribution functions. In

(11-37), the algebraic variables 1 2 nk k k, , ,x x x are called realization variables. The subscripts

on these variables serve only to distinguish one variable from another; F(, , ; k1, k2, k3) is

just as meaningful as 1 2 3k k k 1 2 3F( , , ; k ,k ,k )x x x .

EE603 Class Notes 12/05/13 John Stensby

Updates at http://www.ece.uah.edu/courses/ee385/ 11-14

The probability density functions are obtained by differentiating distribution functions.

That is, the nth-order probability density function is defined as

1 2 n

1 2 n1 2 n

k k k 1 2 n

n

k k k 1 2 nk k k

f ( , , , ; k ,k ,..., k )

F( , , , ;k ,k ,..., k )

x x x

x x xx x x

. (11-38)

The moments of a random sequence are important in applications. The mean (sometimes

called the first-order average) is defined as

[n] E X(n) f ( , n)d

x x x (11-39)

for a sequence of continuous random variables.

Second-order statistical averages appear often in practice. For example, the

autocorrelation function is defined as

X n m n m n mR (n,m) E X(n)X(m) f ( ;n,m)d d

x x x ,x x x . (11-40)

In a similar manner, the autocovariance function is defined as

X

n m n m n m

C (n,m) E {X(n) (n)}{X(m) (m)}

{ (n)}{ (m)} f ( ;n,m)d d

x - x x ,x x x

. (11-41)

Note that both RX and CX are symmetric

EE603 Class Notes 12/05/13 John Stensby

Updates at http://www.ece.uah.edu/courses/ee385/ 11-15

X

X

R (n,m) E X(n)X(m) E X(m)X(n)

R (m,n)

(11-42)

X

X

C (n,m) E {X(n) (n }{X(m) (m }

E {X(m) (m }{X(n) (n }

C (m,n).

) )

) ) (11-43)

Also, we can write

X XC (n,m) R (n,m) (n) (m) . (11-44)

The sequence X(n) is said to have uncorrelated elements (or to be uncorrelated) if

XR (n,m) E X(n)X(m) E X(n)]E[X(m) (n) (m), n m .

For such a sequence, (11-44) leads to the conclusion that

2

X XC (n,m) R (n,m) (n) (m) (n) n m

0 n m

(11-45)

where 2(n) denotes the sequence variance.

Example 11-5: Many applications involve the arrival of objects. For example, we may be

interested in the arrival of cars at an intersection, the arrival of electrons at the plate of a vacuum

tube, etc. A commonly-used simplifying assumption is that the objects arrive independently of

one another. Let (n) denote the interval of time (in seconds) between the arrival of the (n-1)th

and nth objects (relative to a given initial time 0, (1) is the arrival time for the first object). The

EE603 Class Notes 12/05/13 John Stensby

Updates at http://www.ece.uah.edu/courses/ee385/ 11-16

time line is depicted by Fig. 11-2 below. For n 1, we assume that (n) is a sequence of

identical, independent random variables each with the exponential density

f (t n) exp[ t]U(t) ; . (11-46)

The mean of (n) is

x0

(n) E[ (n)] x e dx 1/

, (11-47)

and its variance is

2 2 2 2 x 2 2 20

2

(n) E[ (n) ] (1/ ) e dx (1/ ) 2 / 1/

1/

x

. (11-48)

Relative to a given initial time 0, the running sum of these intervals is the arrival times of

the objects . That is, the arrival time of the nth object is

n

k 1

T(n) (k)

, (11-49)

(1)

0

(2) (3) (4)

T(1)

T(2)

T(3)

T(4)

Fig. 11-2: Random arrival times. (n) is the time between arrivals, and T(n) is the actual arrival time (relative to origin 0).

EE603 Class Notes 12/05/13 John Stensby

Updates at http://www.ece.uah.edu/courses/ee385/ 11-17

a sequence of random variables indexed on n. Since the time intervals are independent, the

density function fT(t;n) for T(n) is an n-1 fold convolution of (11-46) with itself. We claim that

this first-order density function is

n 1

T( t)

f (t;n) exp(- t)U(t)(n 1)!

. (11-50)

This result can be established by induction (by using a different approach, this same result was

derived in Appendix 9B). Clearly, the result is correct for n = 1; assume it is true for n-1. Now,

we convolve again to obtain

T T

n 2t

0

n-2tn0

n 1n

f (t n) f (t n -1) exp( t)U(t)

( {t })exp( ) exp( {t })d U(t)

(n 2)!

exp( t) d U(t)(n 2)!

texp( t) U(t)

(n 1)!

; ;

(11-51)

as claimed. Equation (11-51) is the Erlang density, and T(n) is an Erlang distributed random

variable (this same result was obtained in Appendix 9B). The expected value of random variable

T(n) is

T (n) n (n) n / . (11-52)

Since the interval random variables are independent, the variance of T(n) is

EE603 Class Notes 12/05/13 John Stensby

Updates at http://www.ece.uah.edu/courses/ee385/ 11-18

T(n) n (n) n / 2 2 2 . (11-53)

Gaussian Random Sequences

A random sequence X(n) is called a Gaussian random sequence if all its nth-order

probability density functions are Gaussian. Such sequences are very popular. Because of the

Central Limit Theorem, Gaussian sequences occur in many applications. Also, they are

completely described by only first- and second-order statistical averages (i.e., means and

covariances). Finally, use of Gaussian statistics simplifies many technical developments and

makes mathematically tractable many problems in the areas of filtering, estimation, detection and

control.

Example 11-6: Let X(n) be a zero-mean Gaussian sequence; that is, E[X(n)] = 0 for all n. Also,

let X be delta correlated; that is, RX(n,m) = E[X(n)X(m)] = 2(n-m), where 2 is the variance

and

1, k 0(k)

0, k 0

. (11-54)

Often, delta-correlated sequences are said to be white; in many applications, delta-correlated

Gaussian sequences are called white Gaussian noise. For n m, X(n) and X(m) are uncorrelated

and, since they are Gaussian, independent. As a result, an Nth-order density function factors into

a product of N first-order density functions.

Most computer-based math packages (such as MatLab, Matcad, etc.) generate periodic

sequences that, for many purposes, can be used to approximate white Gaussian noise. For these

sequences, the correlation between elements can be very low, and the sequence period is very

long relative to the number of sequence values that are needed.

Independent Increments

Random sequence X(n) is said to have independent increments if for all N > 1 and n1

EE603 Class Notes 12/05/13 John Stensby

Updates at http://www.ece.uah.edu/courses/ee385/ 11-19

n2 ... nN the process increments X(n1), X(n2) - X(n1), X(n3) - X(n2), ... , X(nN) - X(nN-1) are

jointly independent. Such processes have the nice feature that Nth-order density and distribution

functions can be “built up” as products of the densities of the individual increments. For

example, the second order distribution, for the case n2 > n1, can be written as

1 2 1 2 1 1 2 2

1 1 2 1 2 1

1 1 2 1 2 1

F(x ,x ;n ,n ) X(n ) x ,X(n ) x

X(n ) x ,X(n ) X(n ) x x

X(n ) x X(n ) X(n ) x x .

P

P

P P

(11-55)

We have seen independent increment processes in previous chapters. For example, the

Random Walk process, introduced in Chapter 6, has independent increments.

Stationarity

Often, random sequences are generated by a mechanism that is not changing with time.

In these cases, the sequence moments are constant. More precisely, a random sequence is said to

be stationary if, for all positive integers N, the Nth-order density function of the sequence is

invariant to any shift of the index. That is, stationarity requires

1 2 N

1 2 N

n n n 1 2 N

n n n 1 0 2 0 N 0

f ( , , , ; n , n , ..., n )

f ( , , , ; n n , n n , ..., n n )

x x x

x x x (11-56)

for all orders N and index shift values n0.

Example 11-5 introduces a random sequence (n) of interval times. Since the interval

times are independent, Nth-order densities can be built up as products of first-order densities,

each of a form given by (11-46). Clearly, the random sequence (n) of interval times is

described by an Nth-order density that satisfies (11-56); the sequence (n) is stationary. On the

other hand, the total waiting time to the nth arrival, the sum T(n) given by (11-49), is not

EE603 Class Notes 12/05/13 John Stensby

Updates at http://www.ece.uah.edu/courses/ee385/ 11-20

stationary as is obvious by inspection of (11-52) and (11-53).

Wide-Sense Stationarity (WSS)

A weaker form of stationarity is adequate in some applications. Sometimes, all that is

required is “stationarity in all second-order statistics”. We say that a random sequence is wide

sense stationary (WSS) if its mean function is constant and its covariance depends only on the

time difference. That is, the sequence is WSS if

(n) (0) (11-57)

X X XR (n,m) R (n - m) R (k) , (11-58)

where k n - m is the time difference between the two sequence values. Clearly, all stationary

sequences are WSS. However, the converse is not true. Gaussian sequences provide an

interesting example for which there is no difference between the two forms of stationarity.

Two distinct sequences can have “mutual stationarity” properties. Wide sense stationary

sequences X(n) and Y(m) are said to be jointly wide sense stationary if

xy xyR [n,m] E X(n)Y(m) R [n m] . (11-59)

That is, the cross correlation depends only on the time difference n-m.

Suppose X(n) and Y(m) are jointly WSS so that Rxy[n-m] = E[X(n)Y(m)]. We define k

n-m and write

xyR [k] E X(m k)Y(m) . (11-60)

That is, for Rxy[k], k denotes the shift applied to the first indexed sequence (i.e., X(m)). Note

that Rxy[k] Ryx[k], in general. However, note that Rxy[k] = E[X(m+k)Y(m)] = E[X(m)Y(m-k)]

EE603 Class Notes 12/05/13 John Stensby

Updates at http://www.ece.uah.edu/courses/ee385/ 11-21

= Ryx[-k].

Power Spectral Density

Let X(n) be a real-valued, wide-sense-stationary sequence with finite average power E[X(n)2] <

. Denote the autocorrelation of X as Rx(k) = E[X(n+k)X(n)]. The power spectrum (or power

spectral density) is denoted as Sx(). The celebrated Wiener-Khinchine theorem states that the

power spectrum and autocorrelation comprise a discrete-time Fourier transform (DTFT) pair.

That is, we write

j kx x

k

j kx

( ) R (k) R (k)e ,

1R (k) ( ) e d .

2

x

x

S

S

F

(11-61)

Actually, Sx is 2 periodic in and only need be specified on – < . The average power in

X(n) can be expressed as

2 1Avg. Pwr E X (n) ( )d watts.

2

xS (11-62)

White Noise Sequence

A zero-mean X(n) is said to be a white noise sequence if

2xR (k) E[X(n k)X(n)] (k) . (11-63)

Note that 2 is the finite variance of X(n). The power spectral density of X is

2 j k 2x

k

( ) R (k) (k) e

xS F . (11-64)

EE603 Class Notes 12/05/13 John Stensby

Updates at http://www.ece.uah.edu/courses/ee385/ 11-22

The average power in X is

2 2 21Avg. Pwr E X (n) d watts.

2

(11-65)

Note that a discrete-time white sequence has a finite average power (contrast this with the

continuous-time case discussed in Chapter 8).

Systems

We are interested in systems with random sequence inputs. First, we review some basic

definitions involving systems. Then, we focus on determining the mean and autocorrelation of

the output of a linear system given descriptions of the input process and system impulse

response.

Given input sequence X(n,), we denote the system output as

Y(n, ) L X(n, ) , (11-66)

where operator L[·] maps input X into output Y. Often, we do not explicitly write the variable

in the notation; we write Y(n) = L[X(n)] with the implied.

The system is said to be linear if

1 2 1 2L X (n) + X (n) L X (n) L X (n) (11-67)

for all inputs X1, X2 and all constants , . Linear system can be described by an unit sample

response, denoted as h(n,m), assumed to be real-valued in what follows. This function is the

response at time n to an unit sample function applied at time m. In general, impulse response

h(n,m) may, or may not, depend on the absolute values of indices n and m, and h(n,m) may, or

may not, be nonzero for values of n less than m. Given input X(n) and impulse response

EE603 Class Notes 12/05/13 John Stensby

Updates at http://www.ece.uah.edu/courses/ee385/ 11-23

h(n,m), we can express the output as

=

Y(n) = L X(n) = h(n, )X( )

. (11-68)

A linear system is said to be shift invariant (or time invariant) if a simple delay in the

input sequence produces a corresponding delay in the output sequence. More formally, we say

that linear system L[·] is shift invariant if

0 0Y(n) = L X(n) Y(n - n ) = L X(n - n ) (11-69)

for all input/output pairs (X, Y) and all index shifts n0. Shift invariant systems depend only on

the difference of n and , not their absolute values. In this case, we can write h(n,) = h(n - ).

Also, for shift invariant systems, Equation (11-68) becomes

Y(n) = L X(n) = h(n - )X( ) = h * X

=

, (11-70)

the convolution of input X with impulse response h.

A system is said to be bounded input - bounded output (BIBO) stable if bounded input

sequences produce bounded output sequences. A linear, shift-invariant system is BIBO stable if,

and only if, its impulse response is absolutely summable; that is, BIBO stability is equivalent to

n=

h(n)

. (11-71)

A linear, shift-invariant system is said to be causal if it does not respond before it is

EE603 Class Notes 12/05/13 John Stensby

Updates at http://www.ece.uah.edu/courses/ee385/ 11-24

excited. More explicitly, for a causal system, if two inputs X1(n) and X2(n) are equal up to some

index n0, then the corresponding outputs Y1(n) = L[X1(n)] and Y2(n) = L[X2(n)] must be equal up

to index n0; what happens to the inputs after index n0 in no way influences the outputs before

index n0. One can show that a linear, shift-invariant system is causal if, and only if, h(n) = 0 for

n < 0. For a linear, shift-invariant and causal system, the input-output relationship becomes

n

Y(n) = L X(n) = h(n )X( )

=

. (11-72)

One should consider the differences between (11-68), the most general I/O formula, (11-70) for

the shift-invariant case and (11-72) which describes the most restrictive case.

Linear, shift-invariant systems can be analyzed in the frequency domain. For this

purpose, we describe the Fourier Transform of signal X(k) as

j j k

k=

X (e ) X(k) e

-F (11-73)

(we will use a subscript of F to denote a Fourier transform). If (11-73) converges, XF is a

continuous, 2-periodic function of frequency variable . The inverse Fourier transform is

j j n1X(n) = X (e )e d

2

F . (11-74)

The Fourier transform of a linear, shift-invariant system's output can be found easily.

Simply use the convolution theorem with (11-70) to obtain

j j jY (e ) [h(n) X(n)] H (e )X (e ) F F FF . (11-75)

EE603 Class Notes 12/05/13 John Stensby

Updates at http://www.ece.uah.edu/courses/ee385/ 11-25

Systems With Random Inputs

Given a system with a random input, we determine below the mean and autocorrelation

of the output. A more general, difficult problem is to find the Nth-order density function that

describes the system output. A linear system with a Gaussian input will have a Gaussian output.

Unfortunately, a general statement of this scope cannot be made for nonlinear systems or

systems driven by non-Gaussian inputs.

Theorem 11-2: Consider the linear system with input X(n) and output Y(n) = L[X(n)] (we do

not require shift-invariance or causality). Suppose that both X(n) = E[X(n)] and Y(n) =

E[Y(n)] exist. For this case, we can write

Y X(n) = E Y(n) E L[X(k)] L E[X(k)] L (n) . (11-76)

That is, it is possible to interchange the operations of L[·] and E[·]. We write

m

E Y(n) = E h(n,m)X(m)

. (11-77)

Then, we formally interchange the summation and expectation to obtain

Ym m=

Xm

(n) = E Y(n) = E h(n,m)X(m) h(n,m)E[X(m)]

h(n,m) (m)

, (11-78)

and (11-76) is established.

Note that our derivation of (11-76) is not rigorous. A potential problem with (11-78) is

the formal interchange of expectation and summation. In cases where the mean of Y(n) does not

EE603 Class Notes 12/05/13 John Stensby

Updates at http://www.ece.uah.edu/courses/ee385/ 11-26

exist, this interchange is not valid (can you construct a simple example where the mean of Y(n)

does not exist, i.e., the interchange in (11-78) is not valid?). We will consider this “interchange

problem” again once we have studied some convergence concepts.

Let's consider a special case of (11-78); suppose that input X(n) is wide-sense stationary

and the system is shift invariant. Then, h(n,m) = h(n - m) and X(n) = X is a constant so that

jY X X X

m m=

(n) h(n - m) h(m) H(e )

, (11-79)

so Y(n) = Y is a constant as well. The bracketed quantity on the right-hand side of (11-79) is

the DC gain of the system (which we assume to be finite in the development of (11-79)).

Example 11-7: Consider a low-pass filter with impulse response h(n) = nU(n), where 0 < < 1

to insure stability. The Fourier transform of h is

jj

1H(e )

1 e

(11-80)

According to (11-79), the mean of the filter output is X/(1 - ).

Next, we determine the cross correlation between a system input X(n) and its output

Y(n), both input and output assumed to be real valued. This quantity is defined as

XYR (n,m) E X(n)Y(m) (11-81)

Then, we use this result to find the autocorrelation RY of the system output.

Theorem 11-3: Let X(n) and Y(n) denote the input and output, respectively, of a linear operator

L[·]; that is, Y(n) = L[X(n)]. The cross-correlation between the input X and output Y can be

calculated by the formula

EE603 Class Notes 12/05/13 John Stensby

Updates at http://www.ece.uah.edu/courses/ee385/ 11-27

XY 2 XR (n,m) = L R (n,m) , (11-82)

where L2 signifies that L operate with respect to the second variable (i.e., “m” is the independent

variable in the operation), treating the first variable (i.e., “n”) as a constant. In a similar manner,

the autocorrelation of the output can be calculated by the formula

Y 1 XYR (n,m) = L R (n,m) , (11-83)

where L1 signifies that L operate with respect to the first variable only (i.e., “n” is the

independent variable in the operation).

Proof (see Theorems 7-1 and 7-2 for continuous-time version of this result): First, we write

2X(n)Y(m) = X(n)L[X(m)] = L [X(n)X(m)] , (11-84)

where L2 operates on X(m). Now, take the expected value of this result to obtain

2 2 2 XE[X(n)Y(m)] = E L [X(n)X(m)] = L E[X(n)X(m)] = L R (n,m) , (11-85)

and this establishes (11-82). The formula for the autocorrelation of the output can be developed

by taking the expectation of the product Y(n)Y(m) to obtain

Y 1

1

1 XY

R (n,m) = E Y(n)Y(m) = E L[X(n)]Y(m) = E L [X(n)Y(m)]

= L E[X(n)Y(m)]

= L R (n,m) ,

(11-86)

(L1 operates on functions of n) and this establishes (11-83) so that the theorem is established.

EE603 Class Notes 12/05/13 John Stensby

Updates at http://www.ece.uah.edu/courses/ee385/ 11-28

Note that Theorem 11-3 does not require that operator L (i.e., linear system) be time invariant or

that the input be wide-sense stationary.

Let us consider Theorem 11-3 specialized to the case of a WSS input sequence X(n) and

a shift-invariant, linear system described by unit sample function h(n). For this case, formula

(11-82) yields

XY X=-

X X=- =-

R (n, m) = R (n, m - )h( )

= R ([n - m]+ )h( ) = R ([n - m]- )h(- ) .

(11-87)

Observe that the right-hand side of (11-87) depends on n, m only through the difference k n-m.

Hence, X and Y are jointly wide sense stationary, and we can write

XY XR (k) = R (k)* h(-k) . (11-88)

For the WSS case, the output correlation formula (11-86) becomes

Y XY XYR (n,m) = R ({n - }- m)h( ) = R ({n - m}- )h( )

, (11-89)

a formula depending on k n-m. Hence, we write

Y XYR (k) = R (k) h(k) . (11-90)

Finally, combining (11-88) and (11-90) yields

EE603 Class Notes 12/05/13 John Stensby

Updates at http://www.ece.uah.edu/courses/ee385/ 11-29

Y X XR (k) = R (k) h(-k) h(k) = R (k) {h(-k) h(k)} , (11-91)

and we see that a WSS input produces an WSS output.

Example 11-8: Suppose output Y is related to input X by the simple relationship

Y(n) = L[X(n)] = X(n) - X(n -1) , (11-92)

the first-order, backwards difference. For example, sequence Y(n) might be subjected to a

threshold to implement a “pulse detector” function. The mean of the output is

X XE[Y(n)] = E[X(n)] - E[X(n -1)] = (n) - (n -1) . (11-93)

The cross-correlation between input and output is

XY 2 X X XR (n,m) = L R (n,m) = R (n,m) - R (n,m -1) . (11-94)

Finally, the autocorrelation of the output is

Y 1 XY XY XY

X X X X

X X X X

R (n, m) = L R (n, m) = R (n, m) - R (n -1, m)

= R (n, m) - R (n, m -1) - { R (n -1, m) - R (n -1, m -1) }

= R (n, m) - R (n -1, m) - R (n, m -1) + R (n -1, m -1) .

(11-95)

Suppose the input is WSS with autocorrelation

n-m

XR (n,m) = a , 0 < a < 1. (11-96)

EE603 Class Notes 12/05/13 John Stensby

Updates at http://www.ece.uah.edu/courses/ee385/ 11-30

Then Equations (11-93) and (11-94) yield

Y = 0

RXY(n,m) = an-m - an-m+1 , (11-97)

and (11-95) yields

RY(n,m) = 2an-m - an-1-m -an-m+1. (11-98)

The output sequence Y(n) is WSS; if k n – m, then (11-96) and (11-98) become

RX(k) = ak (11-99)

RY(k) = 2ak - ak-1 -ak+1, (11-100)

respectively. A comparison of Fig. 11-3 and Fig. 11-4 (both correlations were computed and

plotted for a = .6) reveals that the “pulse detector” (11-92) “decorrelates” the input data X(n), at

Fig. 11-3: Eqn. (11-99) with a = .6.

Rx(k)

Fig. 11-4: Eqn. (11-100) with a = .6.

Ry(k)

EE603 Class Notes 12/05/13 John Stensby

Updates at http://www.ece.uah.edu/courses/ee385/ 11-31

least to some extent.

Vector Space of Random Variables

All real-valued random variables with finite second moments (i.e., finite average power)

comprise a vector space over the field of real numbers. We define vector space L2 as

2X : E[ X ] 2L , (11-101)

all real-valued, finite-second-moment random variables. We take the real number field, denoted

here by R, as our scalar field. To show that L2 is a valid vector space, we must show, among

other things, that L2 is closed under vector addition (i.e., X L2 and Y L2 implies that X + Y

L2) and scalar multiplication (i.e., X L2 and c R implies that cX L2).

The fact that L2 is closed under scalar multiplication follows easily. Clearly if X L2

and c R then E[cX2] = c2E[X2] < so cX L2.

The fact that L2 is closed under vector addition follows from use of the Schwarz

inequality (sometimes called the Cauchy-Schwarz inequality).

Theorem 11-4 (Schwarz): Let X L2 and Y L2. Then

2 2 2E[XY] E[ X ] E[ Y ] . (11-102)

Proof: Let be any real-valued number and consider

2 2 2 2E[ X Y ] E[ Y ] E[XY] E[XY] E[ X ] 0 . (11-103)

Now, Equation (11-103) is a quadratic equation in , and the roots are either complex-valued or

real and equal (see Fig. 5). Hence, in the quadratic equation, the discriminant must be non-

positive, or

EE603 Class Notes 12/05/13 John Stensby

Updates at http://www.ece.uah.edu/courses/ee385/ 11-32

2 2 24 E[XY] 4E[ X ]E[ Y ] 0 . (11-104)

The Schwarz inequality follows directly from (11-104). In (11-102), equality results when Y is a

scalar multiple of X.

Now, we show that L2 is closed under vector addition. Let X L2 and Y L2 and

consider the sum X + Y. The second moment of the sum satisfies

2 2 2

2 2 2 2

E[ X Y ] E[ X ] 2E[XY] E[ Y ]

E[ X ] 2 E[ X ] E[ Y ] E[ Y ]

. (11-105)

However, all quantities on the right-hand-side of (11-105) are finite since X L2 and Y L2.

Hence, the sum X+Y L2, and L2 is closed under vector addition. Closure under vector addition

and scalar multiplication is necessary for L2 to be a valid vector space. The remaining

requirements (found in any elementary text on linear algebra) that L2 must satisfy are shown

easily. Hence, we can consider the set of all real-valued random variables with finite second

moments to be a valid vector space.

-axis

2 22E[ Y ] 2E[XY] E[ X ]

Fig. 5: Graph of quadratic equation

EE603 Class Notes 12/05/13 John Stensby

Updates at http://www.ece.uah.edu/courses/ee385/ 11-33

Equality of Random Variables

Let X and Y be random variables. The statement X = Y can be interpreted in different

ways. Everything said about statement X = Y can be said about the equivalent statement X - Y =

0, and vice-versa. Hence, without loss of generality, we discuss the meaning of the statement

random variable X = 0.

X 0 Identically

The statement X 0 identically means that the numerical value of X() = 0 for all S.

This is a very restrictive form of equality, one that is rarely needed in applications. Hence, we

seek a “looser” interpretation of statement X = 0.

X = 0 Almost Surely (a.s.) Means P[X = 0] =1

The statement X = 0 almost surely (a.s.) means P[X = 0] = P[{ S : X = 0}] = 1.

Often, this condition is stated as

1) X = 0 almost everywhere (a.e.)

2) X = 0 with probability one,

both equivalent phrases (used by different authors). It should be noted that X = 0 (a.s.) is NOT

equivalent to X 0 (i.e., X = 0 for all S, or everywhere). If X = 0 (a.s.), the event B = {

S : X 0} has probability zero, however it can be nonempty.

E[X2] = 0 is Equivalent to P[X = 0] = 1 (Same as X = 0 (a.s.))

In words, E[X ] = 0 is stated as X = 0 in mean square, or more simply, X = 0 (m.s.).

E[X ] = 0 is equivalent to P[X 0] = 0. To prove this, we show E[X = 0 if, and only if,

P[X 0] = 0. First, we show the “if” part: assume P[X 0] = 0. Then, X is a discrete random

variable with all probably concentrated at the origin; that is, its distribution function is F(x) =

U(x), a unit step. Observe that

22 2dF[ ] x dx x (x)dx 0

xE X

d

(11-106)

EE603 Class Notes 12/05/13 John Stensby

Updates at http://www.ece.uah.edu/courses/ee385/ 11-34

Second, we show the “only if” part: assume E[X = 0. With this, use the Generalized

Tchebycheff Inequality (see Chapter 2 of these class notes) to write

2

22 E[ X ]X 1/ N N E[ X ] 0

1/ N P (11-107)

for each integer N > 0. Now, note that

N

2 2 2

Nn 1 n 1

X 0 X 0 {X 1/ n} limit {X 1/ n}

P P = P P . (11-108)

But, as indexed on n, sequence of events 2{X 1/ n} is nested increasing. Use continuity of

probability and (11-107) to write

N N

2 2 2

N N Nn 1 n 1

X 0 limit {X 1/ n} limit {X 1/ n} limit X 1/ N

0.

P P P P (11-109)

Equations (11-106) and (11-109) lead to the conclusion that

2E[ X ] 0 if, and only if, P[X 0] = 0 (same as X = 0 (a.s.)). (11-110)

It is worth repeating that P[X = 0] = 1 is not the equivalent to the statement X 0 for all S.

Subspace of L2

M is said to be a subspace of L2 if it is a valid vector space (i.e., closed under scalar

multiplication and vector addition in addition to the other requirements given in any linear

algebra text) and M L2. Subspaces play a crucial role in many applications that involve

EE603 Class Notes 12/05/13 John Stensby

Updates at http://www.ece.uah.edu/courses/ee385/ 11-35

optimization problems.

Inner Product and Norm

It is natural to define an inner product on L2 as the expected value of a product. That is,

for any X L2 and Y L2, we denote the inner product (dot product or scalar product) as

X,Y, and we define

X, Y E[XY] . (11-111)

The Cauchy-Schwarz inequality (11-102) implies that

X, Y X,X Y, Y . (11-112)

That is, the inner product exists as a real number for every vector X and Y in L2. It can be shown

that inner product X,Y = E[XY] satisfies the properties

1. X,X 0, and X,X 0 if and only if X = 0 almost surely ( ., [X = 0] = 1),

2. X,Y Y,X and

3. cX,Y c X,Y , where c .

Pi.e

R

(11-113)

If E[X] = 0, then second moment 2X,X E[X ] is the variance of random variable X. Random

variables X and Y are said to be orthogonal if X, Y E[XY] 0 .

In Part 1) of (11-113), the statement “X = 0 almost surely” is not equivalent to X

(i.e., X identically zero); so, X,X 0 is not equivalent to X However, the

equivalence of X,X 0 and X 0 is a general requirement of an inner product, as defined in

almost all linear algebra books. However, in the applications literature, this subtle “issue” is

overlooked, and (11-111) is declared a valid inner product.

EE603 Class Notes 12/05/13 John Stensby

Updates at http://www.ece.uah.edu/courses/ee385/ 11-36

Some authors change how random variables are defined/interpreted in an attempt to

remove the phrase “almost surely” from Part 1) of (11-113) and “fix” the above-mentioned

“issue”. They interpret a given X as a class of all random variables that are equal to X almost

surely. Two members of the same class can differ on some set B as long as P[B] = 0. All class

members will have the same expected value; sets of probability zero do not influence

expectations. When computing X, Y E XY , X represents any member from its class, as

does Y; the expectation will be the same regardless of which class members are used.

Interpreting a random variable as a class of equivalent random variables allows us to “fix” Part 1

of (11-113), removing the phrase “almost surely”. In terms of equivalent classes, the statement

“X = 0” refers to a class of random variables, all of which are zero almost surely.

On a vector space, a vector norm maps vectors into real numbers in a manner that adapts

the concept of length to vectors. Almost universally, the norm of vector X is denoted as X .

On vector space L2, we define the norm of X as

2X X,X E[X ] (11-114)

(we say that the inner product induces the norm). From (11-113) it follows directly that (11-114)

satisfies

1. X 0. X 0 if, and only if, X = 0 almost surely ( ., [X = 0] = 1),

2. cX c X , for any c and

3. X+Y X Y (the triangle inequality).

Pi.e

R (11-115)

If E[X] = 0, then X is the standard deviation of X.

In Part 1) of (11-115), P[X = 0] = 1 is not equivalent to X (i.e., X identically zero);

so, X 0 is not equivalent to X However, the equivalence of X 0 and X 0 is a

EE603 Class Notes 12/05/13 John Stensby

Updates at http://www.ece.uah.edu/courses/ee385/ 11-37

general requirement of a vector norm (see any text on linear analysis). However, in the

applications literature, this subtle “issue” is overlooked, and (11-114) is declared to be a valid

vector norm. This “problem” can be “fixed” by interpreting each random variable as a class, as

discussed above.

Often, norm (11-114) is called the mean-square norm since it involves the mean of the

square of a random variable. In terms of (11-114), we can restate the Schwarz inequality as

X, Y X Y . (11-116)

Equation (11-116) is how the Schwarz inequality is usually stated in the analysis literature where

the notions of inner product and norm play central roles.

The triangle inequality (Part 3 of (11-115)) has a form similar to the well-known triangle

inequality for real numbers (which states that 1 2 1 2r r r r for any real numbers r1 and

r2 This inequality follows from the observation

2 2 2

2

X + Y X + Y,X + Y X,X 2 X,Y Y,Y X 2 X Y Y

X Y

(11-117)

which leads to the triangular inequality X + Y X Y .

The norm (11-114) allows us a way to define the equality of two vectors (random

variables). If X and Y are random variables for which

X - Y 0 (11-118)

we say that X = Y in the mean-square sense, or we say X = Y (m.s.). From (11-110), we see that

(11-118) is equivalent to P[X = Y] = 1 and P[X Y] = 0.

EE603 Class Notes 12/05/13 John Stensby

Updates at http://www.ece.uah.edu/courses/ee385/ 11-38

Convergence of Random Sequences

Often, one has to deal with sequences of random variables that converge to a random

variable. We say that the random sequence X(n;) converges to random variable X0() if for

every fixed 0 S the sequence of numbers X(n;) converges to the number X0(). This is

"ordinary", sometimes called point-wise, sequence convergence (a topic that is usually covered

in a Calculus course) that has nothing to do with the fact that we are dealing with random

variables. Also, it is very restrictive. In applications, we can get by with much "weaker" modes

of convergence; we discuss three alternative convergence modes. In what follows, we discuss

almost sure (a.s.) convergence, convergence in probability (i.p.) and mean-square (m.s.)

convergence. Mean square convergence is convergence in the mean-square norm (11-114). We

discuss m.s. convergence first.

Mean-Square Convergence (m.s. Convergence)

As n goes to infinity, a sequence of random variables X(n) L2 converges in m.s. to a

random variable X0 L2 if

½20 0

n nlimit X - X(n) 0 same as limit E[{X - X(n)} ] = 0

. (11-119)

The norm used in (11-119) is the mean-square norm given by (11-114). Often, this type of

convergence is denoted as

m.s

0X(n) X , (11-120)

or

0nl.i.m X(n) = X

, (11-121)

EE603 Class Notes 12/05/13 John Stensby

Updates at http://www.ece.uah.edu/courses/ee385/ 11-39

where l.i.m denotes limit in the mean.

Example 11-9: Let Z be a random variable with 2 ( )E Z , Z . 2Li.e. .Let cn, n 0, be a

sequence of deterministic real numbers converging to real number c. Then, cnZ, n 0, is a

sequence of random variables. We show that

nnl.i.m c Z cZ

. (11-122)

To see this, consider

2 2 2 2 2

n n nE c Z cZ E c c Z c c E Z .

Now, cn c and EZ2] < implies EcnZ - cZ2] 0, and this proves (11-122).

Example 11-10: Consider the probability space (S,B,P), where S = [0, 1], B the Borel sets (B is

the -algebra generated by the open intervals on S. See Chapter 1 of class notes), and

B

B d , B P B (11-123)

(if B is an interval, then P[B] is the interval length. P can be thought of as a “generalized length”

of event B). Consider the sequence of random variable defined by

Fig. 6: Sequence of random variables

EE603 Class Notes 12/05/13 John Stensby

Updates at http://www.ece.uah.edu/courses/ee385/ 11-40

1n

1n

X(n; ) 1, 0

0, 1,

(11-124)

as illustrated by Fig. 6. This sequence has a point-wise limit given by

nlimit X(n; ) 1, 0

0, 0

(11-125)

This sequence has zero as its mean-square limit since

2 2 1

nn n nlimit X(n; ) 0 limit E[X (n; )] limit [1 ] 0

(11-126)

Theorem 11-5: Mean-square convergence is additive. That is, if

0n

0n

X l.i.m X(n)

Y l.i.m Y(n) ,

(11-127)

then for any real-valued constants a and b we have

0 0n

aX bY l.i.m aX(n) bY(n)

. (11-128)

Proof: Note that

EE603 Class Notes 12/05/13 John Stensby

Updates at http://www.ece.uah.edu/courses/ee385/ 11-41

0 0 0 0

0 0

0 0

{aX(n) bY(n)} {aX bY } a{X(n) X } b{Y(n) Y }

a{X(n) X } b{Y(n) Y }

= a X(n) X b Y(n) Y .

(11-129)

However, Equation (11-127) ensures that the right-hand-side of (11-129) approaches zero as n

approaches infinity, and this proves (11-128).

Not every sequence of random variables has a mean square limit. We need tools and

techniques for determining if a sequence has a mean-square limit. Fortunately, our intuition is

helpful in this regard. Also helpful is some knowledge of real number sequences. Recall that

real number sequences have the Cauchy property. This property states that a real number

sequence {rn} converges if, and only if, rn - rm 0 as both n and m approach infinity. When

equipped with the Euclidean norm, the set of real numbers is complete, we say. Similarly,

sequences in L2 have the Cauchy property. This property states that a sequence X(n) L2

converges (in the mean square norm) if, and only if, X(n) X(m) 0 as both n and m

approach infinity. When equipped with the mean square norm, the set of L2 random variables is

complete, we say. Stated again, a random sequence X(n) L2 has a mean-square limit X0 if, and

only if, it is Cauchy (that is, X(n) X(m) 0 as both n and m approach infinity).

Mean-Square Cauchy Sequences and Completeness

Let X(n), n 0, be a sequence in L2. The sequence is said to be a mean-square Cauchy

sequence if

n,mlimit X(n) - X(m) = 0

. (11-130)

More tersely, we say that the sequence is m.s. Cauchy if (11-130) is true. For a m.s. Cauchy

sequence, the quantity X(n) X(m) approaches zero as n and m approach infinity, in any

manner whatever. Basically, the further you “go out” in a mean-square Cauchy sequence the

EE603 Class Notes 12/05/13 John Stensby

Updates at http://www.ece.uah.edu/courses/ee385/ 11-42

"closer" (in the mean-square sense) the elements become.

It is easy to show that mean-square convergence implies the mean-square Cauchy

property (i.e., (11-120) implies (11-130)). Actually, this is true for arbitrary normed vector

spaces (i.e., all convergent sequences are Cauchy sequences, regardless of the normed vector

space under consideration). However, for the general normed vector space, Cauchy sequences

are not necessarily convergence. But, for L2 space equipped with the mean-square norm, the

mean-square Cauchy property implies mean square convergence. This is stated by the following

theorem.

Theorem 11-6 (Special Case of Riesz-Fischer Theorem)

Vector space L2 is complete in the sense that a mean-square Cauchy sequence has a

unique limit in L2. That is, for sequence X(n) in L2, there exists a unique element X0 L2 such

that

0 0n nlimit X - X(n) 0 denoted symbolically as l.i.m X(n) = X

(11-131)

if

n,m n,mlimit X(n) - X(m) = 0 denoted symbolically as l.i.m [X(n) - X(m)] 0

. (11-132)

Since the converse is true (see paragraph before the theorem statement), (11-131) and (11-132)

are equivalent for vector space L2. In (11-132), one must remember that the double limit is zero

regardless of how n and m approach infinity.

The value of Theorem 11.6 is this: we do not have to know/find the m.s. limit of a

sequence to know that the sequence is m.s. convergent. To show that L2 sequence X(n)

converges to some m.s. limit X0, we need not know/find X0. Instead, to show convergence, it is

EE603 Class Notes 12/05/13 John Stensby

Updates at http://www.ece.uah.edu/courses/ee385/ 11-43

sufficient to show that X(n) has elements that come arbitrarily close to one another as you “go

out” in the sequence. In some cases, establishing (11-132) is much easier than finding X0

described by (11-131).

With the introduction of Theorem 11.6, we have established L2 as a complete vector

space with norm (11-114) that is induced by inner product (11-111). In the literature, such

vector spaces are referred to as Hilbert Spaces. They are the natural setting for many significant

problems in Fourier series, communication theory, optimal filtering, etc.

Mean-square convergence has a number of useful properties. We discuss the ability to

interchange l.i.m and expectation. Also, we show that a mean-square limit is unique (with

equality in the mean-square sense). To develop these results, we must mention some (almost)

obvious, facts. Note that

nl.i.m X(n)

(11-133)

is a random variable, but

nlimit E[X(n)]

(11-134)

is an "ordinary" limit of an "ordinary" sequence. Also, for any random variable X in L2, we have

2 2E[X] E X E X 1 E X 1 X . (11-135)

The first inequality results from the fact that the absolute value of an integral is less than, or

equal to, the integral of the absolute value. The second inequality comes from the Cauchy-

Schwarz inequality (11-102) with Y = 1. Now, we show that we can interchange expectation and

l.i.m.

EE603 Class Notes 12/05/13 John Stensby

Updates at http://www.ece.uah.edu/courses/ee385/ 11-44

Theorem 11-7: Let X(n) be a sequence in L2. Suppose X(n) has a m.s. limit X0 L2 that is,

m.s

0 0n

X(n) X limit X(n) - X 0

. (11-136)

Then it follows that

0 nn n

E[X ] E l.i.m X limit E[X(n)]

. (11-137)

That is, expectation and l.i.m are interchangeable.

Proof: Since L2 is complete, mean-square limit X0 is in L2 (X0 has a finite second moment), so

E[X0] exists (i.e., the mean is finite). Now, from (11-135), we have

0 0 0 0E[X(n)] E[X ] E[X(n) X ] E X(n) X X(n) X . (11-138)

However, from (11-136) we know that the norm on the right-hand side of (11-138) goes to zero

as n approaches infinity. Hence, we have the desired result (11-137).

An important use of Theorem 11-7 deals with interchanging expectations and

summations. For k = 1, 2, … , let Xk L2 be a sequence of random variables with finite second

moments. Define the nth partial sum

n

n kk 1

Y X

. (11-139)

Suppose that

EE603 Class Notes 12/05/13 John Stensby

Updates at http://www.ece.uah.edu/courses/ee385/ 11-45

n

n kn n k 1

Y l.i.m Y l.i.m X

. (11-140)

We say that partial sum (11-139) converges in mean square to Y. By Theorem 11-7, we can

write

n n

n k k kn n nk 1 k 1 k 1

E Y E l.i.m Y E l.i.m X limit E[X ] E[X ]

. (11-141)

Theorem 11-8: The mean-square limit of a sequence is unique. That is, if

0 0n n

0 0m

X l.i.m X(n) limit X - X(n) 0

Y l.i.m X(m) limit Y - X(m) 0

m

, (11-142)

then 0 0X Y 0 and P[X0 = Y0] = 1.

Proof: Observe that

0 0 0 0 0 0X - Y = {X - X(n)}+{X(n) - Y } < X - X(n) + X(n) - Y (11-143)

from the triangle inequality. Now, on the right-hand side of (11-143), both norms go to zero as a

consequence of (11-142). Hence, we have 0 0X Y 0 as claimed. The fact that P[X0 = Y0] =

1 follows immediately from (11-110).

Example 11-11: We are trying to sample a DC voltage (for example, the output of a strain

gauge, water tank level detector, etc.). However, our samples contain additive noise; the kth

sample is Y(k) = mdc + (k), where mdc is the DC voltage we are trying to measure, and (k) is a

real-valued sample of stationary, zero mean noise with variance 2. We assume that (k) is

EE603 Class Notes 12/05/13 John Stensby

Updates at http://www.ece.uah.edu/courses/ee385/ 11-46

uncorrelated from sample to sample (any two different-indexed samples are uncorrelated). We

try the “time-honored” technique of averaging out the noise. That is, we form the average

n

k=1

1X(n) = Y(k)

n . (11-144)

Note that X(n) has mdc as its mean and 2/n as its variance (indeed, with increasing n, we are

“averaging out” the noise). However, the question remains: As n , does the random

sequence X(n) L2 converge in mean square to a random variable? Let’s see if the sequence is

mean-square Cauchy; consider

2 2dc dc

2 2dc dc dc dc

2 2

dc dc

X(m) - X(n) = E [{X(m) - m }-{X(n) - m }]

E {X(m) - m } - 2{X(m) - m }{X(n) - m }+{X(n) - m }

2E {X(m) m }{X(n) - m } .m n

(11-145)

Consider the case n > m and use the fact that the noise is uncorrelated from sample to sample to

evaluate the middle term

dc dc dc dc

2dc dc

2

E {X(m) m }{X(n) m } E {X(m) m } {X(m) m }+{X(n) - X(m)}

E {X(m) m } E X(m) m E X(n) X(m)

0 0m

(11-146)

Similarly, note that E[{X(m) - mdc}{X(n) - mdc}] = 2/n for the case m > n. Therefore, we can

write (11-145) as

EE603 Class Notes 12/05/13 John Stensby

Updates at http://www.ece.uah.edu/courses/ee385/ 11-47

2 2 1 2 1X(m) - X(n) = + .

m min{n,m} n

(11-147)

As m and n approach infinity (in any order), (11-147) approaches zero, so the sequence is mean-

square Cauchy. By Theorem 11-6, the sequence is mean-square convergent. But what is its

limit? The obvious “candidate” is mdc. To see that this is the limit, consider

n

dcn n nk 1

1limit X(n) m limit (k) limit 0 .

n n

(11-148)

So, we see that X(n) converges in mean square to mdc (and we can expect to get “better” results

the more samples are included in the average).

With Example 11-10, we have established a Mean-Square Law of Large Numbers for

sequences of uncorrelated random variables. More general, let Yk, k = 1, 2, … , be a sequence of

uncorrelated random variables with common mean E[Yk] = m and common variance VAR[Yk] =

2. Then the sample mean

n

k=1

1X(n) = Y(k)

n (11-149)

converges in mean square to m.

In a subsequent section, we will show that (11-149) converges to m in probability, a yet-

to-be-defined mode of convergence that is weaker than mean-square convergence. That sample

mean (11-149) converges in probability to m is just the well-known and popular Law of Large

Numbers, (weak version) that is cited often in the popular press.

Example 11-12: Let X(k), k 1, be a sequence of independent random variables each of which

is either 1 or 0. Furthermore, suppose that

EE603 Class Notes 12/05/13 John Stensby

Updates at http://www.ece.uah.edu/courses/ee385/ 11-48

[X(k) = 1] = 1/k

[X(k) = 0] = 1-1/k

P

P. (11-150)

As k , does X(k) converge in mean square? Let’s check the obvious candidate X = 0;

consider

n nlimit X(n) - 0 limit 1/ n 0

. (11-151)

So, we see that X(n) converges in mean square to the random variable X = 0. However, in

Example 11-16, we will see that X(n) does not converge (to zero) in a point-wise manner.

Example 11-13: Let X(k), k 1, be a sequence of independent random variables similar to the

previous example. However, suppose that X(k) is either k or 0 with

2

2

[X(k) = k] = 1/k

[X(k) = 0] = 1-1/k

P

P

. (11-152)

So, as k becomes large, we see that X is getting larger with a smaller probability. Is X(k) mean

square convergent? To find out, consider

2 2 2X(m) - X(n) = E X(m) - 2X(m)X(n) + X(n)

= 2[1-1/nm]

, (11-153)

a result that converges to 2 as m, n approach infinity. Hence, X(n) is not mean-square Cauchy;

hence, it is not mean square convergent. The last two examples illustrate the fact that mean-

square convergence depends on both the numerical values a sequence takes on and the

probabilities of taking on those values.

EE603 Class Notes 12/05/13 John Stensby

Updates at http://www.ece.uah.edu/courses/ee385/ 11-49

Theorem 11-7 tells us that expectation and l.i.m. are interchangeable for m.s. convergent

sequences. A similar result holds for the inner product operation defined by (11-111).

Theorem 11-9 (Continuity of the Inner Product): Let X(n) and Y(m) be m.s. convergent

sequences with m.s. limits X0 and Y0, respectively, so that

0 0n n

0 0m m

X l.i.m X(n) limit X - X(n) 0

Y l.i.m Y(m) limit Y - Y(m) 0

. (11-154)

Under these conditions, we claim that

0 0n m n,m

X , Y l.i.m X(n), l.i.m Y(m) limit X(n), Y(m)

. (11-155)

Proof: First, consider the simple algebra

0 0 0 0 0 0

0 0 0

0 0 0

0 0 0

X(n), Y(m) - X , Y = X(n), Y(m) - X(n), Y + X(n), Y - X , Y

= X(n), Y(m) - Y + X(n) - X , Y

X(n), Y(m) - Y + X(n) - X , Y

X(n) Y(m) - Y + X(n) - X Y .

(11-156)

Now, since m.s.0X(n) X as n , the sequence X(n) is bounded (can you show

this??), say X(n) < M. Use this fact, (11-154) and (11-156) to conclude

0 0 0 0 0n,m n,mlimit X(n),Y(m) - X ,Y limit Y(m) - Y + X(n) - X Y 0 ,

M (11-157)

EE603 Class Notes 12/05/13 John Stensby

Updates at http://www.ece.uah.edu/courses/ee385/ 11-50

a result that proves (11-155) and the continuity of the inner product.

Theorem 11-9 establishes continuity of the inner product X,Y E[XY]. What we mean

by this is simple. Suppose we are given sequences X(n) and Y(m) with m.s. limits X0 and Y0,

respectively, as described by (11-154). For “large” n and m, X(n) and Y(m) “get close” to X0

and Y0, respectively, and X(n),Y(m) E[X(n)Y(m)] “gets close” to X0,Y0 E[X0,Y0]. This

intuitive idea is known as continuity of the inner product.

Convergence in Probability (i.p. Convergence)

Some results that involve mean square convergence of random sequences can be

generalized to a "weaker" convergence mode. This new mode is called convergence in

probability. It is "weaker" (i.e., more general) than m.s. convergence; m.s. convergent sequences

also converge in probability, but the converse is not true.

As n , a random sequence X(n) converges in probability (i.p.) to a random variable

X0 if, for every > 0, we have

0nlimit X(n) - X 0

P . (11-158)

Often, this type of convergence is denoted by either of

i.p.

0X(n) X (11-159)

nl.i.p X(n) X

. (11-160)

For convergence in probability, many of the results parallel those given above for m.s.

convergence. First, as we “go out” in a series (i.e., as the index becomes large), it may be more

likely that the terms are closer together (this does not mean that the terms must be closer together

in the m.s. sense). We say that a random sequence X(n) is Cauchy in probability if, for every >

EE603 Class Notes 12/05/13 John Stensby

Updates at http://www.ece.uah.edu/courses/ee385/ 11-51

0, we have

n,mlimit X(m) - X(n) 0

P . (11-161)

Cauchy in probability is a “weaker” condition than Cauchy in the mean square sense. A

sequence that is mean square Cauchy is also Cauchy in probability, but the converse is not true.

Condition (11-130) implies condition (11-161); however, the converse is not true. Next, we

provide a theorem that does for convergence in probability what Theorem 11-6 did for

convergence in mean square.

Theorem 11-10: As n , a sequence X(n) converges in probability to a random variable X0

if, and only if, the sequence is Cauchy in probability.

Proof: First, we show that if X(n) converges in probability to X0 then it is Cauchy in

probability. Suppose that the sequence converges in probability. Then note the event (i.e., set)

relationship

0 0X(m) - X > ε/2 X(n) - X > ε/2X(m) - X(n) > ε , (11-162)

as depicted by Figure 11-7. From (11-162), we see that

0 0X(m) - X(n) > ε X(m) - X > ε/2 X(n) - X > ε/2 P P P . (11-163)

Figure 11-7: If X(n) - X(m) then either X(n) - X0 or X(m) - X0

X(n) X(m)X0

Longer Than

Longer Than

EE603 Class Notes 12/05/13 John Stensby

Updates at http://www.ece.uah.edu/courses/ee385/ 11-52

Now, since X(n) converges to X0 in probability, both terms on the right hand side of (11-163)

approach zero as n and m approach infinity. Hence, the sequence is Cauchy in probability as

claimed. The converse (if X(n) is Cauchy in probability then it converges in probability) is

harder to prove and is not given here (see M. Loève, Probability Theory I, 4th Edition, pp. 117-

118).

Theorem 11-11: If a sequence converges in probability, then the limit is unique. That is,

suppose X(n) converges in probability to both X0 and Y0. Then it necessarily follows that P[X0

Y0] = 0.

Proof: Using the same reasoning that led to (11-163), we can write

0 0 0 0X Y X - X(n) / 2 Y - X(n) / 2 (11-164)

0 0 0 0X Y X - X(n) / 2 Y - X(n) / 2 P P P . (11-165)

However, both terms on the right-hand side of (11-165) approach zero as n approaches infinity.

Hence, for every > 0 we have

0 0X Y 0 P , (11-166)

so that

+ 0 00

limit X Y 0

P . (11-167)

Continuity of the probability measure (see Appendix 11B) and (11-167) lead to the conclusion

0 0X Y 0 0 P , (11-168)

EE603 Class Notes 12/05/13 John Stensby

Updates at http://www.ece.uah.edu/courses/ee385/ 11-53

and this establishes the claim that P[X0 Y0] = 0.

As claimed previously, convergence in mean square implies convergence in probability.

This claim is substantiated by the following theorem (which is a nice application of the

Tchebycheff inequality).

Theorem 11-12: Convergence in mean square implies convergence in probability.

Proof: Let X(n) be a sequence that converges in mean square to the random variable X0. For

each n, apply the generalized Tchebycheff inequality (see Chapter 2 of these notes) to X(n) - X0

and obtain

2 20 0

0 2 2

E X(n) - X X(n) - XX(n) - X ε =

ε ε

P (11-169)

for every > 0. However, we know that m.s.0X(n) X , so that 0X(n) - X 0 as n .

Hence, with (11-169), we have

0nlimit X(n) - X 0

P , (11-170)

so that i.p.0X(n) X as claimed.

Let’s reconsider Examples 11-11 and 11-12, both of which provided sequences that

converged in the mean square sense. Now, we know that these sequences converge in

probability, as implied by Theorem 11-12. Actually, that the sequence in Example 11-11

converges in probability is just a statement of the Law of Large Numbers (weak version).

Theorem 11-13 (The Weak Law of Large Numbers): Let X(n) be a sequence of independent,

identically distributed (i.i.d) random variables with mean X and variance X2 . Then, the sample

mean

EE603 Class Notes 12/05/13 John Stensby

Updates at http://www.ece.uah.edu/courses/ee385/ 11-54

n

k=1

1ˆ X(k)

n n (11-171)

converges in probability to the “real” mean X as n approaches infinity.

Proof: The proof of this theorem follows from Example 11-11 and Theorem 11-12.

The law of large numbers is the basis for estimating X from measurements. In

applications, it is common to take the sample mean (11-171) as an estimate of the “real” mean

X. The basis for doing this is the Law of Large Numbers.

Example 11-14: In Example 11-13, we considered a sequence of independent random variables

X(k), k 1, with

P[X(k) = k] = 1/k2

P[X(k) = 0] = 1 - 1/k2.

We found out that this sequence does not converge in the mean square sense (a “sufficient

number” of the sample function sequences contain a “sufficient number” of instances where X(k)

= k so that m.s. convergence is not achieved). Now, we show that it does converge in probability

to X0 = 0. For every > 0, we have

20

k k klimit X(k) - X limit X(k) > limit X(k) = k limit 1/ k 0

P P Pk

, (11-172)

and we see that the sequence converges in probability to zero (in (11-172), only probabilities

that X(k) = k are involved; for k 1, the actual numerical values of X(k) do not enter into the

computation).

The converse of Theorem 11-12 is not true (convergence in probability does not imply

convergence in mean square), and Example 11-14 is a counter example that establishes this fact.

EE603 Class Notes 12/05/13 John Stensby

Updates at http://www.ece.uah.edu/courses/ee385/ 11-55

Basically, convergence in mean square is dependent upon both the numerical values of the

sequence elements and the probabilities associated with the values. On the other hand,

convergence in probability is only concerned with the probabilities.

Example 11-15: For convergence in probability, this example shows that one cannot interchange

the limit and expectation operations. For n 1, consider the sequence X(n), where X(n) is either

= -1 or n. Also, suppose that

X(n) = n = 1/n

X(n) = -1 = 1-1/n

P

P. (11-173)

The sequence converges in probability to X0 = -1 since

0limit X(k) - X limit X(k) - (-1)

limit X(k) = k limit 1/k 0

P P

P

k k

k k

. (11-174)

Now, we look at mean values. Clearly, E[X0] = -1, and

1 1 1E X(n) = n + (-1) 1- =

n n n

, (11-175)

which has a zero limit as n approaches infinity. Hence, we have shown that

0nn

limit E X(n) 0 E[ l.i.p X(n) ] E[X ] 1

. (11-176)

Example 11-15 serves as a counter example that shows that you cannot, in general,

interchange the operations of limit in probability and expectation. That is, it is not generally true

EE603 Class Notes 12/05/13 John Stensby

Updates at http://www.ece.uah.edu/courses/ee385/ 11-56

that

nE l.i.p X(n)

and

nlimit E X(n)

produce the same value (This differs from mean square convergence; recall that Theorem 11-7

proved that expectation and l.i.m are interchangeable). So, while convergence in probability is

very general (and weak), there are limitations on what you can do with it.

Convergence Almost Surely (a.s. Convergence)

The last form of convergence we will study is called almost surely (a.s.) convergence.

The random sequence X(n) converges almost surely to the random variable X0 if the sequence of

functions X(n;) converges to X0() for all S except possibly on a set of probability zero

(recall that S denotes the sample space). Almost surely convergence requires that

0 0n nlimit X(n) X limit X(n; ) X ( ) 1

P P S : . (11-177)

In other words, X(n) converges almost surely to random variable X0 if there exists an event A,

with P(A) = 1 (and P(A_

) = 0), for which X(n;) X0() for all A. Often we write

a.s.

0X(n) X . (11-178)

Obviously, this type of convergence is “weaker” than pointwise (p.w.) convergence (p.w.

EE603 Class Notes 12/05/13 John Stensby

Updates at http://www.ece.uah.edu/courses/ee385/ 11-57

convergence requires that X(n;) X0() for all S). However, as shown below, almost

surely (a.s.) convergence implies convergence in probability (i.p.). And, it doesn’t imply, nor is

it implied by, convergence in mean square (m.s.). In the literature, a.s. convergence goes by the

names convergence almost everywhere and convergence with probability one (other names are

used as well).

Like convergence in mean square and probability, in the context of almost sure

convergence, it is possible to examine the separation, or distance, between sequence elements as

we “go farther out” in a sequence. We say that X(n) is an almost surely Cauchy sequence if

n,m n,mlimit X(n) X(m) 0 limit X(n ) X(m ) 0 1

P P S : ; ; . (11-179)

In other words, there exists an event A, P(A) = 1, for which

n,mlimit X(n ) X(m ) 0

; ; (11-180)

for all A. To establish that X(n) is an almost surely Cauchy sequence, we do not require

knowledge of a sequence limit.

With regard to necessary and sufficient conditions for the Cauchy criteria, almost surely

convergence parallels m.s. and i.p. convergence. To show almost surely convergence of a

sequence, it is not necessary to come up with a limit (in the almost surely convergent sense) for

the sequence. Instead, as shown by the following theorem, we can use the Cauchy criteria.

Theorem 11-14: A sequence X(n) is almost surely convergent if, and only if, it is an almost

surely Cauchy sequence.

Proof: This theorem follows from the fact that, in the real number system, sequences of real

numbers converge if, and only if, they are Cauchy sequences.

A practical and useful test for almost surely convergence is given by the following

EE603 Class Notes 12/05/13 John Stensby

Updates at http://www.ece.uah.edu/courses/ee385/ 11-58

theorem.

Theorem 11-15: Let X(n) denote a sequence of random variables. Suppose that X(n) converges

to random variable X0 almost surely; that is, we suppose that

a.s.

0X(n) X . (11-181)

Then, for every > 0 we have

0 0m m n=m

limit X(n) - X for n m limit {X(n) - X } 1

P P all , (11-182)

which we write as

mmlimit A 1

P , (11-183)

where Am is defined as

m 0 0n m

A X(n; ) - X ( ) for n m { X(n; ) X ( ) } ,

S S: :all (11-184)

an event that depends on m and . The converse is true as well; hence, (11-182) and (11-181) are

equivalent (i.e., one implies the other).

Note: The sequence Am, m 0, is nested increasing with m; that is, Am Am+1 for all m and all

> 0. Also, the complement of (11-184) is (DeMorgan’s Laws come in handy here)

EE603 Class Notes 12/05/13 John Stensby

Updates at http://www.ece.uah.edu/courses/ee385/ 11-59

m 0

0n m

A X(n; ) - X ( ) for n m

{ X(n; ) X ( ) }.

S

S

:

:

some

(11-185)

So, Theorem 11-15 is sometimes stated as: a.s.0X(n) X iff for all > 0 we have

m0m mlimit X(n) X for n m limit A 0

P Psome . (11-186)

Proof: First, suppose that a.s.0X(n) X . Then, there exists an event 1 for which

1

1 1

0 1n

1

{ } 0

limit X(n; ) X ( ) for each .

P

P S P (11-187)

Now, show that 1 kk 1A

. Take any 0 . As shown by (11-187), X(n;0) converges

in an “ordinary” sense to X0(0); this means that, given any > 0, there exists an integer m(0,)

(integer m depends on 0 and with the property

0 0 0X(n, ) X ( ) (11-188)

for n m(0,). Hence, we see that 0 kA , for all k m(0,); that is, we can write

0 1 0 k 0 0n=k

A { X(n ) X ( ) }, k m( , )

S : ; . (11-189)

Since the Ak are nested increasing, we have

EE603 Class Notes 12/05/13 John Stensby

Updates at http://www.ece.uah.edu/courses/ee385/ 11-60

1 kk=1

A

. (11-190)

Since P(1) = 1, Equation (11-190) yields

kk=1

A 1

P . (11-191)

This leads to the conclusion

n n

k k k nn n nk=1 k=1 k=1

1 A limit A limit A limit (A )

P P P P , (11-192)

and we have proven that (11-181), which states a.s.0X(n) X , implies (11-182), which states

nn[A ] 1P . Now, we show the converse; we show that (11-182) implies (11-181). We

do this by showing that a false (11-181) implies a false (11-182) (this is the contrapositive of the

statement “(11-182) implies (11-181)”). Hence, assume that (11-181) is false and show that

mmlimit [A ] 1

P (i.e., (11-182) is false). If (11-181) is false there exists an event , P() > 0,

such that X(n,) / X0() for ( i.e., convergence does not occur for ) . Now,

consider the random variable

0n

Z( ) X(n, ) X ( ) , lim sup S . (11-193)

The event { S : Z() > 0} can be expressed as

n 1

{ Z( ) 0} { Z( ) 1/ n}

S S: : . (11-194)

EE603 Class Notes 12/05/13 John Stensby

Updates at http://www.ece.uah.edu/courses/ee385/ 11-61

For each 0 , we have Z(0) > 0, so 0 { S : Z() > 0}; this fact implies that

: Z( 0 .) S (11-195)

Now, P() > 0 implies ({ ( ): Z 0 }) 0 P S and the existence of some integer n0 for

which the event { S : Z() > 1/n0} has a strictly positive probability (to see this, equate the

probability of both sides of (11-194) and use the continuity of P). That is, we have

P[ { S : Z() > 1/n0 } ] > 0. (11-196)

But, this positive probability event is contained in the complement of Am, m 1, defined using

= 1/n0. This observation is written as

0 m 0 0n=m

{ Z( ) > 1/ n } A { X(n, ) X ( ) 1/ n }

S : S : , (11-197)

for every integer m (apply DeMorgan’s Law to (11-184) to get this complement). Hence, for

every integer m, we have

m(A ) ( { Z( ) ) 0 P P S : , (11-198)

so that m(A )P is bounded away from zero, and P(Am) is bounded away from unity, as m .

Hence, Equation (11-183) (equivalently, Equation (11-182)) cannot be true; we have shown that

a false (11-181) implies a false (11-182) (equivalently, we have shown that (11-182) implies

(11-181)).

EE603 Class Notes 12/05/13 John Stensby

Updates at http://www.ece.uah.edu/courses/ee385/ 11-62

Theorem 11-16: Almost surely (a.s.) convergence implies convergence in probability (i.p.).

Proof: This is easy to show. Suppose that X(n) X0 almost surely (a.s.) so that m(A ) 0P

as m for any fixed (but arbitrary) > 0 used in the definition of Am. Note that

m0 0n=m

{ ( ) ( ) }: X m, A { X(n, ) X ( ) }X

SS : . (11-199)

Hence, m(A ) 0P as m implies that 0P({ X(m) X }) 0 as m , and we have

X(m) X0 in probability (i.p.).

Theorem 11-16 shows that a.s. convergence implies convergence in probability; however, the

converse is not true, as shown by the next example.

Example 11-16: This example shows that convergence in mean square (m.s.) does not imply

convergence almost surely (a.s.). Recall that Example 11-12 discussed a binary random

sequence X(k), all independent random variables, with

[X(k) 1] 1/ k

[X(k) 0] 1 1/ k

P

P. (11-200)

In Example 11-12, we saw that X(k) converges in mean square (m.s.) to X0 0 (hence, it also

converges in probability (i.p.) to X0 0). Now, we show that this sequence does not converge

almost surely (a.s.). In terms of Am given by (11-184), observe that

EE603 Class Notes 12/05/13 John Stensby

Updates at http://www.ece.uah.edu/courses/ee385/ 11-63

n 0n n nm n m n

1 1n n 1n

1n mn m=0

1n mn m=0

limit [A ] limit { X(m) - X } limit {X(m) = 0}

limit(1 )(1 )

limit (1 )

limit exp

0.

P P P

(11-201)

Since this limit is not unity, X(m) cannot converge almost surely to X0 = 0 (study again Equation

(11-182)). What we have provided here is a counter example that shows that mean square (m.s.)

convergence does not imply almost surely (a.s.) convergence. Also, the example shows that

convergence in probability (i.p.) does not imply convergence almost surely (a.s.). Also, see

Stark and Woods (3rd Edition), Example 6.7-3, p. 381 for a similar example.

Example 11-17: This example shows that convergence almost surely (a.s.) does not imply

convergence in mean square (m.s.). Recall that Example 11-13 presented a binary random

sequence X(k) of independent random variables with

2

2

[X(k) k] 1/k

[X(k) 0] 1-1/k

P

P

. (11-202)

As shown by Example 11-13, this sequence is not mean square (m.s.) convergent. We show that

X(k) converges almost surely (a.s.) to X0 = 0. In terms of mA defined by (11-199), observe that

EE603 Class Notes 12/05/13 John Stensby

Updates at http://www.ece.uah.edu/courses/ee385/ 11-64

n 0n n nm=n m n

2

n m=n

limit [A ] limit {X(m) - X } limit {X(m) m}

limit 1/ m 0.

P P P (11-203)

Equivalently, in terms of An given by (11-184), this last result implies that

n 0n n m=n

limit [A ] limit {X(m) - X } 1

P P . (11-204)

From Theorem 11-15 (see Equation (11-182)), we can conclude that X(n) converges almost

surely (a.s.) to X0 = 0. Together with Example 11-13, this example shows that convergence

almost surely (a.s.) does not imply convergence in mean square (m.s.). Also, this example shows

that convergence in probability (i.p.) does not imply convergence in mean square (m.s.).

The next example is somewhat counter intuitive. It demonstrates that convergence point

wise does not imply convergence in mean square, in general. Even though 0X n; ) )X ( ( for

all S (i.e., the random variable converges point wise), the integral in the computation of

E[X[n;] - X02] may diverge so that X(n) does not converge to X0 in mean square.

Example 11-18: Consider the probability space (S,B,P), where S = [0, 1], B the Borel sets (B is

the -algebra generated by the open intervals on S. See Chapter 1 of class notes), and

B

B d , B P B (11-205)

(if B is an interval, then P[B] is the interval length. P can be thought of as a “generalized length”

of event B). For S, define the random variable sequence

EE603 Class Notes 12/05/13 John Stensby

Updates at http://www.ece.uah.edu/courses/ee385/ 11-65

1 2n n

1 2n n

[ , ]

n,X(n, ) n ( )

0, otherwise

I (11-206)

(note that IB() is called the Indicator Function). On S, X(n) converges to zero in a pointwise

manner. We say that p.w.

X(n) 0 . Sometimes, we say that X(n) converges everywhere or

surely. However, sequence X(n) does not converge to zero in the mean square sense since

2 2 2 2 1X(n) 0 E X(n) n n

n n

. (11-207)

Venn Diagram Describing Convergence Modes

Figure 11-8 shows a Venn diagram that depicts the interrelationships between i.p., m.s.,

a.s., and p.w. convergence. The diagram follows directly from the definitions, theorems and

counter examples given in this chapter. Mean square convergence neither implies, nor is it

implied by, a.s. convergence; see Examples 11-16 and 11-17 for relevant counterexamples. The

fact that p.w. convergence does not imply m.s. convergence is established by Example 11-18.

Theorem 11-12 (alternatively, Theorem 11-16) establishes that m.s. (alternatively, a.s.)

convergence implies i.p. convergence.

Figure 11-8: Relationship between modes of convergence.