introduction to stochastic processes pavel...

Introduction to Stochastic Processes

Pavel Chigansky

E-mail address: [email protected]

Preamble

These are lecture notes for the course I have taught at the School of ElectricalEngineering of Tel Aviv University during 2003/04. It was intended as a “pre-introduction” to the measure theoretic probability for the graduate students withbackground in signal processing, information theory, etc. Consequently, most of thetheory is told without proofs and much time is allocated to motivating examples.In particular, the classical filtering problem for partially observed processes is re-visited on several occasions as the story unfolds. The lecture notes are available athttp://pluto.huji.ac.il/~pchiga/ along with the exercises/solutions files anda number of sample exams.

Please do not hesitate to send any questions, comments, etc. regarding thiscourse to the author’s current e-mail [email protected].

P.Ch.

30, September, 2007

Contents

Preamble 2

Chapter 1. The Basics of Mathematical Probability 51. Introduction 52. Axioms of probability 63. Probability spaces 74. Random variables 95. Random processes 116. Limit theorems 18

Chapter 2. Orthogonal projection and linear estimation 211. Orthogonal projection 212. Linear estimation of stationary processes: Kolmogorov-Wiener

approach 253. Linear estimation: Kalman’s state space approach 30

Chapter 3. Conditional expectation and nonlinear estimation 471. Conditional expectation 472. Nonlinear estimation 55

Chapter 4. Stochastic processes in continuous time 651. Poisson process 652. Wiener process 673. Ito stochastic integral with respect to Wiener process 684. Stochastic differential equations 715. Applications 77General References on Probability Theory 80References on Nonlinear Filtering 80

3

CHAPTER 1

The Basics of Mathematical Probability

1. Introduction

What is probability1 ? Toss a coin n (e.g. 1000) times and let pn be the ratioof heads outcomes. Intuitively we feel that pn will be close to 1/2 if the coin isfair, so it is tempting to say that pn is the probability of heads. Also it is clear thatpn will rarely be equal to 1/2 exactly and moreover it will depend on n as well aswill be different when repeating this experiment (for the same n). Also what isprobability then in just one trial ? Clearly the above experiment cannot be usedto define probability ...

Introduce Ω = h, t (h - heads, t - tails), the set of possible outcomes of onecoin tossing (i.e. Ω consists of 2 points ω1 = h and ω2 = t). Define probabilityP (·) to be an Ω 7→ [0, 1] function, such that P (Ω) = P (h) + P (t) = 1. Let forbrevity p = P (h). Now return to n-time tossing of the coin. Let Ω = (x1, ..., xn) :xi ∈ h, t. This set consists of 2n points ωi (e.g. ω1 = (hh...h) etc.) to whichwe assign, say, equal probabilities, i.e. P (ωi) = 2−n. Now let us check that thisconfirms the intuition of the first paragraph 2 (ω`

i is the value of `-th entry of ωi )

P(ωi :

1n

n∑

`=1

I(ωì = h) = 1/2

)= 2−n

(n

n/2

)= 2−n n!

(n/2!)2≈ 0 for large n

i.e. rarely exactly 1/2 is obtained. Fix some small ε > 0 , and let us verify that

P(ωi :

∣∣∣ 1n

n∑

`=1

I(ωì = h)− 1/2

∣∣∣ ≤ ε)

is close to 1 for large n. This can be calculated directly using combinatorics but therough answer can be obtained in a simpler way by means of Chebyshev inequality(to be elaborated later on in the course).

P(ωi :

∣∣∣ 1n

n∑

`=1

I(ωì = h)− 1/2

∣∣∣ > ε)≤ 1

4nε2.

The above construction works well for any finite n. What about an infinitesequence of coin tosses ? I.e. when Ω =

ω = (x1, x2, ...) : xi ∈ h, t. Now Ω

consists of an infinite number of points. If there were countably many points, onecould assign to each ωi probability pi, such that

∑pi = 1. But this is not the case

- indeed, recall that any infinite sequence of 0 and 1 can be considered as a binaryexpansion of a number in [0, 1] and thus Ω contains as many points as the interval[0, 1], i.e. uncountably many. This implies that the probability of each point ω ∈ Ωshould be zero ! At the same time, intuitively we feel that probability of a point to

1The (much more) extended treatment of the subject can be found e.g. in (8)2even n are taken for simplicity

5

Summer, 2004 Stochastic Processes

be in [0, 1/2] is 1/2. Clearly some sets, other than points, should be used to defineprobability in this case.

2. Axioms of probability

The main object of the probability theory is the probability space (Ω,F , P ),where

(1) Ω is a set of points ω, which is called the sampling space(2) F is the σ-algebra of the sets (events) , satisfying the properties

- Ω ∈ F- A ∈ F =⇒ A ∈ F- if An is a sequence of events from F , then

∩∞n=1An ∈ F and ∪∞n=1 An ∈ F .

(3) P is the probability measure, i.e. positive function F 7→ [0, 1], such thatP (Ω) = 1 and for any sequence of pairwise nonintersecting sets An in Fthe σ-additivity property is satisfied, i.e.

P(∑

n

An

)=

∑n

P (An).

Example 2.1. In a single coin tossing experiment- Ω = h, t- F = ∅, h, t,Ω (check that this is algebra)- P (∅) = 0, P (Ω) = 1 and P (h) = P (t) = 1/2

Example 2.2. In n coin tosses- Ω =

ω = (x1, ..., xn : xi ∈ h, t)

- F = ∅, ωi ∈ Ω and all possible unions of ωi- for any set A ∈ F

P (A) =∑

ωi∈A

2−n

The σ-algebra (which is just a finite algebra in this case) F is the richest (or finest), i.e. any physical outcome of the experiment belongs to F . In fact many other(more coarse) σ-algebras can be defined on the same Ω. E.g. F ′ = ∅, E, O,Ω,where E = ωi :

∑n`=1 ω`

i is even and O = ωi :∑n

`=1 ω`i is odd (check that

this is algebra!). Another probability measure P ′ is assigned on F ′ via P ′(O) =P ′(E) = 1/2.

Exercise 2.3. Show that the union of two σ-algebras is not necessarily a σ-algebra. Show that an intersection of two σ-algebras is always a σ-algebra.

Note that in the above examples Ω was always finite and thus F definition wasimmediate. What happens e.g. in the infinite coin tossing experiment ?

As before Ω = [0, 1] can be chosen. Let I be the collection of sets, obtained byfinite unions of nonintersecting intervals of the form (a, b], i.e.

A =∑

i

(ai, bi], bi > ai

Clearly I is algebra (why?). But not a σ-algebra ! E.g. An = (0, 1 − 1/n] are inI, but (0, 1) = ∪An is not. The minimal (coarsest) σ-algebra, which contains Iis called Borel σ-algebra and denoted by B. Clearly such a σ-algebra exists and is

6 EE, Tel Aviv University


obtained by intersection of all σ-algebras containing I (there is at least one - thefinest σ-algebra on Ω).

The Borel σ-algebra is rich enough to contain many events of interest, e.g.

a =⋂n

(a− 1/n, a] (points)

(a, b) =⋃n

(a, b− 1/n] (open intervals)

etc.

There are of course sets that do not belong to B.We can assign the probability P on I by P (A) =

∑i(bi−ai). Fortunately (and

by no means obviously!) this defines a unique probability λ on B, such that for anyset A ∈ I

λ(A) = P (A),i.e. the restriction of λ on I coincides with P . In other words, λ is an extension ofP on the σ-algebra B, generated by I. Moreover e.g. for any interval (semi closed,closed, open, etc.)

λ((a, b)

)= b− a.

λ is called Lebesgue probability measure.Now the infinite coin tossing experiment can be studied on the probability space

(Ω,B, λ). E.g. the event (we associate t with 0, and h with 1)

A = first 17 trials give t = [0, 2−17) ∈ Band λ(A) = 2−17. More fancy events are measurable with respect to B, e.g.

ω : lim

n→∞1n

n∑

`=1

I(ω` = h) = 1/2

is B-measurable, since it may be constructed by countable number of union andintersections of more simple B-measurable sets (similar to Section 5.2, p. 12).

3. Probability spaces

As we have seen, the probability space ([0, 1],B, λ) is sufficient for the cointossing. More complicate experiments require dealing with other probability spaces.

3.1. R. Consider Ω = R. The Borel σ-algebra B(R) can be introduced viasemi-open intervals as in the previous section. It turns out that B(R) coincideswith the σ-algebra, generated by the open sets (with the usual distance metric).

A probability measure on I(R) can be defined by means of a distribution func-tion F (x), satisfying the properties

(1) F (x) is a positive nondecreasing function(2) F (∞) = 1, F (−∞) = 0(3) F (x) is continuous from the right and has limit from the left for any x ∈ R

Then for any set A =∑

i(ai, bi] in I let

P (A) =∑

i

[F (bi)− F (ai)

].

Similarly to Lebesgue measure, this probability measure can be extended to B(R).All distribution functions can be one of three kinds (or their combination)



(1) lattice (or purely atomic), if all the increase points of F (x) are isolated(discrete)

(2) absolutely continuous (with respect to extended Lebesgue measure), ifthere is a nonnegative function f(x) such that

F (x) =∫ x

−∞f(s)ds

where the integral can be understood in the usual Reinman sense(3) singular, if the increase points of F (x) have Lebesgue measure zero (e.g.

Cantor distribution)

Example 3.1.1. Normal distribution

F (x) =1√2π

∫ x

−∞e−s2/2ds

2. Cauchy distribution

F (x) =∫ x

−∞

1/π

s2 + 1ds

3.2. Rn. When Ω = Rn, B(Rn) is generated by sets of the form

A = (−∞, x1]× ...× (−∞, xn].Any probability measure can be constructed by means of n-dimensional distri-

bution function F (x), x ∈ Rn, satisfying the properties(1) ∆a1,b1 ...∆an,bnF (x) ≥ 0, where ∆ai,bi is the difference operator, applied

to i-th coordinate(2) F is continuous from the right (jointly in all the arguments)(3) F (∞, ...,∞) = 1(4) limxi→−∞ F (x1, ..., xn) = 0

As in the case of R, F can be atomic, absolutely continuous and singular.

3.3. R∞. Often we deal with infinite sequences with entries in R (rather thanin 0, 1, as in coin tossing case), i.e. with the sampling space R∞ = (x1, x2, ...) :xi ∈ R. The Borel σ-algebra is generated in this case by cylindrical sets of theform

In = x ∈ R∞ : x1 ∈ (a1, b1], ..., xn ∈ (an, bn]for n = 1, 2, ... The probability measure P can be defined on any cylindrical set asin the case of Rn. When does this uniquely define a probability measure on B(R∞)? The answer was given by A. Kolmogorov

Theorem 3.2. Let P1, P2, ... be a sequence of probability measures on (R,B(R)),(R2,B(R2)), ..., which satisfy the consistency property, i.e.

Pn+1(B × R) = Pn(B), B ∈ B(Rn).

Then there is a unique probability measure P on (R∞,B(R∞)), such that for anyn = 1, 2, ...

P (B) = Pn(B), B ∈ B(Rn).¤

Roughly speaking, the latter means that the measure on B(R∞) is defined byall the finite dimensional measures.



Example 3.3. Let F (x) be a distribution function on R. Define an n-dimen-sional distribution function Fn(x) =

∏ni=1 F (xi) on Rn. Clearly the family of

measures Pn, corresponding to Fn(x) is consistent and so there is a probabilitymeasure P on (R∞,B(R∞)), such that all its n-marginal distributions coincidewith Fn(x).

4. Random variables

Let’s fix some probability space (Ω,F , P ).

Definition 4.1. A real function ξ : Ω 7→ R is called a random variable if

ω : ξ(ω) ∈ B ∈ F , ∀B ∈ B(R).

Random variables are often used to describe outcomes of the experiments with-out specifying the underlying probability space. The probabilistic description of ξis then given by its distribution function

Fξ(x) = P (ω : ξ(ω) ≤ x),which is the probability measure, induced by ξ on (R,B(R)).

Example 4.2. On the space ([0, 1],B, λ) consider the random variable

ξ(ω) = sign(ω − 1/2)

Verify that

Fξ(x) =

0, x < −11/2, −1 ≤ x < 11, 1 ≤ x

Example 4.3. On ([a, b]∞,B([a, b]∞), P ), a < b, let ξ(ω) = limn→∞ ωn. ξ(ω)is a random variable (why?) Calculation of Fξ(x) can be very subtle in this case.

4.1. Expectation and its properties. A random variable is simple if it hasthe form

X(ω) =n∑

i=1

xiI(ω ∈ Ai),

where Ai are disjoint sets in F . The expectation of a simple r.v. is by definition

EX =n∑

i=1

xiP (Ai).

An arbitrary random variable (not necessarily simple) is said to be Lebesgueintegrable (or to have expectation EX) if there is a sequence of simple randomvariables Xn, converging to X uniformly, in which case EX := limn→∞EXn isdefined. Let us verify the correctness of this definition: i.e. (1) that for anyuniformly convergent sequence limn→∞EXn exists and (2) it does not depend onthe choice of the sequence. The first claim holds since EXn is a Cauchy sequenceof numbers

∣∣EXn − EXm

∣∣ ≤ E|Xn −Xm| ≤ supω|Xn(ω)−Xm(ω)| n,m→∞−−−−−→ 0,



while the second claim is true by the following argument. Let X ′n(ω) be another

approximating sequence of simple random variables in the sense limn→∞X ′n(ω) =

X(ω) and assume that limn→∞EX ′n 6= limn→∞EXn. Let

X ′′n(ω) =

X ′

n(ω), n is evenXn(ω), n is odd

Clearly X ′′n(ω) is a sequence of simple r.v. and limn→∞X ′′

n = X as well. HoweverEX ′′

n does not converge, which is a contradiction (to existence!) and hence theLebesgue integral does not depend of the approximating sequence.

An approximating sequence can be constructed explicitly: for a nonnegativer.v. X let

Xn(ω) =n2n∑

k=1

k − 12n

I

(k − 12n

≤ X(ω) ≤ k

2n

)+ nI

(X(ω) ≥ n

).

Note that Xn(ω) X(ω). In this case the sequence EXn does not decreaseand hence has a limit (possibly infinite). The limit EX = limn→∞EXn is theLebesgue integral (expectation) of X. For general r.v. X let X+ = max(X, 0) andX− = −min(X, 0). If X+ < ∞ and X− < ∞, then the expectation is defined as

EX = EX+ − EX−.

The expectation satisfies the following properties:(1) Let c be a constant and assume that EX exists, then EcX = cEX.(2) X ≤ Y P -a.s. 3 =⇒ EX ≤ EY(3) If E|X| < ∞ and E|Y | < ∞, then E(X + Y ) = EX + EY(4) If X = Y P -a.s. then EX = EY(5) If X ≥ 0, EX = 0, then X = 0 P -a.s.(6) (Chebyshev inequality) if X ≥ 0 P -a.s. then

P (X ≥ a) ≤ EX

a

(7) (Cauchy-Schwarz inequality) assume that EX2 < ∞ and EY 2 < ∞, then

E|XY | ≤√

EX2EY 2

(8) (Jensen inequality) assume that E|X| < ∞. Then for any convex functiong(x)

g(EX) ≤ Eg(X)(9) (Lyapunov inequality) assume that E|X|p < ∞. Then for any 0 < q ≤ p

(E|X|q

)1/q

≤(E|X|p

)1/p

Proof. Set r = p/q, so that the function xr is convex. Then byJensen inequality

(E|X|q)r ≤ E|X|qr = E|X|p

¤

3X < Y P -a.s. (P -almost surely) means that P (ω : X(ω) < Y (ω)) = 1, which is clearlyweaker than X(ω) < Y (ω) for any ω ∈ Ω. It is customary to neglect sets of zero P measure inprobability theory.



Expectation of X can be calculated with respect to the induced measure (readdistribution function) rather than the original measure P

EX =∫

Ω

X(ω)dP (ω) =∫

RxdF (x).

Similarly if Y = f(X) for some measurable function f , then

EY =∫

Ω

f(X(ω))dP (ω) =∫

Rf(x)dF (x) =

∫

RydG(y)

where G(y) is the distribution of Y .

4.2. Characteristic function. The Fourier transform of the distribution ofX is called characteristic function of X:

ϕ(λ) = EeiXλ =∫

ReixλdF (x).

Due to the properties of the Fourier transform, ϕ(λ) uniquely defines F (x).

4.3. Independence. The sets (events) A and B are independent if

P (A ∩B) = P (A)P (B),

so that the conditional probability of A given B (0 = 00 by convention)

P (A|B) =P (A ∩B)

P (B)

equals P (A).Similarly two random variables are said to be independent if their joint distri-

bution function can be factored into

FXY (x, y) = FX(x)FY (y).

This is equivalent to(1) Ef(X)g(Y ) = Ef(X)Eg(Y ) for all bounded functions f and g(2) E expiλX + iµY = E expiλXE expiµY for any real λ and µ

5. Random processes

The collection of real valued random variables X = (Xn)n≥1, n ∈ N, defined onsome probability space (Ω,F , P ) is called random process (sequence) with discretetime.

For any fixed ω′ ∈ Ω, the sequence Xn(ω′) is called trajectory or realization ofthe random process X. For any fixed n′, X ′

n(ω) is a random variable.Any random process induces a probability measure PX on the measurable space

(R∞,B(R∞)), which is called the distribution of the process. The projection of thisdistribution on any fixed finite set of indices n1, ..., nd is called d-dimensionaldistribution of X, i.e.

Fn1...nd(x1, ..., xd) = P

(Xn1 ≤ x1; ...; Xnd

≤ xd

).

By Kolmogorov’s Theorem 3.2 for any family of consistent finite dimensionaldistributions there is a probability space on which X can be defined.

5.1. Examples of random processes.5.1.1. i.i.d. process. is the sequence of independent identically distributed r.v.

E.g. ε = (εn)n≥1 with ε1 being a standard Gaussian r.v.



5.1.2. Autoregressive process. Let εn be an i.i.d sequence. The AR process isgenerated recursively by

Xn =q∑

k=1

αkXn−k + εn

with a fixed array of numbers α1, ..., αq.5.1.3. Moving average process. Let εn be and i.i.d. sequence. The MA process

is obtained by

Xn =p∑

k=0

βkεn−k

where β0, ..., βp are constants.5.1.4. Markov process. Let p(x,A) be R×B(R) 7→ [0, 1] function, such that for

any fixed x it is a probability measure on B(R) and for any fixed A ∈ B(R) is ameasurable function on R. Let µ be a probability measure on (R,B(R)).

Define a probability measure on B(Rn)

Pm1...mn(Am1 × ...×Amn) =∫

R

∫

Am1

∫

R...

∫

Amn

µ(dx1)mn∏

`=1

p(x`−1, dx`),

where the integration with respect to x`, ` 6∈ m1, ..., mn is on the whole R. Verifythat this family of probability measures is consistent. The corresponding randomprocess is called Markov.

5.2. Convergence of random processes. Recall that an infinite sequenceof numbers z = (zn)n≥1 converges to a limit z (limn→∞ zn = z) if for each ε > 0there is an index n′ε, such that for all n ≥ n′ε |zn − z| ≤ ε.

Unlike numeric sequence, the sequence of functions, which any random processis, may converge to a limit (function!) in many different senses.

Definition 5.1. The r.p. Xn converges to a r.v. X pointwise if

limn→∞

Xn(ω) = X(ω)

for all ω ∈ Ω.

Definition 5.2. The r.p. Xn converges to a r.v. X in Lp p ≥ 1, if E|Xn|p < ∞,n ≥ 1 and

limn→∞

E∣∣X −Xn

∣∣p = 0

The convergence in L1 and in L2 is often called convergence in the mean andin the mean square respectively.

Definition 5.3. The r.p. Xn converges to r.v. X in probability if for anyε > 0

limn→∞

P (|Xn −X| ≥ ε) = 0.

Definition 5.4. The r.p. Xn converges to r.v. X P -a.s. (or with probabilityone) if

P(ω : lim

n→∞Xn(ω) = X(ω)

)= 1.



Definition 5.5. The r.p. Xn converges weakly (or in law) to r.v. X, if

limn→∞

Ef(Xn) = Ef(X)

for any bounded and continuous function f .

The latter is equivalent to convergence of the sequence of corresponding dis-tribution functions Fn(x) to the distribution function F (x) of X in all x at whichF (x) is continuous. Note that weak convergence makes sense even if the randomvariables are not defined on the same probability space!

Theorem 5.6.

” L2

−→ ” =⇒ ” L1

−→ ” =⇒” P−a.s.−−−−→ ” =⇒

” P−→ ” =⇒ ” law−−→ ”

Proof.1. Convergence in L2 implies convergence in L1. By Cauchy-Schwarz inequality

E|Xn −X| ≤√

E|Xn −X|2 n→∞−−−−→ 0 (5.1)

More generally convergence in Lp implies convergence in Lq for q ≤ p, whichsimilarly to (5.1), follows from the Lyapunov inequality

E|Xn −X|q ≤ (E|Xn −X|p)q/p n→∞−−−−→ 0

2. Convergence in L1 implies convergence in probability. By Chebyshev inequalityfor any fixed ε > 0

P(|Xn −X| ≥ ε

) ≤ E|Xn −X|ε

n→∞−−−−→ 0

3. P -a.s. convergence implies convergence in probability. For any ε > 0

P(|Xn −X| ≥ ε

) ≤ P(

supm≥n

|Xm −X| ≥ ε)

= P( ∪m≥n |Xm −X| ≥ ε)

By continuity of probability measure P

limn→∞

P(|Xn −X| ≥ ε

) ≤ P(

limn→∞

∪m≥n|Xm −X| ≥ ε) =

P( ∩n≥0 ∪m≥n|Xm −X| ≥ ε) = P (Xn 6→ X) = 0

4. Convergence in probability implies weak convergence. Fix an arbitrary continu-ous function f(x). By definition of continuity, for any given ε > 0 there is a δ ≥ 0such that

|x− y| ≤ δ =⇒ |f(x)− f(y)| ≤ ε.

Then with ε > 0 fixed

|Ef(Xn)− Ef(X)| ≤ E|f(Xn)− f(X)| =E|f(Xn)− f(X)|I(|Xn −X| > δ) + E|f(Xn)− f(X)|I(|Xn −X| ≤ δ) ≤2‖f‖∞P (|Xn −X| > δ) + ε

n→∞−−−−→ ε.

Since ε is arbitrary, limn→∞ |Ef(Xn)− Ef(X)| = 0. ¤

Theorem 5.7. The limits (in different senses) of Xn coincide P -a.s.



Proof. For example, show that if XnP−→ X and Xn

P−a.s.−−−−→ X ′, then P (X 6=X ′) = 0, i.e. P (|X −X ′| ≥ ε) = 0 for any ε > 0. Indeed

P (|X −X ′| ≥ ε) ≤ P (|X −Xn| ≥ ε/2) + P (|Xn −X ′| ≥ ε/2) n→∞−−−−→ 0

since P -a.s. convergence implies convergence in probability. ¤

The unmentioned implications are false in general.

Example 5.8. (L1 6⇒ L2) On ([0, 1],B, λ) define the sequence

Xn(ω) =√

nI(ω ∈ [0, 1/n])

Xn converges to X ≡ 0 in L1

EXn =√

n/n = n−1/2 n→∞−−−−→ 0

but not in L2

EX2n ≡ 1 6→ 0.

Example 5.9. (weak convergence does not imply any strong convergence) E.g.i.i.d. sequence.

Example 5.10. (P -a.s. convergence is not implied by convergence of any otheraforementioned type ) On ([0, 1],B, λ) define for n ≥ 1

ξkm(ω) = I

(ω ∈

(k − 1m

,k

m

]), k = 1, ..., m.

and consider the sequence

X =(ξ11 , ξ1

2 , ξ22 , ξ1

3 , ξ23 , ξ3

3 , ...)

Clearly X converges to 0 in probability (why?), but not P -a.s., since for any fixedω′ ∈ (0, 1] and for any index n, there are indices n1, n2, ..., such that 1 = Xn1(ω

′) =Xn2(ω

′) = ..., i.e.

P(ω : 1 = lim

n→∞Xn(ω) > lim

n→∞Xn(ω) = 0

)= 1.

Example 5.11. Let ξn be a sequence of i.i.d. r.v. with P (ξ1 = 0) = P (ξ1 =1) = 1/2 and define

Un =n∑

m=1

2−mξm

Un is a non decreasing sequence bounded by 1 and thus converges for all ω. Afortiori it converges P -a.s, and thus also in probability and in law. With ε > 0 andUn = |Un − U |

EUn = EUnI(Un > ε) + EUnI(Un ≤ ε) ≤ 2P (Un > ε) + εn→∞−−−−→ ε

and thus Un converges in L1 as well by arbitrariness of ε.Note that the limit is an non degenerate r.v. in this case and its distribution

is uniform (why ?).



5.2.1. Borel-Cantelli lemmas. It may seem that for studying the almost sureconvergence, one always needs the description of the probability space (like e.g. inthe Examples 5.10, 5.11 above). Fortunately this is not always the case and oftenthe P -a.s. convergence can be verified by means of Borel-Cantelli Lemmas, givenbelow.

Let An be an infinite sequence of sets from F . Assume that An decreases, i.e.An+1 ⊆ An. Then the limit set

A = limn→∞

An ≡ ∩n≥1An

is well defined. Analogously if An ⊆ An+1 the limit set is

A = limn→∞

An ≡ ∪n≥1An.

Can the limit set be defined for a non-monotonous sequence An ? The answer isnegative in general, just like it is negative for sequences of numbers: e.g. xn =(−1)n has no limit. However, analogously to limn→∞ xn = limn→∞ supm≥n xm

and limn→∞ xn = limn→∞ infm≥n xm (which are always well defined, though maytake infinite values!), we may define

limn→∞

An = ∩n≥1 ∪m≥n Am

andlim

n→∞An = ∪n≥1 ∩m≥n Am.

Now if limn→∞An and limn→∞An coincide, we say that An has a limit and itequals e.g. limn→∞An.

Both upper and lower limits have interesting probabilistic interpretations: theset Ai.o = limn→∞An consists of all the points ω ∈ Ω appearing in the sequence An

infinitely often (why?); the set Ae = limn→∞An contains the points which appearin all the sets Ak, starting k ≥ n′ for some n′, i.e. eventually appear in all the sets.

Now we are ready for the first Borel-Cantelli Lemma

Lemma 5.12. Let (An)n≥1 be a sequence of sets, then∞∑

n=1

P (An) < ∞ =⇒ P (Ai.o.) = 0.

Proof.

P (Ai.o) = P( ∩n≥1 ∪m≥nAm

) †= lim

n→∞P

( ∪m≥n An

) ≤ limn→∞

∞∑m=n

P (Am) = 0

where the equality † is due to continuity property4 of the probability measure P . ¤

Corollary 5.13. Let (ξn)n≥1 be a sequence of r.v. If for any ε > 0,∞∑

n=1

P (|ξn| ≥ ε) < ∞,

ξn converges to zero P -a.s.

4which follows from the axiomatic σ-additivity property



Proof. Note thatω : ξn 6→ 0

= ∪k≥1 ∩n≥1 ∪m≥n|ξm| ≥ 1/k ≡ ∪k≥1A

1/ki.o .

So

P (ξn 6→ 0) = P( ∪k≥1 A

1/ki.o

) ≤∞∑

k=1

P (A1/ki.o )

and the desired statement holds, since P (A1/ki.o ) = 0 for any k.

¤

Corollary 5.14. The sequence (ξn)n≥0 converges P -a.s. to zero, if there areconstants C > 0 and ρ ∈ [0, 1), such that

E|ξn|p ≤ Cρn, n ≥ 1

for some p ≥ 1, i.e. if (ξn)n≥1 converges exponentially in Lp.

Proof. By Chebyshev inequality. ¤

Is the opposite to Lemma 5.12 true ? I.e. does∑∞

n=1 P (An) = ∞ imply thatP (Ai.o.) = 1 ? The answer is generally negative: e.g. take An = I(ω ≤ n−1) onthe probability space ([0, 1],B, λ). But

Lemma 5.15. If (An)n≥1 is a sequence of independent sets, then

∞∑n=1

P (An) = ∞ =⇒ P (Ai.o.) = 1.

Proof. By the well known set operations rules

Ai.o = ∪n≥1∪m≥nAm = ∪n≥1 ∩m≥n Am

and so it suffices to verify that for any fixed n,

P( ∩m≥n Am

)= 0.

By independence

P( ∩m≥n Am

)=

∞∏m=n

P (Am) =∞∏

m=n

(1− P (Am)

)=

exp ∞∑

m=n

log(1− P (Am)

) ≤ exp−

∞∑m=n

P (Am)

= 0

where the inequality log(1− x) ≤ −x, x ∈ [0, 1) had been used. ¤

Example 5.16. (convergence in probability does not imply P -a.s. convergence)Let (ξn)n≥1 be a sequence of independent r.v. such that P (ξn = 1) = pn andP (ξn = 0) = 1− pn.

Let An = ξn = 1, then by the second Borel-Cantelli lemma∑∞

n=1 pn = ∞(e.g. pn = 1/n) implies P (Ai.o) = 1, i.e. ξn does not converge to zero P -a.s.However it converges to zero in probability.



5.2.2. Cauchy criteria for convergence. The sequence of numbers xn is said tobe fundamental (or Cauchy) if

sup`,m≥n

|x` − xm| n→∞−−−−→ 0.

A numerical sequence xn converges if and only if it is fundamental. The same holdswhen xn are vectors in Rn. In an infinite dimensional space, this is not necessarilytrue: while any convergent sequence is fundamental, the other implication does nothold in general. If it does the space is called complete. Completeness is an importantproperty, since it can be used to construct limit objects (as in e.g. Theorem 1.6).

In the next chapter we will extensively use the space of square integrable ran-dom variables (i.e. with finite expectation EX2 < ∞), denoted by L2(Ω,F , P ) (orshortly L2). Clearly L2 is a linear space (X ∈ L2, Y ∈ L2 =⇒ αX + βY ∈ L2

for α, β ∈ R). Endowed with the scalar product 5 〈X, Y 〉 = EXY it becomes anEuclidian space with the induced norm ‖X‖ =

√〈X,X〉 =

√EX2. Usually L2 is

infinite dimensional, i.e. there is no finite number of elements of L2, such that anyother element is their linear combination or in other words, it has no finite basis.Let us verify completeness of L2, which will turn it to the Hilbert space with all itswell developed machinery.

Theorem 5.17. A sequence of random variables ξn from L2 converges to arandom variable ξ in L2 if and only if it is fundamental (Cauchy) in L2, i.e.

limn→∞

sup`,m≥n

E(ξn − ξ)2 = 0.

Proof. Clearly if ξnL2

−−−−→n→∞

ξ, then it is fundamental:

E(ξn − ξm)2 ≤ 2E(ξn − ξ)2 + 2E(ξm − ξ)2n,m→∞−−−−−→ 0.

The proof of sufficiency is more involved. Suppose that ξn is fundamental. Thenthere is an n1, such that

‖ξn − ξm‖ =√

E(ξn − ξm)2 ≤ 122

, ∀m,n ≥ n1.

Similarly there are indices nk, k ≥ 1 such that nk ≥ nk−1 and√

E(ξn − ξm)2 ≤ 2−2k, ∀m,n ≥ nk.

Let Ak = ω : |ξnk+1 − ξnk| ≥ 2−k then

P (Ak) ≤ 2kE|ξnk+1 − ξnk| ≤ 2k

√E(ξnk+1 − ξnk

)2 ≤ 2−k

So∑∞

k=1 P (Ak) < ∞ and by the first Borel-Cantelli lemma P (Ak, i.o.) = 0, i.e.

P(|ξnk+1 − ξnk

| ≥ 2−k, i.o.)

= 0

which implies∑∞

k=1 |ξnk+1 − ξnk| < ∞ and thus ξ = limk→∞ ξnk

=∑∞

k=1

(ξnk+1 −

ξnk

)exists. Now it is left to show that this ξ is the required limit of ξn, that is

5Recall that scalar product 〈x, y〉 is a function of elements pairs from the linear space underconsideration (i.e. L2 here) to R, such that (1) 〈αX + βY, Z〉 = α〈X, Z〉 + β〈Y, Z〉 for α, β ∈ R;(2) 〈X, X〉 ≥ 0; and (3) 〈X, X〉 = 0 =⇒ X = 0 (in the case L the latter is understood P -a.s.)



E(ξn − ξ)2 → 0 and that Eξ2 < ∞, i.e. ξ ∈ L2. Let us fix ε > 0 and choose Nε

such that E(ξm − ξn)2 ≤ ε for all n,m ≥ Nε. Then with n ≥ Nε

E(ξn − ξ)2 = E(ξn − limk→∞

ξnk)2 = E lim

k→∞(ξn − ξnk

)2 =

E limk→∞

(ξn − ξnk)2 ≤ lim

k→∞E(ξn − ξnk

)2 ≤ ε

and hence 6 by arbitrariness of ε, limn→∞E(ξn − ξ)2 = 0. Since Eξ2 ≤ 2E(ξ −ξn)2 + 2Eξ2

n, Eξ2 < ∞ as well. ¤

6. Limit theorems

Limit theorems deal with convergence of sums of r.v. and are one of the centralissues of probability theory and stochastic processes. Below we give the simplestversions of classic limit theorems, which originated more than several centuries ago.

6.1. The Weak Law of Large Numbers.

Theorem 6.1. Let ξn be a sequence of orthogonal random variables with m =Eξn and V = var(ξn) = E(ξn −m)2 < ∞. Let Sn =

∑nk=1 ξk, then

1n

SnL2

−−−−→n→∞

m.

Proof.

E( 1

nSn −m

)2

=1n2

E( n∑

k=1

(ξk −m))2

=1n2

n∑

k=1

V = V/n → 0.

¤

As was already mentioned before the WLLN holds under much more generalconditions.

6.2. The Strong Law of Large Numbers. By the strong LLN P -a.s. con-vergence of the empirical mean is usually meant.

Theorem 6.2. (Cantelli) Let ξn be a sequence of independent r.v. with finitefourth moment, such that

E|ξn − Eξn|4 ≤ C, n ≥ 1

for some C > 0. Then

limn→∞

Sn − ESn

n= 0, P − a.s.

Proof. Without loss of generality we may consider Eξn = 0. By the firstBorel-Cantelli lemma Sn/n → 0 P -a.s. is implied by

∑P

(∣∣Sn/n∣∣ ≥ ε

)< ∞

for any ε > 0. By Chebyshev inequality it is sufficient that∑

E|Sn/n|4 < ∞.

6The inequality E limn→∞Xn ≤ limn→∞ EXn is called the Fatou Lemma. The proof can

be found in e.g. (8)



Let us verify the latter condition. First note

S4n =

n∑

i=1

ξ4i +

∑

i<j

4!2!2!

ξ2i ξ2

j +∑

i 6=j,i 6=k,j<k

4!2!1!1!

ξ2i ξjξk

+∑

i<j<k<`

4!ξiξjξkξ` +∑

i 6=j

4!3!1!

ξ3i ξj

Since Eξk = 0

ES4n =

n∑

i=1

Eξ4i + 6

∑

i<j

Eξ2i Eξ2

j ≤ nC + 6∑

i<j

√Eξ4

i Eξ4j ≤

nC + 6n(n− 1)

2C = (3n2 − 2n)C ≤ 3n2C

and thus ∑E(S4

n/n4) ≤ 3C∑

n−2 < ∞.

¤6.3. Central Limit Theorem. The laws of large numbers state that Sn =∑n

i=1 ξi converges ”strongly” (e.g. to the expectation of ξ1), if it is scaled by n−1.What happens if

∑ni=1 ξi is examined on a finer scale, say 1/

√n ? The Central

Limit Theorem states that it converges weakly to a Gaussian r.v. regardless (!) ofthe precise form of ξn distribution.

A useful tool in the proof of weak convergence is

Theorem 6.3. Let Fn be a sequence of distribution functions on R and ϕn(t)is the corresponding sequence of characteristic functions. Then

Fn(x) w−→ F (x) ⇔ ϕn(t) → ϕ(t),∀t ∈ R.

Theorem 6.4. Let ξn be an i.i.d. sequence with 0 < var(ξ1) < ∞. ThenSn − ESn√

var(Sn)w−−−−→

n→∞ξ

where ξ is a standard Gaussian r.v.

Proof. (Sketch) Denote m = Eξ1 and σ2 = var(ξ1) and

ϕ(t) = Eeit(ξ1−m).

Then

ϕn(t) ≡ E exp

itSn − ESn√

var(Sn)

= E exp

it

∑n(ξk −m)√nσ

=

=n∏

E exp

it(ξ1 −m)/(√

nσ)

=[ϕ( t√

nσ

)]n

.

It can be shown (see II.12.14 in (8)) that var(ξ) < ∞ implies

ϕ(t) = 1− σ2t2/2 + o(t2), t → 0.

Then

ϕn(t) =[1− σ2t2

2nσ2+ o(1/n)

]n n→∞−−−−→ e−t2/2.

Theorem 6.3 then implies the statement. ¤



6.4. The Law of Small Numbers.

Theorem 6.5. (Poisson) Let for each n ≥ 1 the i.i.d. r.v. ξ1n, ..., ξn

n are suchthat

P (ξkn = 1) = pn, P (ξk

n = 0) = 1− pn, 1 ≤ k ≤ n

where pn → 0 as n →∞, so that limn→∞ npn = λ > 0. Then

P (Sn = m) → e−λλm

m!for any m ≥ 0.

Proof. Since Eeitξkn = pneit + 1− pn, then

ϕSn(t) = EeitSn =

(1 + pn(eit − 1)

)n =(1 +

λ

n(eit − 1) + o(1/n)

)n → expλ(eit − 1)

The claim holds by Theorem 6.3, since expλ(eit−1)

is the characteristic function

of the Poisson distribution. ¤


CHAPTER 2

Orthogonal projection and linear estimation

1. Orthogonal projection

1.1. Simple scalar case. Consider a pair of random variables (X, Y ) fromthe Hilbert space L2(Ω,F , P ) (see section 5.2.2 for a discussion). Suppose we wouldlike to find the best approximation of X by means of Y and a constant, i.e. to findX = a0 + a1Y minimizing the mean square error E(X − X)2. Simple calculationsgive the desired answer

E(X − a0 − a1Y )2 = E(X − EX − a0 + EX − a1EY − a1(Y − EY )

)2 =

E(X − EX − a1(Y − EY )

)2 + E(EX − a0 − a1EY

)2 ≥E

(X − EX − a1(Y − EY )

)2 = cov(X)− 2a1 cov(X, Y ) + a21 cov(Y ) ≥

cov(X)− cov2(X,Y )/ cov(Y )

where the equalities are attained at a1 = cov(X, Y )/ cov(Y ) and a0 = EX−a1EY .In other words

X = EX + cov(X, Y )/ cov(Y )(Y − EY

)

gives the best estimate of X from the vector (1, Y ) with the estimation error

E(X − X)2 = cov(X)− cov2(X,Y )/ cov(Y ).

What happens when cov(Y ) = 0 ? In this case Y = EY P -a.s. and so

E(X − a0 − a1Y )2 = E(X − a0 − a1EY )2 =

E(X − EX + EX − a0 − a1EY )2 ≥ cov(X)

where the equality is attained e.g. a0 = EX and a1 = 0. So the general formulae

X = EX + cov(X,Y ) cov⊕(Y )(Y − EY

)

E(X − X)2 = cov(X)− cov2(X, Y ) cov⊕(Y )

hold where

cov⊕(Y ) =

cov−1(Y ), cov(Y ) > 00, cov(Y ) = 0.

The random variable X is called the orthogonal projection of X on the linearsubspaceM⊆ L2 spanned by r.v. 1 and Y , denoted also as E(X|M) and sometimesreferred as the conditional expectation in the wide sense of X with respect to M.

Indeed X = X + (X − X), where X belongs to M and X − X is orthogonal toM, i.e. for any Z ∈M.

E(X − X)Z = 0. (1.1)

21


To see this consider E(X − X − tZ)2, where t is a constant and Z ∈ M. Byoptimality

E(X − X − tZ)2 ≥ E(X − X)2

i.e.−2tE(X − X)Z + t2EZ2 ≥ 0.

Choose t = αE(X − X)Z, then[E(X − X)Z

]2(− 2α + α2EZ2) ≥ 0

The latter can hold for small enough α only if (1.1) holds.The other direction is also true: if X ∈M and

E(X − X)Z = 0

for all Z ∈M, then X = X, P -a.s.Indeed

E(X − Z)2 = E(X − X + X − Z)2 = E(X − X)2 + E(X − Z)2 ≥ E(X − X)2,

for any Z. In particular with Z := X, we have E(X − X)2 = 0 or X = X, P -a.s.

1.2. Vector case. Now let us extend this result to the vector case. Let Xbe an L2 r.v. and Y be random vector in Rn with entries in L2. Let M be thelinear subspace generated by 1, Y1, ..., Yn. We are interested in the optimal estimateX = a0 +

∑ni=1 aiYi, such that

E(X − X)2 ≤ E(X − Z)2

for all Z ∈M.By the very same arguments as in the scalar case, we get

Lemma 1.1. The estimate X is optimal if and only if E(X − X)Z = 0 for allZ ∈M.

Now let a be the column vector with entries a1, ..., an. By the above Lemmathe optimal estimate is found from

E(X − a0 − a∗Y ) · 1 = 0

E(X − a0 − a∗Y )Y ∗ = 0

The first constraint implies a0 = EX − a∗EY , so that the second one is rewritten

E((X − EX)− a∗(Y − EY )

)(Y − EY )∗ = 0

or in other wordscov(X,Y )− a∗ cov(Y ) = 0. (1.2)

If cov(Y ) > 0 (positive definite matrix), then

a∗ = cov(X, Y ) cov−1(Y )

and thusX = EX + cov(X, Y ) cov−1(Y )(Y − EY ).

But what if cov(Y ) is singular ?



1.2.1. Some facts from linear algebra.

Lemma 1.2. Any symmetric matrix S (S = S∗) is decomposable as S = UΛU∗

where U is a real orthogonal matrix (U∗U = I) and Λ is a real diagonal matrix.

Proof. (partial) First show that any eigenvalue of S is real. Let λ be aneigenvalue of S and ϕ corresponding right eigenvector, i.e. Sϕ = λϕ. Multiply thisequation by the conjugate-transposed ϕ′, so that ϕ′Sϕ = λ‖ϕ‖2. λ is real sinceboth ‖ϕ‖2 and ϕ′Sϕ are real. Indeed, let α = ϕ′Sϕ, then α = ϕ∗Sϕ = ϕ′S∗ϕ =ϕ′Sϕ = α. If λ is real, then ϕ is real as well, being the solution of (S − λI)ϕ = 0.

Let us show now that the eigenvectors corresponding to distinct eigenvalues areorthogonal:

Sϕ1 = λ1ϕ1 =⇒ ϕ∗2Sϕ1 = λ1ϕ∗2ϕ1 =⇒ λ2ϕ

∗2ϕ1 = λ1ϕ

∗2ϕ1 =⇒ ϕ∗2ϕ1 = 0.

¤

Exercise 1.3. Give a proof without assuming that the eigenvalues are distinct.

If the diagonal matrix Λ has positive (nonnegative) entries, S is said to bepositive (nonnegative) definite matrix, since for any vector v

v∗Sv = v∗UΛU∗v =∥∥Uv

∥∥2

Λ> 0.

Definition 1.4. Let S be a symmetric matrix with S = UΛU∗. The pseudo-inverse of S in the sense of Moore-Penrose is S⊕ = UΛ⊕U∗ where Λ⊕ is the diagonalmatrix with entries λ−1

i I(λi 6= 0) .

1.2.2. Back to the estimation problem. Analogously to the scalar case set

X = EX + cov(X,Y ) cov⊕(Y )(Y − EY ).

Let us verify orthogonality i.e. (1.2). First note that if ϕ is an eigenvector of cov(Y )corresponding to the zero eigenvalue, then

0 = ϕ∗ cov(Y )ϕ = E((Y − EY )∗ϕ

)2 =⇒ (Y − EY )∗ϕ = 0, P − a.s.

and thuscov(X,Y )ϕ = E(X − EX)(Y − EY )∗ϕ = 0. (1.3)

So

cov(X,Y )− cov(X, Y ) cov⊕(Y ) cov(Y ) = cov(X, Y )U(I − Λ⊕Λ

)U∗ = 0

where the latter equality is due to (1.3) and the Definition 1.4. Note that orthogo-nality would be preserved if in Definition 1.4, the diagonal entries of Λ⊕ are definedλ−1

i I(λi > 0)+αiI(λi = 0) with any constants αi and αi = 0 is a convenient choice.The following theorem summarizes the results obtained above:

Theorem 1.5. Let X and Y be random vectors in Rm and Rn with squareintegrable entries. The orthogonal projection of X on the linear subspace M =span1, Y1, ..., Yn is given by

E(X|Y ) = EX + cov(X, Y ) cov⊕(Y )(Y − EY ).

Moreover for any Z ∈M E[X − E(X|Y )

]Z = 0 and

E(X − E(X|Y )

)(X − E(X|Y )

)∗ = cov(X)− cov(X, Y ) cov⊕(Y ) cov∗(X, Y ).



1.3. Infinite dimensional case. Suppose we are given a random variable Xand a sequence Y1, Y2, ... with X, Yi ∈ L2. Let M be the closed linear subspacegenerated by 1, Y1, Y2, ..., i.e. all linear combinations of 1, Y1, Y2, ... and all theirmean square limits:

M = span1, Y1, Y2, ....The formulae of the previous section does not make much sense in this infinitedimensional case. However the orthogonal projection is still well defined:

Theorem 1.6. There is a unique (P-a.s.) random variable X := E(X|Y ),such that

E(X − X)2 = infZ∈M

E(X − Z)2

and E(X − X

)Z = 0 for any Z ∈M.

Proof. Denote d2 = infZ∈ME(X − Z)2 and let Z1, Z2, ... be the sequencesuch that limn→∞E(X − Zn)2 = d2. Note that

E(Zn − Zm)2 = 2E(Zn −X)2 + 2E(Zm −X)2 − 4E(Zn + Zm

2−X

)2

.

Since Zn + Zm ∈M, E((Zn + Zm)/2−X

)2 ≥ d2 and so

E(Zn − Zm)2 ≤ 2E(Zn −X)2 + 2E(Zm −X)2 − 4d2 n→∞−−−−→ 0

i.e. Zn is a fundamental sequence in L2. The space L2 is complete, i.e. anyfundamental sequence converges to an element in L2, which is nothing but X P -a.s.

Let us verify the P -a.s. uniqueness: suppose that there is a r.v. X such that

E(X − X)2 = E(X − X)2 = d2.

Then

E(X + X − 2X)2 + E(X − X)2 = 2E(X −X)2 + 2E(X −X)2 = 4d2.

But E(X + X − 2X)2 = 4E((X + X)/2−X

)2 ≥ 4d2 and so E(X − X)2 = 0, i.e.P (X = X) = 1.

Now let us verify orthogonality: since for any t ∈ R and Z ∈M

E(X − X − tZ)2 ≥ E(X − X)2

we have

t2EZ2 − 2tE(X − X)Z ≥ 0.

Take t = αE(X − X)Z, α ∈ R, then(E(X − X)Z

)2[α2EZ2 − 2α

] ≥ 0.

For sufficiently small α > 0,[α2EZ2 − 2α

]< 0 and thus E(X − X)Z = 0. ¤



1.4. Properties of orthogonal projection. The orthogonal projection sat-isfies the following properties. Below M denotes1 some linear subspace of L2

1. EE(X|M) = EX

2. E(X|M) =

X X ∈M0 X⊥M

3. E(c1X1 + c2X2|M) = c1E(X1|M) + c2E(X2|M)

4. If M1 ⊆M2, then E(E(X|M2)

∣∣M1

)= E(X|M1).

5. If M1 and M2 are orthogonal subspaces, then

E(X|M1 ⊕M2) = E(X|M1) + E(X|M2)

Proof.(1) By orthogonality E(X − E(X|M)) · 1 = 0(2) If X ∈M, it is the orthogonal projection by definition. If X⊥M, i.e. EXZ = 0for any r.v. Z ∈M, then again by definition E(X|M) = 0.(3) For any r.v. Z ∈M

E(c1X1 + c2X2 − c1E(X1|M)− c2E(X2|M))Z =

c1E(X1 − E(X1|M))Z + c2E(X2 − E(X2|M))Z = 0.

(4) For any Z ∈M1

0 = E[E(X|M2)− E

(E(X|M2)|M1

)]Z =

E[E(X|M2)−X

]Z + E

[X − E

(E(X|M2)|M1

)]Z

= E[X − E

(E(X|M2)|M1

)]Z

where the latter equality holds since Z ∈M1 =⇒ Z ∈M2.(5) Any Z ∈ M1 ⊕M2 can be decomposed into Z = Z1 + Z2 with Z1 ∈ M1 andZ2 ∈M2. Then

E(X − E(X|M1)− E(X|M2))(Z1 + Z2) =

E(X − E(X|M1))Z1 + E(X − E(X|M2))Z2−EE(X|M2)Z1 − EE(X|M1)Z2 = 0

where the latter two terms vanish due to M1⊥M2. ¤

2. Linear estimation of stationary processes: Kolmogorov-Wienerapproach

In many engineering applications it is required to estimate signals from thenoisy observations. Within the probabilistic framework both the signal and theobservation are assumed to be random processes. Typically the signal Xn is tobe estimated from some segment of trajectory of Y , i.e. a functional ψn(Y ) is tobe found to minimize the mean square error criterion. Specifically the followingestimation problems are frequently encountered in applications

- Filtering: estimate Xn from Y n0 := Y0, Y1, ..., Yn for each n ≥ 1;

11 ∈M is always assumed



- Prediction: estimate Xn+m with m > 0 from Y n0 := Y0, Y1, ..., Yn for

each n ≥ 1;- Smoothing: estimate Xn−m with m > 0 from Y n

0 := Y0, Y1, ..., Yn foreach n ≥ 1;

In this chapter the linear estimation problems are considered, i.e. the optimalestimator ψn(·) is constrained to be a linear functional.

2.1. Stationary processes. The Kolmogorov-Wiener theory deals with esti-mation of stationary processes.

Definition 2.1. The random process X = (Xn)n∈Z is stationary if all its finitedimensional distributions are shift independent, i.e.

P (Xn1 ≤ x1, ..., Xnd≤ xd) = P (Xn1+h ≤ x1, ..., Xnd+h ≤ xd)

for any h ∈ Z and any set of indices n1, ..., nd and any numbers x1,...,xd.

For example the sequence of i.i.d. r.v.’s is a stationary process.

Definition 2.2. The random process X = (Xn)n∈Z is stationary in the widesense if its mean sequence is constant and its correlation sequence depends only onthe shift, i.e. EXn ≡ EXm and cov(Xn, Xm) ≡ R(n−m) for all n,m.

Clearly if X is a stationary process and E|Xn|2 < ∞, it is also stationary in thewide sense. Hereafter we will abuse the notations, referring wide sense stationaryprocesses as stationary, assuming w.l.o.g. that EXn ≡ 0.

The correlation sequence of a stationary process satisfies a number of importantproperties

Lemma 2.3. Let R(k) = EXnXn+k. Then(1) R(k) is a nonnegative definite function, i.e. for any 2 complex sequence

zn ∈ C ∑

k,`

zkR(k − `)z` ≥ 0.

(2) R(k) is an even function with maximum at k = 0

Proof.1. ∑

k,`

zkR(k − `)z` = E∣∣ ∑

n=1

znXn

∣∣2 ≥ 0

2. R(k) = EXnXn+k = R(−k). R(k) = EXnXn+k ≤√

EX2nEX2

n+k = R(0). ¤

If∑

k |R(k)| < ∞, the Fourier transform of R(k)

f(λ) =∑

k

R(k)e−iλk

is called spectral density of X. The inverse formula is

R(k) =12π

∫ π

−π

eiλkf(λ)dλ.

For example the spectral density of an i.i.d. sequence (of square integrable r.v.)is constant, which is the reason such sequence is called discrete time white noise.

2such that the summation makes sense



If R(k) is not summable the spectral density may not exist but neverthelessthe spectral distribution (measure) F (λ) is well defined so that

R(k) =∫ π

−π

eiλkdF (λ).

2.2. Smoothing. Consider a pair of (jointly) stationary random processes(X, Y ) = (Xn, Yn)n∈Z, where the signal at time n is to be estimated from thetrajectory of Yn,−∞ ≤ n ≤ ∞.

Suppose that the linear estimate of the form

Xn =∞∑

k=−∞akYn−k

is to be constructed to minimize the mean square error, i.e.

E(Xn − Xn)2 ≤ E(Xn − Z)2

for any Z ∈ M = span..., Y−1, Y0, Y1, .... Such an estimate exists (see Theorem1.6) and is nothing but the orthogonal projection of X on M.

The coefficients of the optimal linear smoother an can be found from the or-thogonality relation

E(Xn −

∑

k

akYn−k

)Y` = 0, ∀`

or, assuming stationary processes,

RXY (m) =∞∑

k=−∞akRY (m− k), ∀m := n− ` (2.1)

The latter is known as Wiener-Hopf equations and can be solved explicitly in termsof spectral densities, which are assumed to exist. The Fourier transform of theleft hand side is SXY (λ), while taking the Fourier transform of the right hand sidegives3

∞∑

k=−∞ak

∞∑m=−∞

RY (m− k)e−imλ =

∞∑

k=−∞ake−ikλ

∞∑

m′=−∞RY (m′)e−im′λ = A(λ)SY (λ).

So the coefficients of the optimal smoother can be found by taking the inverseFourier transform of

A(λ) =SXY (λ)SY (λ)

where SY (λ) > 0 is assumed. The coefficients of the optimal estimator are obtainedvia inverse Fourier transform.

3and assuming that all the correlations decay fast enough so that the summations can beinterchanged



The corresponding minimal mean square error is given by

E(Xn − Xn)2 = E(Xn − Xn)Xn = RX(0)−∞∑

k=−∞akRXY (k) =

12π

∫SX(λ)dλ− 1

2π

∫SXY (λ)

∞∑

k=−∞akeikλdλ =

12π

∫ [SX(λ)−

∣∣SXY (λ)∣∣2

SY (λ)

]dλ.

(2.2)

Example 2.4. Suppose that X is a stationary process satisfying the recursion

Xn = aXn−1 + εn, ∀nwhere a ∈ [0, 1) is a constant and ε is a standard sequence of i.i.d. r.v. Theobservation process is Yn = Xn + σξn with σ > 0 and ξn another standard i.i.d.sequence, independent of X.

The variance V = EX2n satisfies the equation V = a2V + 1 and so V =

1/(1− a2), while the correlation satisfies

RX(k) = aRX(k − 1), k ≥ 1.

So RX(k) = a|k|/(1− a2) for all k. The spectral density is then given by

SXY (λ) = SX(λ) = 1/(1− a2)∞∑

k=−∞a|k|e−ikλ = ... =

11− 2a cos(λ) + a2

The spectral density of Y is given by SY (λ) = SX(λ) + σ2 and so

A(λ) =SX(λ)

SX(λ) + σ2=

1/σ2

1− 2a cos(λ) + a2 + 1/σ2.

The minimal mean square error can be calculated by (2.2).

2.3. Prediction. Let X = (Xn)n∈Z be a stationary process with positivespectral density S(λ) > 0, ∀λ ∈ R. The linear prediction problem is to estimateXm, for some fixed m ≥ 1 from the observations of X0

−∞ = ..., X−1, X0, i.e. tofind the orthogonal projection E(Xm|X0

−∞) =∑0

n=−∞ anXn such that

E(Xm −

0∑n=−∞

anXn

)X` = 0, ∀` ≤ 0.

This leads

R(m− `)−0∑

n=−∞anR(n− `) =

12π

∫ π

−π

ei(m−`)λS(λ)dλ− 12π

∫ π

−π

0∑n=−∞

aneinλe−i`λS(λ)dλ =

=12π

∫ π

−π

[eimλ −A(λ)

]S(λ)ei`λdλ = 0, ∀` ≥ 0

(2.3)

The latter holds if the Fourier transform of the function[eimλ−A(λ)

]S(λ) contains

only terms with positive exponentials.



Assume that S(λ), being a positive real function, can be factored S(λ) =σ(λ)σ(λ) where σ(λ) is a function with Fourier transform of nonnegative exponen-tials. Then it suffices that the Fourier transform of h(λ) :=

[eimλ − A(λ)

]σ(λ)

contains only positive exponentials. In other words

eimλσ(λ) = A(λ)σ(λ) + h(λ)

where h(λ) has only positive harmonics. Note A(λ)σ(λ) has only nonpositive har-monics, so if σ(λ) =

∑∞k=0 ske−ikλ, then

h(λ) = s0eimλ + s1e

i(m−1)λ + ... + sm−1eiλ

and

A(λ) =1

σ(λ)

∞∑

k=m

ske−i(k−m)λ.

The prediction error is given by

V (m) := E(Xm −

0∑n=−∞

anXn

)2

= E(Xm −

0∑n=−∞

anXn

)Xm =

R(0)−0∑

n=−∞anR(n−m) =

12π

∫ π

−π

S(λ)dλ− 12π

∫ π

−π

0∑n=−∞

anei(n−m)λS(λ)dλ =

12π

∫ π

−π

[1−A(λ)e−imλ

]S(λ)dλ

†=

12π

∫ π

−π

[eimλ −A(λ)

][e−imλ − A(λ

]S(λ)dλ =

12π

∫ π

−π

∣∣eimλ −A(λ)∣∣2S(λ)dλ =

12π

∫ π

−π

∣∣[eimλ −A(λ)]σ(λ)∣∣2dλ =

12π

∫ π

−π

∣∣h(λ)∣∣2dλ = |s0|2 + ... + |sm−1|2

where the equality † is due to (2.3). It is worth noting that

limm→∞

V (m) =∞∑

`=0

|s`|2 =12π

∫ π

−π

S(λ)dλ = R(0) = EX21 ,

i.e. the prediction trivializes as m →∞.Consider now the case m = 1. Suppose the function log S(λ) can be expanded

into convergent Fourier series

log S(λ) =∞∑

n=−∞bne−inλ.

Since log S(λ) is a real function, b−n = bn and so

σ(λ) = expb0/2 + b1e−iλ + b2e

−i2λ....Using the expansion formula ez =

∑∞n=0 zn/n!, the first Fourier coefficient of

σ(λ) is found

s0 = 1 + b0/2 +(b0/2)2

2!+ ... = expb0/2.



which leads to

V (1) = expb0 = exp

12π

∫ π

−π

log S(λ)dλ

. (2.4)

The latter is known as Szego-Kolmogorov formula.

Exercise 2.5. Apply the above formulae to Xn from the Example 2.4.

2.4. More about stationary processes. Let ξ = (ξn)n∈Z be a stationary(in the wide sense) random sequence and denote by Hn(ξ) = span..., ξn−1, ξn andH = span..., ξ−1, ξ0, ξ1, .... Introduce4

S(ξ) =⋂

n≥0

H−n(ξ).

If H(ξ) = S(ξ), the process is called singular and if S(ξ) = 0, i.e. contains onlyrandom variables equivalent to zero, then the process is called regular or purelynondeterministic.

For example, the random process Xn ≡ X, where X is a square integrablerandom variable is singular (why?) and an i.i.d. sequence of L2 random variablesis regular. In general a stationary process can be decomposed into singular andregular components Xn = Xs

n + Xrn (Wold decomposition). In turn, the regular

part Xrn admits the following expansion

Xn =∞∑

k=0

αkεn−k, ∀n

where αn’s are real numbers and ε = (εn)n∈Z is a sequence of orthonormal randomvariables. The sequence ε is called innovation process. It can be shown that asequence is regular if and only if

∫ π

−πlog S(λ)dλ > −∞ (compare to (2.4)). The

classic reference on stationary processes is (6).

3. Linear estimation: Kalman’s state space approach

The estimation problems that can be considered within the Kolmogorov-Wienerframework are essentially limited to the case of stationary processes. The statespace approach proposed by R. Kalman allows to solve many estimation problemsin the non-stationary case.

3.1. Recursive orthogonal projection. Suppose that given the signal /observation pair of processes (X,Y ) = (Xn, Yn)n≥1, the optimal linear estimate ofXn from Y n

1 = Y1, ..., Yn is to be calculated for each n ≥ 1, i.e. the orthogonalprojection E(Xn|Y n

1 ) is to be found. The formulae of Theorem 1.5 are not efficientin this case, since each time n is increased the estimate is to be recalculated entirely.The key to a more efficient way of calculating E(Xn|Y n

1 ) is given in the nexttheorem, where the following notations are used

Xn = E(Xn|Y n1 ), Xn|n−1 = E(Xn|Y n−1

1 ), Yn|n−1 = E(Yn|Y n−11 )

PXn = E

(Xn − Xn

)(Xn − Xn

)∗PX

n|n−1 = E(Xn − Xn|n−1

)(Xn − Xn|n−1

)∗

PYn|n−1 = E

(Yn − Yn|n−1

)(Yn − Yn|n−1

)∗

4recall that intersection of linear subspaces is a linear subspace



Theorem 3.2. (R.Kalman, 1960) The orthogonal projection Xn = E(Xn|Y n0 )

and the corresponding error covariance Pn = E(X − Xn)(X − Xn)∗ satisfy theequations

Xn = a0 + a1Xn−1 + a2Yn−1 +(a1Pn−1A

∗1 + b B

)×(A1Pn−1A

∗1 + B B

)⊕(Yn −A0 −A1Xn−1 −A2Yn−1

)(3.2)

and

Pn = a1Pn−1a∗1 + b b− (

a1Pn−1A∗1 + b B

)×(A1Pn−1A

∗1 + B B

)⊕(a1Pn−1A

∗1 + b B

)∗ (3.3)

where B B := B1B∗1 + B2B

∗2 , b b := b1b

∗1 + b2b

∗2 and b B := b1B

∗1 + b2B

∗2 . The

equations (3.2) and (3.3) are solved subject to

X0 = EX0 + cov(X0, Y0) cov⊕(Y0)(Y − EY )

P0 = cov(X0)− cov(X0, Y0) cov⊕(Y0) cov(Y0, X0).

Proof. Apply Theorem 3.1, finding the expressions for all the building blocks,e.g.

Xn|n−1 = E(Xn|Y n−10 ) = a0 + a1Xn−1 + a2Yn−1.

etc. ¤

3.2.1. Some properties of the Kalman filter.1. The Kalman filter is the linear recursive equation (3.2), coupled with the Riccatiequation (3.3). Note that (3.3) does not depend on the observations and thus canbe calculated off-line.

2. Even if the coefficients in (3.1) are constant, the solution of (3.3) is a stillnonconstant sequence of matrices. Under certain conditions (observability andcontrollability) Pn converges to a positive definite limit. In general the latter is notguaranteed.

3. The Kalman filter recursions can be seen as two stage algorithm: at the prop-agation stage the estimate Xn−1 and the error covariance Pn−1 are extrapolatedaccording to the dynamic equations to the time n and in the update stage the newestimate and error covariance is calculated on the basis of the interpolated estimateand the new observation. The sequence ξn = Yn−A0−A1Xn−1−A2Yn−1 consistsof orthogonal random vectors and is called innovations to emphasize the fact thatit carries all the sufficient information contained in the observations.

4. Clearly the Kalman filter is applicable to estimation problems with nonstationarysignals. In the case of stationary processes with rational spectral densities, a modelof (3.1) type can be found. The Kalman filter then leads to the estimate equationswhich asymptotically reduce to those obtained by Kolmogorov-Wiener theory.

3.3. Examples and applications.



3.3.1. Simple scalar model. Consider the simple model (n ≥ 1)

Xn = aXn−1 + bεn

Yn = AXn−1 + Bξn

subject to a r.v. X0 with zero mean and unit variance. (εn)n≥1 and (ξn)n≥1 areindependent standard i.i.d sequences.

The Kalman filter equations read

Xn = aXn−1 + aAPn−1/(A2Pn−1 + B2)(Yn −AXn−1)

Pn = a2Pn−1 + b2 − [aAPn−1

]2/(A2Pn−1 + B2)

subject to X0 = 0 and P0 = 1.It can be shown that in this case the limit P∞ = limn→∞ Pn exists and is the

unique positive solution of

P∞ = a2P∞ + b2 − [aAP∞

]2/(A2P∞ + B2).

Note that P∞ < ∞ even if |a| > 1, i.e. when the system is unstable.3.3.2. Phase Locking Loop. The signal A(t) > 0 modulates the sinusoidal car-

rier of a known frequency ω = 1, so that the transmitted wave is given by

x(t) = A(t) cos(t + ϕ)

where ϕ is the initial phase, chosen by the transmitter.The receiver’s antenna picks up this signal and passes the transmission to the

A/D converter so that the sequence

Yn+1 = x(∆n) + ξn+1 = A(∆n) cos(∆n + ϕ) + ξn+1, n = 0, 1, ...

is available to the receiver’s processor, where ∆ > 0 is the sampling step and (ξn)n≥0

is an i.i.d. sequence of standard Gaussian r.v.Suppose that A(t) is constant (practically slowly varying signal), i.e. A(t) ≡ A.

Both A and ϕ are necessary for the transmission decoding and are unknown to thereceiver. Assume that both are r.v. with uniform distribution on [a, b] (b > a > 0)and [0, 2π] respectively. The following questions are of interest

(1) Find a recursion for calculation of x(t) = E(x(t)|Yk, k∆ ≤ t

).

(2) Does the limit limt→∞E(x(t)− x(t)

)2 exist ? If yes, can perfect synchro-nization be achieved as t →∞ ?

1. Introduce

Z(t) :=[ζ1(t)ζ2(t)

]=

[x(t)x(t)

]

and define5 a sequence Zn = Z(∆n), n = 0, 1, 2, .... This sequence 6 satisfies therecursion

5Compare to

Zt =

[0 1−1 0

]Zt, s.t. Z0 =

[A cos ϕ−A sin ϕ

]

6Indeed

ζ1(∆(n + 1)) = A cos(∆n + ϕ + ∆) = A cos(∆n + ϕ) cos∆−A sin(∆n + ϕ) sin∆ =

= ζ1(∆n) cos(∆) + ζ2(∆n) sin(∆)



Zn+1 =[

cos∆ sin ∆− sin∆ cos∆

]Zn := θ(∆)Zn, s.t. Z0 =

[A cosϕ−A sin ϕ

](3.4)

The matrix θ(∆) is known as direct cosine matrix or rotation matrix.Recall, that the observation sequence is

Yn = [1 0]Zn−1 + ξn := u∗Zn−1 + ξn, n ≥ 1 (3.5)

Note that for any t ∈ [∆n, ∆n + ∆), x(t) = u∗Z(t) and

Z(t) = E(Z(t)|Yk, k∆ ≤ t

)= θ(t−∆n)E

(Z(∆n)|Yk, k∆ ≤ t

)= (3.6)

= θ(t−∆n)Z(∆n)

so it suffices to determine only the sequence Zn := Z(∆n) and thus also onlyx(∆n). The equations (3.4) and (3.5) are in the form of the Kalman filter model,so xn = u∗Zn, n = 1, 2, ... where

Zn = θ(∆)Zn−1 +θ(∆)Pn−1u

u∗Pn−1u + 1(Yn − u∗Zn−1

)

Pn = θ(∆)Pn−1θ∗(∆)− θ(∆)Pn−1uu∗Pn−1θ

∗(∆)u∗Pn−1u + 1

(3.7)

How to choose the initial conditions? Define Y0 ≡ 0. Clearly E(Zn|Y n

1

) ≡E

(Zn|Y n

0

), so one can solve (3.7) starting with index n = 1, with

Z0 = E(Z0|Y0

)= EZ0

P0 = EZ0Z∗0

Now

EZ0 = E

[A cos ϕ−A sin ϕ

]= 0

and

P0 =[

EA2E cos2 ϕ −EA2E cos ϕ sin ϕ−EA2E cos ϕ sin ϕ EA2E sin2 ϕ

]=

b3 − a3

3(b− a)12I ≡ C · I

2. The existence of the limit limn→∞ Pn can be established as a special case of amore general theory of Riccati equations. In this case however, Pn can be foundexplicitly and some additional insight can be gained. We make use of the matrixinversion lemma

Lemma 3.3. A = B−1 + CD−1C∗ ⇔ A−1 = B −BC(D + C∗BC

)−1C∗B

and similarly

ζ2(∆(n + 1)) = −A sin(∆n + ϕ + ∆) = −A sin(∆n + ϕ) cos(∆)−A cos(∆n + ϕ) sin(∆) =

= ζ2(∆n) cos(∆)− ζ1(∆n) sin(∆)

and (3.4) follows.



−1 −0.5 0 0.5 1

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

x

y

Synchronization on a circle

Figure 1. The phase locking process ∆ = 2msec.

0 1 2 3 4 5 6 7−5

−4

−3

−2

−1

0

1

2

3

4

5

time in seconds

ampl

itude

Estimated and actual trajectories vs. noisy measurements

noisy measurementsactual signal estimated signal

Figure 2. Actual trajectory, measured data, estimate ...

Put Qn = P−1n and apply the lemma with B−1 := Pn−1, C = Pn−1u, D =

−(u∗Pn−1u + 1

)to the Riccati equation 7from (3.7):

Qn = θ(∆)(P−1

n−1 −P−1

n−1Pn−1uu∗Pn−1P−1n−1

−u∗Pn−1u− 1 + u∗Pn−1P−1n−1Pn−1u

)θ∗(∆) =

= θ(∆)(Qn−1 + uu∗

)θ∗(∆)



0 1 2 3 4 5 6 71

1.5

2

2.5

3

3.5

4

time in seconds

P11 n

*n

Estimation error converges to zero with asymptotic rate 2

Figure 3. Mean square error of x-coordinate, multiplied by n, i.e. nP 11n .

This is a linear recursion and it can be solved explicitly, n ≥ 1

Qn = θn(∆)Q0θn∗(∆) +

n∑

k=1

θn−k(∆)θ(∆)uu∗θ∗(∆)θ(n−k)∗(∆) =

= C−1I +n−1∑

k=0

θk+1(∆)uu∗θ(k+1)∗(∆) = C−1I +n∑

k=1

θ(∆k)uu∗θ∗(∆k)

where the property θk(∆) = θ(∆k) had been used.Now

P 11n = E

(x(∆n)− x(∆n)

)2 =Q22

n

Q11n Q22

n −Q12n Q21

n

The elements of Qn are readily found, e.g.

Q11n = 1/C +

n∑

k=1

cos2(∆k)

so

P 11n =

1/C +∑n

k=1 sin2(∆k)[1/C +

∑nk=1 cos2(∆k)

][1/C +

∑nk=1 sin2(∆k)

]− [ ∑nk=1 sin(∆k) cos(∆k)

]2

Let us investigate the asymptotic of P 11n as n →∞. Recall that 8

1n

n∑

k=1

cos(∆k) n→∞−−−−→ 0,1n

n∑

k=1

sin(∆k) n→∞−−−−→ 0

7The property θθ∗ = I had been used here



for ∆ 6= 2π. So e.g.

limn→∞

1n

(1/C +

n∑

k=1

sin2(∆k))

= limn→∞

1n

(1/C +

n∑

k=1

12−

n∑

k=1

12

cos(2∆k))

=12

limn→∞

1n

(1/C +

n∑

k=1

cos2(∆k))

= limn→∞

1n

(1/C +

n∑

k=1

12

+n∑

k=1

12

cos(2∆k))

=12

limn→∞

1n

( n∑

k=1

cos(∆k) sin(∆k))

= limn→∞

1n

( n∑

k=1

12

sin(2∆k))

= 0

from which follows

limn→∞

nP 11n =

1/21/2 · 1/2 + 0

= 2

which suggests that the estimation error decays to zero as 1/n at rate 2 (irrespec-tively to ∆). I.e the finer ∆ is chosen for fixed interval the better convergence isobtained. The sampling rate is then can be chosen to achieve essential convergenceduring the interval on which A(t) remains effectively constant.

Note also that the estimate of y(t) = A sin(t + ϕ) = −ζ2(t) is obtained as abyproduct and can be used by the receiver.

3.3.3. Tracking a particle motion. This problem is taken from the original pa-per by R.Kalman (12). A number of particles leaves the origin at time n = 0 withrandom velocities; after n = 0, each particle moves with a constant (unknown ve-locity). Suppose that the position of one of these particles is measured, the databeing contaminated by stationary, additive, correlated noise. What is the opti-mal estimate of the position and velocity of the particle at the time of the lastmeasurement ?

Let x1(n) be the position and x2(n) the velocity of the particle; x3(n) is thenoise. The problem is then represented by the model:

x1(n + 1) = x1(n) + x2(n)x2(n + 1) = x2(n) (3.8)x3(n + 1) = ϕx3(n) + u3(n)y(n) = x1(n) + x3(n)

and the additional conditions

(1) Ex21(0) = Ex2(0) = 0, Ex2

2(0) = a2 > 0(2) Eu3(n) = 0, Eu2

3(n) = b2

The objective is to find xi(n) = E[xi(n)|yn−1

0

], i = 1, 2, 3. Several solution

versions are possible in this problem.

8this can be verified directly:

1

n

n∑

k=1

cos(∆k) =1

n

n∑

k=1

1/2(ej∆k + e−j∆k

)

where the latter sum is split into pair of geometrical sequences.



1. The equivalent problem is formulated in the vector form:

Xn =

x1(n)x2(n)x3(n)

Then for n ≥ 1

Xn+1 =

1 1 00 1 00 0 ϕ

︸︷︷︸:=A

Xn +

00b

︸︷︷︸:=B

un (3.9)

where un = u3(n)/b is an i.i.d. Gaussian white noise.

Zn+1 := yn =(1 0 1

)︸︷︷︸

:=C∗

Xn (3.10)

Let Xn := E(Xn|yn−10 ) = E(Xn|Zn

1 ). Then n ≥ 0:

Xn+1 = AXn + APnC(C∗PnC)⊕[Zn+1 − C∗AXn

](3.11)

Pn+1 = APnA∗ + BB∗ −APnC(C∗PnC)⊕C∗PnA∗ (3.12)

with

X0 =

000

, P0 =

0 0 00 a2 00 0 b2

2. Note that x1(n) = nx2(n). Redefine

Xn =(

x2(n)x3(n)

)

consequently

Xn+1 =(

1 00 ϕ

)

︸︷︷︸:=A

Xn +(

0b

)

︸︷︷︸:=B

un

andZn+1 := y(n) =

(n 1

)︸︷︷︸

:=C∗n

Xn

The Kalman filter is again given by (3.11) and (3.12) with the newly defined terms(A, B, Cn and Zn). Note that it is of lower dimension this time and

X0 =(

00

), P0 =

(a2 00 b2

)

Let us solve the Ricatti equation (3.12). Set:

Pn :=(

αn γn

γn βn

)

Then:C∗nPnCn = n2αn + 2nγn + βn



and

PnCnC∗nPn =(

nαn + γn

nγn + βn

) (nαn + γn, nγn + βn

)=

=(

n2α2n + 2nαnγn + γ2

n, n2αnγn + n(αnβn + γ2n) + γnβn

n2αnγn + n(αnβn + γ2n) + γnβn n2γ2

n + 2nγnβn + β2n

)

and finally:

αn+1 = αn − n2α2n + 2nαnγn + γ2

n

n2αn + 2nγn + βn= (3.13)

=αnβn − γ2

n

n2αn + 2nγn + βn

γn+1 = ϕγn − ϕn2αnγn + n(αnβn + γ2

n) + γnβn

n2αn + 2nγn + βn= (3.14)

= ϕnγ2

n − αnβn

n2αn + 2nγn + βn

βn+1 = ϕ2βn − ϕ2 n2γ2n + 2nγnβn + β2

n

n2αn + 2nγn + βn+ b2 = (3.15)

= ϕ2n2 βnαn − γ2n

n2αn + 2nγn + βn+ b2

from which follows (n ≥ 0):

γn+1 = −ϕnαn+1 (3.16)βn+1 = b2 + ϕ2n2αn+1 (3.17)

so that (n ≥ 1)

αn+1 =αn(b2 + ϕ2(n− 1)2αn)− ϕ2(n− 1)2α2

n

n2αn − 2nϕ(n− 1)αn + (n− 1)2ϕ2αn + b2=

=αnb2

n2αn − 2nϕ(n− 1)αn + (n− 1)2ϕ2αn + b2

and from (3.13), α1 = a2. Set ρn := α−1n . Then (n ≥ 1):

ρn+1 = b−2(n2 − 2nϕ(n− 1) + (n− 1)2ϕ2

)+ ρn =

ρn + b−2(n− (n− 1)ϕ

)2

and (n ≥ 2)

ρn = a−2 + b−2n−1∑

k=1

(k − (k − 1)ϕ

)2

So that (n ≥ 2)

αn = ρ−1n =

a2b2

b2 + a2∑n−1

k=1

(k − (k − 1)ϕ

)2

and

γn =−ϕ(n− 1)b2a2

b2 + a2∑n−1

k=1

(k − (k − 1)ϕ

)2



βn = b2 +ϕ2(n− 1)2b2a2

b2 + a2∑n−1

k=1

(k − (k − 1)ϕ

)2

Several interesting observations are worth noting.

(a) Consider the case ϕ 6= 1. Note that αn → 0, γn → 0 and βn → b2 as n → ∞,since 9

|βn − b2| ≤ b2 ϕ2(n− 1)2∑n−1

k=1

(k(1− ϕ) + ϕ

)2 ≤

≤ b2

(1− ϕ)2ϕ2(n− 1)2∑n−1

k=1 k2→ 0, n →∞

This means that the estimate of the velocity converges (in the mean square) tothe real velocity and ceases to update (x2(n + 1) = x2(n)). However the estimateof the current sample of the noise remains uncertain with error b2 and is generatedby:

x3(n + 1) = ϕx3(n) + ϕ(y(n)− nx2(n)− x3(n)) = ϕ(y(n)− nx2(n))

(b) Surprisingly, when ϕ = 1, it follows from the formulae above that as n →∞

αn =a2b2

b2 + a2(n− 1)∝ b2

n

γn ∝ b2

βn ∝ b2n

i.e. the noise sequence x3(n) can not be estimated efficiently, though the velocity ofthe particle x2(n) is inferred perfectly as n →∞ ! Moreover when |ϕ| > 1, the noiseis ”exponentially unstable”, while with ϕ = 1 its trajectories diverge to infinity witha linear rate. On the first glance it may seem that if the position estimate convergesin the former case, it should converge in the latter case a fortiori!

Let us try to understand this phenomenon. Set x2(n) ≡ ξ for brevity andassume first that ϕ 6= 1. Define δy(0) = y(0) and for n ≥ 1

δy(n) := y(n)− ϕy(n− 1) = (1− ϕ)n−1∑

k=1

ξ + ξ + x3(n)− ϕx3(n− 1) =

=[n(1− ϕ) + ϕ

]ξ + u3(n− 1) := %nξ + bun−1

Note that the ’signal-to-noise’ ratio increases with n. Set ξn := E(ξ|δyn0 ) =

E(ξ|yn0 ) = E(x2(n)|yn

0 ) and Pn = E(ξ − ξn)2. Then:

Pn = Pn−1 −P 2

n−1%2n

%2nPn−1 + b2

=Pn−1b

2

%2nPn−1 + b2

9It can be easily seen that∑n

k=1 k2 is of O(n3):

n3 =n∑

k=1

[k3 − (k − 1)3

]=

n∑

k=1

[3k2 − 3k + 1

]= 3

n∑

k=1

k2 + O(n2)



Set Qn = P−1n :

Qn = Qn−1 + 1/b2%2n ∝ 1/b2

n∑

k=1

%2k ∝ n3

i.e. Pn is O(1/n3). Now form the estimate of x3(n):

x3(n + 1) = ϕ(y(n)− nξn

)

Introduce ∆3(n) = x3(n)− x3(n) and ∆2(n) = ξ − ξn, then:

∆3(n + 1) = ϕx3(n) + u3(n)− ϕ(y(n)− nξn

)=

= u3(n)− nϕ∆2(n)

so that

E∆23(n) = b2 + ϕ2n2E∆2

2(n) → b2, n →∞Now let us consider the case ϕ = 1. Verify that x3(n) can not be estimated

with the mean square error growth rate better than O(n). Note that δy(n) :=y(n) − ϕy(n − 1) = ξ + u3(n − 1) so that %n ≡ 1 and hence Pn ∝ 1/n for n large.Recall that x1(n) = nξ, so that x1(n) = nξn and E(x1(n) − x1(n))2 = n2Pn ∝ n.Since y(n) = x1(n) + x3(n), we conclude 10 that x3(n) can not be estimated withthe error growth rate better that O(n).

3.3.4. Solution of linear equations by means of the Kalman filter.Consider the system of linear equations Ax = b, where A is an n × m real

matrix and b is an n× 1 real vector. Let r = rank(A) ≤ min(n,m). Depending onr these equations may have the unique solution, an infinite number of solutions orno solutions at all. If n = m and A is nonsingular, then x = A−1b is the uniquesolution. As shown below, in the case of infinitely many solutions, there is onewith the minimal Euclidian norm, while in the case of no solutions there is alwaysa unique vector x, which minimizes the norm ‖Ax− b‖2 among x ∈ Rm. There isone formula to calculate any of the aforementioned vectors: x = A⊕y, where A⊕

is the Moore-Penrose pseudoinverse of A, which is briefly described below.

Theorem 3.4. ( Singular Values Decomposition)If A is a real m× n matrix , then there exist orthogonal matrices

U =[u1, ..., um

] ∈ Rm×m, V =[v1, ..., vn

] ∈ Rn×n

such that

U∗AV = diag(σ1, ..., σp) ∈ Rm×n, p = minm, n (3.18)

where σ1 ≥ σ2 ≥ ... ≥ σp ≥ 0.

Proof. (Version I) Assume e.g. that m ≤ n and let r := rank(A) ≤ m.Set Q := AA∗ ∈ Rm×m. Since Q is symmetric and non-negative definite thereexists an orthogonal matrix U ∈ Rm×m, such that U∗QU = diag(σ2

1 , ..., σ2r , 0, ..., 0),

such that σ1 ≥ σ2 ≥ ... ≥ σr > 0. Let U ′ and U ′′ be the columnwise partitionof U , such that U ′ =

[u1, ..., ur

] ∈ Rm×r and U ′′ =[ur+1, ..., um

] ∈ Rm×(m−r),where ui are columns of U . Clearly U ′∗QU ′ = diag(σ2

1 , ..., σ2r) := Γ ∈ Rr×r,

10Use the following fact. If Z = X + Y and X = E(X|Z) and P = E(X − Xn)2, then

Y = E(Y |Z) = E(Z −X|Z) = Z − X and Q := E(Y − Y )2 = P



U ′′∗QU ′′ = 0 ∈ R(m−r)×(m−r), etc. Introduce matrix V ′ := A∗U ′Γ−1/2 ∈ Rn×r.Note that

V ′∗V ′ = Γ−1/2U ′∗AA∗U ′Γ−1/2 = Ir ∈ Rr×r

where Ir stands for identity matrix of size r. The latter implies that the columns ofV ′ are orthogonal. Choose V ′′ ∈ Rn×(n−r) with orthogonal columns which span theorthogonal complement to the subspace spanned by columns of V ′ (this is alwayspossible!). Form V =

[V ′ V ′′] ∈ Rn×n, which is orthogonal matrix by construction.

Then the statement of the theorem follows from

U∗AV =(

U ′∗

U ′′∗

)A

(V ′ V ′′ )

=(

U ′∗AV ′ U ′∗AV ′′

U ′′∗AV ′ U ′′∗AV ′′

)=

=(

U ′∗AA∗U ′Γ−1/2 U ′∗AV ′′

U ′′∗AV ′ U ′′∗AV ′′

)=

(Γ1/2 0

0 0

)

where the latter equality follows from the facts: (i) U ′′∗AA∗U ′′ = 0 =⇒ U ′′∗A = 0and (ii) U ′∗AV ′′ = Γ1/2Γ−1/2U ′∗AV ′′ = Γ1/2V ′∗V ′′ = 0. ¤

Proof. (Version II) Let x ∈ Rn and y ∈ Rm a pair of unit vectors (i.e. e.g.‖x‖ =

√∑mi=1 x2

i = 1) such that

Ax = σy

where σ = ‖A‖2 (recall that ‖A‖22 = maxx 6=0 ‖Ax‖2/‖x‖2). It is always possible tochoose such vectors: e.g. x := arg maxξ∈Rn:‖ξ‖=1 ‖Aξ‖ and y := Ax/σ.

Since any orthonormal set of vectors can be completed to an orthonormal basis,there exist U ′ ∈ Rm×(m−1) and V ′ ∈ Rn×(n−1), such that U = [y U ′] ∈ Rm×m andV = [x V ′] ∈ Rn×n are orthogonal. Note that U∗AV is of the following structure

U∗AV =(

σ w∗

0 B

):= A′

where w ∈ R(n−1) and B ∈ R(m−1)×(n−1). Since∥∥∥∥A′

(σw

)∥∥∥∥2

2

≥ (σ2 + w∗w)2

so that ‖A′‖22 ≥ (σ2 +w∗w). But σ2 = ‖A‖22 = ‖A′‖22 (since ‖ · ‖2-norm is invariantunder orthogonal transformations), which implies that w∗w = 0 or w = 0. Now thestatement of the theorem is verified by induction. ¤

Theorem 3.5. Let U∗AV = Σ be the SVD of matrix A ∈ Rm×n with r =rank(A). If U = [u1, ..., um] and V = [v1, ..., vn] and b ∈ Rm, then

x =r∑

i=1

u∗i bσi

vi (3.19)

has the least ‖ ·‖2 norm among all such vectors x, that bring ‖Ax−b‖ to minimum.Moreover

‖Ax − b‖22 =m∑

i=r+1

(u∗i b

)2



Proof. First let us show that there exists a unique vector of least norm thatbrings ‖Ax− b‖2 to minimum. (Note that if r < n, then there are a lot of vectorswhich minimize this norm, since if x is such a vector then x+ z, where z ∈ null(A),also is minimizing). Let

χ = x ∈ Rn : ‖Ax− b‖2 ≤ ‖Ay − b‖2,∀y 6= xχ is convex. Indeed fix x1, x2 ∈ χ, λ ∈ [0, 1], then

‖A(λx1 + (1− λ)x2

)− b‖2 ≤ λ‖Ax1 − b‖2 + (1− λ)‖Ax2 − b‖2 = minx∈Rn

‖Ax− b‖2so that λx1 + (1− λ2)x2 ∈ χ. Since χ is convex, there is a unique vector in χ withthe least norm.

For any x ∈ Rn (recall that orthogonal transformation does not change ‖ · ‖2norm of a vector)

‖Ax− b‖22 = ‖(U∗AV )(V ∗x)− U∗b‖22 = ‖Σα− U∗b‖22 == Σr

i=1(σiαi − u∗i b)2 + Σm

i=r+1(u∗i b)

2 ≥ Σmi=r+1(u

∗i b)

2 (3.20)

where α := V ∗x. The lower bound in (3.20) is attained if αi = u∗i b/σi, i = 1, ..., rare chosen. Since x = V α, the least norm of x is obtained if αi, i = r + 1, ...,m areset to zero. This completes the proof. ¤

Definition 3.6. Moore-Penrose inverse of A is a matrix in Rn×m defined by

A⊕ = V Σ⊕U∗

where

Σ⊕ = diag(

1σ1

, ...,1σr

, 0, ..., 0)∈ Rn×m

Remark 3.7. Note thatx = A⊕b

Corollary 3.8. (i) If n = m = rank(A), then A⊕ = A−1; (ii) if rank(A) = n,then A⊕ = (A∗A)−1A∗

Proof. It is easy to verify that in both cases the matrices are unique solu-tions to minimization of ‖Ax − b‖ so the statement of the Corollary is implied byuniqueness of A⊕. The latter can be verified also directly, taking into account thatΣ is full rank. ¤

Theorem 3.9. A⊕ is the unique matrix satisfying:

AA⊕A = A (3.21)

and there exists a pair of matrices Q and P such that

A⊕ = PA∗ = A∗Q (3.22)

i.e. rows and columns of A⊕ are linear combinations of the rows and columns ofA∗.

Proof. First we show that there exist a unique matrix satisfying the (3.21)and (3.22). Assume that there are two matrices satisfying these equations: A⊕1 andA⊕2 . Then

AA⊕1 A = A, A⊕1 = P1A∗ = A∗Q1

andAA⊕2 A = A, A⊕2 = P2A

∗ = A∗Q2



for some matrices P1, P2, Q1, Q2. Let D = A⊕1 −A⊕2 , P = P1 − P2, Q = Q1 −Q2,then

ADA = 0 ∈ Rm×n, D = PA∗ = A∗Q

and (by D∗ = Q∗A)

(DA)∗(DA) = A∗D∗DA = A∗Q∗ADA = 0

that is DA = 0. Making use of D∗ = AP ∗ we find that

DD∗ = DAP ∗ = 0

so D = A⊕1 −A⊕2 = 0.Now it is left to check that A⊕ satisfies (3.21) and (3.22).

AA⊕A = (UΣV ∗)(V Σ⊕U∗)(UΣV ∗) = UΣΣ⊕ΣV ∗ = UΣV ∗ = A

To check (3.22) we use the relation

Σ⊕ = (Σ∗Σ)⊕Σ∗ = Σ∗(ΣΣ∗)⊕

which is verified by simple manipulations with diagonal matrices.

A⊕ = V Σ⊕U∗ = V (Σ∗Σ)⊕Σ∗U∗ = V (Σ∗Σ)⊕V ∗V Σ∗U∗ := PA∗

A⊕ = V Σ⊕U∗ = V Σ∗(ΣΣ∗)⊕U∗ = V Σ∗U∗U(ΣΣ∗)⊕U∗ := A∗Q

¤

Remark 3.10. Since A⊕ is a unique matrix, Theorem 3.9 can be used as itsdefinition.

Lemma 3.11. A⊕ obeys the following properties

(1) AA⊕A = A, A⊕AA⊕ = A⊕

(2) (A∗)⊕ = (A⊕)∗

(3) (A⊕)⊕ = A(4) (A⊕A)2 = A⊕A, (A⊕A)∗ = A⊕A, (AA⊕)2 = AA⊕, (AA⊕)∗ = AA⊕

(5) (A∗A)⊕ = A⊕(A∗)⊕ = A⊕(A⊕)∗

(6) A⊕ = (A∗A)⊕A∗ = A∗(AA∗)⊕

(7) A⊕AA∗ = A∗AA⊕ = A∗

(8) if S is an orthogonal matrix, then (SAS∗)⊕ = SA⊕S∗

(9) if A is a symmetric nonnegative definite matrix of order n × n of rankr < n, then

A⊕ = T ∗(TT ∗)−2T

where T ∈ Rr×n of rank r is defined by the decomposition

A = T ∗T

Proof. Properties (1)-(7) can be verified directly for diagonal matrix Σ andthen extended due to orthogonality of U and V to A:

(1) A⊕AA⊕ = V Σ⊕U∗UΣV ∗V Σ⊕U∗ = V Σ⊕U∗ = A⊕

(2) A∗ = V Σ∗U∗, so that by definition (A∗)⊕ = U(Σ⊕)∗V ∗ = (A⊕)∗

(3) A⊕ = V Σ⊕U∗. By definition (A⊕)⊕ = U(Σ⊕)⊕V ∗ = UΣV ∗ = A



(4)

(A⊕A)2 = A⊕AA⊕A = A⊕A

(A⊕A)∗ = A∗(A∗)⊕ = V Σ∗U∗U(Σ∗)⊕V ∗ = V (Σ⊕Σ)∗V ∗ == V Σ⊕ΣV ∗ = V Σ⊕U∗UΣV ∗ = A⊕A, etc.

(5)

(A∗A)⊕ = (V Σ∗U∗UΣV ∗)⊕ = (V Σ∗ΣV ∗)⊕ = V (Σ∗Σ)⊕V ∗ == V Σ⊕(Σ⊕)∗V ∗ = V Σ⊕U∗U(Σ⊕)∗V ∗ = A⊕(A∗)⊕, etc.

(6) follows from (5), (4), (1)(7) follows from (1) and (4)(8) If S is orthogonal, then SUU∗S∗ = I, i.e. SU is orthogonal as well. So is

SV . It follows

(SAS∗)⊕ = (SUΣV ∗S∗)⊕ = SV Σ⊕U∗S∗ = SA⊕S∗

(9) Any matrix A ∈ Rm×n of rank r can be decomposed into

A = BC

where B ∈ Rm×r and C ∈ Rr×n. Indeed form B from r independentcolumns of A, then choose rows of C so that A = BC. Let us show thatA⊕ = C⊕B⊕. Since C and B are full rank we have

C⊕B⊕ = C∗(CC∗)−1(B∗B)−1B∗

and so

BC(C⊕B⊕)BC = BCC∗(CC∗)−1(B∗B)−1B∗BC = BC

i.e. (3.21) holds. Similarly it can be shown that (3.22). Since in the caseof symmetric A, B = T ∗ and C = T , the desired formula result follows.

¤

Consider a set of linear algebraic equations Ax = b, where A is an m × nmatrix, b is a vector in Rm and x ∈ Rn is the vector to be found to minimize theEuclidian norm ‖Ax−b‖. If A is square and nonsingular, then the solution is uniquex = A−1b, while in general x = A⊕b as discussed above. The calculation of A⊕ isessentially equivalent to the problem of finding the eigenvalues of the nonnegativedefinite matrix AA∗ (see Theorem 3.4) and for large matrices may be cumbersome.Moreover if only x is required the calculation of A⊕ is unnecessary.

Consider the following auxiliary estimation problem: let X be a zero meanGaussian random vector with unit covariance matrix and Y = AX, where A. Theoptimal linear estimate of X from Y is

X = E(X|Y ) = cov(X, Y ) cov⊕(Y )Y =

cov(X)A∗[A cov(X)A∗

]⊕Y = A∗

[AA∗

]⊕Y = A⊕Y, (3.23)

where the latter equality is nothing but the property (6) of the pseudoinverse.Notice that (3.23) holds only for the observation vector Y , compatible with the

model Y = AX and hence coincides with the linear function x = A⊕y only on then-dimensional subspace y ∈ Rm : y = Ax, x ∈ Rn. Hence e.g. when m ≤ n and Ais full rank, the orthogonal projection formula obtained in the auxiliary stochasticproblem coincides with the solution of the linear set of equations Ax = y. If y is



not in the range of Ax, it can still be substituted into the formula of the orthogonalprojection, but the result will typically differ from A⊕y.

The auxiliary problem (and thus, the system of linear equations Ax = b withb ∈ y ∈ Rm : y = Ax, x ∈ Rn) can be solved recursively by the Kalman filter.Denote the rows of A by ai, i = 1, ..., m and consider the signal satisfying therecursion

Xi = Xi−1

X0 = X,

The observation is generated by

Yi = aiXi−1.

The Kalman filter equations read

Xi = Xi−1 + Pi−1ai∗[aiPi−1a

i∗]⊕(Yi − aiXi−1

)

Pi = Pi−1 − Pi−1ai∗[aiPi−1a

i∗]⊕aiPi−1.

Clearly Xi = A⊕i Y i, where Ai is the matrix of the first i rows of A and Y i is thesub-vector of the first entries of Y and Xm = A⊕Y . In other words, the Kalmanfilter can be applied to y (in the range of Ax) to generate xi = A⊕i yi for eachi = 1, ..., m and xm = A⊕y.

Notice, however, that if m > n, and the first n rows of A are linearly inde-pendent, then the algorithm ignores the rest of the rows and hence will not yieldx = A⊕y in this case (as we have already stressed before).

Since[aiPi−1a

i∗]⊕ =

[aiPi−1a

i∗]−1, aiPi−1a

i∗ > 00 otherwise

.

the natural question is what happens when aiPi−1a∗i vanishes ?

It turns out that

aiPi−1ai∗ = min

c1,...,ci−1‖ai −

∑

`

c`a`‖2, (3.24)

which means that when aiPi−1ai∗ = 0 is encountered, the row ai is the linear

combination of the previous rows and thus the matrix is not full rank.To verify (3.24), use the properties of pseudoinverse

minc1,...,ci−1

‖ai −∑

`

c`a`‖2 = min

c∈Ri−1‖ai − c∗Ai−1‖2 = ‖ai − aiA⊕i−1Ai−1‖2 =

ai(I −A⊕i−1Ai−1

)(I −A⊕i−1Ai−1

)∗ai∗ = ai

(I −A⊕i−1Ai−1

)2ai∗ =

ai(I −A⊕i−1Ai−1

)ai∗ = aiPi−1a

i∗.

It is also worth noting that in practice small values of aiPi−1ai∗ may indicate

that the matrix is close to singular and numerical problems are possible.


CHAPTER 3

Conditional expectation and nonlinear estimation

1. Conditional expectation

Let (Ω,F , P ) be a fixed probability space. The conditional probability of eventA, given event B is defined

P (A|B) =P (A ∩B)

P (B)(1.1)

where 0/0 = 0 is understood for definiteness. The probabilistic interpretation ofthis definition is that once B happens (i.e. ω′ ∈ B is obtained), the probability ofA changes: it becomes the probability of being in the part of A, which is also in B,normalized by probability of B. For example, intuitively one expects that if A andB are mutually exclusive, then probability of A should be zero, given B happened.Or if B is a subset of A, i.e. if B happens then A happens too, the probability ofA given B should be one.

Consider now the events D = D1, ..., Dn, such that Di ∩ Dj = ∅ and Ω =∑ni=1 Di. The conditional probability of A given the partition D is a random

variable (!)

P (A|D)(ω) =n∑

i=1

P (A ∩Di)P (Di)

IDi(ω). (1.2)

Finally the conditional expectation of a r.v. X with values in x1, ..., xm, giventhe partition D is a r.v.

E(X|D) =m∑

j=1

xj

n∑

i=1

P (X(ω) = xj ∩Di)P (Di)

IDi(ω). (1.3)

However in many cases the conditioning with respect to events of zero probabil-ity is required. E.g. suppose one chooses at random a point ω from ([0, 1],B, λ) andthen tosses a coin with probability of heads p = ω. Intuitively one would expectthat P (h|ω = 1/3) = 1/3, though the definitions above cannot provide this answer.

The following definition of the conditional expectation is much more general

Definition 1.1. Let X(ω) be a real random variable with E|X| < ∞, definedon a probability space (Ω,F , P ). Let F ′ be a sub σ-algebra of F . The conditionalexpectation E(X|F ′)(ω) is a random variable, such that

(1)E(X|F ′)(ω) ≤ x

∈ F ′ for any x ∈ R (in other words, E(X|F ′)(ω) isF ′-measurable)

(2) E(X − E(X|F ′))I(A) = 0 for any A ∈ F ′.

The proof of correctness of this definition relies on certain auxiliary facts frommeasure theory and can be found in (8).

47


Let us see how the aforementioned ”elementary” definitions coincide with Def-inition 1.1.

Let e.g. F ′ = B, B, Ω, ∅ and verify that

P (A|F ′)(ω) := E(IA(ω)|F ′)(ω) =

P (A ∩B)P (B)

IB(ω) +P (A ∩ B)

P (B)IB(ω).

First let us check that this random variable is F ′ measurable. In fact any r.v. of theform ξ(ω) = β1IB(ω) + β2IB(ω) is measurable w.r.t. F ′. Indeed (if e.g. β2 > β1)

ω : ξ(ω) ≤ x =

Ω, max(β1, β2) < x

B min(β1, β2) ≤ x < max(β1, β2)∅ x < min(β1, β2)

∈ F ′.

Further

E(IA(ω)− P (A ∩B)

P (B)IB(ω)− P (A ∩ B)

P (B)IB(ω)

)IB(ω) =

E(IA(ω)IB(ω)− P (A ∩B)

P (B)IB(ω)

)= P (A ∩B)− P (A ∩B)

P (B)P (B) = 0

and similarly for the other atoms of F ′. Repeating these calculations for n atomsone obtains (1.2).

The formula (1.3) is checked in the same way: as before the right hand side ismeasurable w.r.t F ′ and

E(X − E(X|D)

)ID`

(ω) = E(X −

m∑

j=1

xj

n∑

i=1

P (X(ω) = xj ∩Di)P (Di)

IDi(ω))ID`

(ω)

E(XID`

(ω)−m∑

j=1

xjP (X(ω) = xj ∩D`)

P (D`)ID`

(ω))

=

E( m∑

k=1

xkI(X = xk)ID`(ω)−

m∑

j=1

xjP (X(ω) = xj ∩D`)

P (D`)ID`

(ω))

= 0.

The special important case is conditioning with respect to σ-algebra, generatedby a r.v. i.e. minimal σ-algebra with respect to which this r.v. is measurable. Forexample if X = I(ω ≤ 1/2) is defined on ([0, 1],B, λ), then the σ-algebra, generatedby X is atomic (P -a.s.) FX = Ω, ∅, A, A, where A = (0, 1/2]. The followingdefinition is the special case of Definition 1.1

Definition 1.2. Let X and Y be a pair of r.v. and E|X| < ∞. The conditionalexpectation of X with respect to Y (or σ-algebra FY generated by Y ), denoted byE(X|Y ) is an FY -measurable r.v. such that

E(X − E(X|Y )

)Z = 0

for any bounded FY -measurable r.v. Z.

It turns out (see (8)) that any r.v. Z measurable w.r.t. FY has the formZ = f(Y ), for some function f(x). So the Definition 1.2 is equivalent to

Proposition 1.3. Let X and Y be a pair of r.v. and E|X| < ∞. The condi-tional expectation E(X|Y ) is given by E(X|Y ) = g(Y ), where g(x) is such that

E(X − g(Y )

)f(Y ) = 0 (1.4)



for any bounded function f .

In the view of the latter proposition, we will write E(X|Y = y) to denote anyfunction g(y), satisfying (1.4)

Now let us check that Definition 1.2 (or 1.1) gives the the right answer in thecoin tossing experiment with random probability for heads. On some probabilityspace (Ω,F , P ) let π (the ”random” heads probability) and ξ be a pair of i.i.d. r.v.with uniform distributions U(0, 1) For any p ∈ [0, 1] define Zp = I(ξ ≤ p). ClearlyZp is a binary random variable with P (Z = 1) = p. Consider ζ = Zπ and findP (ζ = 1|π) = E

[I(ζ = 1)|π]. For an arbitrary bounded function f : R 7→ R

E[I(ζ = 1)− π

]f(π) = E

[I(ξ ≤ π)− π

]f(π) =

∫ 1

0

∫ 1

0

[I(x ≤ y)− y

]f(y)dydx =

∫ 1

0

[ ∫ 1

0

I(x ≤ y)dx− y]f(y)dy = 0

and thus E[I(ζ = 1)|π] = π.

As mentioned before the Definition 1.1 is general and leads to familiar formulaeunder special setups. For example

Corollary 1.4. Assume that two real r.v. X and Y have joint probabilitydensity

f(x, y) =∂2

∂x∂yP

(X ≤ x, Y ≤ y

).

Then for any bounded function h : R 7→ R

E(h(X)|Y )

=

∫R h(x)f

(x, Y (ω)

)dx∫

R f(x, Y (ω)

)dx

, (1.5)

or equivalently, the conditional probability distribution

dF (x; Y ) := dP(X ≤ x|Y )

=f(x, Y (ω)

)dx∫

R f(x, Y (ω)

)dx

.

Proof. Clearly the right hand side of (1.5) is a function of Y and so onlyorthogonality should be checked: fix an arbitrary bounded function ψ : R 7→ R

E

(h(X)−

∫R h(x)f

(x, Y (ω)

)dx∫

R f(x, Y (ω)

)dx

)ψ(Y ) =

∫∫

R2

(h(s)−

∫R h(x)f

(x, t

)dx∫

R f(x, t

)dx

)ψ(t)f(s, t)dsdt =

∫∫

R2h(s)ψ(t)f(s, t)dsdt−

∫

R

∫R h(x)f

(x, t

)dx∫

R f(x, t

)dx

ψ(t)[∫

Rf(s, t)ds

]dt = 0

¤

The formula (1.5) is a form of the Bayes rule, which is the central tool incalculation of conditional expectations. The Bayes rule is valid for the setups,much more general than in Corollary 1.4.



Example 1.5. Suppose that a system malfunctions at τ seconds after it ispowered, where τ is an exponential r.v., i.e.

P (τ ≤ t) =

1− e−λt, t ≥ 00 t < 0

(1) What is the expected period of proper operation ?(2) What is the probability of additional t seconds of operation, if the system

already operates properly for s seconds ?(3) What is the expected period of proper operation if the system already

operates properly for s seconds ?The expected period till failure is1

Eτ = E

∫ τ

0

ds = E

∫ ∞

0

I(s ≤ τ)ds =∫ ∞

0

P (τ ≥ s)ds =∫ ∞

0

[1− P (τ ≤ s)

]ds =

∫ ∞

0

e−λsds =1λ

Further for t ≥ 0

P (τ ≥ s + t|τ ≥ s) =e−λ(t+s)

e−λs= e−λt = P (τ ≥ t).

The latter is referred as memoryless property of the r.v. τ : the fact that the systemworks properly for s seconds already does not change the probability of properoperation during next t seconds ! It turns out that this property characterizesexponential distribution, i.e. only the latter is memoryless among all distributionswith density.

Also

E(τ |τ ≥ s) = E

(∫ ∞

0

I(τ ≥ u)du∣∣∣τ ≥ s

)=

∫ s

0

1du +∫ ∞

s

P (τ ≥ u|τ ≥ s)du =

s +∫ ∞

0

P (τ ≥ u′)du′ = s + 1/λ

i.e. the additional expected time till the first failure is 1/λ.

1.1. Properties of conditional expectation. The conditional expectationsatisfies the following properties 2 (below X, Y and Z are random variables, forwhich the expectations are assumed to exists, when needed; generic bounded func-tions are denoted by g, f and h)

a. E(C|Y ) = C for constant C

Proof. Constant r.v. C is certainly a function of Y and thus the claim is trueby definition of the conditional expectation. ¤

b. If X ≤ Y (P -a.s.) then E(X|Z) ≤ E

(Y |Z)

P -a.s.

1The same answer is obtained by integrating vs. probability density λe−λt.2The conditioning w.r.t. random variables is considered here - conditioning in the general

case (i.e. w.r.t. σ-algebras) is treated similarly



Proof. For any f

0 = E[E(X|Y, Z)− E

(E(X|Y, Z)

∣∣Z)]f(Z) =

E[E(X|Y, Z)−X

]f(Z) + E

[X − E

(E(X|Y,Z)

∣∣Z)]f(Z)

= E[X − E

(E(X|Y, Z)

∣∣Z)]f(Z)

where the latter equality holds since f(Z) is in particular a function of (Y, Z). ¤

Remark 1.6. In σ-algebras terms, the above property reads: if3 G1 ⊆ G2, thenE(X|G1) = E

(E(X|G2)

∣∣G1

)P -a.s. In particular this is valid with G1 generated by

Y1, ..., Ym and G2 by Y1, ..., Ym, Ym+1, ..., Yn.

h. E(E(X|Y )

∣∣Y, Z)

= E(X|Y ) P -a.s.

Proof. particular case of (e). ¤

i. If X and Y are independent, then E(X|Y ) = EX.

Proof. Being a constant, EX is a trivial function of Y . Moreover

E(X − EX

)f(Y ) = EXEf(Y )− EXEf(Y ) = 0.

¤

j. E(Xg(Y )|Y ) = g(Y )E(X|Y ) for bounded g

Proof.

E(Xg(Y )− g(Y )E(X|Y )

)f(Y ) = E

(X − E(X|Y )

)g(Y )f(Y ) = 0

¤

Remark 1.7. The latter property holds also when only E|g(Y )| < ∞ andE|Xg(Y )| < ∞.

k. If g(x) is convex, then E(g(X)|Y ) ≥ g

(E(X|Y )

)P -a.s. (Jensen inequality)

Proof. For any x0 ∈ R, g(x) ≥ g(x0) + α(x− x0) for some α ∈ R. Then

g(X) ≥ g(E(X|Y )

)+ α

(X − E(X|Y )

)

and by properties (b) and (e)

E(g(X)|Y ) ≥ g

(E(X|Y )

).

¤

l. Assume that EX2 < ∞, then E(X|Y ) is the optimal estimate of X given Y , inthe sense

E(X − E(X|Y )

)2 = infψ

E(X − ψ(Y )

)2

where the infimum is taken among all functions such that Eψ2(Y ) < ∞.

3Note that inclusion in G1 ⊆ G2 means that any event from G1 is also an event from G2



Proof. By virtue of property (j) (see Remark 1.7) and the definition of con-ditional expectation

E(X − ψ(Y )

)2 = E(X − E(X|Y ) + E(X|Y )− ψ(Y )

)2 =

E(X − E(X|Y )

)2 + E(E(X|Y )− ψ(Y )

)2 ≥ E(X − E(X|Y )

)2

¤

1.2. Additional results, examples and applications.1.2.1. Gaussian processes. The random vector X in Rn is Gaussian if its char-

acteristic function has the form

ϕ(λ) = Eeiλ∗X = expiλ∗m− 1

2λ∗Γλ

, ∀λ ∈ Rn (1.6)

where m and Γ are a vector and a nonnegative definite matrix.It is not hard to check that m = EX and Γ = cov(X). Moreover if Γ is

nonsingular the distribution of X has the density

f(x) =1√

(2π)n det Γexp

− 1/2(x−m)∗Γ−1(x−m).

If Γ is singular the density does not exist: e.g. according to the above definition aconstant r.v. is Gaussian as well.

The next theorem gives a number of important properties of the Gaussian r.v.

Theorem 1.8. Let (X, Y ) ∈ Rm+n be a Gaussian random vector, then(1) X and Y are independent if and only if they are orthogonal (uncorrelated)(2) Any linear combination Z = AX + b is a Gaussian random vector with

mean EZ = AEX + b and covariance cov(Z) = A cov(X)A∗.(3) E(X|Y ) = E(X|Y ) and moreover the conditional distribution P (X ∈

B|Y ), B ⊂ B(Rd) is P -a.s. Gaussian with the mean

E(X|Y ) = EX + cov(X,Y ) cov⊕(Y )(Y − EY

)(1.7)

and (deterministic!) covariance

cov(X|Y ) = cov(X)− cov(X, Y ) cov⊕(Y ) cov(Y, X). (1.8)

Proof.(1) For Gaussian vectors E‖X‖p < ∞ and E‖Y ‖p < ∞ for any p, so independenceimplies orthogonality. Conversely, if X and Y are orthogonal, i.e. cov(X, Y ) = 0,then Γ is block diagonal with block matrices Γ11 = cov(X) and Γ22 = cov(Y ).Then

ϕn+m(λ) = expiλ∗µ− 1/2λ∗Γλ =

expiλ∗1µ1 − 1/2λ∗1Γ11λ1 expiλ∗2µ2 − 1/2λ∗2Γ22λ2 = ϕm(λ1)ϕn(λ2)

where µ1 = EX, µ2 = EY and ϕn(·) denotes Gaussian characteristic function ofan n× 1 vector and λ1 and λ2 are m× 1 and n× 1 vectors.(2)

ψ(λ) = E(eiλ∗Z

)= E expiλ∗b + iλ∗AX = expiλ∗bE expiλ∗AX =

expiλ∗µ− 1/2λ∗Gλ



where µ = b+AEX and G = A cov(X)A∗. The latter is a nonnegative definite ma-trix and thus the characteristic function of Z corresponds to a Gaussian distributionwith appropriate parameters.(3) It is to be shown that

ϕ(λ; Y ) = E(eiλ∗X

∣∣Y)

= expiλ∗µ(Y )− 1/2λ∗Γ(Y )λwhere µ(Y ) is given by (1.7) and Γ(Y ) by (1.8).

Recall that X can be decomposed into

X = E(X|Y ) +(X − E(X|Y )

)

where(X − E(X|Y )

)is orthogonal to Y (and E(X|Y )). Since E(X|Y ) is a linear

map of Y , the vector (X,Y, E(X|Y ) is Gaussian as well by virtue of (2). Hence((X − E(X|Y )

), Y ) is Gaussian and since

(X − E(X|Y )

)and Y are orthogonal,

they are also independent. So for an arbitrary bounded function h

E(X − E(X|Y )

)h(Y ) = E

(X − E(X|Y )

)Eh(Y ) = 0

i.e. E(X|Y ) = E(X|Y ). The equation 1.7 follows immediately.Since X − E(X|Y ) and Y are independent we have

E(

expiλ∗X|Y)

= expiλ∗E(X|Y )E(

expiλ∗(X − E(X|Y )|Y)

=

expiλ∗E(X|Y )E(

expiλ∗(X − E(X|Y ))

=

expiλ∗E(X|Y ) exp−1/2λ∗ cov(X − E(X|Y ))λand the desired result follows. ¤

The r.p. X = (Xn)n≥0 is Gaussian, if all its finite dimensional distributionsare Gaussian and consistent. For example, the sequence generated by the recursion(n ≥ 1)

Xn = anXn−1 + bnεn (1.9)with constants an and bn and standard i.i.d. Gaussian sequence ε = (εn)n≥0 is aGaussian Markov process.

It turns out that any zero mean Gaussian Markov process has the structure of(1.9):

Lemma 1.9. Let X = (Xn)n≥0 be a Gaussian Markov process. Then there existdeterministic sequences (an)n≥1 and (bn)n≥1, such that for n ≥ 1

Xn = anXn−1 + bnεn

where ε is a standard Gaussian i.i.d. sequence.

Proof. Let Xn0 = σX0, X1, ..., Xn and consider the decomposition

Xn = E(Xn|Xn−10 ) +

(Xn − E(Xn|Xn−1

0 ))

Since the process is Markov E(Xn|Xn−10 ) = E(Xn|Xn−1) and since it is Gaussian

E(Xn|Xn−1) = anXn−1 for some constant an. Now set

bn =√

E(Xn − E(Xn|Xn−1))2

and εn =(Xn − E(Xn|Xn−1

0 ))/bn, so that

Xn = anXn−1 + bnεn.



It is left to check that εn is an i.i.d. Gaussian sequence. First note that being alinear transformation of Gaussian r.v. ε = (εn)n≥1 is a Gaussian process. Also e.g.for n > m

Eεnεm = b−1n b−1

m E(Xn − E(Xn|Xn−10 ))(Xm − E(Xm|Xm−1

0 )) = 0

and thus εn and εm are independent. Similarly independence can be verified forany number of entries of ε. Since cov(εn) = 1 and the process is Gaussian, thesequence ε is a standard i.i.d. Gaussian sequence. ¤

Corollary 1.10. Any Gaussian Markov stationary process has the correlationfunction of the form

R(n) ∝ a|k|

with some |a| < 1.

Indeed, due to representation of Lemma (1.9) for any n ≥ 1

R(n− 1, n) = EXnXn−1 = anEX2n−1 = anR(n− 1, n− 1).

Since the process is stationary R(1) = R(n − 1, n) = anR(n − 1, n − 1) = anR(0).Assuming R(0) > 0 (otherwise the claim is trivial!) we obtain an = R(1)/R(0) =const for all n ≥ 1. This implies that Xn satisfies

Xn = aXn−1 + bnεn,

which implies that cov(Xn) = a2 cov(Xn−1)+ b2n for any n ≥ 1 and thus bn ≡ b, i.e.

bn is constant as well. The correlation function of the stationary process satisfyingthe recursion with constant coefficients is b2/(1 − a2)a|k| if |a| < 1. If |a| ≥ 1 theprocess can not be stationary (e.g. its variance grows).

2. Nonlinear estimation

In chapter 2 the linear estimation problem was addressed: the optimal estimatewas constrained to be a linear transformation of the observation process. Thisconstraint may be quite restrictive in many problems, i.e. the accuracy can besignificantly improved if the nonlinear estimates are considered. In the next sectionwe address the simplest nonlinear filtering problem.

2.1. Filtering of Markov chains.2.1.1. Finite state Markov chains. The notion of Markov process was intro-

duced in the section 5.1.4. Markov processes in discrete time are often calledMarkov chains. The simplest Markov chain can be constructed in the followingway: let S = a1, ..., ad be a finite set of real numbers (symbols); p0 be someprobability distribution on S (which can be identified with a vector from the sim-plex S = x ∈ Rd : xi ≥ 0,

∑di=1 xi = 1) and transition probabilities λij ∈ [0, 1],

1 ≤ i, j ≤ d,∑d

j=1 λij = 1, i = 1, ..., d.Now let X0 be a random variable with values in S and distribution p0. Define

Xn recursively as a random variable

Xn =d∑

i=1

εn(i)I(Xn−1 = ai)

where (εn)n≥1 is an i.i.d. vector sequence with P (ε1(i) = aj) = λij , j = 1, ..., d.



By construction the process X = (Xn)n≥0 is Markov. Denote by pn its marginaldistribution at time n, i.e. pn(i) = P (Xn = ai). and let Λ be the matrix with entriesλij . Then (why?)

pn = Λ∗pn−1 = Λ∗np0. (2.1)

The chain X is said to be ergodic if it converges weakly (in law) to a randomvariable with positive distribution on S, i.e. if pn

n→∞−−−−→ µ with µ(i) > 0. It can beshown that X is ergodic if and only if its transition matrix Λ is q-primitive, i.e. Λq

has positive entries for some q ≥ 1. Finite state Markov chains has been the subjectof research since 30’s. For further reading see (8) and the references therein.

2.1.2. Filtering in the Hidden Markov Models (HMM). Consider the signal /observation pair (X, Y ), where X = (Xn)n≥0 is a finite state Markov chain with(S, Λ, p0) and Y = (Yn)n≥1 is generated by

Yn = h(Xn) + ξn

where h is S 7→ R function and ξ = (ξn)n≥1 is an i.i.d. sequence with ξ1 havingdensity f(x). This setting is often referred as Hidden Markov Model (HMM). Oneof the basic problems in HMM is to estimate the state Xn from the observationsY n

1 = σY1, ..., Yn. Similarly to the Kalman filter, the solution can be given in theefficient recursive form

Theorem 2.1. For any ϕ : S 7→ R, E(ϕ(Xn)|Y n1 ) =

∑di=1 ϕ(ai)πn(i), where

πn(i) = P (Xn = ai|Y n1 ). The vector process π = (πn)n≥1 is generated by

πn =D(Yn)Λ∗πn−1

〈1, D(Yn)Λ∗πn−1〉 , n ≥ 1 (2.2)

subject to π0 = p0, where 〈1, x〉 =∑d

i=1 xi, x ∈ Rd and D(y), y ∈ R is a diagonalmatrix with entries f(y − h(ai)), i = 1, ..., d.

Proof. First note that

E(ϕ(Xn)|Y n1 ) = E

( d∑

i=1

ϕ(ai)I(Xn = ai)|Y n1

)=

d∑

i=1

ϕ(ai)P (Xn = ai|Y n1 ).

Let Gi : R × Rn−1 7→ R be a function such that P (Xn = ai|Y n1 ) = Gi(Yn; Y n−1

1 ).Then Gi should satisfy

E(I(Xn = ai)−Gi(Yn; Y n−1

1 ))ψ(Yn)Ψ(Y n−1

1 ) = 0 (2.3)

for any bounded ψ : R 7→ R and Ψ : Rn−1 7→ R. Note that no generality is lostif only the functions of the form ψ(y)Ψ(yn−1

1 ) are considered, since any boundedfunction of n variables can be approximated by series of the products of functionsof single variable (e.g. indicators). Note also that (2.3) holds if

E((

I(Xn = ai)−Gi(Yn; Y n−11 )

)ψ(Yn)|Y n−1

1

)= 0, P − a.s.



is satisfied. Further

E(I(Xn = ai)ψ(Yn)|Y n−1

1

)= E

(I(Xn = ai)ψ(h(ai) + ξn)|Y n−1

1

)=∫

Rψ(h(ai) + s)f(s)dsE

(I(Xn = ai)|Y n−1

1

)=

∫

Rψ(s)f(s− h(ai))dsE

(E

(I(Xn = ai)|Xn−1, Y

n−11

)|Y n−11

)=

∫


(E

(I(Xn = ai)|Xn−1

)|Y n−11

)=

∫


( d∑

j=1

λjiI(Xn−1 = aj)|Y n−11

)=

∫

Rψ(s)f(s− h(ai))ds

d∑

j=1

λjiπn−1(j)

and similarly

E(Gi(Yn;Y n−1

1 )ψ(Yn)|Y n−11

)=

E( d∑

`=1

I(Xn = a`)Gi(h(a`) + ξn;Y n−11 )ψ(h(a`) + ξn)|Y n−1

1

)=

d∑

`=1

∫

RGi(h(a`) + s;Y n−1

1 )ψ(h(a`) + s)f(s)dsE(I(Xn = a`)|Y n−1

1

)=

d∑

`=1

∫

RGi(s; Y n−1

1 )ψ(s)f(s− h(a`))dsE(I(Xn = a`)|Y n−1

1

)=

d∑

`=1

∫

RGi(s; Y n−1

1 )ψ(s)f(s− h(a`))ds

d∑

j=1

λj`πn−1(j).

By arbitrariness of ψ

Gi(s; Y n−11 ) =

f(s− h(ai))∑d

j=1 λjiπn−1(j)∑d`=1 f(s− h(a`))

∑dj=1 λj`πn−1(j)

which proves the recursion (2.2). ¤

Example 2.2. Binary Symmetric ChannelThe message Xn is a binary Markov chain with

PXn = 1|Xn−1 = 0 = PXn = 0|Xn−1 = 1 = λ

with the initial distribution PX0 = 1 = p. It is transmitted via symmetricchannel, so that the observed sequence is (n ≥ 1)

Yn = Xn ⊕ ξn = (Xn + ξn) mod 2

where ξn is an i.i.d. binary noise sequence Pξn = 1 = ε. Find the recursivefiltering estimate of Xn from Y n

1 .First note that the transition matrix of Xn is

Λ =(

1− λ λλ 1− λ

)



and the conditional distribution of Yn, given Xn is

f(i, j) := PYn = i|Xn = j =

1− ε, i = jε, i 6= j

, i, j ∈ 0, 1

so that f(Yn, 1) = (1 − 2ε)Yn + ε and f(Yn, 0) = (1 − 2ε)(1 − Yn) + ε. Let πn =PXn = 1|Y n

1 . Slightly modifying the proof of Theorem 2.1, the recursion isobtained:

πn =

[(1− 2ε)Yn + ε

]πn|n−1[

(1− 2ε)Yn + ε]πn|n−1 +

[(1− 2ε)(1− Yn) + ε

](1− πn|n−1

) (2.4)

where

πn|n−1 = λ(1− πn−1) + (1− λ)πn−1.

The recursion (2.4) can be rewritten as:

πn =(1− ε)πn|n−1

(1− ε)πn|n−1 + ε(1− πn|n−1

)Yn +

+επn|n−1

επn|n−1 + (1− ε)(1− πn|n−1

) (1− Yn) (2.5)

¤

The optimal linear estimate E(ϕ(Xn)|Y n1 ) can also be efficiently calculated

via the Kalman filter:

Theorem 2.3. Assume that ξ is an i.i.d. sequence with σ2 = Eξ21 < ∞ and

Eξ1 = 0. Then for any ϕ : S 7→ R, E(ϕ(Xn)|Y n1 ) =

∑di=1 ϕ(ai)ηn(i), where

ηn(i) = E(I(Xn = ai)|Y n1 ). The vector process η = (ηn)n≥1 is generated by

ηn = Λ∗ηn−1 +

(Λ∗Pn−1Λ + Vn

)h

h∗Λ∗Pn−1Λh + h∗Vnh + σ2

(Yn − h∗Λ∗ηn−1

)(2.6)

Pn = Λ∗Pn−1Λ + Vn −(Λ∗Pn−1Λ + Vn

)hh∗

(Λ∗Pn−1Λ + Vn

)

h∗Λ∗Pn−1Λh + h∗Vnh + σ2(2.7)

Vn = diag(pn)− Λ∗ diag(pn−1)Λ (2.8)

subject to η0 = p0 and V0 = P0 = diag(p0)− p0p∗0, where h is a vector with entries

h(i) and pn satisfies (2.1).

Proof. Introduce the random vector process

In =

I(Xn = a1)...

I(Xn = ad)

and let pn = EIn, i.e. pn(i) = P (Xn = ai).Note that E(ϕ(Xn)|Y n

1 ) = E(〈ϕ, In〉|Y n1 ) = 〈ϕ, ηn〉, where 〈x, y〉 =

∑di=1 xiyi,

x, y ∈ Rd and the ϕ is vector in Rd with entries ϕ(ai).Let ε = (εn)n≥1 be the vector random process, satisfying

εn = In − Λ∗In−1.



Then Eεn = pn − Λ∗pn−1 = 0 (see (2.1)) and

cov(εn) = E(In − Λ∗In−1

)(In − Λ∗In−1

)∗ =

E(In − Λ∗In−1

)I∗n − E

(In − Λ∗In−1

)I∗n−1Λ =

E diag(In)− EΛ∗In−1E(I∗n|In−1)− E(E(In|In−1)− Λ∗In−1

)I∗n−1Λ =

diag(pn)− EΛ∗In−1I∗n−1Λ = diag(pn)− Λ∗ diag(pn−1)Λ := Vn

Moreover for m 6= n (say n > m)

Eεnε∗m = E(In − Λ∗In−1)(Im − Λ∗Im−1)∗ =

E(E(In|Xn−10 )− Λ∗In−1)(Im − Λ∗Im−1)∗ = 0

The equations (2.6)-(2.7) are nothing but the Kalman filter, corresponding to thesystem

In = Λ∗In−1 + εn

Yn = h∗In + ξn = h∗Λ∗In−1 + h∗εn + ξn.

¤

Exercise 2.4. Apply the equations of Theorem 2.3 to the model in Example2.2.

In general the mean square error of the nonlinear filter (2.2) is strictly lessthan of the filter (2.6). However the equations (2.6)-(2.7) have their pros: thecorresponding filtering error is obtained via Pn, whereas the performance of (2.2)can not be assessed in a simple way. Also their stability properties are better known.

2.1.3. Filtering of the occupation times and transitions counters. In many HMMapplications it is required to calculate the estimates of certain processes related tothe signal X. For example, the efficient calculation of the transition probabilitiesestimates is based (see (10) for details) on the filtering estimates of the occupationtimes

θin =

n∑

k=0

I(Xk = ai), i = 1, ..., d

and the number of i-to-j transitions

N ijn =

n∑

k=1

I(Xk = aj)I(Xk−1 = ai).

The particularly efficient algorithms for calculation of θin = E

(θi

n|Y n1

)and

N ijn = E

(N ij

n |Y n1

)were derived by the authors of (10).

Occupation time

First note that θrn obeys the recursion:

θrn = θr

n−1 + I(Xn = ar)

Introduce the vector Zn := θrnIn, where In is the vector of indicators as in the proof

of Theorem 2.3. Clearly

Zn := θrnIn =

[θr

n−1 + In(r)]In = θr

n−1In + erIn(r) =

= θrn−1

[Λ∗In−1 + In − Λ∗In−1

]+ erIn(r) =

= Λ∗Zn−1 + θrn−1εn + erIn(r) (2.9)



where er is the r-th vector from the standard basis of Rd and εn := In − Λ∗In−1.Conditioning both sides of (2.9) with respect to Y n−1

1 we arrive at

Zn|n−1 := E(Zn|Y n−11 ) = Λ∗Zn−1 + E

(θr

n−1εn|Y n−11

)+ erπn|n−1(r) (2.10)

The mid term in (2.10) vanishes:

E(θr

n−1εn|Y n−11

)= E

E

(θr

n−1εn|In−10 , Y n−1

1

)∣∣Y n−11

=

= E

θrn−1E

(In − Λ∗In−1)|In−1

)∣∣Y n−11

≡ 0

so that

Zn|n−1 = Λ∗Zn−1 + er〈er, Λ∗πn−1〉 (2.11)

Let us find Zn = E(Zn|Y n1 ) in terms of Yn and Zn|n−1. Set

Zn = G(Yn; Y n−11 )

For any bounded ψ : R 7→ R

E(

Zn −G(Yn; Y n−11 )

)ψ(Yn)

∣∣Y n−11

= 0

Calculating the first term componentwise:

E(Zn(i)ψ(Yn)|Y n−11 ) = E

(θr

nI(Xn = ai)ψ(h(ai) + ξn)|Y n−11

)=

= E(θr

nI(Xn = ai)|Y n−11

) ∫ψ(h(ai) + x)f(x)dx =

= Zn|n−1(i)∫

ψ(x)f(x− h(ai))dx (2.12)

Analogously:

E(G(Yn;Y n−1

1 )ψ(Yn)|Y n−11

)=

∑

i

πn|n−1(i)∫

G(x; Y n−11 )ψ(x)f(x− h(ai))dx (2.13)

The eq. (2.12), (2.13) and πn|n−1 = Λ∗πn−1 lead to:

Zn =DnZn|n−1

〈1, DnΛ∗πn−1〉 (2.14)

where Dn is a diagonal matrix with the elements f(Yn − h(ai)).Finally combining (2.14) and (2.11) we arrive at the filtering equations:

Zn =DnΛ∗Zn−1

〈1, DnΛ∗πn−1〉 +Dner〈er,Λ∗πn−1〉〈1, DnΛ∗πn−1〉 , t ≥ 1 (2.15)

Z0 = Eθr0I0 = EI(X0 = ar)I0 = erp0(r)

where πn is generated by (2.2).To recover the optimal estimate of θr

n recall that 〈1, In〉 ≡ 1, so that:

θrn = E(θr

n〈1, In〉|Y n1 ) = 〈1, E(θr

nIn|Y n1 )〉 = 〈1, Zn〉



0 50 100 150 200 250 300 350 400 450 500−3

−2

−1

0

1

2

3

4Filtering of the occupation time of state 0

time index

0 50 100 150 200 250 300 350 400 450 5000

50

100

150

200

exact optimal suboptimal

Figure 1. Filtering of the occupation time of the 0 state of aMarkov chain. See the noisy path versus the signal realization atthe upper plot. The optimal and suboptimal estimates are drawnversus the exact signal path.

2.1.4. Simulation example. Consider the following example: a slowly switchingMarkov chain with the state space −1, 0, 1 is observed in zero mean Gaussiannoise. The optimal estimate for the 0 state occupation time has been calculatedby the recursion (2.15). In figure 1 it is compared to the exact signal path and thefollowing suboptimal estimate:

ϑ0n = ϑ0

n−1 + I(Xn = 0)

where Xn = argmax1≤i≤d πn(i) is the MAP (maximum a posteriori probability)estimate of Xn.

Number of r-to-s transitions

Nrsn =

n∑

j=1

I(Xj−1 = ar)I(Xj = as) = Nrsn−1 + 〈er, In−1〉〈es, In〉 (2.16)

Define vectors Qn := Nrsn In and εn := In − Λ∗In−1. Then:

Qn := Nrsn In =

[Nrs

n−1 + 〈er, In−1〉〈es, In〉][

Λ∗In−1 + εn

]= (2.17)

= Nrsn−1Λ

∗In−1 + Nrsn−1εn + 〈er, In−1〉〈es, Λ∗In−1 + εn〉

[Λ∗In−1 + εn

]

Denote Qn|n−1 = E[Qn|Y n−1

1

]and Qn = E

[Qn|Y n

1

].



Conditioning (2.17) on Y n−11 gives:

Qn|n−1 = Λ∗Qn−1 + αn−1 + βn−1 (2.18)

whereαn−1 = E

[〈er, In−1〉〈es,Λ∗In−1〉Λ∗In−1

∣∣Y n−11

]

andβn−1 = E

[〈er, In−1〉〈es, εn〉εn|Y n−11

].

The above objects can be simplified:

αn−1 = E[〈er, In−1〉〈es,Λ∗er〉Λ∗er

∣∣Y n−11

]=

= 〈er, πn−1〉〈es, Λ∗er〉Λ∗er

Also:

βn−1 = E[〈er, In−1〉〈es, εn〉εn

∣∣Y n−11

]=

= E[〈er, In−1〉〈es, In − Λ∗In−1〉εn

∣∣Y n−11

]=

= E[〈er, In−1〉〈es, In〉εn

∣∣Y n−11

]=

= E[〈er, In−1〉〈es, In〉(In − Λ∗In−1)

∣∣Y n−11

]=

= E[〈er, In−1〉〈es, In〉(es − Λ∗er)

∣∣Y n−11

]=

= E[〈er, In−1〉〈es, Λ∗In−1〉(es − Λ∗er)

∣∣Y n−11

]=

= E[〈er, In−1〉〈es, Λ∗er〉(es − Λ∗er)

∣∣Y n−11

]=

= 〈er, πn−1〉〈es, Λ∗er〉(es − Λ∗er)

So that the eq. (2.18) reads:

Qn|n−1 = Λ∗Qn−1 + 〈er, πn−1〉〈es, Λ∗er〉Λ∗er + (2.19)+〈er, πn−1〉〈es,Λ∗er〉(es − Λ∗er) =

= Λ∗Qn−1 + 〈er, πn−1〉〈es, Λ∗er〉es

Set Qn := G(Yn;Y n−11 ). Then for any bounded ψ(x)

E[(

Qn −G(Yn; Y n−11 )

)ψ(Yn)|Y n−1

1

]= 0 (2.20)

The first term is calculated explicitly (ai ∈ S)

E[Qn(i)ψ(Yn)|Y n−1

1

]= E

[Nrs

n In(i)ψ(h(ai) + ξn)|Y n−11

]=

= Qn|n−1(i)∫

ψ(x)f(x− h(ai))dx

Expanding similarly the second term, we arrive at:

Qn =DnQn|n−1

〈1, DnΛ∗πn−1〉 (2.21)

The equations (2.19) and (2.21) lead to a finite dimensional filter for Qn:

Qn =DnΛ∗Qn−1 + Dn〈er, πn−1〉〈es, Λ∗er〉es

〈1, DnΛ∗πn−1〉 , t ≥ 1 (2.22)

This recursion should be solved subject to Q0 = 0. The estimate of Nrsn is recovered

from Qn byNrs

n = 〈1, Qn〉62 EE, Tel Aviv University


0 50 100 150 200 250 300 350 400 450 500−3

−2

−1

0

1

2

3

time index

Filtering of 0 to −1 transitions

0 50 100 150 200 250 300 350 400 450 5000

1

2

3

4

5

time index

exact suboptimaloptimal

Figure 2. Filtering of the number of transition from the 0 to −1.See the noisy path versus the signal realization at the upper plot.The optimal and suboptimal estimates are drawn versus the exactsignal path at the lower plot.

2.1.5. Simulation example. A slowly switching Markov chain with states in−1, 0, 1 is observed in Gaussian noise. The optimal filtering estimate of thenumber of transition form state 0 to −1 is drawn at Figure 2, along with suboptimalestimate

N0,−1n = N0,−1

n−1 + I(Xn−1 = 0)I(Xn = −1)

where Xn is the MAP estimate of Xn.

2.2. The general filtering problem. Let (Xn)n≥0 be a Markov process withthe state space S ⊆ R, the transition kernel λ(x, dy), i.e. for any

P (Xn ∈ A|Xn−10 ) =

∫

A

λ(Xn−1, dy), A ⊂ B(S)

and initial distribution p(dx), i.e.

P (X0 ∈ A) =∫

A

p(dx).

Typical example is the process X generated by a nonlinear recursion

Xn = a(n,Xn−1) + εn, n ≥ 1



subject to a random variable X0, where a(n, x) is Z+ × R 7→ R function andε = (εn)n≥0 is an i.i.d. sequence with ε1 having probability density ϕ(x). In thiscase

λ(x, dy) = ϕ(y − a(n, x)

)dy. (2.23)

Suppose that the observation process is given by (n ≥ 1)

Yn = A(n,Xn) + ξn

where A(n, x) is Z+ × R 7→ R function and ξ = (ξn)n≥0 is an i.i.d. sequence,independent of ε, with ξ1 having probability density ψ(x).

Theorem 2.5. Assume that λ(x, dy) satisfies (2.23) and X0 has probabilitydensity p(x). Then the conditional distribution P (Xn ≤ x|Y n

1 ) has density πn(x),satisfying the recursion

πn(x) =ψ

(Yn −A(n, x)

) ∫S ϕ

(x− a(n, s)

)πn−1(s)ds∫

R ψ(Yn −A(n, x)

) ∫S ϕ

(x− a(n, s)

)πn−1(s)dsdx

(2.24)

subject to π0(x) = p(x).

Proof. The proof is similar to Theorem 2.1. ¤The equation (2.24) is a recursion propagates densities and its implementation

generally requires two integrations at each time step, which makes the optimalfiltering equations practically useless. Usually certain approximations are used inthe applications in general.

In certain special cases, the conditional density πn(x) can be parameterizedby a finite sufficient statistics, i.e. there is a vector process θ = (θn(Y ))n≥1, suchthat πn(x) = f(x, θn(Y )), and θ can be generated by recursion, not involvingintegration. For example, for the linear Gaussian systems, i.e. if with a(n, x) = anxand A(n, x) = Anx, ϕ(x) and ψ(x) are Gaussian densities and X0 is a Gaussianr.v., the conditional density πn(x) is Gaussian4 as well. Its sufficient statistic is twodimensional: its mean Xn and covariance Pn are generated by the Kalman filter,so that

E(f(Xn)|Y n

1

)=

1√2πPn

∫

Rexp

− (x− Xn)2

2Pn

f(x)dx.

Though the latter calculation involves integration as well, for many functions f iscan be tabulated (or computed exactly: e.g. polynomials, etc.) off-line.

When such parametrization is possible, the filtering equation (2.24) is said tobe finite dimensional, since its finite dimensional realization exists.

Another example of finite dimensional filter is the equation (2.2), where theconditional distribution is parameterized by d-dimensional vector πn. Finite dimen-sional filters exist in the filtering problem of the occupation times and transitioncounters, eq. (2.15) and (2.22). In general no constructive way to verify existenceof a finite dimensional filter in a specific problem is currently known.

4recall that conditional expectation and orthogonal projection coincide in the Gaussian case


CHAPTER 4

Stochastic processes in continuous time

The continuous time stochastic process can be viewed as a parametric familyof random variables X = (Xt)t≥0 with t ∈ R+. The theory of these processesis significantly more delicate than in the discrete time case. This chapter gives anon-rigorous introductory treatment of several important topics in this field. Forfurther reading the reader is referred to (1), (2), (13), (14), (7), (5).

1. Poisson process

Let (τn)n≥1 be a sequence of i.i.d. exponential random variables with parameterλ (see Example 1.5 in chapter 3)

P (τ1 ≤ t) =

1− e−λt, t ≥ 00 t < 0

The process Π = (Πt)t≥0

Πt = maxk : τ1 + ... + τk ≤ tis called1 Poisson process with intensity λ. By definition the trajectories of Πt arepiecewise constant functions (since Πt takes integer values), continuous from theright and having limits from the left; Π0 = 0 with probability one.

Theorem 1.1.

(1) The process Π is Markov(2) Πt has Poisson distribution with parameter λt, i.e.

P (Πt = k|Πs0) =

(λ(t− s))(k−Πs)e−λ(t−s)

(k −Πs)!I(k ≥ Πs), ∀t ≥ s ≥ 0,

and so in particular

EΠt = λt, E(Πt − λt)2 = λt.

(3) Π has stationary independent increments, i.e. for any v > u > t > s andk ≥ ` ≥ 0

P (Πv −Πu = k, Πt −Πs = `) = P (Πv −Πu = k)P (Πt −Πs = `) =

(λ(v − u))ke−λ(v−u)

k!(λ(t− s))`e−λ(t−s)

`!. (1.1)

1the convention max∅ = 0 is understood

65


Proof. Introduce σk =∑k

i=1 τi. Then

P (Πt = k|Πs0) =

k∑

`=1

P (Πt = k|τ1, ..., τ`, τ`+1 > s− σ`)I(Πs = `)

and thus

P (Πt = k|τ1, ..., τ`, τ`+1 > s− σ`) =(λ(t− s))(k−`)e−λ(t−s)

(k − `)!

is to be verified. Further

P (Πt = k|τ1, ..., τ`, τ`+1 > s− σ`) = P (σk ≤ t < σk+1|τ1, ..., τ`, τ`+1 > s− σ`) =

E(P (σk ≤ t < σk+1|τ1, ..., τ`+1)

∣∣∣τ1, ..., τ`, τ`+1 > s− σ`

)=

E(P (τ`+2 + ... + τk ≤ t− σ` − τ`+1 < τ`+2 + ... + τk+1|σ`, τ`+1)

∣∣∣σ`, τ`+1 > s− σ`

)

= P(τ`+2 + ... + τk ≤ t− σ` − τ`+1 < τ`+2 + ... + τk+1

∣∣σ`, τ`+1 > s− σ`

)=

eλ(s−σ`)

∫ ∞

s−σ`

P(τ`+2 + ... + τk ≤ t− σ` − u < τ`+2 + ... + τk+1

∣∣σ`

)λe−λudu =

=∫ ∞

0

P(τ`+2 + ... + τk ≤ t− s− u′ < τ`+2 + ... + τk+1

)λe−λu′du′ =

= P(τ`+1 + τ`+2 + ... + τk ≤ t− s < τ`+1 + τ`+2 + ... + τk+1

)=

= P(τ1 + ... + τk−` ≤ t− s < τ1 + ... + τk−`+1

)= P

(Πt−s = k − `

).

Now it is left to show that

P (Πt = k) =(λt)ke−λt

k!, k ≥ 0. (1.2)

Note that

P (Πt = k) = P (σk ≤ t < σk + τk+1) = EI(σk ≤ t)I(τk+1 > t− σk) =

EI(σk ≤ t)e−λ(t−σk) =∫ t

0

e−λ(t−s)dP (σk ≤ s). (1.3)

and

P (σk ≤ s) = P (τk ≤ s− σk−1) = EP (τk ≤ s− σk−1|σk−1) =

EI(s− σk−1 ≥ 0)(1− e−λ(s−σk−1)) =∫ s

0

(1− e−λ(s−u))dP (σk−1 ≤ u) (1.4)

ClearlyP (σ1 ≤ s) = P (τ1 ≤ s) = 1− e−λs

and so by induction P (σk ≤ s) has density, which by (1.4) satisfies

dP (σk ≤ s)ds

= λ

∫ s

0

e−λ(s−u) dP (σk−1 ≤ u)du

du.

By inductiondP (σk ≤ s)

ds= λ

(λs)k−1e−λs

(k − 1)!.

The latter is known as Erlang distribution. The equation (1.2) and thus also thestatement of (2) now follow from (1.3).



The claim (3) follows directly from (2)

P (Πv −Πu = k, Πt −Πs = `) = EI(Πt −Πs = `)P (Πv = k + Πu|Πu0 ) =

EI(Πt −Πs = `)(λ(v − u))ke−λ(v−u)

k!=

(λ(t− s))`e−λ(t−s)

`!(λ(v − u))ke−λ(v−u)

k!.

¤

It can be shown that any process with piecewise constant trajectories with unitjumps, stationary independent increments and covariance process λt is the Pois-son process, i.e. the times between the jumps are necessarily exponential randomvariables.

Poisson process is one of the fundamental building blocks in the theory of jumpprocesses: including continuous time Markov chains, etc. It is also popular inmany applications such as queueing theory. The typical setup consists of a numberof servers and a queue. The arriving clients join the queue and get the serviceaccording to certain regime from one of the servers (e.g. FIFO). Each service timesare usually assumed to be independent random variables (either exponential or not).The queueing system types have conventional names: e.g. M(λ)/G/1/∞/FIFOstands for one server system with infinite FIFO queue length, where the customersarrivals stream is Markov (i.e. Poissonian with intensity λ) and the service times areof general distribution. Usually FIFO and infinite queue are the defaults and thenthe same system is called M/G/1. The typical questions are: what the expectedqueue length is; or what is the distribution of the idle times, etc. For example, theexpected waiting time in the queue of the stationary M(λ)/G/1 system is given bythe Khinchin-Pollaczek formula

Wq =λES2

2(1− ρ), ρ = λES

where S is the random service time.

2. Wiener process

Let ξ = (ξn)n≥0 be a sequence of i.i.d. random variables with zero mean andunit variance. For 0 ≤ t ≤ 1 define the scaled random walk process

Wnt =

1√n

btnc∑

j=1

ξj .

For any fixed t by the Central Limit Theorem Wnt converges weakly (in distribution)

to a Gaussian random variable with zero mean and variance t. Moreover it is nothard to check that for any set of points 0 ≤ t1 < t2 < ... < tm ≤ 1, the randomvector [Wn

t1 , ..., Wntm

] converges weakly to a Gaussian vector with zero mean andthe covariance matrix Γ with entries2 γij = ti ∧ tj .

It turns out that Wn = (Wnt )0≤t≤1 has also a weak limit as a continuous time

random process, i.e. for any bounded and continuous functional ψ on the space3 offunctions continuous from the right and with limits from the left

limn→∞

Eψ(Wn) = Eψ(W ),

2a ∧ b = min(a, b) and a ∨ b = max(a, b)3with appropriate metric



where W is a Gaussian process with continuous trajectories, independent and sta-tionary increments, zero mean and covariance function EWtWs = s ∧ t. The limitprocess is called the Wiener process. It is the mathematical model for the Brownianmotion, i.e. the motion of a particle driven by interactions with other particles ina solution, which was first described by botanist Brown.

It turns out that any process X satisfying the properties(1) X has continuous trajectories(2) For any t > s, E(Xt|FX

s ) = Xs, E[(Xt − Xs)2|FX

s

]= t − s, where

FXs = σXu, u ≤ s

is necessarily the Wiener process, i.e. it is Gaussian!The trajectories of the Wiener process are extremely irregular. For example,

they are nowhere differentiable and even have unbounded variation, i.e.b∨a

Wt(ω) = sup∑

j

|Wtj+1 −Wtj | = ∞, P − a.s.

where the supremum is taken with respect to all partitions a ≤ t1 < t2 < ... < tn ≤b.

The Wiener process is the main ingredient in many probabilistic and statisticalmodels in physics and engineering. Due to the unusual properties of its trajectoriesit is also a fascinating object for mathematical research.

3. Ito stochastic integral with respect to Wiener process

In many engineering applications the systems are subject to random perturba-tions (noise). In particular, it is customary to deal with the differential equations

Xt = a(t,Xt) + b(t,Xt)νt, X0 = x (3.1)

where ν is so called ”white” noise process with intensity N0, i.e. a process withzero mean and Dirac δ correlation function: Eνtνs = N0δ(t − s). The spectraldensity of νt, being the Fourier transform of δ, is constant, which is the origin forthe term ”white”.

One can try to construct such process by formal differentiation of the Wienerprocess:

EWtWs =∂2

∂t∂sEWsWt =

∂2

∂t∂st ∧ s = δ(t− s) (3.2)

Exercise 3.1. Show that for e.g. t > s > 0 and continuous f∫ ∞

0

∂2

∂t∂s(s ∧ t)f(s)ds = f(t)

i.e. the latter equality in (3.2) formally holds.

However, as it was mentioned above, the trajectories of the Wiener processare nowhere differentiable and thus such definition is incorrect. Note that ν doesnot make sense physically as well: it has infinite variance, which contradicts theengineering intuition/practice.

The next step is to interpret (3.1) as an integral equation, i.e.

Xt = X0 +∫ t

0

a(s,Xs)ds +∫ t

0

b(s,Xs)dWs. (3.3)



This construction also fails, because the last term cannot be defined as Lebesgue- Stieltjes integral due to unbounded variation of the trajectories of W .

Example 3.2. Let 0 = t0 < t1 < ... < tn = T be a partition of [0, T ] and set

In =n−1∑

i=0

Wti

(Wti+1 −Wti

), Sn =

n−1∑

i=0

Wti+1

(Wti+1 −Wti

)

The only difference between these random variables is the point at which the in-tegrand Wt is sampled in each interval (ti, tj+1]. While for Stieltjes integrablefunctions this would not influence the limit as n → ∞, for the highly irregulartrajectories of W this turns to be a crucial matter. Clearly for any n ≥ 1

EIn = 0, ESn = T

and thus the limits (if exist in some sense) would not coincide!

Nevertheless (3.3) makes sense if the integral with respect to the Wiener processis well defined for certain class of random processes. The ultimate construction ofsuch integral was given by K.Ito.

All the random objects below are defined on some probability space (Ω,F , P ).Let FW

t be the sub-σ-algebras of F generated by the process W , i.e. FWt =

σWs, s ≤ t. Clearly FWt is an increasing family of σ-algebras, which is called

filtration. The random process X = (Xt)0≤t≤T is said to be adapted with respectto filtration FW

t if for any fixed t, Xt is FWt -measurable.

The idea of construction is to define the integral for a class of simpler ran-dom processes, which is sufficiently rich to approximate more complex processes ofinterest. It turns out that any adapted random process X, satisfying4

E

∫ T

0

X2s ds < ∞ (3.4)

can be approximated by the piecewise constant (or simple) processes of theform 5

Xnt =

n∑

j=1

αj1(tj ,tj+1](t), (3.5)

where (tj)j≤n is some partition (depending on n) and (αj)j≤n is a sequenceof r.v., such that αj is FW

t -measurable if tj ≤ t. The approximation holds in thesense that

limn→∞

E

∫ T

0

(Xs −Xns )2ds = 0.

Since in L2 any converging sequence is also fundamental it follows

limn,m→∞

E

∫ T

0

(Xms −Xn

s )2ds = 0. (3.6)

4the Ito integral can also be defined under much weaker assumption

P

(∫ T

0X2

s ds < ∞)

= 1.

5For example, if Xt has continuous pathes, then Xnti

= Xtj can be taken - see e.g. (5) for a

more detailed account



For a simple function the Ito integral is defined as

IT (Xn) :=n∑

j=1

αj

[Wtj+1 −Wtj

]

and is denoted by∫ T

0Xn

s dWs.Note that

E(IT (Xn)

)2 = E

n∑

i=1

n∑

j=1

αjαi

[Wtj+1 −Wtj

][Wti+1 −Wti

]=

E

n∑

j=1

α2jE

([Wtj+1 −Wtj

]2∣∣FWtj

)+

E∑

i 6=j

αiαjE([

Wtj+1 −Wtj

][Wti+1 −Wti

]∣∣FWti∨tj

)=

n∑

j=1

Eα2j [tj+1 − tj ] =

∫ T

0

E(Xn

s

)2ds

This property and (3.6) give (note that Xns −Xm

s is again a simple function of theform (3.5))

E(IT (Xn)− IT (Xm))2 = E(IT (Xn −Xm)

)2 =∫ T

0

E(Xns −Xm

s )2dsn,m→∞−−−−−→ 0,

which means that the sequence IT (Xn) is fundamental in L2 and thus converges 6

in L2 to a limit IT (X). This limit is defined to be the Ito integral of X and denotedby

IT (X) =∫ T

0

XsdWs.

The Ito integral can be considered as a stochastic process with time parameter t:

It(X) =∫ t

0

XsdWs :=∫ T

0

1[0,t](s)XsdWs

Remark 3.3. This construction assumed that Xt is FWt -adapted, i.e. for any

Xt is FWt measurable for any t. Clearly all the arguments can be replicated if Xt

is Ft-adapted for some larger filtration Ft ⊇ FWt . This extends the definition of

the stochastic integrals to more general integrands, e.g.∫ t

0WtdVt where V and W

are independent Wiener processes.

3.1. Properties of the Ito integral. Consider the following basic propertiesof the Ito integral

(1) It(aX + bY ) = aIt(X) + bIt(Y )(2) E(It(X)|FW

s ) = Is(X) for t > s and in particular EIt(X) = 0(3) EIs(X)It(Y ) =

∫ t∧s

0EXuYudu and in particular EI2

t (X) =∫ t

0EX2

s ds(4) It(X) has continuous trajectories for 0 ≤ t ≤ T

6the P -a.s. convergence can be verified as well



The first three properties are verified for the simple processes and then their validityfor more complex processes is established by passing to the limit. For example, letXn and Y n be the approximations of X and Y . The process aXn + bY n is clearlysimple (and adapted) and thus It(aXn + bY n) = aIt(Xn) + bIt(Y n). So (1) holds,

since aIt(Xn) + bIt(Y n) L2

−−−−→n→∞

aIt(X) + bIt(Y ). The fourth property will not be

proved here (note that it also holds for simple functions!).

4. Stochastic differential equations

Consider the integral equation

Xt = X0 +∫ t

0

a(s,Xs)ds +∫ t

0

b(s,Xs)dWs, 0 ≤ t ≤ T (4.1)

where X0 is a random variable, a(s, x) and b(s, x) are R+×R 7→ R functions and Wis the Wiener process. Assume that X0 and W are independent. The integrationw.r.t ”ds” is understood in the usual sense (i.e. Riemann or Lebesgue).

Definition 4.1. A non FWt -adapted process X is a (strong) solution of (4.1),

if

P

(∫ T

0

|a(s,Xs)|ds < ∞)

= 1, P

(∫ T

0

[b(s,Xs)]2ds < ∞)

= 1

and (4.1) holds P -a.s.

It can be shown e.g. that if a(t, x) and b(t, x) satisfy the Lipschitz condition:

[a(t, y)− a(t, y′)]2 + [b(t, y)− b(t, y′)]2 ≤ L[y − y′]2, t ∈ [0, T ]

with some constant L and increase not faster than linearly

a2(t, y) + b2(t, y) ≤ L(1 + y2)

then (4.1) has the unique solution. Usually the equation (4.1) is written in thedifferential form in the spirit of the regular calculus

dXt = a(t,Xt)dt + b(t,Xt)dWt

subject to X0 = x. The process generated by this stochastic differential equationis called diffusions with drift a(t, x) and diffusion coefficient b(t, x).

4.1. The Ito formula.4.1.1. Scalar case. Consider the random process ξ = (ξt)0≤t≤T , which has the

Ito differentialdξt = a(t, ξt)dt + b(t, ξt)dWt (4.2)

where a and b are functions, satisfying the appropriate properties. It turns out thatthe process ζt = f(t, ξt) also has Ito differential, if f is sufficiently smooth.

Theorem 4.2. (Ito formula) Let the function f(t, x) be continuous and hasthe continuous partial derivatives f ′t(t, x), f ′x(t, x) and f ′′xx(t, x). Assume that therandom process ξ has the stochastic differential (4.2). Then the process f(t, ξt) alsohas a stochastic differential and

df(t, ξt) =[f ′t(t, ξt) + f ′x(t, ξt)a(t, ξt) +

12f ′′xx(t, ξt)b2(t, ξt)

]dt+

f ′x(t, ξt)b(t, ξt)dWt (4.3)



Proof. (heuristic sketch) Since f is twice differentiable, for any partition tj

f(T, ξT )− f(0, ξ0) =n∑

j=1

[f(tj+1, ξtj+1)− f(tj , ξtj )

]=

n∑

j=1

f ′t(tj , ξtj )[tj+1 − tj ]+

n∑

j=1

f ′x(tj , ξtj)[ξtj+1 − ξtj

] +12

n∑

j=1

f ′′xx(tj , ξtj)[ξtj+1 − ξtj

]2 + rn (4.4)

where the residual terms rn contain summation of the higher powers of [tj+1 − tj ]and [ξtj+1 − ξtj ]. Denote fj = f(tj , ξtj ), etc. for brevity. The first term in (4.4)converges to

∫ T

0f ′t(s, ξs)ds, while the second term gives

n∑

j=1

f ′xjaj [tj+1 − tj ] +n∑

j=1

f ′xjbj [Wtj+1 −Wtj] n→∞−−−−→

∫ t

0

a(s, ξs)f ′x(s, ξs)ds +∫ t

0

b(s, ξs)f ′x(s, ξs)dWs

where the latter integral is in the sense of Ito. The last term in (4.4) gives

12

n∑

j=1

f ′′xxj [ξtj+1 − ξtj ]2 =

12

n∑

j=1

f ′′xxja2j [tj+1 − tj ]2+

n∑

j=1

f ′′xxjajbj [Wtj+1 −Wtj ][tj+1 − tj ]+

12

n∑

j=1

f ′′xxjb2j [Wtj+1 −Wtj ]

2 := J1 + J2 + J3

The term J1 converges to zero as maxj

∣∣tj+1 − tj∣∣ → 0

∣∣∣n∑

j=1

f ′′xxja2j [tj+1 − tj ]2

∣∣∣ ≤ maxj

∣∣tj+1 − tj∣∣

n∑

j=1

f ′′xxja2j [tj+1 − tj ]

since∑n

j=1 f ′′xxja2j [tj+1 − tj ] converges to the integral

∫ T

0f ′′xx(s, ξs)a2(s, ξs)ds.

The term J2 also converges to zero in L2

E( n∑

j=1

f ′′xxjajbj [Wtj+1 −Wtj ][tj+1 − tj ])2

=

E

n∑

`=1

n∑

j=1

f ′′xx`a`b`[Wt`+1 −Wt`][t`+1 − t`]f ′′xxjajbj [Wtj+1 −Wtj ][tj+1 − tj ] =

En∑

`=1

n∑

j=1

f ′′xx`f′′xxjajbja`b`[t`+1 − t`][tj+1 − tj ]E

([Wt`+1 −Wt`

][Wtj+1 −Wtj ]|FWtj∨t`

)

= E

n∑

j=1

(f ′′xxjajbj

)2[tj+1 − tj ]3n→∞−−−−→ 0



The third term J3 converges to a nonzero limit. Indeed

E

n∑

j=1

f ′′xxjb2j [Wtj+1 −Wtj

]2 −n∑

j=1

f ′′xxjb2j [tj+1 − tj ]

2

=

E

n∑

j=1

f ′′xxjb2j

[Wtj+1 −Wtj

]2 − [tj+1 − tj ]

2

=

E

n∑

j=1

n∑

`=1

f ′′xxjb2jf′′xx`b

2`

[Wtj+1 −Wtj ]

2 − [tj+1 − tj ]

[Wt`+1 −Wt`]2 − [t`+1 − t`]

=

E

n∑

j=1

(f ′′xxj

)2b4j

[Wtj+1 −Wtj ]

2 − [tj+1 − tj ]2 = E

n∑

j=1

(f ′′xxj

)2b4j2[tj+1 − tj ]2

n→∞−−−−→ 0

whereasn∑

j=1

f ′′xxjb2j [tj+1 − tj ]

n→∞−−−−→∫ T

0

f ′′xx(s, ξs)b2(s, ξs)ds.

Similarly the residual terms rn can be shown to vanish as n →∞ and thus the righthand side of (4.4) gets the integral form of the required formula as n →∞. ¤

Remark 4.3. Note that if a function Wt had finite variation then by the chainrules of the classic calculus one would obtain

df(t,Xt) = f ′t(t,Xt)dt + f ′x(t,Xt)dXt =[f ′t(t,Xt)+

f ′x(t,Xt)a(t,Xt)]dt + f ′x(t,Xt)b(t, Xt)dWt,

i.e. the Ito formula has a ”non classical” appendix 1/2f ′′xxb2dt!

Example 4.4. Let f(t, x) = x2 and Xt ≡ Wt, i.e. a ≡ 0 and b ≡ 1. Then byIto formula

d(W 2t ) = 2WtdWt +

122dt

which actually means that

W 2t = 2

∫ t

0

WsdWs + t.

Example 4.5. Consider the Ornstein-Uhlenbeck process X satisfying the SDE

dXt = atXtdt + btdWt

subject to random initial condition X0 = η, where at and bt are deterministicfunctions. Let mt = EXt and Vt = E(Xt − mt)2. Recall that the differentialequation is nothing but notation for

Xt = η +∫ t

0

asXsds +∫ t

0

bsdWs.

Applying the expectation to the latter equality, obtain

mt = Eη +∫ t

0

asmsds

or in other wordsmt = atmt, m0 = Eη.



Let ∆t = Xt −mt, then

∆t = η − Eη +∫ t

0

as∆sds +∫ t

0

bsdWs.

Now apply the Ito formula to ∆2t :

d(∆2t ) = 2∆2

t atdt + 2∆tbtdWt + b2t dt

which stands for

∆2t = ∆2

0 +∫ t

0

(2as∆2s + b2

s)ds +∫ t

0

2∆sbsdWs.

Taking the expectation from both sides obtain

Vt = E(η − Eη)2 +∫ t

0

(2asVs + b2s)ds

orVt = 2atVt + b2

t , V0 = var(η).

Note that if at ≡ −a < 0 and bt ≡ b > 0, then mt → 0 and Vt → b2/(2a) as t →∞.In engineering notations this means that the output of the low pass filter, drivenby the ”white” noise, has the power b2/(2a) in the steady state.

Since this system is linear, the probability distribution of Xt has Gaussiandensity, if η is a Gaussian random variable

pt(x) =d

dxP (Xt ≤ x) =

1√2πVt

exp−(x−mt)2

2Vt

.

The solution of this equation can be found explicitly:

Xt = e∫ t0 asds

(X0 +

∫ t

0

bse− ∫ s

0 aududWs

)

which is verified by the (vector7) Ito formula applied to Xt.

4.1.2. Vector case. Consider now a vector diffusion

dξt = a(t, ξt)dt + b(t, ξt)dWt

where a(t, x) : R+ × Rd 7→ Rd and b(t, x) : R+ × Rd 7→ Rd×m and W is a vector ofm independent Wiener processes.

Theorem 4.6. (vector Ito formula) Let f(t, x) be a R+ × Rd 7→ R continuousfunction with continuous derivatives f ′t, f ′xj

and f ′′xjxj. Then the process f(t, ξt)

has the stochastic Ito differential

df(t, ξt) =[f ′t(t, ξt) +

d∑

i=1

f ′xi(t, ξt)ai(t, ξt)+

12

d∑

i,j=1

f ′′xixj(t, ξt)

m∑

k=1

bik(t, ξt)bjk(t, ξt)]dt +

d∑

i=1

m∑

j=1

f ′xi(t, ξt)bij(t, ξt)dWt(j). (4.5)

7see the following section



Remark 4.7. The equation (4.5) can be written compactly as

df(t, ξt) =[f ′t(t, ξt) +∇∗f(t, ξt)a(t, ξt) +

12

(∇∗b(t, ξt)b∗(t, ξt)∇

)f(t, ξt)

]dt+

∇∗f(t, ξt)b(t, ξt)dWt, (4.6)

where ∇f is the column gradient vector and the differential operator ∇∗bb∗∇ isdetermined by the formal rules of multiplication.

Example 4.8. Consider the linear system

dXt = aXtdt + bdWt, X0 = η

where η is square integrable random vector, a and b are d× d and d×m matricesand Wt is the vector Wiener process of dimension m. Lets find the equationsfor mt = EXt and Pt = cov(Xt). Taking the expectation from both sides oneimmediately obtains the differential equation for mt = EXt

mt = amt, m0 = Eη.

The process Dt = Xt −mt satisfies

dDt = aDtdt + bdWt, D0 = η − Eη.

Let fpq(x) = xpxq, x ∈ Rd and apply the vector Ito formula to Γpqt = fpq(Dt)

dΓpqt = Dp

t dDqt + Dq

t dDpt +

12

d∑

i,j=1

[δ(i = p, j = q) + δ(i = q, j = p)

] m∑

k=1

bikbjkdt =

Dpt

d∑

i=1

aqiDitdt + Dp

t

m∑

j=1

bqjdW jt + Dq

t

d∑

i=1

apiDitdt + Dq

t

m∑

j=1

bpjdW jt +

12

[ m∑

k=1

bpkbqkdt +m∑

k=1

bqkbpkdt]

=

d∑

i=1

aqiΓpit dt + Dp

t

m∑

j=1

bqjdW jt +

d∑

i=1

apiΓqit dt + Dq

t

m∑

j=1

bpjdW jt +

m∑

k=1

bpkbqkdt

Taking the expectation of the latter equation one gets

dP pqt =

d∑

i=1

aqiPpit dt +

d∑

i=1

apiPqit dt +

m∑

k=1

bpkbqkdt,

for P pqt = EΓpq

t or in the matrix form

Pt = aP ∗t + Pta∗ + bb∗ = aPt + Pta

∗ + bb∗, P0 = cov(η) (4.7)

where Pt = EDtD∗t . The Lyapunov equation (4.7) is linear and it can be further

analyzed to verify whether Xt has a non degenerate Gaussian density or whetherthis density stabilizes as t →∞, etc.

The vector Ito formula can be conveniently remembered as follows. Let Xt andYt be a pair of Ito processes with differentials

dXt = a1(Xt, Yt)dt + b11(Xt, Yt)dWt + b12(Xt, Yt)dW ′t

dXt = a2(Xt, Yt)dt + b21(Xt, Yt)dWt + b22(Xt, Yt)dW ′t ,



where W ′ and W are independent Wiener processes. Let f(t, x, y) be a real functionof three arguments, sufficiently differentiable so that Ito formula is applicable.

Use the regular calculus rules and consequent Taylor expansion up to order twoto write formally

df(t,Xt, Yt) = ftdt + fxdXt + fydYt +12fxx(dXt)2 + fxy(dXtdYt) +

12fyy(dYt)2,

where e.g.

fxx =∂2

∂2xf(t, x, y)∣∣∣

x := Xt, y := Yt

, etc.

To proceed use the following ”multiplication” table

· dt dWt dW ′t

dt 0 0 0dWt 0 dt 0dW ′

t 0 0 dt

For example

(dXt)2 =(a1dt + b11dWt + b12dW ′t)

2 = a21(dt)2 + b2

11(dWt)2 + b212(dW ′

t )2+

2a1b11(dWtdt) + 2a1b12(dW ′tdt) + 2b11b12(dWtdW ′

t ) = b211dt + b2

12dt.

Similar calculations lead to

df(t,Xt, Yt) = ft + fxdXt + fydYt +12fxx(b2

11dt + b212)dt+

fxy(b11b21 + b12b22)dt +12fyy(b2

21 + b222)dt

Verify that the proposed procedure leads to the correct answer (suggested by formalIto formula).

Example 4.9. Consider the two dimensional Ito system

du′t = − sin(ξt)dWt + cos(ξt)dVt

du′′t = cos(ξt)dWt + sin(ξt)dVt

subject to u′0 = u′′0 = 0, where W and V are independent Wiener processes and ξt issome FV,W

t -adapted process. Let’s show that u′ and u′′ are independent Gaussianrandom variables for any fixed t > 0, i.e.

E(

expiλu′t + iµu′′t

)= e−1/2λ2te−1/2µ2t (4.8)

Apply the Ito formula to the function ϕt = expiλu′t + iµu′′t :dϕt = ϕt(iλdu′t + iµdu′′t )− 1/2λ2ϕt[sin2(ξt) + cos2(ξt)]dt

− 1/2µ2ϕt[cos2(ξt) + sin2(ξt)]dt− ϕtλµ[− sin(ξt) cos(ξt) + cos(ξt) sin(ξt)]dt =

ϕt(iλdu′t + iµdu′′t )− 1/2[λ2 + µ2]ϕtdt

The characteristic function ψt = Eϕt then satisfies

ψt = −1/2[λ2 + µ2]ψt



subject to ψ0 = 1 and thus is given by ψt = e−1/2tλ2−1/2tµ2. In fact, similarly it

can be verified that

E(

expiλ(u′t − u′s) + iµ(u′′t − u′′s )

∣∣FWs ∨ FV

s

)= e−1/2λ2(t−s)e−1/2µ2(t−s)

which implies that u′t and u′′t are independent Wiener processes, since both havecontinuous trajectories.

Example 4.10. What SDE does the process (Wt is the Wiener process)

ξt =eWt

1 +∫ t

0eWsds

satisfy ? Let Xt = eWt and Yt =∫ t

0eWsds, then

dXt = eWtdWt +12eWtdt =

12Xtdt + XtdWt

anddYt = eWtdt = Xtdt

Denote ξt = f(Xt, Yt), where f = x/(1 + y). Clearly

fx =1

1 + y, fy =

−x

(1 + y)2.

By vector Ito formula

dξt =1

1 + YtdXt − Xt

(1 + Yt)2dYt =

11 + Yt

(12Xtdt + XtdWt

)− Xt

(1 + Yt)2Xtdt =

12ξtdt + ξtdWt − ξ2

t dt = ξt(1/2− ξt)dt + ξtdWt.

5. Applications

5.1. PDE for the marginal density of diffusions. Consider the scalarSDE

dXt = a(Xt)dt + b(Xt)dWt (5.1)

where a and b are twice continuously differentiable functions and W is a Wienerprocess. Suppose that this equation is solved subject to X0 = η, which has smoothprobability density p0(x). Below we give a heuristic derivation for the probabilitydensity

pt(x) =∂

∂xP (Xt ≤ x).

Let’s assume that this density exists and is twice continuously differentiable as well.Fix a bounded compactly supported function f : by Ito formula

df(Xt) = f ′(Xt)a(Xt)dt + f ′(Xt)b(Xt)dWt +12f ′′(Xt)b2(Xt)dt.

Thus

Ef(Xt) = Ef(X0) +∫ t

0

Ef ′(Xs)a(Xs) +

12f ′′(Xs)b2(Xs)

ds =

∫

Rf(u)p0(u)du +

∫ t

0

∫

R

f ′(u)a(u) +

12f ′′(u)b2(u)

ps(u)duds.



Integration by parts gives∫

R

f ′(u)a(u) +

12f ′′(u)b2(u)

ps(u)du =

−∫

Rf(u)

∂

∂ua(u)ps(u)du−

∫

R

12f ′(u)

∂

∂ub2(u)ps(u)

du =

−∫

Rf(u)

∂

∂ua(u)ps(u)du +

∫

R

12f(u)

∂2

∂u2b2(u)ps(u)

du

So∫

Rf(u)pt(u)du =

∫

Rf(u)p0(u)du+∫ t

0

∫

Rf(u)

[− ∂

∂ua(u)ps(u)+

12

∂2

∂u2b2(u)ps(u)

]duds

which by arbitrariness of f implies

pt(x) = p0(x) +∫ t

0

[− ∂

∂ua(u)ps(u)+

12

∂2

∂u2b2(u)ps(u)

]ds

or in differential form∂

∂tpt(x) = − ∂

∂xa(x)pt(x)+

12

∂2

∂x2b2(x)pt(x)

.

The latter is known as the Fokker-Planck or the forward Kolmogorov equation.

Exercise 5.1. Consider the linear equation from Example (4.5), subject toGaussian random variable X0 = η. The FPK PDE reads

∂

∂tpt(x) = − ∂

∂xaxpt(x)+

b2

2∂2

∂x2pt(x)

=

− apt(x)− ax∂

∂xpt(x) +

b2

2∂2

∂x2pt(x)

Verify that the solution is the Gaussian density with mean mt and variance Vt,satisfying the equations derived in Example (4.5).

5.2. Filtering of linear diffusions. Consider the signal/observation pair ofprocesses (Xt, Yt)t≥0, generated by the linear system

dXt = aXtdt + bdWt

dYt = AXtdt + BdVt

subject to Y0 = X0 = 0, where a, A and b,B > 0 are known parameters and W andV are independent Wiener processes. Suppose that Xt is estimated by the linearfilter of the form

dXt = aXtdt + γ(dYt −AXtdt

), (5.2)

subject to X0 = 0. Choose γ so that the filter is stable in the sense Q∞(γ) =limt→∞E

(Xt − Xt

)2 exists and finite and the steady state filtering error Q∞(γ) isminimal.

Define ∆t = Xt − Xt, then

d∆t = a∆tdt + bdWt − γ(A∆tdt + BdVt

)=

(a− γA

)∆tdt + bdWt − γBdVt



−1 −0.5 0 0.5 1 1.5

0

0.5

1

1.5f(x) vs. g(x)

linear prabola

Figure 1. f(x) versus g(x)

and

d(∆2t ) = 2∆td∆t + (b2 + γ2B2)dt =

2(a− γA

)∆2

t dt + 2∆t(bdWt − γBdVt) + (b2 + γ2B2)dt.

Thus Qt = E∆2t satisfies

Qt = 2(a− γA

)Qt + b2 + γ2B2.

The latter is a linear equation and it is stable (i.e. Q∞ = limt→∞Qt < ∞ exists)if a− γA < 0. In this case Q∞ solves the linear equation

2(a− γA

)Q∞ + b2 + γ2B2 = 0.

Now γ is to be chosen to minimize Q∞(γ) under constraint a− γA < 0.Consider the function f(x) = 2(a− γA)x + b2 + γ2B2 and note that (examine

this claim geometrically)

f(x) = 2ax + b2 − 2γAx + γ2B2 ≥ 2ax + b2 − A2x2

B2:= g(x)

It can be verified directly that the equation g(x) always has a positive root P(even when a > 0, i.e. when the diffusion X escapes to infinity as t → ∞). Sincef(x) ≥ g(x) and the parabola g(x) is concave it follows Q∞ ≥ P (again refer thegeometry Figure 1 for intuition). The lower bound P is attained, i.e. Q∞ = P ,when both equations

2(a− γA)Q∞ + b2 + γ2B2 = 0

2aQ∞ + b2 − A2Q2∞

B2= 0

are simultaneously satisfied for some γ. The appropriate solution is

γ =AQ∞B2

.



In this case a−γA = −(b2+[γ]2B2)/(2Q∞) < 0 as required (recall that Q∞ > 0).The minimal steady state error for this gain is found from

aQ∞ + b2 − A2Q2∞

B2= 0.

The equations derived above are the special case of the Kalman-Bucy filter:

Theorem 5.2. Let (X, Y ) satisfy the equations

dXt = atXtdt + btdWt

dYt = AtXtdt + BtdVt

subject to the Gaussian vector (X0, Y0), where at, bt, At and Bt ≥ C > 0 are squareintegrable functions and W and V are independent Wiener processes, independentof (X0, Y0). The conditional mean Xt = E(Xt|Y t

0 ) and covariance Qt = E((Xt −

Xt

)2|Y t0

)satisfy

dXt = atXtdt +AtQt

B2t

(dYt −AtXtdt

)(5.3)

Qt = 2atQt + b2t −

A2t Q

2t

B2t

(5.4)

subject to X0 = E(X0|Y0) and Q0 = E(X0 − X0)2.

General References on Probability Theory

[1] J.L. Doob. Stochastic Processes. Wiley, 1953.[2] R. Durrett. Stochastic calculus : a practical introduction. CRC press, 1996.[3] W. Feller. An introduction to probability theory and its applications. Wiley,

1971.[4] I.I. Gikhman and A.V. Skorokhod. Introduction to the Theory of Random

Processes. Dover Publications, 1996.[5] B. Oksendal. Stochastic differential equatons: an introduction with applicatons.

Springer, 2007.[6] Yu.A. Rozanov. Stationary random processes. Holden-Day, Inc., 1967.[7] Z. Schuss. Theory and applications of stochastic differential equations. Wiley,

1980.[8] A. Shiryaev. Probability. Springer, 1995.

References on Nonlinear Filtering

[9] B.D.O. Anderson and J.B. Moore. Optimal Filtering. Dover Publications,2005.

[10] R. J. Elliott, L. Aggoun, and J.B. Moore. Hidden Markov Models: Estimationand Control. Springer-Verlag, 1995.

[11] A.H. Jazwinski. Stochastic processes and filtering theory. Dover Publications,2007.

[12] R. Kalman. A new approach to linear filtering and prediction problems. Trans.ASME Ser. D. J. Basic Engrg., 82:35–45, 1960.

[13] R. Liptser and A. Shiryaev. Statistics of random processes: General theory (I)and applications (II). Springer, 2000.



[14] E. Wong and B. Hajek. Stochastic processes in engineering systems. Springer,1985.


introduction to stochastic processes pavel...

Documents