continuous-time finance: lecture notes

Contents

1 Brownian Motion 31.1 Axioms of Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.3 Random Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

1.3.1 Random Vectors Generalia . . . . . . . . . . . . . . . . . . . . . . . . 201.3.2 Marginal Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 211.3.3 Expectations, etc. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221.3.4 The Variance-Covariance Matrix . . . . . . . . . . . . . . . . . . . . . 231.3.5 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241.3.6 Functions of Random Vectors . . . . . . . . . . . . . . . . . . . . . . . 25

1.4 The Multivariate Normal Distribution . . . . . . . . . . . . . . . . . . . . . . 301.4.1 The Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301.4.2 The Bivariate Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

1.5 Brownian Motion Defined . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401.5.1 Stochastic Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401.5.2 The Distribution of a Stochastic Process . . . . . . . . . . . . . . . . . 411.5.3 Brownian Motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411.5.4 A Little History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441.5.5 Gaussian Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441.5.6 Examples of Gaussian Processes . . . . . . . . . . . . . . . . . . . . . 451.5.7 Brownian Motion as a Limit of Random Walks . . . . . . . . . . . . . 46

1.6 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 571.6.1 Random Number Generators . . . . . . . . . . . . . . . . . . . . . . . 571.6.2 Simulation of Random Variables . . . . . . . . . . . . . . . . . . . . . 601.6.3 Simulation of Normal Random Vectors . . . . . . . . . . . . . . . . . . 631.6.4 Simulation of the Brownian Motion and Gaussian Processes . . . . . . 64

1

2 CONTENTS

1.6.5 Monte Carlo Integration . . . . . . . . . . . . . . . . . . . . . . . . . . 651.7 Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

1.7.1 Conditional Densities . . . . . . . . . . . . . . . . . . . . . . . . . . . 701.7.2 Other Conditional Quantities . . . . . . . . . . . . . . . . . . . . . . . 721.7.3 σ-algebra = Amount of Information . . . . . . . . . . . . . . . . . . . 741.7.4 Filtrations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 761.7.5 Conditional Expectation in Discrete Time . . . . . . . . . . . . . . . . 771.7.6 A Formula for Computing Conditional Expectations . . . . . . . . . . 811.7.7 Further properties of Conditional Expectation . . . . . . . . . . . . . . 821.7.8 What happens in continuous time? . . . . . . . . . . . . . . . . . . . . 841.7.9 Martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 871.7.10 Martingales as Unwinnable Gambles . . . . . . . . . . . . . . . . . . . 89

2 Stochastic Calculus 1052.1 Stochastic Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

2.1.1 What do we really want? . . . . . . . . . . . . . . . . . . . . . . . . . 1062.1.2 Stochastic Integration with respect to Brownian Motion . . . . . . . . 1072.1.3 The Ito Formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1092.1.4 Some properties of stochastic integrals . . . . . . . . . . . . . . . . . . 1112.1.5 Ito processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

2.2 Applications to Finance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1222.2.1 Modelling Trading and Wealth . . . . . . . . . . . . . . . . . . . . . . 1222.2.2 Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1232.2.3 Arbitrage Pricing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1262.2.4 Black and Scholes via PDEs . . . . . . . . . . . . . . . . . . . . . . . . 1272.2.5 The “Martingale method” for pricing options . . . . . . . . . . . . . . 130

A Odds and Ends 133A.1 No P for all Subsets of [0, 1] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134A.2 Marginals: Normal, Joint: Not . . . . . . . . . . . . . . . . . . . . . . . . . . 135A.3 Mersenne Twister . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136A.4 Conditional Expectation Cheat-Sheet . . . . . . . . . . . . . . . . . . . . . . . 141

Chapter 1Brownian Motion

The probable is what usually happens.

—Aristotle

It is a truth very certain that when it is not in our power to determine whatis true we ought to follow what is most probable

—Descartes - “Discourse on Method”

It is remarkable that a science which began with the consideration of games ofchance should have become the most important object of human knowledge.

—Pierre Simon Laplace - “Thorie Analytique des Probabilits, 1812 ”

Anyone who considers arithmetic methods of producing random digits is, ofcourse, in a state of sin.

—John von Neumann - quote in “Conic Sections” by D. MacHale

I say unto you: a man must have chaos yet within him to be able to give birthto a dancing star: I say unto you: ye have chaos yet within you . . .

—Friedrich Nietzsche - “Thus Spake Zarathustra”

1.1 Axioms of Probability

All of probability starts with a set, the state space, which we usually denote by Ω. Everyelement ω ∈ Ω stands for a state of the world, i.e. represents a possible development of theevents of interest.

Example 1.1.1. The universe of Derek the Daisy is very simple. All that really mattersto him is whether it rains or shines. Therefore, his Ω has two elements: Ω = R,S,

3

1.1. AXIOMS OF PROBABILITY CHAPTER 1. BROWNIAN MOTION

where R stands for rain and S for sun. But even daisies live more than one day, sohe might be interested in what happens tomorrow. In that case, we should have Ω =(R,R), (R,S), (S,R), (S, S) - there are four possible states of the world now, and eachω ∈ Ω is an ordered pair.

Figure 1.1: Derek theDaisy

From the previous example, it is clear how to picture Derekas an even wiser daisy, looking 3, 4, or even 100 days in advance.It is also clear how to add snow, presence of other daisies or evensolar eclipses into the model; you would just list all the variablesof interest and account for all the possibilities. There will of courseexist an ω ∈ Ω describing the situation (the world, the future, theparallel universe, . . . ) in which there is a solar eclipse every day fromtoday until the 100th day. One might object that such a displayof events does not seem too likely to happen. The same personshould immediately realize that he or she has just discovered thenotion of probability: while the clerk Ω blindly keeps track ofevery conceivable contingency, the probability brings reason (andinequality) to the game. Its job is to assign a number to each ω ∈ Ω- higher numbers to likelier states of the world, and smaller numbersto unusual coincidences. It can also take groups of ω’s (subsets ofΩ) and assign numbers to those by simply adding the probabilities.The subsets of Ω are usually referred to as events. For an event

A ⊆ Ω we write P[A] for the probability assigned to A. With these new concepts in mind,we venture into another example.

Example 1.1.2. Derek is getting sophisticated. He is not interested in meteorology anymore.He wants to know how his stocks are doing (this is a course in mathematical finance afterall !). He has bought $100 worth of shares of the Meadow Mutual Fund a year ago, and iswondering how much his portfolio is worth today. He reads his Continuous-Time FinanceLecture Notes and decides that the state space Ω will consist of all positive numbers (nobodysaid state spaces should be finite!!!!!). So far so good - all the contingencies are accountedfor and all we need to do is define the probability. You have to agree that it is very silly toexpect his portfolio to be worth $1, 000, 000, 000, 000. It is much more likely it will be around$100. With that in mind he starts:

“ hm. . . let’s see . . . what is the probability that my portfolio will be worth exactly$101.22377 (in Derek’s world there are coins worth 0.1 cents, 0.01 cents, etc.)Well, . . . I have to say that the probability must be 0. Same for $95.232123526763,and $88.23868726345345345384762, . . . and $144.34322423345345343423. . . I mustbe doing something wrong.”

Last Updated: Spring 2005 4 Continuous-Time Finance: Lecture Notes

CHAPTER 1. BROWNIAN MOTION 1.1. AXIOMS OF PROBABILITY

But then he realizes:

“ my broker Mr. Mole has told me that it is extremely unlikely that I will losemore than 10%. He also said that these days no fund can hope to deliver a returnlarger than 15%. That means that I should be quite confident that the value ofmy portfolio lies in the interval [90$, 115$] . . . say,. . . 90% confident. That mustbe it!”

And Derek is right on the money here. When Ω is infinite, it usually makes no sense toask questions like What is the probability that ω ∈ Ω will turn out to be the true state of theworld?. Not that you cannot answer it. You can. The answer is usually trivial and completelyuninformative - zero. The only meaningful questions are the ones concerning the probabilitiesof events: What is the probability that the true state of the world ω will turn out to be anelement of the set A? In Derek’s case, it made much more sense to assign probabilities tosubsets of Ω and it cannot be done by summing over all elements as in the finite case. If youtry to do that you will end up adding 0 to itself an uncountable number of times. Strange. . .

Where did this whole discussion lead us? We are almost tempted to define the probabilityas a function assigning a number to each subset of Ω, and complying with a few other well-know rules (additive, between 0 and 1, etc.), namely

Definition 1.1.3 (Tentative). Let Ω be a set. A probability is a function P from the familyP(Ω) of all subsets of Ω to the set of reals between 0 and 1 such that

• P[Ω] = 1, and

• P[A ∪B] = P[A] + P[B], whenever A,B are subsets of Ω, and A ∩B = ∅.

. . . but the story is not so simple. In 1924, two Polish mathematicians Stefan Banach andAlfred Tarski proved the following statement:

Banach-Tarski Paradox: It is possible to take a solid ball in 3-dimensionalspace, cut it up into finitely many pieces and, moving them using only rotationand translation, reassemble the pieces into two balls the same size as the original.



Figure 1.2: Stefan Banach and Alfred Tarski

. . . and showed that the concept ofmeasurement cannot be applied to allsubsets of the space. The only wayout of the paradox is to forget aboutthe universal notion of volume and re-strict the class of sets under consider-ation. And then, if you think aboutit, probability is a weird kind of a vol-ume. A volume on events, not bodiesin space, but still a volume. If youare unhappy with an argument basedon the premise that probability is afreaky kind of volume, wait until you

read about σ-additivity and then go to the Appendix to realize that it is impossible to definea σ-additive probability P on all subsets of [0, 1], where any subset A has the same probabilityas its translated copy x+A (as long as x+A ⊆ [0, 1]).

Figure 1.3: Andrei Kolmogorov

The rescue from the apparently doomed situationcame from Russian mathematician Andrei Nikolae-vich Kolmogorov who, by realizing two importantthings, set foundations for the modern probabilitytheory. So, what are the two important ideas? Firstof all, Kolmogorov supported the idea that the prob-ability function can be useful even if it is not definedon all subsets of Ω. If we restrict the class of eventswhich allow a probability to be assigned to them, wewill still end up with a useful and applicable theory.The second idea of Kolmogorov was to require that Pbe countably-additive (or σ-additive), instead of onlyfinitely-additive. What does that mean? Instead ofpostulating that P[A] + P[B] = P[A∪B], for disjointA,B ⊆ Ω, he required more: for any sequence of pair-wise disjoint sets (An)n∈N, Kolmogorov kindly askedP to satisfy

P[∪∞n=1An] =∞∑

n=1

P[An] (countable additivity).

If you set A1 = A, A2 = B, A3 = A4 = . . . = ∅, you get immediately the weaker statementof finite additivity.

So, in order to define probability, Kolmogorov used the notion of σ-algebra as the third



and the least intuitive ingredient in the axioms of probability theory:

Definition 1.1.4. A collection F of subsets of Ω is called a σ-algebra (or a σ-field) if thefollowing three conditions hold

• Ω ∈ F .

• for each A ∈ F , we must also have Ac ∈ F , where Ac , ω ∈ Ω : ω 6∈ A.

• if (An)n∈N is a sequence of elements of F , then their union ∪∞n=1An must belong to Fas well.

Typically F will contain all the events that allow probability to be assigned to them, andwe call elements of F the measurable sets. As we have seen, these will not always be allsubsets of Ω. There is no need to worry, though: In practice, you will never encountera set so ugly that it is impossible to assign a probability to it.

Remark 1.1.1. The notion of σ-algebra will however prove to be very useful very soon in arole quite different from the present one. Different σ-algebras on the same state-space willbe used to model different amounts of information. After all, randomness is just the lack ofinformation.

We finish the story about foundations of probability with the axioms of probability theory

Definition 1.1.5. A triple (Ω,F ,P) is called a probability space if Ω is a non-empty set,F is a σ-algebra of subsets of Ω and P is a probability on F , i.e.

• P : F → [0, 1]

• P[Ω] = 1

• P[∪∞n=1An] =∑∞

n=1 P[An], for any sequence (An)n∈N of sets in F , such that An∩Am = ∅,for n 6= m.



Exercises

Exercise 1.1.6. Let F and G be σ-algebras on Ω. Prove the following

1. F ∩ G is a σ-algebra.

2. F ∪ G is not necessarily a σ-algebra.

3. There exists the smallest σ-algebra σ(F ∪ G) containing both F and G. (Hint: proveand use the fact that P(Ω) = A : A ⊆ Ω is a σ-algebra containing both F and G.)

Exercise 1.1.7. Which of the following are σ-algebras?

1. F = A ⊆ R : 0 ∈ A.

2. F = A ⊆ R : A is finite.

3. F = A ⊆ R : A is finite, or Ac is finite.

4. F = A ⊆ R : A is open. (A subset of R is called open if it can be written as a unionof open intervals (a, b).)

5. F = A ⊂ R : A is open or A is closed. (Note: A is closed if an only if Ac is open).

Exercise 1.1.8. Let F be a σ-algebra and Ann∈N a sequence of elements of F (not nec-essarily disjoint). Prove that

1. There exists a pairwise disjoint sequence Bnn∈N of elements in F such that ∪∞n=1An =∪∞n=1Bn. ( A sequence Bnn∈N is said to be pairwise disjoint if Bn∩Bm = ∅ whenevern 6= m.)

2. ∩∞n=1An ∈ F .



Solutions to Exercises in Section 1.1

Solution to Exercise 1.1.6:

1. We need to prove that the collection H , F ∩ G of subsets of Ω satisfies the threeaxioms of a σ-algebra:

• It is obvious that Ω ∈ H since Ω ∈ F and Ω ∈ G by assumption.

• Secondly, suppose that A is an element of H. Then A ∈ F and A ∈ G byassumption, and thus Ac ∈ F and Ac ∈ G since both are σ-algebras. We concludethat Ac ∈ H.

• Then, let Ann∈N be a sequence of subsets of Ω with An ∈ H for each n. Then,by assumption An ∈ F and An ∈ G for any n, and so ∪nAn ∈ F and ∪nAn ∈ Gbecause F and G are σ-algebras. Hence, ∪nAn ∈ H.

2. Take Ω = 1, 2, 3, 4 and take

F =∅, 1, 2, 3, 4, 1, 2, 3, 4

,

andG =

∅, 1, 2, 3, 4, 1, 3, 2, 4

.

It is easy to show that both F and G are σ-algebras. The union F ∪G is not a σ-algebrasince 1, 2 ∈ F ∪ G and 1, 3 ∈ F ∪ G, but 1, 2 ∪ 1, 3 = 1, 2, 3 6∈ F ∪ G.

3. Just take the intersection of all σ-algebras containing F and G. The intersection isnonempty because, at least, P (the set containing all subsets of Ω) is there.


1. By definition A = 0 ∈ F , but Ac = (−∞, 0)∪ (0,∞) 6∈ F since 0 6∈ Ac. Therefore, Fis not a σ-algebra.

2. Eacs element of the sequence An , n is in F . However, ∪nAn = N is not an elementof F because it is obviously not finite. We conclude that F is not a σ-algebra.

3. The same counterexample as above proves that F is not a σ-algebra since neither Nnor R \ N are finite.

4. Remember that a set A is said to be open if for each x ∈ A there exists ε > 0 suchthat (x − ε, x + ε) ⊆ A. It is quite evident that A = (−∞, 0) is an open set (prove itrigorously if you feel like it: just take ε = |x|.). However, Ac = [0,∞) is not an openset since for x = 0 no ε > 0 will do the trick; no matter how small ε > 0 I take, theinterval (−ε, ε) will always contain negative numbers, and so it will never be a subsetof Ac.



5. Suppose that F is a σ-algebra. Define A = [1, 2], B = (0, 1). The set A is closedand B is open, and so they both belong to F . Therefore, the union A ∪ B = (0, 2] isan element of F . This is a contradiction with the definition of F since neither (0, 2]nor (0, 2]c = (−∞, 0] ∪ (2,∞) are open (argument is very similar to that from 4.) Bycontraposition, F is not a σ-algebra.


1. Define B1 = A1, B2 = A2 − B1, B3 = A3 − (B1 ∪ B2), . . . In general Bn = An −(B1 ∪B2 ∪ . . . ∪Bn−1). It is easy to verify that Bn’s are disjoint and that, for each n,∪n

k=1Ak = ∪nk=1Bk. Inductively, it is also quite obvious that Bn ∈ F .

2. By De Morgan’s rules, we have ∩n∈NAn =(∪n∈N A

cn

)c.


CHAPTER 1. BROWNIAN MOTION 1.2. RANDOM VARIABLES

1.2 Random Variables

Look at the following example

Example 1.2.1. Derek’s widened interest in the world around him makes the structure ofthe state space Ω more and more complicated. Now he has to take the value of his portfolioin US$, the weather, the occurrence of the solar eclipse and the exchange rate between theAmerican Dollar and MCU (Meadow Currency Unit) into account. In symbols, we have

Ω = R+ × R,S × Eclipse,No eclipse × R+,

and the typical element of Ω is a quadruplet very much like ($114.223, R,No eclipse, 2.34 MCU/$),or ($91.12, R,Eclipse, 1.99 MCU/$).

If Derek were interested in the value of his portfolio expressed in MCUs, he would justmultiply the first and the fourth component of ω, namely the value of the portfolio in $multiplied by the exchange rate. Of course, this can be done for any ω ∈ Ω and the resultcould be different in different states of the world. In this way, Derek is extracting informationfrom what can be read off ω: the value of the portfolio in MCUs is a function of ω. It isvariable and it is random so we can give the following definition:

Definition 1.2.2 (tentative). A function X : Ω → R is called a random variable.

Why tentative? It has to do with those guys from Poland - Banach and Tarski: for thevery same reason that we have introduced a σ-algebra as a constituent in the probabilityspace. The true (non-tentative) definition of a random variable will respect the existence ofthe σ-algebra F :

Definition 1.2.3. We say that the function X : Ω → R is F-measurable, if for any realnumbers a < b the set

X−1((a, b)

), ω ∈ Ω : X(ω) ∈ (a, b) ,

is an element of F . A random variable is any F-measurable function X : Ω → R.

The notion of F-measurability makes sense once you realize that in practice you are goingto be interested in the probability of events of the form X−1

((a, b)

)for some real numbers a,

b. For example, “what is the probability that my portfolio return will be between 10% and20% at the end of the next year?” If you want to be able to assign probabilities to such events- and you do - your random variables had better be F-measurable. However, you do not needto worry about measurability at all: all random variables encountered in practice aremeasurable, and in this course we will never make an issue of it.

So, in practice any function going from Ω to R is a random variable, and tells us somethingabout the state of the world. In general, Ω is going to be a huge set in which we can puteverything relevant (and irrelevant), but random variables are going to be the manageablepieces of it. Look at the following example:


1.2. RANDOM VARIABLES CHAPTER 1. BROWNIAN MOTION

Example 1.2.4. Consider the world in which Derek wants to keep track of the value of hisportfolio not only a year from now, but also on every instant from now until a year fromnow. In that case each ω would have to describe the evolution of the portfolio value as timet (in years) goes from 0 (today) to 1 (a year from today). In other words each ω is a real-valued function ω : [0, 1] → R. By a leap of imagination you could write ω as an uncountableproduct of R by itself, each copy of R corresponding to the value of the portfolio at a differenttime-instant. So, Ω is going to be a collection of all functions from [0, 1] to R. A huge set,indeed.

In the previous example, the value of the portfolio at a particular time point (say t0 = 0.5,i.e. 6 months from now) is a random variable. For any state of the world ω, it returns thenumber ω(t0). For the ω from the picture, that will be approx. $75. You can take - (a)the net gain of the portfolio over the year, (b) the log-return in the second month, or (c)the maximum value of the portfolio - and express them as functions of ω. They will all berandom variables (under the reasonable assumptions, always verified in practice, that is):

(a) ω(1)− ω(0), (b) log(ω( 212)/ω( 1

12)), (c) max0≤t≤1 ω(t).

Our discussion of random variables so far never mentioned the probability. Suppose thatthere is a probability P defined on Ω, and that we are given a random variable X. We mightbe interested in probabilities of various events related to X. For example, we might want toknow the probability P

[ω ∈ Ω : a < X(ω) < b

](P[a < X < b] in the standard shorthand).

This is where the notion of the distribution of the random variable comes in handy. Eventhough the probability spaces differ hugely from one application to another, it will turn outthat there is a handful of distributions of random variables that come up over and over again.So, let us give a definition first

Definition 1.2.5. Let X : Ω → R be a random variable on a probability space (Ω,F ,P).The function F : R → [0, 1] defined by F (x) , P[X ≤ x] is called the distribution function(or just the distribution) of the random variable X.

In a sense, the distribution function tells us all about X when we take it out of thecontext, and sometimes that is enough. For the purposes of applications, it is useful to singleout two important classes of random variables discrete and continuous1.

Definition 1.2.6. Let X be a random variable on a probability space (Ω,F ,P).

(a) we say X is a discrete random variable if there exists a sequence (xn)n∈N such that

P[X ∈ x1, x2, . . . ] =∞∑

n=1

P[X = xm] = 1.

1Caution: there are random variables which are neither discrete nor continuous, but we will have no use

for them in this course.



In other words, X always has some xn as its value.

(b) we say X is a continuous random variable if there exists a integrable functionfX : R → R+ such that

P[a < X < b] =∫ b

afX(x) dx, for all a < b. (1.2.1)

When X is a discrete random variable, the probabilities of all events related to X canbe expressed in terms of the probabilities pn = P[X = xn]. For example, P[a ≤ X ≤ b] =∑

n : a≤xn≤b pn. In particular, F (x) =∑

n : xn≤x pn. In the case when the sequence (xn)n∈Ncontains only finitely many terms (or, equivalently, when there exists k ∈ N such that pn = 0,for n > k), the distribution of X can be represented in the form of the distribution table

X ∼

(x1 x2 . . . xk

p1 p2 . . . pk

)

For a continuous random variable X, the function fX from the definition 1.2.6.(b), iscalled the density function of the random variable X, or simply the density of X. For arandom variable X with the density fX , the relation (1.2.1) holds true even when a = −∞or b = ∞. It follows that

F (x) =∫ x

−∞fX(x) dx, and that

∫ ∞

−∞fX(x) = 1,

since P[X ∈ R] = 1.Given all these definitions, it would be fair to give an example or two before we proceed.

Example 1.2.7. This is the list of Derek’s favorite distributions. He is sort of a fanatic whenit comes to collecting different famous distributions. He also likes the trivia that comes withthem.

1. Discrete distributions

(a) (Bernoulli: B(p)) The Bernoulli distribution is the simplest discrete distribution.It takes only 2 different values: 1 and 0, with probabilities p and 1−p, respectively.Sometimes 1 is called success and 0 failure. If X is the outcome of tossing of anunfair coin, i.e. a coin with unequal probabilities for head and tails, then X

will have the Bernoulli distribution. Even though it might not seem like a veryinteresting distribution, it is ubiquitous, and it is very useful as a building blockfor more complicated models. Just think about binomial trees in discrete-timefinance. Or winning a lottery, or anything that has exactly two outcomes.



(b)

0

0.05

0.1

0.15

0.2

0.25

prob

2 4 6 8 10k

Figure 1.4:Binomial distribution, n = 10, p = 0.3.

(Binomial: B(n, p)) Adding the outcomes ofn unrelated (I should say independent here)but identical Bernoulli random variables givesyou a binomial random variable. If you adoptthe success-failure interpretation of a Bernoullirandom variable, the Binomial says how manysuccesses you have had in n independent tri-als. The values a Binomial random variablecan take are x0 = 0, x1 = 1, . . . , xn = n,with probabilities

pm = P[X = m] =(n

m

)pm(1− p)n−m,

for m = 0, 1, . . . , n.

(c)

0 3 6 9 12 15

0

0.2

0.4

p

k

Figure 1.5: Poisson distribution, λ = 2

(Poisson, P (λ)) The Poisson distrib-ution is a limiting case of a Binomialdistribution when n is very large, p isvery small and np ∼ λ. The probabil-ity distribution is given by pn = P[X =n] = e−λ λn

n! for n = 0, 1, . . ..

(d)

0 3 6 9 12 15

0

0.2

0.4

p

k

Figure 1.6: Geometric distribution, p =0.4

(Geometric, G(p)) If you are tossingan unfair coin (P[Tails] = p), the Geo-metric distribution will be the distrib-ution of the number of tosses it takesto get your first Tails. The values itcan take are n = 1, 2, . . . and the cor-responding probabilities are pn = p(1−p)n−1.



2. Continuous distributions

(a)

0 1 2 3 4 5

0

0.5

1.0

f(x)

x

Figure 1.7: Uniform Distribution on[a,b], a=2, b=4

(Uniform, U(a, b)) The Uniform distribu-tion models complete ignorance of an un-known quantity as long as it is constrainedto take values in a bounded interval. Theparameters a and b denote the end-pointsof that interval, and the density function isgiven by

f(x) =

1

a−b , x ∈ [a, b],

0, x < a or x > b

(b)

-10 -7.5 -5.0 -2.5 0.0 2.5 5.0

0

0.25

f(x)

x

Figure 1.8: Normal Distribution µ = 4, σ = 2

(Normal, N(µ, σ)) The nor-mal distribution was originallystudied by DeMoivre (1667-1754), who was curious aboutits use in predicting the prob-abilities in gambling! The firstperson to apply the normaldistribution to social data was

Adolph Quetelet (1796-1874). He collected data on the chest measurements ofScottish soldiers, and the heights of French soldiers, and found that they werenormally distributed. His conclusion was that the mean was nature’s ideal, anddata on either side of the mean were a deviation from nature’s ideal. Although hisconclusion is arguable, he nonetheless represented normal distribution in a real-lifesetting. The density function of the normal distribution with parameters µ(mean),σ(standard deviation) is

f(x) =1√

2πσ2exp

(− (x− µ)2

2σ

), x ∈ R.



(c)

0 1 2 3 4 5 6 7 8

0

0.25

f(x)

x

Figure 1.9: Exponential Distributionλ = 0.5

(Exponential, Exp(λ)) Exponential dis-tribution is the continuous-time analogueof the Geometric distribution. It usu-ally models waiting times and has a niceproperty of no memory. Its density is

f(x) =

λ exp(−λx), x > 0

0, x ≤ 0,

where the parameter λ stands for the rate - i.e. the reciprocal of the expectedwaiting time.

(d)

0 1 2 3 4 5 6 7 8 9

0

0.5

f(x)

x

Figure 1.10: Double Exponential Dis-tribution λ = 1.5

(Double Exponential, DExp(λ)) Thisdistribution arises as a modification ofthe Exponential distribution when the di-rection of the uncertainty is unknown -like an upward or a downward movementof the (log-) stock price. The densityfunction is given by

f(x) =12λ exp(−λ|x|), x ∈ R

where λ has the same meaning as in the previous example.

(e)

0 0.25 0.50 0.75 1.00

0

1

2

3

f(x)

x

Figure 1.11: Arcsin Distribution

(Arcsin, Arcsin) Consider a baseball team thathas a 0.5 probability of winning each game.What percentage the season would we expect itto have a loosing record (more games lost thanwon so far)? A winning record? The ratherunexpected answer is supplied by the Arcsindistribution. Its density function is

f(x) =

1π

1√x(1−x)

, x ∈ (0, 1)

0, x ≤ 0 orx ≥ 1.



Exercises

Exercise 1.2.8. Let X and Y be two continuous random variables on the same probabilityspace. It is true that their sum is a continuous random variable? What about the sum oftwo discrete random variables - is their sum always discrete?

Exercise 1.2.9. Let X be a continuous random variable with the density function f(x) suchthat f(x) > 0 for all x. Furthermore, let F (x) =

∫ x−∞ f(x) dx be the distribution function of

X: F (x) = P[X ≤ x]. We define the random variable Y by Y (ω) = F (X(ω)). Prove that Yis a uniformly distributed continuous random variable with parameters a = 0 and b = 1. Inother words, Y ∼ U(0, 1).

Exercise 1.2.10. Let X be a standard normal random variable, i.e. X ∼ N(0, 1). Provethat

for n ∈ N ∪ 0, we have

E[Xn] =

1 · 3 · 5 · · · (n− 1), n is even,

0, n is odd.

Exercise 1.2.11. Let X be an exponentially distributed random variable with parameterλ > 0, i.e. X ∼ Exp(λ). Compute the density fY (y) of the random variable Y = log(X).

(Hint: compute the distribution function FY (y) of Y first.)




Solution to Exercise 1.2.8: The first statement is not true. Take X to be any continuousrandom variable and take Y = −X. Then Z = X + Y is the constant random variable 0, i.e.P[Z = 0] = 1. Suppose there exists a density fZ for the random variable Z. The function fZ

should have the property that

• fZ(z) ≥ 0, for all z > 0.

•∫∞−∞ fZ(z) dz = 1

•∫ ba fZ(z) dz = 0 for every interval (a, b) which does not contain 0.

It is quite obvious that such a function cannot exist. For those of you who have heard about“Dirac’s delta function“, I have to mention the fact that it is not a function even thoughphysicists like to call it that way.

The second statement is true. Let xnn∈N be (the countable) sequence of values that Xcan take, and let ynn∈N be the sequence of values the random variable Y can take. The sumZ = X + Y will always be equal to some number of the form xn + ym, and it is thus enoughto prove that there are countably many numbers which can be written in the form xn + ym.In order to prove that statement, let us first note that the number of numbers of the formxn +ym is “smaller” than the number of ordered pairs (n,m). This follows from the fact thatthe pair (n,m) determines uniquely the sum xn + ym, while the same sum might correspondto many different pairs. But we know that the sets N×N and N both have countably manyelements, which concludes the proof.

Solution to Exercise 1.2.9: Let us compute the distribution function of the random vari-able Y : (note that F−1

X exists on (0, 1) since FX is a strictly increasing function due to thestrict positivity of fX)

FY (y) = P[Y ≤ y] = P[FX(X) ≤ y] =

1, y > 1

P[X ≤ F−1X (y)] = FX(F−1

X (y)) = y, y ∈ (0, 1)

0, y < 0.

It is clear now that FY (y) =∫ y−∞ fU(0,1)(z) dz, where

fU(0,1) =

0, y < 0 or y > 1

1, 0 < y < 1,

is the density of of the uniform random variable with parameters a = 0 and b = 1.



Solution to Exercise 1.2.10: Let f(x) denote the density of the standard normal: f(x) =exp(−x2/2)√

2π. Then

E[Xn] =∫ ∞

−∞xnf(x) dx.

Letting an = E[Xn] we immediately note that an = 0 whenever n is odd, because of thesymmetry of the function f . For n = 0, we have a0 = 1 since f is a density function ofa random variable. Finally, we use partial integration, L’Hospital rule and the fact thatf ′(x) = −xf(x) (where exactly are they used?) to obtain

a2n =∫ ∞

−∞x2nf(x) dx =

x2n+1f(x)2n+ 1

∣∣∣∞−∞

−∫ ∞

−∞

x2n+1

2n+ 1(−xf(x)) dx =

a2n+2

2n+ 1.

Having obtained the recursive relation a2n+2 = (2n+1)a2n the claim now follows by induction.

Solution to Exercise 1.2.11: The distribution function FX of the exponentially distributedrandom variable X with parameter λ is given by

FX(x) =

∫ x0 λe

−λy dy, x ≥ 0

0, x < 0=

1− e−λx x ≥ 0

0, x < 0.

Therefore, for y ∈ R we have

FY (y) = P[Y ≤ y] = P[log(X) ≤ y] = P[X ≤ ey] = FX(ey) = 1− e−λey.

Differentiation with respect to y gives

fY (y) = λey−λey.


1.3. RANDOM VECTORS CHAPTER 1. BROWNIAN MOTION

1.3 Random Vectors

1.3.1 Random Vectors Generalia

Random variables describe numerical aspects of the phenomenon under observation thattake values in the set of real numbers R. Sometimes we might be interested in a piece ofinformation consisting of more then one number - say the height and the weight of a person,or the prices of several stocks in our portfolio. Therefore, we are forced to introduce randomvectors - (measurable functions) from Ω to Rn.

There is a number of concepts related to random vectors that are direct analogies ofthe corresponding notions about random variables. Let us make a list. In what follows, letX = (X1, X2, . . . , Xn) be an n-dimensional random vector.

1. The function FX : Rn → R, defined by FX(x1, x2, . . . , xn) , P[X1 ≤ x1, X2 ≤x2, . . . , Xn ≤ xn]. is called the distribution function of the random vector X.

2. X is said to be a continuous random vector if there exists a nonnegative integrablefunction fX : Rn → R such that for a1 < b1, a2 < b2, . . . , an < bn we have

P[a1 < X1 < b1, a2 < X2 < b2, . . . , an < Xn < bn] =

=∫ bn

an

. . .

∫ b2

a2

∫ b1

a1

fX(y1, y2, . . . , yn) dy1 dy2 . . . dyn

Subject to some regularity conditions (you should not worry about them) we have

fX(x1, x2, . . . , xn) =∂n

∂x1 ∂x2 . . . ∂xnFX(x1, x2, . . . , xn).

Example 1.3.1. Let X = (X1, X2, X3) be a random vector whose density function fX isgiven by

fX(x1, x2, x3) ,

23(x1 + x2 + x3), 0 ≤ x1, x2, x3 ≤ 1

0, otherwise.

It is easy to show that fX indeed possesses all the properties of a density function (positive,integrates to one). Note that the random vector X takes values in the unit cube [0, 1]3

with probability one. As for the distribution function FX , it is not impossible to computeit explicitly, even though the final result will be quite messy (there are quite a few cases toconsider). Let us just mention that FX(x1, x2, x3) = 0 for x1 < 0, x2 < 0 or x3 < 0 andFX(x1, x2, x3) = 1 for all x1, x2, x3 are greater that 1.

Even though the definition of a density teaches us only how to compute probabilities of thefrom P[a1 < X1 < b1, a2 < X2 < b2, a3 < X3 < b3], we can compute more complicated


CHAPTER 1. BROWNIAN MOTION 1.3. RANDOM VECTORS

probabilities. For nice enough set A ⊆ R3 we have

P[X ∈ A] =∫∫∫

A

fX(y1, y2, y3) dy1 dy2 dy2. (1.3.1)

In words, the probability that X lies in the region A can be obtained by integrating fX overthat region. Of course, there is nothing special about n = 3 here. The same procedure willwork in any dimension.

Example 1.3.2. Let X = (X1, X2, X3) be the random vector from Example 1.3.1. Supposewe are interested in the probability P[X1−2X2 ≥ X3]. The first step is to cast the inequalityX1 − 2X2 ≥ X3 into the form X ∈ A for a suitably chosen set A ⊆ R3. In our caseA =

(x1, x2, x3) ∈ R3 : x1 − 2x2 − x3 ≥ 0

, and so the required probability can be obtained

by using formula (1.3.1):

P[X1 − 2X2 ≥ X3] =∫∫∫

(x1,x2,x3)∈R3 : x1−2x2−x3≥0

fX(x1, x2, x3) dx1 dx2 dx2.

Furthermore, since fX is equal to 0 outside the cube 0 < x1, x2, x3 < 1, we have reduced theproblem of finding the probability P[X1 − 2X2 ≥ X3] to the computation of the integral∫∫∫

0<x1,x2,x3<1,x1−2x2−x3≥0

23(x1 + x2 + x3) dx1 dx2 dx3.

The Maple command

>int(int(int(2/3*(x+y+z)*piecewise(x-2*y-z>0, 1,0),x=0..1),y=0..1),z=0..1);

does the trick, and we get P[X1 − 2X2 ≥ X3] = 116 .

1.3.2 Marginal Distributions

For a component random variable Xk of the random vector X there is a simple way ofobtaining its distribution function FXk

from the distribution of the random vector (why isthat so?) :

FXk(x) = lim

x1→∞,...xk−1→∞,xk+1→∞,...,xn→∞F (x1, . . . , xk−1, x, xk+1, . . . , xn).

You would get the distribution function FXk,Xl(x, x′) in a similar way by holding the kth and

the lth coordinate fixed and taking the limit of FX when all the other coordinates tend to ∞,etc.

For the densities of continuous random vectors, similar conclusions are true. Suppose weare interested in the density function of the random vector X ′ , (X2, X3), given that the



density of X = (X1, X2, . . . , Xn) is fX(x1, x2, . . . , xn). The way to go is to integrate out thevariables X1, X4, . . . , Xn:

fX′(x2, x3)(

= f(X2,X3)(x2, x3))

=∫ ∞

−∞· · ·∫ ∞

−∞fX(y1, x2, x3, y4, . . . , yn)

no dx2 or dx3 !!!︷︸︸︷dy1 dy4, . . . dyn .

You would, of course, do the analogous thing for any k-dimensional sub-vector X ′ of X. Whenk = 1, you get distributions (densities) of random variables X1, X2, . . . , Xn. The distribution(density) of a sub-vector X ′ of X is called the marginal distribution (density).

Example 1.3.3. Continuing with the random vector from Example 1.3.1, the random vari-able X2 has a density fX2 given by

fX2(x2) =∫ ∞

−∞

∫ ∞

−∞fX(y1, x2, y3) dy1 dy3.

The fact that fX is 0 when any of its arguments is outside [0, 1]3, we get

fX2(x2) =∫ 1

0

∫ 1

0

23(y1 + x2 + y3) dy1 dy3.

By simple integration (or Maple) we have:

fX2(x2) =

23(1 + x2), 0 < x2 < 1

0, otherwise.

By symmetry fX1 = fX2 = fX3 . Note that the obtained function fX2 is positive and integratesto 1 - as it should. Similarly,

f(X2,X3)(x2, x3) =∫ ∞

−∞fX(y1, x2, x3) dy1 =

23(1

2 + x2 + x3), 0 < x2, x3 < 1

0, otherwise.

1.3.3 Expectations, etc.

Given a random vector X we define its expectation vector (or mean vector) µ =(µ1, µ2, . . . , µn) by the formula:

µk =∫ ∞

−∞xkfX(x1, x2, . . . , xn) dx1 dx2 . . . dxn,

provided that all the integrals converge absolutely, i.e.∫∞−∞ |xk| fX(x1, x2, . . . , xn) dx1 dx2 . . . dxn <

∞, for all k. Otherwise, we say that the expectation does not exist. In order not to re-peat this discussion every time when dealing with integrals, we shall adopt the followingconvention and use it tacitely throughout the book:



Any quantity defined by an integral will be said not to exist if its definingintegral does not converge absolutely.

It is customary in probability theory to use the notation E[X] for the expectation µ of X. Itcan be shown that E acts linearly on random vectors, i.e. if X and X ′ are two random vectorsdefined on the same probability space Ω, and of the same dimension n, then for α, β ∈ R,

E[αX + βX ′] = αE[X] + βE[X ′].

See Example 1.3.9 for a proof of this statement.In general, for any function g : Rn → R we define

E[g(X)] ,∫ ∞

−∞. . .

∫ ∞

−∞g(x1, x2, . . . , xn) dx1 dx2 . . . dxn.

When g takes values in Rm, E[g(X)] is defined componentwise, i.e.

E[g(X)] =(E[g1(X)],E[g2(X)], . . . ,E[gm(X)]

)where g = (g1, g2, . . . , gm). The definition of the expectation is just the special case withg(x1, x2, . . . , xn) = (x1, x2, . . . , xn).

1.3.4 The Variance-Covariance Matrix

Just as the expectation is used as a measure of a center of a random vector, the measure ofspread is provided by the variance-covariance matrix Σ, i.e. the matrix whose (i, j)th

entry is given by

Σij = E[(Xi−µi)(Xj−µj)] =∫ ∞

−∞. . .

∫ ∞

−∞(xi−µi)(xj−µj)fX(x1, x2, . . . , xn) dx1 dx2 . . . dxn,

where µi = E[Xi] and µj = E[Xj ]. When n = 1, we retrieve the familiar concept of variance(µ = E[X1]):

Σ = (Σ11) = Var[X1] =∫ ∞

−∞(x− µ)2fX(x) dx.

One easily realizes (try to prove it!) that for general n, we have

Σij = E[(Xi − E[Xi])(Xj − E[Xj ])] = E[XiXj ]− E[Xi]E[Xj ].

This quantity is called the covariance between random variables Xi and Xj , and is usuallydenoted by Cov(Xi, Xj). We will give examples of the meaning and the computation of thevariance-covariance matrix in the section devoted to the multivariate normal distribution.



Using the agreement that the expectation E[A] of a matrix A (whose entries are randomvariables), is just the matrix of expectations of entries of A, we can define the variance-covariance matrix of the random vector X simply by putting (T -denotes the transposition ofthe vector)

Σ , E[(X − µ)T (X − µ)]. (1.3.2)

(Note: by convention X is a row-vector, so that XT is a column vector and (X−µ)T (X−µ)is a n× n-matrix.) Using this representation one can easily show the following statement.

Proposition 1.3.4. Let X be a random vector and let B be a m×n matrix of real numbers.Then the variance-covariance matrix Σ′ os the random vector X ′ = XB is given by

Σ′ = BTΣB, (1.3.3)

where Σ is the variance-covariance matrix of the random vector X.

1.3.5 Independence

It is important to notice that the distribution of a random vector carries more informationthan the collection of the distribution functions of the component random variables. Thisextra information is sometimes referred to as the dependence structure of the randomvector. It is quite obvious that height and weight may both be normally distributed, butwithout the knowledge of their joint distribution - i.e. the distribution of the random vector(height, weight) - we cannot tell whether taller people tend to be heavier than the shorterpeople, and if so, whether this tendency is strong or not. There is, however, one case whenthe distributions of the component random variables determine exactly the distribution ofthe vector:

Definition 1.3.5. Random variables X1, X2, . . . , Xn are said to be independent if

FX1(x1)FX2(x2) . . . FXn(x2) = FX(x1, x2, . . . , xn),

where X , (X1, X2, . . . Xn), and FX1 , FX2 , . . .FXn are the distribution functions of thecomponent random variables X1, X2, . . .Xn.

The definition implies (well, it needs a proof, of course) that for independent continuousrandom variables, the density function fX factorizes in a nice way:

fX(x1, x2, . . . , xn) = fX1(x1)fX2(x2) . . . fXn(xn), (1.3.4)

and, furhermore, the converse is also true: if the densities of the random variablesX1, X2, . . . , Xn

satisfy (1.3.4), then they are independent.



Example 1.3.6. For the random vector X from Example 1.3.1, the component randomvariables X1, X2 and X3 are not independent. If they were, we would have (look at theprevious example)

23(x1 + x2 + x3) = fX(x1, x2, x3) = fX1(x1)fX2(x2)fX3(x3) =

827

(1 + x1)(1 + x2)(1 + x3),

for all 0 ≤ x1, x2, x3 ≤ 1. This is obviously a contradiction.

Finally, let me mention that (1.3.4) implies easily the validity of the following theorem,and the corresponding rule : independence means multiply!.

Theorem 1.3.7. Let random variables X1, X2, . . . , Xn be independent. For all (sufficientlynice) real functions h1 : R → R, h2 : R → R, . . . , hn : R → R we have

E[h1(X1)h2(X2) . . . hn(Xn)] = E[h1(X1)] E[h2(X2)] . . .E[hn(Xn)],

subject to existence of all expectations involved. In particular, if X and Y are independent,

E[XY ] = E[X]E[Y ] and Cov(X,Y ) = 0.

How does the last formula in the theorem follow from the rest?

1.3.6 Functions of Random Vectors

Let X be a random vector with density fX and let h : Rn → Rn be a function. Thecomposition of the function h and the random vector X (viewed as a function Ω → Rn) is arandom vector again. The natural question to ask is whether X ′ , h(X) admits a density,and if it does, how can we compute it? The answer is positive, and we will state it in atheorem which we give without the proof. Hopefully, everything will be much clearer afteran example.

Theorem 1.3.8. Let h : Rn → Rm, h = (h1, h2, . . . , hn) be a one-to-one mapping such thath and its inverse h−1 are continuously differentiable. (The inverse being defined on a subsetB ⊆ Rn). Then X ′ , h(X) admits a density fX′, and it is given by the formula

fX′(x′1, x′2, . . . , x

′n) = fX

(h−1(x′1, x

′2, . . . , x

′n)) ∣∣det Jh−1(x′1, x

′2, . . . , x

′n)∣∣ ,

for (x′1, x′2, . . . , x

′n) ∈ B and fX′(x′1, x

′2, . . . , x

′n) = 0 outside B. Here, for a differentiable

function g, |det Jg(x′1, x′2, . . . , x

′n)| denotes the absolute value of the determinant of the Ja-

cobian matrix Jg (matrix whose ij-the entry is the partial derivative of the ith componentfunction gi with respect to jth variable x′j.)

Example 1.3.9.



1. When n = 1, the formula in Theorem 1.3.8 is particularly simple. Let us illustrate by anexample. Suppose thatX has a standard normal distribution, and define h(x) , exp(x).The density of the resulting random variable X ′ = exp(X) is then given by

fX′(x′) =1

x′√

2πexp

(−1

2log2(x′)

)(= fX(log(x′))

∣∣∣∣ ddx′ log(x′)∣∣∣∣) ,

for x′ > 0 and fX′(x′) = 0 for negative x′. The random variable X ′ is usually calledlog-normal and is used very often to model the distribution of stock returns.

2. Let X be a 2-dimensional random vector X = (X1, X2) with density fX(x1, x2). Weare interested in the density function of the sum X1 +X2. In order to apply Theorem1.3.8, we need an invertible, differentiable function h : R2 → R2, whose first componenth1 will be given by h1(x1, x2) = x1 + x2. A natural candidate will be the functionh : R2 → R2 given by

h(x1, x2) = (x1 + x2, x1 − x2).

Moreover, both h and h−1 are linear functions and we have

h−1(x′1, x′2) =

(x′1 + x′2

2,x′1 − x′2

2

).

The Jacobian matrix of the function h−1 is given by

Jh−1(x′1, x′2) =

(12

12

12 −1

2

),

so∣∣det Jh−1(x′1, x

′2)∣∣ = 1

2 . Therefore

fX′(x′1, x′2) =

12fX

(x′1 + x′2

2,x′1 − x′2

2

).

We are only interested in the density of the random variable X1 +X2 which is the firstcomponent of the random vector X ′ = h(X). Remembering from subsection 1.3.2 howto obtain the marginal from the joint density, we have

fX1+X2(x) =12

∫ ∞

−∞fX

(x+ x′2

2,x− x′2

2

)dx′2. (1.3.5)

By introducing a simple change of variables y = x+x′22 , the expression above simplifies

tofX1+X2(x) =

∫ ∞

−∞fX(y, x− y) dy. (1.3.6)



In the special case whenX1 andX2 are independent, we have fX(x1, x2) = fX1(x1)fX2(x2)and the formula for fX1+X2(x) becomes

fX1+X2(x) = (fX1 ∗ fX2)(x) ,∫ ∞

−∞fX1(y)fX2(x− y) dy.

The operation fX1 ∗ fX2 is called the convolution of functions fX1 and fX2 . Notethat we have just proved that the convolution of two positive integrable functions is apositive integrable function, and that∫ ∞

−∞(f ∗ g)(y) dy =

(∫ ∞

−∞f(x) dx

)(∫ ∞

−∞g(x′) dx′

).

How exactly does the last statement follow from the discussion before?

3. Let us use the result we have just obtained to prove the linearity of mathematicalexpectation. Let X1 and X2 be random variables on the same probability space. Wecan stack one on top of the other to obtain the random vector X = (X1, X2). By (1),the density of the sum X1 + X2 is then given by (1.3.6), so (assuming everything isdefined)

E[X1+X2] =∫ ∞

−∞xfX1+X2(x) dx =

∫ ∞

−∞x

∫ ∞

−∞fX1(y)fX2(x−y) =

∫ ∞

−∞

∫ ∞

−∞xfX1(y)fX2(x−y) dy dx.

After change of variables ξ = x− y, we get∫ ∞

−∞

∫ ∞

−∞xfX1(y)fX2(x− y) dy dx =

∫ ∞

−∞

∫ ∞

−∞(ξ + y)fX1(y)fX2(ξ) dy dξ =

=∫ ∞

−∞y

(∫ ∞

−∞fX1(ξ) dξ

)fX2(y) dy +

∫ ∞

−∞ξ

(∫ ∞

−∞fX1(y) dy

)fX2(ξ) dξ

=∫ ∞

−∞yfX2(y) dy +

∫ ∞

−∞ξfX1(ξ) dξ = E[X2] + E[X1].

You prove that E[αX] = αE[X] (you do not need the strength of Theorem 1.3.8 to dothat. Just compute the density of αX from the first principles).



Exercises


1.4. THE MULTIVARIATE NORMAL DISTRIBUTIONCHAPTER 1. BROWNIAN MOTION

1.4 The Multivariate Normal Distribution

1.4.1 The Definition

The multivariate normal distribution (MND) is one of the most important examples of mul-tivariate2 distributions. It is a direct generalization of the univariate (standard) normal andshares many of its properties. Besides from being analytically tractable as well as very ap-plicable in modelling a host of every-day phenomena, the MND often arises as the limitingdistribution in many multidimensional Central Limit Theorems.

Definition 1.4.1. A random vector X = (X1, X2, . . . , Xn) is said to have a multivariatenormal distribution if for any vector a = (a1, a2, . . . , an) ∈ Rn the distribution of therandom variable XaT = a1X1 + a2X2 + · · · + anXn has a univariate normal distribution(with some mean and variance depending on a).

Putting ai = 1, ak = 0, for k 6= i, the definition above states that each componentrandom variable Xi has a normal distribution, but the converse is not true: there arerandom vectors whose component distributions are normal, but the random vector itself is notmultivariate normal (See Appendix, Section A.2). The following theorem reveals completelythe structure of a multivariate normal random vector:

Theorem 1.4.2. Let the random vector X = (X1, X2, . . . , Xn) have a multivariate nor-mal distribution, let µ = (µ1, µ2, . . . , µn) be its expectation (mean) vector, and let Σ be itsvariance-covariance matrix. Then the distribution of X is completely determined by µ andΣ. Also, when Σ is non-singular, its density fX is given by

fX(x1, x2, . . . , xn) = (2π)−n2 (detΣ)−

12 exp

(−1

2Qµ,Σ(x1, x2, . . . , xn)

), (1.4.1)

where the function Qµ,Σ is a quadratic polynomial in x = (x1, x2, . . . , xn) given by

Qµ,Σ(x) = (x− µ)Σ−1(x− µ)T =n∑

i=1

n∑j=1

τij(xi − µi)(xj − µj),

and τij is the (i, j)th element of the inverse matrix Σ−1.The converse is also true: if a random vector X admits a density of the form (1.4.1),

then X is multivariate normal.

Note that when n = 1, the mean vector becomes a real number µ, and the variance-covariance matrix Σ becomes σ2 ∈ R. It is easy to check (do it!) that the formula abovegives the familiar density of the univariate normal distribution.

2A distribution is said to be multivariate if it pertains to a (multidimensional) random vector. The

various distributions of random variables are often referred to as univariate


CHAPTER 1. BROWNIAN MOTION1.4. THE MULTIVARIATE NORMAL DISTRIBUTION

1.4.2 The Bivariate Case

The case when n = 2 is often referred to as the bivariate normal distribution. Before wegive our next example, we need to remember the notion of correlation corr(X,Y ) betweentwo random variables X and Y :

corr(X,Y ) =Cov(X,Y )√

Var(X) Var(Y ).

The number corr(X,Y ) takes values in the interval [−1, 1] and provides a numerical measureof the relation between the random variables X and Y . When X and Y are independent, thencorr(X,Y ) = 0 (because Cov(X,Y ) = 0, as we have seen in Section 1.3), but the converse isnot true (T. Mikosch provides a nice example (Example 1.1.12) on the page 21).

Example 1.4.3. Let X = (X,Y ) have a bivariate normal distribution. If we let µX = E[X],µY = E[Y ], σX =

√Var(X), σY =

√Var(Y ), and ρ = corr(X,Y ), then the mean vector µ

and the variance-covariance matrix Σ are given by

µ = (µX , µy),Σ =

(σ2

X ρσXσY

ρσXσY σ2Y

)The inverse Σ−1 is not hard to compute when ρ2 6= 1 , and so the quadratic Qµ,Σ simplifiesto

Qµ,Σ(x, y) =1

(1− ρ2)

((x− µX)2

σ2X

+(y − µY )2

σ2Y

− 2ρ(x− µX)(y − µY )

σXσY

).

Therefore, the density of the bivariate normal is given by

fX(x, y) =1

2πσXσY

√1− ρ2

exp(− 1

2(1− ρ2)

((x− µX)2

σ2X

+(y − µY )2

σ2Y

− 2ρ(x− µX)(y − µY )

σXσY

))In order to visualize the bivariate normal densities, try the following Maple commands (herewe have set σX = σy = 1, µX = µy = 0): > f:=(x,y,rho)->(1/(2*Pi*sqrt(1-rho^2)))*exp((-1/(2*(1-rho^2)))*(x^2+y^2-2*rho*x*y));¿plot3d(f(x,y,0.8),x=-4..4,y=-4..4,grid=[40,40]); ¿ with(plots): ¿ contourplot(f(x,y,0.8),x=-4..4,y=-4..4,grid=[100,100]); ¿ animate3d(f(x,y,rho),x=-4..4,y=-4..4,rho=-0.8..0.80,frames=20,grid=[30,30]);

One notices right away in the preceding example that when X and Y are uncorrelated(i.e. ρ = 0), then the density fX can be written as

fX(x, y) =

1√2πσ2

X

exp(−(x− µX)2

2σ2X

)

1√2πσ2

Y

exp(−(y − µY )2

2σ2Y

)

,

so that fX factorizes into a product of two univariate normal density functions with para-meters µX , σX and µY ,σY . Remember the fact that if the joint density can be written asthe product of the densities of its component random variables, then the component randomvariables are independent. We have just discovered the essence of the following proposition:



Proposition 1.4.4. Let X = (X,Y ) be a bivariate normal random vector. Random variablesX and Y are independent if and only if they are uncorrelated i.e. corr(X,Y ) = 0.

For general multivariate normal random vectors we have the following nice result:

Proposition 1.4.5. Let X = (X1, X2, . . . , Xn) be a random vector whose marginal distrib-utions (distributions of component vectors) are normal. If the collection (X1, X2, . . . , Xn) isindependent, then X has a multivariate normal distribution.

This result will enable us to construct arbitrary multivariate normal vectors from n in-dependent standard normals ( but I will have to wait until the section about simulation tosee give you more details). Also, we can prove a well-known robustness property of normaldistributions in the following example:

Example 1.4.6. Statement: Let X and Y be two independent normally distributed randomvariables. Then, their sum X + Y is again a normally distributed random variable.

How do we prove that? The idea is to put X and Y into a random vector X = (X,Y )and conclude that X has a bivariate normal distribution by the Proposition 1.4.5. But then,by the definition of multivariate normality, any linear combination αX+βY must be normal.Take α = β = 1, and you are done.

To recapitulate, here are the most important facts about multivariate normals:

• linear combinations of the components of a multivariate normal are normal

• the multivariate normal distribution is completely determined by its mean and thevariance-covariance matrix: once you know µ and Σ, you know the density (if Σ isinvertible) and everything else there is to know about X

• bivariate normals have (relatively) simple densities and are completely determined byµX , µY , σX , σY and ρ

• for bivariate normals independence is equivalent to ρ = 0.

• random vectors with independent and normally distributed components are multivariatenormal



Exercises

Exercise 1.4.7.

1. Let X = (X1, X2, . . . , Xn) have a multivariate normal distribution with mean µ andthe variance-covariance matrix Σ, and let B be a n × m matrix. Show that the m-dimensional random vector XB has a multivariate normal distribution (Hint: use thedefinition) and compute its mean and the variance-covariance matrix in terms of µ,Σand B.

2. Let X = (X1, X2, X3) be multivariate normal with mean µ = (1, 2, 3) and the variance-covariance matrix

Σ =

5 4 −9

4 6 −7

−9 −7 22

.

Further, let B be a 3× 2 matrix

B =

1 3

4 3

2 7

.

Find the distribution of the bivariate random vector XB. Use software!

Exercise 1.4.8. Derek the Daisy owns a portfolio containing two stocks MHC (MeadowHoney Company) and BII (Bee Industries Incorporated). The statistical analyst working forMr. Mole says that the values of the two stocks in a month from today can be modelled usinga bivariate normal random vector with means µMHC = 101, µBII = 87, variances σMHC = 7,σBII = 10 and corr(MHC,BII) = 0.8 (well, bees do make honey!). Help Derek find theprobability that both stocks will perform better that their means, i.e.

P[MHC ≥ 101,BII ≥ 87].

(Hint: of course you can always find the appropriate region A in R2 and integrate the jointdensity over A, but there is a better way. Try to transform the variables so as to achieveindependence, and then just use symmetry.)

Exercise 1.4.9. Let X = (X1, X2) be a bivariate normal vector with mean µ = (2, 3) and

variance-covariance matrix Σ =

(1 −2−2 4

).



1. Find the distributions of the following random variables:

(a) X1

(b) X1 +X2

(c) aX1 + bX2

2. What is the correlation (coefficient) between X1 and X2?

Exercise 1.4.10. Let X have a multivariate normal distribution with the mean (vector)µ ∈ Rn and the positive-definite symmetric variance-covariance matrix Σ ∈ Rn×n. Let a bean arbitrary vector in Rn. Prove that the random variable

Y =(X − µ)aT

√aΣaT

has the standard normal distribution N(0, 1).

Exercise 1.4.11. Let X = (X1, X2, X3) be a multivariate-normal vector such that E[X1] =E[X2] = E[X3] = 0 and Var[X1] = 1, Var[X2] = 2 and Var[X3] = 3. Suppose further that X1

is independent of X2−X1, X1 is independent of X3−X2, and X2 is independent of X3−X2.

1. Find the variance-covariance matrix of X.

2. Find P[X1 +X2 +X3 ≤ 1] in terms of the distribution function Φ(x) =∫ x−∞

1√2πe−

y2

2 dy

of the standard unit normal.

Exercise 1.4.12. The prices (S1, S2) of two stocks at a certain date are modelled in thefollowing way: let X = (X1, X2) be a normal random vector

with mean µ = (µ1, µ2) and the variance-covariance matrix Σ =

(Σ11 Σ12

Σ21 Σ22

).

The stock-prices are then given by S1 = exp(X1) and S2 = exp(X2).

(a) Find Cov(S1, S2) in terms of µ and Σ.

(b) Find the probability that at the price of at least one of the stocks is in the interval[s1, s2], for some constants 0 < s1 < s2. You can leave your answer in terms of thefollowing:i) the cdf of the unit normal

Φ(x) =∫ x

0

1√2πe−

ξ2

2 dξ, and

ii) the joint cdf of a bivariate normal with zero mean, unit marginal variances, andcorrelation ρ:

Ψ(x1, x2, ρ) =∫ x1

−∞

∫ x2

−∞

1

2π√

1− ρ2exp

(− 1

2(1− ρ2)(ξ21 + ξ22 − 2ρξ1ξ2)

)dξ1 dξ2





1. First we prove that the random vector X ′ has a bivariate normal distribution. It willbe enough (by Definition 1.4.1) to prove that X ′aT is a normal random variable forany vector a ∈ Rm:

X ′aT = (XB)aT = XcT , where c = BaT .

Using the multivariate normality of X, the random varaible XcT is normal for anyc ∈ Rn, and therefore so is X ′aT .

It remains to identify the mean and the variance of X ′:

µ′ = E[X ′] = E[XB] = (by componentwise linearity of expectation) = E[X]B = µB.

As for the variance-covariance matrix, we use the linearity of expectation again andrepresentation (1.3.2) to obtain

Σ′ = E[(X ′−µ′)T (X ′−µ′)] = E[BT (X−µ)T (X−µ)B] = BT E[(X−µ)T (X−µ)]B = BTΣB.

We conclude that X ′ ∼ N(µB,BTΣB).

2. Note that here we have a special case of part 1., so we know that we are dealing witha normal random vector. The mean and the variance can be obtained using the resultsjust obtained:

Σ′ = BTΣB =

(7 3 32 4 1

) 5 4 −94 6 −7−9 −7 22

1 3

4 32 7

=

(73 100100 577

),

and

µ′ = µB =(1, 2, 3

)1 34 32 7

=(15, 30

).

Therefore the distribution of X ′ is multivariate normal:

N

((15, 30),

(73 100100 577

))

Here are the Maple commands I used to get the new mean and the new variance-covariance matrix:



> with(linalg):

> B:=matrix(3,2,[1,3,4,3,2,7]);

> Sigma:=matrix(3,3,[5,4,-9,4,6,-7,-9,-7,22]);

> mu:=matrix(1,3,[1,2,3]);

> mu1:=evalm(mu &* B);

> Sigma1:=evalm(transpose(B) &* Sigma &* B);

. . . and some more that will produce the picture of the density of X ′:

> Q:=evalm(transpose([x1,x2]-mu1) &* inverse(Sigma1) &* ([x1,x2]-mu1));

> f:=(x1,x2)->1/(2*Pi*sqrt(det(Sigma1)))*exp(-1/2*Q);

> plot3d(f(x1,x2),x1=-30..40,x2=-70..70);

Solution to Exercise 1.5.13: Let X = (X1, X2) be the random vector, where X1 is theprice of MHC, and X2 of BII;

µ = (110, 87), Σ =

(49 5656 100

).

We are interested in the probability p = P[X1 ≥ 101, X2 ≥ 87], and we can rewrite this asp = P[X ∈ A], where A =

(x1, x2) ∈ R2 : x1 ≥ 101, x2 ≥ 87

, and compute

p =∫∫A

fX(x1, x2) dx1dx2 =∫ ∞

110

∫ ∞

87fX(x1, x2) dx1dx2,

where fX is the density of the multivariate normal random vector X:

fX(x1, x2) =exp

(− 1

2(1−0.82)

((x1−110)2

72 + (x2−87)2

102 − 2 · 0.8 (x1−110)(x2−87)7·10

))2π7 · 10

√1− 0.82

.

And while this integral can be evaluated explicitly, it is quite cumbersome and long.A different approach would be to try and modify (transform) the vector X = (X1, X2)

in order to simplify the computations. First of all, to get rid of the mean, we note that

p = P[X1 ≥ 110, X2 ≥ 87] = P[X1 − 110 ≥ 0, X2 − 87 ≥ 0] = P[X1 − µ1 ≥ 0, X2 − µ2 ≥ 0],

and realize that X ′ , (X1 − µ1, X2 − µ2) is still a normal random vector with the samevariance-covariance matrix and mean µ′ = (0, 0), so that

p = P[X ′1 ≥ 0, X ′

2 ≥ 0].



If X ′1 and X ′

2 were independent, the probability p would be very easy to compute: P[X ′1 ≥

0]P[X ′2 ≥ 0] = 1/4 by symmetry. However, ρ 6= 0, so we cannot do quite that. We can,

however, try to form linear combinations Y1 and Y2 (or ’mutual funds’ in the language offinance) so as to make Y1 and Y2 into independent standard normal variables (N(0, 1)), andsee if the computations simplify. So, let us put

Y = (Y1, Y2) = XA = (X1, X2)

(a11 a12

a21 a22

),

and try to determine the coefficients aij so that the vector Y has the identity matrix as itsvariance-covariance matrix. From previous exercise we know how to compute the variance-covariance matrix of Y , so that we get the following matrix equation; we are looking for thematrix A so that

AT

(49 5656 100

)A =

(1 00 1

)(1.4.2)

However, we do not need to rush and solve the equation, before we see what informationabout A we actually need. Let us suppose that we have the sought-for matrix A, and thatit is invertible (it will be, do not worry). Note that

p = P[X ∈ (0,∞)× (0,∞)] = P[AX ∈ A([0,∞)× [0,∞)

)] = P[Y ∈ A

([0,∞)× [0,∞)

)],

so all we need from A is the shape of the region D = A([0,∞) × [0,∞)

). Since A is a

linear transformation, the region D will be the infinite wedge between (a11, a12) = (1, 0)Aand (a21, a22) = (0, 1)A. Actually, we need even less information form A: we only need theangle ]D of the wedge D. Why? Because the probability we are looking for is obtained byintegrating the density of Y over D, and the density of Y is a function of x2

1 + x22 (why?), so

everything is rotationally symmetric around (0, 0). It follows that p = ]D/2π.To calculate ]D, we remember our analytic geometry:

cos(]D) = cos(]((1, 0)A, (0, 1)A

))=

a11a21 + a12a22√a2

11 + a212

√a2

21 + a222

,

and we also compute

B , AAT =

(a2

11 + a212 a11a21 + a12a22

a11a21 + a12a22 a221 + a2

22

)

and realize that cos(]D) = B12√B11B22

, so that the only thing we need to know about A is the

product B = AT A. To compute B, we go back to equation (1.4.2) and multiply it by A onthe left and AT on the right to obtain

(AAT )Σ(AAT ) = (AAT ),



and since A and AAT will be invertible, we can multiply both sides by by (AAT )−1Σ−1

from the right to obtain

B = AAT = Σ−1, and so cos(]D) = −ρ,

by simply plugging the elements of Σ−1 into the expression for cos(]D).We can now calculate p:

p =arccos(−ρ)

2π.= 0.398.


1. All three random variables are linear combinations of elements of a multivariate normaland therefore have normal distributions themselves. Thus, it is enough to identify theirmeans and variances:

(a) E[X1] = µ1 = 2, Var[X1] = Σ11 = 1, and so X1 ∼ N(2, 1).

(b) E[X1 +X2] = µ1 +µ2 = 5, Var[X1 +X2] = Var[X1]+Var[X2]+2 Cov[X1, X2] = 1,and so X1 +X2 ∼ N(5, 1).

(c) E[aX1 + bX2] = 2a + 3b, Var[aX1 + bX2] = a2 + 4b2 − 4ab, and so aX1 + bX2 ∼N(2a+ 3b, a2 + 4b2 − 4ab).

2. The formula says

ρ(X1, X2) =Cov[X1X2]√

Var[X1] Var[X2]=

Σ12√Σ11Σ22

= −1.

Solution to Exercise 1.4.10: Realize first that Y can be written as a linear combinationof the components of X and is therefore normally distributed. It is thus enough to provethat E[Y ] = 0 and Var[Y ] = 1:

E[Y ] =1√aΣaT

E[(X − µ)aT ] =1√aΣaT

n∑k=1

akE[Xk − µk] = 0.

Var[Y ] =1

aΣaTVar[(X − µ)aT ] =

1aΣaT

E[((X − µ)aT

)2] =

1aΣaT

E[((X − µ)aT

)T((X − µ)aT

)]

=aE[(X − µ)T (X − µ)]aT

aΣaT= 1,

since Σ = E[(X − µ)T (X − µ)] by definition.

Solution to Exercise 1.4.11:Solution to Exercise 1.4.12:



(a) First of all, let us derive an expression for E[eZ ] for a normal random variable Z withmean µ and variance σ2. To do that, we write Z = µ + σZ0, where Z0 is a standardunit normal. Then

E[eZ ] = eµE[eσZ0 ] = eµ∫ ∞

−∞eσx 1√

2πe−

x2

2 dx = eµ+σ2

2

∫ ∞

−∞

1√2πe−

(x−σ)2

2 dx = eµ+σ2

2 .

We know that X1 ∼ N(µ1,Σ11) and X2 ∼ N(µ2,Σ22), and so

E[eX1 ] = eµ1+Σ2

112 , and E[eX2 ] = eµ2+

Σ2222 .

Further, X1 +X2 ∼ N(µ1 + µ2,Σ11 + 2Σ12 + Σ22) because Var[X1 +X2] = Var[X1] +Var[X2] + 2 Cov[X1, X2], so that

E[eX1eX2 ] = E[eX1+X2] = eµ1+µ2+

Σ211+Σ2

22+2Σ122 .

Therefore

Cov(eX1 , eX2) = E[eX1eX2 ]− E[eX1 ]E[eX2 ] = eµ1+µ2+Σ2

11+Σ222+2Σ122 − eµ1+

Σ2112 eµ2+

Σ2222

= eµ1+µ2+Σ2

11+Σ222

2

(eΣ12 − 1

)(b) At least one of the prices S1 = eX1 , S2 = eX2 will take value in [s1, s2] if and only if

at least one of the components X1, X2 of the bivariate normal vector X takes valuesin the interval [log(s1), log(s2)]. This probability is equal to the probability that therandom vector X ′ = (X ′

1, X′2) = (X1−µ1√

Σ11, X2−µ2√

Σ22) falls into the set

A = (y1, y2) ∈ R : y1 ∈ [l1, r1] or y2 ∈ [l2, r2]

where

l1 =log(s1)− µ1√

Σ11, r1 =

log(s2)− µ1√Σ11

, l2 =log(s1)− µ2√

Σ22and r2 =

log(s2)− µ2√Σ22

.

We have transformed X into X ′ because X ′ is now a bivariate normal vector withcorrelation coefficient ρ′ = Σ12√

Σ11Σ22, whose marginals X1 and X2 are standard normal

(mean 0, and variance 1). To calculate the probability P[X ′ ∈ A], we write

P[X ′ ∈ A] = 1− P[X ′ ∈ Ac]

= 1− P[X ′1 < l1, X

′2 < l2]− P[X ′

1 < l1, X′2 > r2]− P[X ′

1 > r1, X′2 < l2]− P[X ′

1 > r1, X′2 > r2]

= 1−Ψ(l1, l2, ρ′)−(Φ(l1)−Ψ(l1, r2, ρ′)

)−(Φ(l2)−Ψ(l2, r1, ρ′)

)−Ψ(−r1,−r2, ρ′).

We have used the fact that the distribution of X ′ is symmetric (equal to that of −X ′).


1.5. BROWNIAN MOTION DEFINED CHAPTER 1. BROWNIAN MOTION

1.5 Brownian Motion Defined

1.5.1 Stochastic Processes

After random variables and random vectors come stochastic processes. You can think of themas of very large random vectors (where n = ∞), or as random elements having functions(instead of numbers or vectors) as values. Formally, we have the following definition

Definition 1.5.1. A stochastic process is a collection

Xt : t ∈ T ,

of random variables Xt, defined on the same probability space. T denotes the index set ofthe stochastic process and is usually taken to be

• T = N (discrete-time processes),

• T = [0, T ], for some T ∈ R (finite-horizon continuous-time processes), or

• T = [0,∞) (infinite-horizon continuous-time processes).

The notion of a stochastic processes is one of the most important in the modern probabilitytheory and mathematical finance. It used to model a myriad of various phenomena wherea quantity of interest varies continuously through time in a non-predictable fashion. In thiscourse we will mainly be interested in continuous-time stochastic processes on finite andinfinite time-horizons.

0

5

10

15

20

25

30

35

40

Jan-18-2002 Dec-18-2002

Figure 1.12: Dow Jones Chemical

Every stochastic process can be viewedas a function of two variables - t andω. For each fixed t, ω 7→ Xt(ω) isa random variable, as postulated inthe definition. However, if we changeour point of view and keep ω fixed,we see that the stochastic process isa function mapping ω to the real-valued function t 7→ Xt(ω). Thesefunctions are called the trajectoriesof the stochastic process X. Figureon the left shows the graph of the

evolution of the Dow Jones Chemical index from January 2002 to December 2002. It is nothard to imagine the stock index pictured here as a realization (a trajectory) of a stochasticprocess.


CHAPTER 1. BROWNIAN MOTION 1.5. BROWNIAN MOTION DEFINED

1.5.2 The Distribution of a Stochastic Process

In contrast to the case of random vectors or random variables, it will not be very easy todefine a notion of a distribution for a stochastic process. Without going into details whyexactly this is a problem, let me just mention that the main culprit is infinity again. Thereis a way out, however, and it is provided by the notion of finite-dimensional distributions:

Definition 1.5.2. The finite-dimensional distributions of a random process (Xt)t∈Tare all distribution functions of the form

F(Xt1 ,Xt2 ,...,Xtn ) , P[Xt1 ≤ x1, Xt2 ≤ x2, . . . , Xtn ≤ xn],

for all n ∈ N and all n-tuples (t1, t2, . . . tn) of indices in T .

For a huge majority of stochastic processes encountered in practice, the finite-dimensionaldistributions (together with the requirement of regularity of its paths) will be sufficient todescribe their full probabilistic structure.

1.5.3 Brownian Motion

The central object of study of this course is Brownian motion. It is one of the fundamentalobjects of applied mathematics and one of the most symmetric and beautiful things in wholewide world (to quote a mathematician who wanted to remain anonymous). To define theBrownian motion, it will be enough to specify its finite-dimensional distributions and ask forits trajectories to be continuous:

Definition 1.5.3. Brownian motion is a continuous-time, infinite-horizon stochastic process(Bt)t∈[0,∞) such that

1. B0 = 0 (Brownian motion starts at 0),

2. for any t > s ≥ 0,

(a) the increment Bt − Bs is normally distributed with mean µ = 0 and varianceσ2 = t− s (the increments of the Brownian motion are normally distributed)

(b) the random variables Bsm −Bsm−1 , Bsm−1 −Bsm−2 , . . . , Bs1 −B0, are independentfor any m ∈ N and any s1 < s2 < · · · < sm. (the increments of the Brownianmotion are independent)

3. the trajectories of a Brownian motion are continuous function (Brownian paths arecontinuous).

Did I really specify the finite-dimensional distributions of the Brownian motion in thepreceeding definition? The answer is yes:



Proposition 1.5.4. Let (Bt)t∈[0,∞) be a Brownian motion. For any n-tuple of indices0 ≤ t1 < t2 < · · · < tn, the random vector (Bt1 , Bt2 , . . . , Btn) has the multivariate normaldistribution with mean µ = (0, 0, . . . , 0) and the variance-covariance matrix

Σ =

t1 t1 . . . t1t1 t2 . . . t2...

.... . .

...t1 t2 . . . tn

, i.e. Σij = min(ti, tj) = tmin(i,j).

Proof. First of all, let us prove that the random vectors (Bt1 , Bt2 , . . . , Btn) have the multi-variate normal distribution. By definition (see Definition 1.4.1) it will be enough to provethat X , a1Bt1 + a2Bt2 + · · · + anBtn) is a normally distributed random variable for eachvector a = (a1, a2, . . . , an). If we rewrite X as

X = an(Btn −Btn−1) + (an + an−1)(Btn−1 −Btn−2) + (an + an−1 + an−2)(Btn−2 −Btn−3) + . . .

we see that X is a linear combination of the increments Xk , Btk −Btk−1. I claim that these

increments are independent. To see why, note that the last increment Xn is independent ofBt1 , Bt2 , . . . Btn−1 by definition, and so it is also independent of Bt2 − Bt1 , Bt3 − Bt2 , . . . .Similarly the incrementBtn−1−Btn−2 is independent of ’everybody’ before it by the same argu-ment. We have therefore proven that X is a sum of independent normally distributed randomvariables, and therefore normally distributed itself (see Example 1.4.6). This is true for eachvector a = (a1, a2, . . . , an), and therefore the random vector (Bt1 , Bt2 , . . . , Btn) is multivariatenormal. Let us find the mean vector and variance-covariance matrix of (Bt1 , Bt2 , . . . , Btn).First, for m = 1, . . . , n, we write the telescoping sum

E[Btm ] = E[Xm] + E[Xm−1] + . . .E[X1], where Xk , Btk −Btk−1.

By Definition 1.5.3-2.(a), the random variables Xk, k = 1 . . . ,m have mean 0, so E[Btm ] = 0and, thus, µ = (0, 0, . . . , 0). To find the variance-covariance matrix we suppose that i < j

and writeΣij = E[BtiBtj ] = E[B2

ti ]− E[Bti(Btj −Bti)] = E[B2ti ],

since Bti and Btj−Bti are independent by Definition 1.5.3-2.(b). Finally, the random variableBti is normally distributed with mean 0, and variance ti (just take t = ti and s = 0 inDefinition 1.5.3, part 2.(a)), and hence E[B2

ti ] = Var[Bti ] = ti. The cases i = j and i ≥ j canbe dealt with analogously.



0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

Figure 1.13: A Trajectory of Brownian motion

The graph on the right shows a typ-ical trajectory (realization) of a Brown-ian motion. It is important to note thatfor a random variable a realization isjust one point on the real line, and an-tuple of points for a random vector.For Brownian motion it is a whole func-tion. Note, also, the jaggedness of thedepicted trajectory. Later on we willsee that no trajectory of the Brownianmotion is differentiable at any of its points. . .

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

Figure 1.14: A Trajectory of an “Independent”Process

. . . nevertheless, by the definition, everytrajectory of Brownian motion is contin-uous. That is not the case with the (ap-proximation to) a trajectory of an “in-dependent process” shown on the right.An “independent process“ is the process(Xt)t∈[0,1] for which Xt and Xs are inde-pendent for t 6= s, and normally distrib-uted. This is what you might call - “ acompletely random function [0, 1] → R”.Blah blah blah blah blah blah blah blahblah Blah blah blah blah blah blah blahblah blah Blah blah blah blah blah blahblah blah blah Blah blah blah blah blahblah blah blah blah Blah blah blah blahblah blah blah blah blah



1.5.4 A Little History

Figure 1.15: Founding Fathers of BrownianMotion

Brownian Motion, the physical phenomenon,was named after the English naturalist and botanistRobert Brown (upper-right picture) who dis-covered it in 1827, is the zig-zagging motionexhibited by a small particle, such as a grainof pollen, immersed in a liquid or a gas. Thefirst explanation of this phenomenon was givenby Albert Einstein (lower-left caricature) in1905. He showed that Brownian motion couldbe explained by assuming the immersed particlewas constantly bombarded by the molecules ofthe surrounding medium. Since then, the ab-stracted process has been used beneficially insuch areas as analyzing price levels in the stockmarket and in quantum mechanics. The math-ematical definition and abstraction of the phys-ical process as a stochastic process, given by theAmerican mathematician Norbert Wiener (lower-right photo) in a series of papers starting in1918. Generally, the terms Brownian motion

and Wiener process are the same thing, although Brownian motion emphasizes the physicalaspects, and Wiener process emphasizes the mathematical aspects. Bachelier process is anuncommonly applied term meaning the same thing as Brownian motion and Wiener process.In 1900, Louis Bachelier (upper-left photo) introduced the limit of random walk as a modelfor the prices on the Paris stock exchange, and so is the originator of the idea of what is nowcalled Brownian motion. This term is occasionally found in financial literature and Europeanusage.

1.5.5 Gaussian Processes

Back to mathematics. We have seen that both the Brownian motion and the Independentprocess share the characteristic that their finite-dimensional distributions are multivariatenormal. In case of the Independent process, it is because the distributions of random variablesXt are normal and independent of each other. In case of the Brownian motion, it is the contentof Proposition 1.5.4. The class of processes which share this property is important enough tomerit a name:

Definition 1.5.5. A continuous-time stochastic process (Xt)t∈[0,∞) (or (Xt)t∈[0,T ] for some



T ∈ R) is called a Gaussian processes if its finite-dimensional distributions are multivariatenormal, i.e. for each t1 < t2 < · · · < tn the random vector (Xt1 , Xt2 , . . . , Xtn) is multivariatenormal.

Figure 1.16: C. F. Gauss on a 10DM bill

The name - Gaussian process- derives from the fact that thenormal distribution is sometimesalso called the Gaussian distri-bution, after Carl Friedrich Gauss,who discovered many of its prop-erties. Gauss, commonly viewedas one of the greatest mathe-maticians of all time (if not thegreatest ), is (was) properly hon-ored by Germany on their 10

Deutschmark bill shown in the figure left. With each Gaussian process X we associatetwo functions: µX : [0,∞) → R and cX : [0,∞) × [0,∞) → R. The function µX , de-fined by, µX(t) , E[Xt] is called the expectation function of X and can be thought of(loosely) as the trend around which the process X fluctuates. The function cX , defined bycX(t, s) = Cov(Xt, Xs), is called the covariance function of the process X and we can in-terpret it as the description of dependence structure of the process X. It is worth noting thatthe function c is positive semi-definite (see Exercise 1.4.9 for the precise definition andproof). Conversely, it can be proved that given (almost any) function µ and any positive-semidefinite function c, we can construct a stochastic process which has exactly µ as itsexpectation function, and c as its covariance function.

1.5.6 Examples of Gaussian Processes

Finally, here are several examples of Gaussian processes used extensively in finance andelsewhere.

Example 1.5.6. (Independent Process) The simplest example of a Gaussian process isthe Independent process - the process (X)t∈[0,∞) such that each Xt is normally distributedwith mean µ and variance σ2, and such that the collection random variables (X)t∈[0,∞) areindependent. The expectation and covariance functions are given by

µX(t) = µ, cX(t, s) =

σ2, t = s,

0, t 6= s.



Example 1.5.7. (Brownian Motion) The Brownian motion is the most important exampleof a Gaussian process. The expectation and covariance functions are

µX(t) = 0, cX(t, s) = min(t, s).

Example 1.5.8. (Brownian Motion with Drift) Let (B)t∈[0,∞) be a Brownian motionand let b be a constant in R. We define the process (X)t∈[0,∞) by

Xt = Bt + bt.

The process (X)t∈[0,∞) is still a Gaussian process with the expectation and covariance func-tions

µX(t) = bt, cX(t, s) = min(t, s).

Example 1.5.9. (Brownian Bridge) The Brownian Bridge (or Tied-Down BrownianMotion) is what you get from Brownian Motion (Bt)t∈[0,1] on the finite interval [0, 1], whenyou require that B1 = 0. Formally, Xt = Bt − tB1, where B is some Brownian motion.In the exercises I am asking you to prove that X is a Gaussian process and to compute itsexpectation and covariance function.

1.5.7 Brownian Motion as a Limit of Random Walks

1 2 3 4 5 6

n

Sn

Figure 1.17: A Trajectory of a Random Walk

Blah blah blah blah blah blah blah blahblah Blah blah blah blah blah blah blah blahblah Brownian motion describes an idealiza-tion of a motion of a particle subjected to in-dependent ‘shoves’. It is therefore not unrea-sonable to imagine our particle as a closelyapproximated by a random walker taking eachstep, independently of its position or previ-ous steps, either to the left or to the rightwith equal probabilities. Mathematically, wehave the following definition: Blah blah blahblah blah blah blah blah blah Blah blah blahblah blah blah blah blah blah Blah blah blah

blah blah blah blah blah blah Blah blah blah blah blah blah blah blah blah Blah blah blahblah blah blah blah blah blah Blah blah blah blah blah blah blah blah blahDefinition 1.5.10. Let X1, X2, . . . be a sequence of independent Bernoulli random variables,i.e.,

Xi ∼

(−1 11/2 1/2

)



The stochastic process (Sn)n∈N0 defined by

S0 = 0, Sn = X1 +X2 + · · ·+Xn, n ≥ 1,

is called the random walk.

Suppose now that we rescale the random walk in the following way: we decrease the timeinterval between steps from 1 to some small number ∆t, and we accordingly decrease thestep size to another small number ∆x. We would like to choose ∆t and ∆x so that after 1unit of time, i.e. n = 1/∆t steps, the standard deviation of the resulting random variableSn is normalized to 1. By the independence of steps of the random walk - and the fact thatthe for the random variable which takes values ∆x and −∆x with equal probabilities, thevariance is (∆x)2 - we have

Var[Sn] = Var[X1] + Var[X1] + · · ·+ Var[Xn] = n(∆x)2 =(∆x)2

∆t,

so we should choose ∆x =√

∆t. Formally, for each ∆t, we can define an infinite-horizoncontinuous-time process (B∆t)t∈[0,∞) by the following procedure:

• take a random walk (Sn)n∈N0 and multiply all its increments by ∆x =√

∆t to obtainthe new process (S∆x

n )n∈N0 -

S∆x0 = 0, S∆x

n =√

∆tX1 +√

∆tX2 + · · ·+√

∆tXn =√

∆tSn, n ≥ 1.

• define the process (B∆t)t∈[0,∞) by

B∆tt =

S∆x

n , t is of the form t = n∆t,

interpolate linearly, otherwise.

It can be shown that the processes (B∆t)t∈[0,∞) converge to a Brownian motion in a mathe-matically precise sense (the sense of weak convergence), but we will neither elaborate onthe exact definition of this form of convergence, nor prove any rigorous statements. Heuris-tically, weak convergence will allow us to use the numerical characteristics of the paths ofthe process B∆t, which we can simulate on a computer, as approximations of the analogouscharacteristics of the Brownian Motion. We will cover the details later in the section onsimulation.



Exercises

Exercise 1.5.11. Let (B)t∈[0,T ] be a Brownian motion. Which of the following processes areBrownian motions? Which are Gaussian processes?

• Xt , −Bt, t ∈ [0,∞).

• Xt =√tB1, t ∈ [0,∞)

• Xt = |Bt| t ∈ [0,∞)

• Xt = BT+t −BT , t ∈ [0,∞), where T is a fixed constant - T ∈ [0,∞)

• Xt = Bet , t ∈ [0,∞)

• Xt = 1√uBut, t ∈ [0,∞), where u is a fixed positive constant u ∈ (0,∞)

Exercise 1.5.12. Prove that the Brownian motion with drift is a Gaussian process with theexpectation and covariance functions given in (1.5.8).

Exercise 1.5.13. Prove that the Brownian Bridge is a Gaussian process and compute itsexpectation and covariance functions. Does it have independent increments?

Exercise 1.5.14. A function γ : [0,∞)× [0,∞) → R is called positive semi-definite if forany n ∈ N and any t1 < t2 < · · · < tn the matrix Γ defined by

Γ ,

γ(t1, t1) γ(t1, t2) . . . γ(t1, tn)γ(t2, t1) γ(t2, t2) . . . γ(t2, tn)

......

. . ....

γ(tn, t1) γ(tn, t2) . . . γ(tn, tn)

,

is symmetric positive semi-definite. Let (Xt)t∈[0,∞) be a Gaussian process. Show that itscovariance function cX is positive semi-definite.

Exercise 1.5.15. Let (B)t∈[0,T ] be a Brownian motion , and let X =∫ 10 Bs ds be the area

under the trajectory of B. Compute the mean and variance of X by following the followingprocedure

1. Compute the mean and variance of the random variable X∆t =∫ 10 B

∆ts ds.

2. Let ∆t→ 0 to obtain the mean and variance of X.

Exercise 1.5.16. Let (Bt)t∈[0,1] be a Brownian motion on [0, 1], and let (Xt)t∈[0,1], Xt = Bt−tB1 be the corresponding Brownian Bridge. Define the process (Yt)t∈[0,∞) by Yt = (1+t)X t

1+t.

Show that Y is a Brownian motion, by showing that



• (Yt)t∈[0,∞) has continuous paths,

• (Yt)t∈[0,∞) is a Gaussian process, and

• the mean and the covariance functions of (Yt)t∈[0,∞) coincide with the mean and thecovariance functions of the Brownian motion.

Use the result above to show that for any Brownian motion (Wt)t∈[0,∞) we have

limt→∞

Wt

t= 0.

Exercise 1.5.17. (Brownian Motion in the Plane)

Let (B1)t∈[0,T ] and (B2)t∈[0,T ] be two independent Brownian motions (independent meansthat for any t, B1

t is independent of the whole process (B2)t∈[0,T ], and vice versa). You canthink of the trajectory t 7→ (B1

t (ω), B2t (ω)) as a motion of a particle in the plane (see picture).

Let the random variable Rt denote the distance from the Brownian particle (B1t , B

2t ) to

the origin, i.e.

Rt =√

(B1t )2 + (B2

t )2.

Find the density fRt of Rt, for fixed t > 0. Calculate the mean µ(t) and standard deviation√σ2(t) of Rt as functions of t, and sketch their graphs for t = [0, 5]. What do they say about

the expected displacement of the Brownian particle from the origin as a function of timet? (Hint: compute first the distribution function FRt by integrating a multivariate normaldensity over the appropriate region, and then get fRt from FRt by differentiation. To get µ(t)and

√σ2(t) you might want to use Maple, or Mathematica, or some other software package

capable of symbolic integration, because the integrals involved might get ugly. Once you getthe results, you might as well use the same package to draw the graphs for you.)

Exercise 1.5.18. (Exponential Brownian Motion) The process (S)t∈[0,T ] defined by

St = s0 exp(αBt + βt),

for some constants s0, α, β ∈ R, with α 6= 0 and s0 > 0, is called exponential Brown-ian motion. The exponential Brownian motion is one of the most wide-spread stock-pricemodels today, and the celebrated Black-Scholes formula rests on the assumption that stocksfollow exponential Brownian motion. It is the purpose of this exercise to give meanings toparameters α and β.

1. Calculate the expectation E[St] as a function of t. (Do not just copy it from the book.Do the integral yourself.) What must be the relationship between α and β, so that Xt

models a stock with 0 rate of return, i.e. E[St] = s0 for all t ≥ 0?



2. If you had to single out one parameter (can be a function of both α and β) and callit the rate of return of the stock (S)t∈[0,T ], what would it be? Why? Why would somepeople call your conclusion paradoxical?

Exercise 1.5.19. (a) Let X1 and X2 be two independent normal random variables withmean 0 and variance 1. Find the density function of the random variable Y = X1

X2. Do

not worry about the fact that X2 might be equal to 0. It happens with probability 0.(Hint: You might want to use polar coordinates here.)

(b) Let Bt, t ∈ [0,∞) be a Brownian motion. Define the process Yt =Bt−B 2

3 t

B 23 t−B 1

3 t, t ∈ (0,∞),

Y0 = 0. Is Yt a Gaussian process? If it is, find its mean and covariance functions. If itis not, prove rigorously that it is not.




Solution to Exercise 1.5.11 (Note: in what follows we always assume t > s).

1. Xt = −Bt is a continuous process, X0 = −B0 = 0, and its increments Xt − Xs areindependent of the past because they are just the negatives of the increments of theBrownian motion. Finally, the distribution of Xt − Xs = −(Bt − Bs) is normal withmean 0 and variance t− s, by symmetry of the normal distribution. Therefore, Xt is aBrownian motion.

2. The process Xt =√tB1 is not a Brownian motion because its increments Xt − Xs =

B1(√t−

√s) are not independent of the past: Xt −Xs =

√t−√

s√rXr, for 0 < r < t− s,

so Xt −Xs cannot be independent of Xr.

On the other hand, Xt is a Gaussian process. To show that, we take indices t1, . . . , tn,and constants a1, a2, . . . an and form the linear combination Y = a1Xt1 + a2Xt2 + · · ·+anXtn . By the definition of Xt, we have Y = γB1, with γ = (a1

√t1 + a2

√t2 + · · · +

an√tn). Therefore Y is a normally distributed random variable, and we conclude that

X is a Gaussian process.

3. Xt = |Bt| is not a Gaussian process because X1 = |B1| is not a normally distributedrandom variable (it is nontrivial, and yet never takes negative values), as it should bein the case of a Gaussian process. Since Brownian motion is a Gaussian process, weconclude that Xt is not a Brownian motion, either.

4. The process Xt = BT+t − BT has continuous paths (since the Brownian motion Bt

does), and X0 = BT+0−BT = 0. The increments Xt−Xs have the normal N(0, t− s)-distribution since Xt − Xs = BT+t − BT+s, and the distribution of BT+t − BT+s isN(0, T + t− (T + s)) ∼ N(0, t− s), by the definition of the Brownian motion. Finally,the increment Xt − Xs = BT+t − BT+s is independent of the history up to the timeT + s, i.e. Xt −Xs is independent of the random vector (BT+s0 , BT+s1 , . . . , BT+sn) forany collection s0, s1, . . . , sn in [0, s]. We can always take s0 = 0, and conclude that, inparticular, Xt − Xs is independent of BT . Therefore, Xt − Xs is independent of therandom vector (BT+s1 −BT , BT+s2 −BT , . . . , BT+sn −BT ) = (Xs1 , Xs2 , . . . , Xsn), andit follows that Xt is a Brownian motion (and a Gaussian process).

5. Since Bt is a Gaussian process, the random vector (Bs1 , Bs2 , . . . , Bsn) is multivari-ate normal for any choice of s1, s2, . . . , sn. In particular, if we take s1 = et1 , s2 =et2 , . . . , sn = etn , we have that (Bet1 , Bet2 , . . . , Betn ) is a multivariate normal. There-fore, Xt is a Gaussian process.

Xt is not a Brownian motion since Var[X1] = Var[Be] = e 6= 1.



6. For u > 0, the process Xt = 1√uBut has continuous paths and X0 = 0. The increment

Xt−Xs = 1√u(But−Bus) is independent of the history of B before (and including) us,

which is exactly the history of X before (and including) s. Finally, the distribution ofthe increment Xt −X − s is normal with mean 0 and variance Var[ 1√

u(Btu − Bsu)] =

1u(tu− ts) = t− s. Therefore, X is a Brownian motion (and a Gaussian process).

Solution to Exercise 1.5.12 Let (t1, t2, . . . , tn) be an arbitrary n-tuple of indices. We haveto prove that the random vector (Xt1 , Xt2 , . . . , Xtn) is multivariate normal, whereXt = Bt+btis a Brownian motion with drift. To do that, we take constants a1, a2, . . . , an and compute

Y = a1Xt1 + a2Xt2 + · · ·+ anXtn = Z + bt(a1 + a2 + · · ·+ an),

where Z = a1B1 +a2B2 + . . . anBn is a normal random variable since the Brownian motion isa Gaussian process. The random variable Y is also normal since it is constructed by addinga constant to a normal random variable. Therefore, X is a Gaussian process. To get µX

and cX , we write (noting that adding a constant to a random variable does not affect itscovariance with another random variable):

µX(t) = E[Xt] = E[Bt + bt] = bt

cX(t, s) = Cov[Xt, Xs] = Cov[Xt − bt,Xs − bs] = Cov[Bt, Bs] = min(t, s)

Solution to Exercise 1.5.13

Just like in the previous solution, for an arbitrary n-tuple of indices (t1, t2, . . . , tn) in [0, 1],we are trying to prove that the random vector (Xt1 , Xt2 , . . . , Xtn) is multivariate normal. hereXt = Bt −B1t is a Brownian bridge. Again, we take constants a1, a2, . . . , an and compute

Y = a1Xt1 + a2Xt2 + · · ·+ anXtn = −(a1 + a2 + · · ·+ an)tB1 + a1Bt1 + a2Bt2 + · · ·+ anBtn ,

which is normal because (B1, Bt1 , Bt2 , . . . , Btn) is a multivariate normal random vector.Therefore, X is a Gaussian process. To get µX and cX , we use the fact that E[BtBs] = st

and E[BtB1] = t, E[BsB1] = s, for t, s ≤ 1, and

µX(t) = E[Xt] = E[Bt −B1t] = 0

cX(t, s) = Cov[Xt, Xs] = E[(Bt − tB1)(Bs − sB1)] = E[BtBs]− sE[BtB1]− tE[BsB1] + tsE[B1B1]

= min(t, s)− ts (= s(1− t), when s < t).

Solution to Exercise 1.5.14 Let Xt be a Gaussian process, and let cX be its covarianceprocess. We will assume that E[Xt] = 0 for each t, since adding a constant to each Xt willnot affect the covariance function.



Let Γ be defined as in the exercise with γ = cX , and some t1, t2, . . . , tn. To prove that Γis non-negative definite, we have to show that Γ is symmetric, and that for any (row) vectora we have aΓaT ≥ 0.

The first statement is easy since Γij = Cov[Xti , Xtj ] = Cov[Xtj , Xti ] = Γji. To show thenon-negative definiteness, we take the vector a = (a1, a2, . . . , an) and construct the randomvariable Y = a1Xt1 + a2Xt2 + . . . anXtn . Since variances are always non-negative, we have(with X = (Xt1 , Xt2 , . . . , Xtn))

0 ≤ Var[Y ] = Var[XaT ] = E[aXT XaT ] = aΓaT .

Solution to Exercise 1.5.15 Let us fix ∆t = 1/n and compute the integral

A(n) =∫ 1

0B∆t

s ds,

which can be interpreted as the area under the graph of the linearly-interpolated randomwalk B∆t. From geometry of the trajectories of B∆t we realize that the area A(n) is the sumof areas of n + 1 trapezia. The kth trapezium’s altitude is ∆t (on the x-axes) and its sidesare Sk−1 and Sk, the process (Sn) being the random walk used in the construction of B∆t.Therefore, the sum of the areas of the trapezia is

A(n) =12∆t((S0+S1)+(S1+S2)+· · ·+(Sn−1+Sn)

)=

12∆t((2n−1)X1+(2n−3)X2+· · ·+3Xn−1+Xn

),

where Xk is the increment Sk−Sk−1. Since Xk takes values ±∆x = ±√

∆t with probabilities1/2, and since Xk and Xl are independent for k 6= l, so E[Xk] = 0, Var[Xk] = (∆x)2 = ∆t,and Cov[Xk, Xl] = 0 for k 6= l. Therefore

E[A(n)] = 0, Var[A(n)] =(∆t)2

4

((2n−1)2(∆x)2+(2n−3)2(∆x)2+· · ·+(∆x)2

)=

(∆t∆x)2

4α(n),

where α(n) = 12 + 32 + · · ·+ (2n− 1)2. It can be shown that α(n) = 4/3n3− 1/3n, so, usingthe fact that ∆t∆x = n−3/2 we get

Var[A(n)] = n−3(1

3n3 − 1

12n)

=13− 1

12n−2.

Letting n→∞ and using the approximating property of B∆t we get

E[∫ 1

0Bt dt] = lim

nE[A(n)] = 0,Var[

∫ 1

0Bt dt] = lim

n→∞Var[A(n)] = lim

n→∞

(13− 1

12n−2

)=

13.




• The trajectories of (Yt)t∈[0,∞) are continuous because they are obtained from the trajec-tories of the Brownian Bridge by composition with the continuous function t 7→ t

1+t , andmultiplication by the continuous function t 7→ (1+ t). The trajectories of the BrownianBridge are continuous because they are obtained from the (continuous) trajectories ofthe Brownian motion by adding the continuous function −t 7→ tB1.

• It can easily be shown that the process Y is Gaussian by considering linear combinationsα1Yt1 +α2Yt2 +· · ·+αnYtn , and rewriting them in terms of the original Brownian motionB (just like we did in class and in HW 2).

• The expectation function µY coincides with the one for Brownian motion (µB(t) = 0)since

E[Yt] = E[(1 + t)Xt/(1+t)] = (1 + t)E[B t1+t

− t

1 + tB1] = (1 + t)E[B t

1+t]− tE[B1] = 0.

The same is true for the covariance function, because for s ≤ t we have

cY (s, t) = E[YsYt] = (1 + t)(1 + s)E[(B t1+t

− t

1 + tB1)(B s

1+s− s

1 + sB1)]

= (1 + t)(1 + s)(

s

1 + s− s

1 + s

t

1 + t− t

1 + t

s

1 + s+

t

1 + t

s

1 + s

)= s = min(s, t) = cB(s, t).

For the Brownian Bridge Xt we have limt→1−Xt = 0, so

limt→∞

Yt

t= lim

t→∞

(1 + t)t

X t1+t

=[s =

t

1 + t

]= lim

s→1−

1sXs = 0.

We know that Yt is a Brownian motion, so we are done.Solution to Exercise 1.5.17

To compute the distribution function FRt(x) = P[Rt ≤ x] for t > 0 and x > 0, we writeR2

t = (B1t )2 + (B2

t )2 so that

FRt(x) = P[R2t ≤ x2] = P[(B1

t )2 + (B2t )2 ≤ x2] = P[(B1

t , B2t ) ∈ D(x)],

where D(x) is the disk of radius x centered at the origin. The sought-for probability canbe found by integrating the joint density of the random vector (B1

t , B2t ) over the region

D(x). Since B1t and B2

t are independent, the joint density f(B1t ,B2

t )(x1, x2) is given byf(B1

t ,B2t )(x1, x2) = ϕt(x1)ϕt(x2), where ϕt is the density of the normal random variable B1

t

(and of B2t as well). We are dealing with Brownian motions, so B1

t ∼ N(0, t) and

f(B1t ,B2

t )(x1, x2) =1√2πt

e−x21

2t1√2πt

e−x22

2t =1

2πte−

x21+x2

22t .



You can, of course, get the same expression for the joint density by calculating the mean andthe variance-covariance matrix of the multivariate-normal random vector (B1

t , B2t ) and using

the formula given in the notes.We proceed by writing

P[Rt ≤ x] =∫∫

x21+x2

2≤x2

12πt

e−x21+x2

22t dx1 dx2,

and passing to polar coordinates (r, θ) (and remembering that dx1 dx2 = r dr dθ simplifiesthe integral to µ(t), t ∈ [0, 5]4in0.1in4inscale=1

FRt(x) = P[Rt ≤ x] =∫ 2π

0

∫ x

0

r

2πte−

r2

2t dr dθ =∫ x

0

r

te

r2

2t dr.

Now it is very easy to differentiate FRt(x) with respect to x to obtain

fRt(x) =x

te−

x2

2t , x > 0.

(The distribution with such density is called the Gamma-distribution).In order to compute µ(t) = E[Rt] and σ(t) =

√Var[Rt] we simply need to integrate (using

Maple, e.g.)

µ(t) =∫ ∞

0xfRt(x) dx =

∫ ∞

0

x2

te−

x2

2t dx =√π

2

√t.

Similarly,

E[R2t ] =

∫ ∞

0x2fRt(x) dx =

∫ ∞

0

x3

te−

x2

2t dx = 2t,

so σ(t) =√

Var[Rt] =√

E[R2t ]− E[Rt]2 =

√(2− π

2 )√t. The graphs of µ(t) and σ(t) are

given in Figures 2. and 3.Solution to Exercise 1.5.18

1. To compute the expectation E[St] observe that St = g(Bt) where g(x) = eαx+βt, sothat E[St] = s0E[g(Bt)] and therefore

E[St] = s0

∫ ∞

−∞g(x)

1√2πt

e−x2

2t dx = s01√2πt

∫ ∞

−∞e−

12t

x2+αx+βt dx,

since Bt has the normal distribution with mean 0, and variance t. The last integral canbe evaluated by completing the square in the exponent, or using a software package, sowe have

E[St] = s0et(β+ 1

2α2),

and E[St] = s0 if β = −α2/2.



2. The rate b(t) of (expected) return over the interval [0, t] for the stock S can be definedby

b(t) =log(E[St])log(s0)

= t(β + α2/2),

and the instantaneous rate of return is then ddtb(t) = β + α2/2. The apparent paradox

comes from the fact that even for negative β, the rate of return can be positive. So ifwe take a Brownian motion with a negative drift (a process that decreases on average)and apply to it a deterministic function (exp) which does not depend on t, we get aprocess which increases on average. Weird . . .


(a)We start by computing the value of the distribution function

F (x) = P[Y ≤ x] = P[X1

X2≤ x] = P[(X1, X2) ∈ A],

where A =

(x1, x2) R2 : x1x2≤ x

The joint density function of the random vector

(X1, X2) is given by fX1,X2(x1, x2) = 12π exp(−x2

1+x22

2 ). Therefore,

F (x) =∫∫

Af(X1,X2)(x1, x2) dx1dx2

Without any loss of generality we can assume x > 0 because of symmetry of Y , sothat the region A looks like given in Figure 1., where the slanted line has slope 1/x.The density function fX1,X2 , and the integration region A are rotationally symmetric,so passing to the polar coordinates makes sense. In polar coordinates the region A isgiven by

A = (r, φ) ∈ [0,∞)× [0, 2π) : φ ∈ [arctan(1/x), π] ∪ [π + arctan(1/x), 2π] ,

and the integral simplifies to F (x) = 12π

( ∫ πarctan(1/x) dφ +

∫ 2ππ+arctan(1/x) dφ

)= 1 −

1π arctan(1/x), and thus, by differentiation,

fY (x) =1

π(1 + x2),

and we conclude that Y has a Cauchy distribution.

(b) For an arbitrary, but fixed t > 0, the random variables X1 = 1q13t

(Bt − B 2

3t

)and

X2 = 1q13t

(B 2

3t −B 1

3t

)are independent unit normals (we just have to use the defining

properties of Brownian motion). Therefore, part (a) implies that the distribution of Yt

is Cauchy. Cauchy is obviously not normal, so Yt is not a Gaussian process.


CHAPTER 1. BROWNIAN MOTION 1.6. SIMULATION

1.6 Simulation

In this section give a brief introduction into the topic of simulation of random variables,random vectors and random processes, and the related Monte Carlo method. There is a vastliterature on the topic and numerous web pages dedicated to different facets of the problem.

Simulation of random quantities is one of the most versatile numerical methods in ap-plied probability and finance because of its (apparent) conceptual simplicity and ease ofimplementation.

1.6.1 Random Number Generators

We start off by introducing the fundamental ingredient of any simulation experiment - therandom number generator or RNG. This will usually be a computer function which, whencalled, produces a number in the range [0, 1]. Sometimes (depending of the implementation)the random number generator will produce an integer between 0 and some large numberRAND MAX, but a simple division (multiplication) by RAND MAX will take care of the transitionbetween the two. We shall therefore talk exclusively about RNGs with the range [0, 1], andthe generic RNG function will be dented by rand, after its MATLAB implementation.

So, far there is nothing that prevents rand from always returning the same number 0.4,or the sequence 0.5, 0.25, 0.125, . . . . Such a function will, however, hardly qualify for anRNG since the values it spits out come in a predictable order. We should, therefore, requireany candidate for a random number generator to produce a sequence of numbers which isas unpredictable as possible. This is, admittedly, a hard task for a computer having onlydeterministic functions in its arsenal, and that is why the random generator design is such adifficult field. The state of the affairs is that we speak of good or less good random numbergenerators, based on some statistical properties of the produced sequences of numbers.

One of the most important requirements is that our RNG produce uniformly distrib-uted numbers in [0, 1] - namely - the sequence of numbers produced by rand will haveto cover the interval [0, 1] evenly, and, in the long run, the number of random numbers ineach subinterval [a, b] if [0, 1] should be proportional to the length of the interval b− a. Thisrequirement if hardly enough, because the sequence

0, 0.1, 0.2, . . . , 0.8, 0.9, 1, 0.05, 0.15, 0.25, . . . , 0.85, 0.95, 0.025, 0.075, 0.125, 0.175, . . .

will do the trick while being perfectly predictable.To remedy the inadequacy of the RNGs satisfying only the requirement of uniform distri-

bution, we might require rand to have the property that the pairs of produced numbers coverthe square [0, 1] × [0, 1] uniformly. That means that, in the long run, every patch A of thesquare [0, 1]× [0, 1] will contain the proportion of pairs corresponding to its area. Of course,one could continue with such requirements and ask for triples, quadruples, . . . of random


1.6. SIMULATION CHAPTER 1. BROWNIAN MOTION

numbers to be uniform in the [0, 1]3, [0, 1]4 . . . . The highest dimension n such that the RNGproduces uniformly distributed in [0, 1]n is called the order of the RNG. In the appendix Iam including the C source code of Mersenne Twister, an excellent RNG with the order of623.

Another problem with RNGs is that the numbers produced will start to repeat after awhile (this is a fact of life and finiteness of your computer’s memory). The number of calls ittakes for a RNG to start repeating its output is called the period of a RNG. You might havewondered how is it that an RNG produces a different number each time it is called, since,after all, it is only a function written in some programming language. Most often, RNGs usea hidden variable called the random seed which stores the last output of rand and is usedas an (invisible) input to the function rand the next time it is called. If we use the same seedtwice, the RNG will produce the same number, and so the period of the RNG is limited bythe number of different possible seeds. Some operating systems (UNIX, for example) havea system variable called random seed which stores the last used seed and writes it to thehard-disk, so that the next time we use rand we do not get the same output. For Windows,there is no such system variable (up to my knowledge) so in order to avoid getting the samerandom numbers each time you call rand, you should provide your own seed. To do this youcan either have your code do the job of the system and write the seed to a specific file everytime you stop using the RNG, or use the system clock to provide you with a seed. The secondmethod is of dubitable quality and you shouldn’t use it for serious applications.

In order to assess the properties of a RNG, a lot of statistical tests have been developedin the last 50 years. We will list only three here (without going much into the details or thetheory) to show you the flavor of the procedures used in practice.

They all require a sequence of numbers (xn)n∈1,2,...,N to be tested.

• uniform distribution Plot the histogram of x1, x2, . . . by dividing the interval [0, 1]into bins (the good size of the bin is proportional to

√N .) All the bins have an

approximately equal number of xn’s.

• 2, 3-dim uniform distribution Divide your sequence into pairs (x1, x2), (x3, x4),(x5, x6), etc., and plot the obtained points in the unit square [0, 1] × [0, 1]. Thereshould be no patterns. You can do the same for 3 dimensions, but not really for 4 ormore.

• Maximum of Brownian motion Use your numbers to produce a sequence (yn)n∈Nwith a (pseudo-) normal distribution (we shall see later how to do that). Divide theobtained sequence into M subsequences of equal length, and use each to simulate aBrownian motion on the unit interval. For each of the M runs, record the maximumof the simulated trajectory and draw the histogram of M the M maxima. There is



a formula for the density fmax B of the maximum of the trajectory of the Brownianmotion, so it is easy to compare the histogram you’ve obtained with fmax B.

The simplest RNGs are so-called linear congruential random number generatorsand a large proportion of the generators implemented in practice belong to this class. Theyproduce random integers on some range [0,M −1], and the idea behind their implementationis simple. Pick a large number M , another large number m, and yet another number c. Startwith the seed x1 ∈ [0,M − 1]. You get xn from xn−1 using the formula

xn = mxn−1 + c (mod M),

i.e. multiply xn−1 by m, add c, and take the reminder from the division of the result by M .The art is, of course, in the choice of the numbers M , m, c and x1 and it is easy to come upwith examples of linear congruential generators with terrible properties (m = 1, M = 10, forexample).

Finally, we give an example of a (quite bad) random number generator that has beenwidely used in the sixties and seventies.

Example 1.6.1.

RANDU is a linear congruential random number gener-ator with parameters m = 65539, c = 0 and M = 231.Here is a histogram of the distribution of RANDU. Itlooks OK.

Fig 1. Histogram of RANDU output

Fig 2. 2-d plot RANDUoutput

Here is a 2-d plot of the pairs of RANDU’s pseudo-random numbers. Still no apparent pattern.



Finally, the 3-d plot of the tripletsof RANDU’s pseudorandom num-bers reveals big problem. See howall the points lie in a dozen-or-so planes in 3D. That wouldn’thappen for truly random numbers.This is called the Marsaglia ef-fect, after M. Marsaglia who pub-lished the paper Random NumbersFall Mainly in the Planes in 1968.

Fig 3. 3-d plots RANDU output

1.6.2 Simulation of Random Variables

Having found a random number generator good enough for our purposes, we might want touse it to simulate random variables with distributions different from the uniform on [0, 1].This is almost always achieved through transformations of the output of a RNG, and we willpresent several methods for dealing with this problem.

1. Discrete Random Variables Let X have a discrete distribution given by

X ∼

(x1 x2 . . . xn

p1 p2 . . . pn

).

For discrete distributions taking an infinite number of values we can always truncateat a very large n and approximate it with a distribution similar to the one of X.

We know that the probabilities p1, p2, . . . , pn add-up to 1, so we define the numbers0 = q0 < q1 < · · · < qn = 1

q0 = 0, q1 = p1, q2 = p1 + p2, . . . , qn = p1 + p2 + · · ·+ pn = 1.

To simulate our discrete random variable X, we call rand and then return x1 if0 ≤rand< q1, return x2 if q1 ≤rand< q2, and so on. It is quite obvious that thisprocedure indeed simulates a random variable X.



2. The Method of Inverse Functions The basic observation in this method is that ,for any continuous random variable X with the distribution function FX , the randomvariable Y = FX(X) is uniformly distributed on [0, 1]. By inverting the distributionfunction FX and applying it to Y , we recover X. Therefore, if we wish to simulate arandom variable with an invertible distribution function F , we first simulate a uniformrandom variable on [0, 1] (using rand) and then apply the function F−1 to the result.Of course, this method fails if we cannot write F−1 in closed form.

Example 1.6.2. (Exponential Distribution) Let us apply the method of inversefunctions to the simulation of an exponentially distributed random variable X withparameter λ. Remember that

fX(x) = λ exp(−λx), x > 0, and so FX(x) = 1− exp(−λx), x > 0,

and so F−1X (y) = − 1

λ log(1 − y). Since, 1−rand has the same U[0, 1]-distribution asrand, we conclude that

− log( rand )λ

has the required Exp(λ)-distribution.

Example 1.6.3. (Cauchy Distribution) The Cauchy distribution is defined throughits density function

fX(x) =1π

1(1 + x2)

.

The distribution function Fx can be determined explicitly in this example:

FX(x) =1π

∫ x

−∞

1(1 + x2)

dx =1π

(π2

+ arctan(x)), and so F−1

X (y) = tan(π(y − 1

2)),

yielding that tan(π(rand− 0.5)) will simulate a Cauchy random variable for you.

3. The Box-Muller method This method is useful for simulating normal random vari-ables, since for them the method of inverse function fails (there is no closed-form ex-pression for the distribution function of a standard normal). It is based on a clevertrick:

Proposition 1.6.4. Let Y1 and Y2 be independent U[0, 1]-distributed random variables.Then the random variables

X1 =√−2 log(1− Y1) cos(2πY2), X2 =

√−2 log(1− Y1) sin(2πY2)

are independent and standard normal (N(0,1)).



The proof of this proposition is quite technical, but not hard, so I will omit it.

Therefore, in order to simulate a normal random variable with mean µ = 0 and varianceσ2 = 1, we produce call the function rand twice to produce two random numbers rand1and rand2. The numbers

X1 =√−2 log(rand1) cos(2π rand2), X2 =

√−2 log(rand1) sin(2π rand2)

will be two independent normals. Note that it is necessary to call the function rand

twice, but we also get two normal random numbers out of it. It is not hard to write aprocedure which will produce 2 normal random numbers in this way on every secondcall, return one of them and store the other for the next call.

4. Method of the Central Limit Theorem The following algorithm is often used tosimulate a normal random variable:

(a) Simulate 12 independent uniform random variables (rands) - X1, X2, . . . , X12.

(b) Set Y = X1 +X2 + · · ·+X12 − 6.

The distribution of Y is very close to the distribution of a unit normal, although notexactly equal (e.g. P[Y > 6] = 0. and P[Z > 6] 6= 0, for a true normal Z). The reasonwhy Y approximates the normal distribution well comes from the following theorem

Theorem 1.6.5. Let X1, X2, . . . be a sequence of independent random variables, allhaving the same distribution. Let µ = E[X1] (= E[X2] = . . . ) and σ2 = Var[X1] (=Var[X2] = . . . ). The sequence of normalized random variables

(X1 +X2 + · · ·+Xn)− nµ

σ√n

,

converges to the normal random variable (in a mathematically precise sense).

The choice of exactly 12 rands (as opposed to 11 or 35) comes from practice: it seemsto achieve satisfactory performance with relatively low computational cost. Also, thestandard deviation of a U [0, 1] random variable is 1/

√12, so the denominator σ

√n

conveniently becomes 1 for n = 12.

The figures below show the densities of random variables obtained as (normalized) sumsof n independent uniform random variables for n = 1, 2, 3, 5, 6, 9, 12. The plots of theirdensities are in red, while the density of the standard unit normal is added in ‘dot-dash-dot‘ blue.



sum of 1 rands

.25

.15

531–1–3–5

.25

.15

0 531–1–3–5

.25

.15

0 531–1–3–5

.4

.25

.15

0 531–1–3–5

.25

.15

0 531–1–3–5

sum of 9 randssum of 5 randssum of 3 randssum of 2 rands

Fig 4. Approximating normal with sums of rands.

When choosing between this method and the previously de-scribed Box-Muller method, many factors have to be takeninto consideration. The Box-Muller method uses only 1 rand

per call (on average) as opposed to 12 used by the methodbased on the central limit theorem. On the other hand, thelatter method uses only addition and subtraction, whereasthe Box-Muller method utilizes more expensive operationscos, sin, log, and

√·.

sum of 12 rands

0

0.15

0.25

0.4

–5 –3 –1 1 3 5

Fig 5. Sum of 12 rands

The conclusion of the comparison of the two methods will inevitably rely heavily onthe architecture you are running the code on, and the quality of the implementation ofthe functions cos, sin . . . .

5. Other methods There is a number of other methods for transforming the output ofrand into random numbers with prescribed density (rejection method, Poisson trick,. . . ). You can read about them in the free online copy of Numerical recipes in C athttp://www.library.cornell.edu/nr/bookcpdf.html

1.6.3 Simulation of Normal Random Vectors

It is our next task to show how to simulate an n-dimensional normal random vector withmean µ and variance-covariance matrix Σ. If the matrix Σ is the identity matrix, andµ = (0, 0, . . . , 0), then our problem is simple: we simulate n independent unit normals andorganize them in a vector γ (I am assuming that we know how to simulate normal randomvariables - we can use the Box-Muller method, for example).

When Σ is still identity, but µ is a general vector, we are still in luck - just add µ+γ. Itis the case of general symmetric positive-semidefinite matrix Σ that is interesting. The ideais to apply a linear transformation A to the vector of independent unit normals γ: Y = γA.



Therefore, we are looking for a matrix A such that

Σ = E[Y T Y ] = E[AT γT γA] = AT E[γT γ]A = AT IA = AT A.

In numerical linear algebra the decomposition Σ = AT A is called the Cholesky decompo-sition and, luckily, there are fast and reliable numerical methods for obtaining A from Σ.For example, the command chol in MATLAB will produce A from Σ.

To recapitulate, to simulate an n-dimensional normal random vector with mean µ andvariance-covariance matrix Σ, do the following

1. simulate n independent unit normal random variables and put them in a vector γ.

2. compute the Cholesky decomposition AT A = Σ of the variance-covaraince matrix Σ.

3. the required vector is Y = γA + µ.

1.6.4 Simulation of the Brownian Motion and Gaussian Processes

We have already seen that the Brownian motion is a limit of random walks, and that will bethe key insight for building algorithms for simulating the paths of Brownian motion. As isinevitably the case with stochastic processes, we will only be able to simulate their paths ona finite interval [0, T ], sampled on a finite number of points 0 = t0 ≤ t1 < t2 < · · · < tn = T .When dealing with continuous processes, we will usually interpolate linearly between thesepoints if necessary.

The independence of increments of Brownian motion will come in very handy because wewill be able to construct the whole trajectory from a number of independent steps. Thus,the procedure is the following (I will describe only the case where the points t0, t1, t2, . . . , tnare equidistant, i.e. tk − tk−1 = ∆t = T/n.)

1. Choose the horizon T and the number n of time-steps. Take ∆t = T/n.

2. Simulate n independent normal random variables with variance ∆t, and organize themin a vector ∆B.

3. the value of the simulated Brownian motion at ti will be the sum of the first i termsfrom ∆B.

The increments do not need to be normal. By the Central Limit Theorem, you can use theBernoulli random variables with space-steps ∆x =

√∆t.

The case of a general Gaussian process is computationally more intensive, but conceptu-ally very simple. The idea is to discretize time, i.e. take the points 0 = t0 < t1 < t2 < · · · <tn = T and view the process (Xt)t∈[0,T ] you are simulating as being closely approximated byhuge multivariate normal vector X = (Xt0 , Xt1 , . . . , Xtn). So, simulate the normal vector X



- its variance-covariance matrix and the mean-vector can be read off the functions cX andµX specified for the Gaussian process X.

Of course, the simulation of a normal vector involves the Cholesky decomposition of thevariance-covariance matrix - a relatively expensive operation from the computational pointof view. It is therefore, much more efficient to use a special structure of the Gaussian process(when possible) to speed up the procedure - i.e. in the case of the Brownian motion, Brownianmotion with drift or the Brownian Bridge.

1.6.5 Monte Carlo Integration

Having described some of the procedures and methods used for simulation of various randomobjects (variables, vectors, processes), we turn to an application in numerical mathematics.We start off by the following version of the Law of Large Numbers which constitutes thetheory behind most of the Monte Carlo applications

Theorem 1.6.6. (Law of Large Numbers) Let X1, X2, . . . be a sequence of identicallydistributed random variables (works for vectors, too), and let g : R → R be function such thatµ = E[g(X1)] (= E[g(X2)] = . . . ) exists. Then

g(X1) + g(X2) + · · ·+ g(Xn)n

→ µ =∫ ∞

−∞g(x)fX1(x) dx, as n→∞.

The key idea of Monte Carlo integration is the following

Suppose that the quantity y we are interested in can be written as y =∫∞−∞ g(x)fX(x) dx

for some random variable X with density fX and come function g, and thatx1, x2, . . . are random numbers distributed according to the distribution with den-sity fX . Then the average

1n

(g(x1) + g(x2) + · · ·+ g(xn)),

will approximate y.

It can be shown that the accuracy of the approximation behaves like 1/√n, so that you

have to quadruple the number of simulations if you want to double the precision of youapproximation.

Example 1.6.7.

1. (numerical integration) Let g be a function on [0, 1]. To approximate the integral∫ 10 g(x) dx we can take a sequence of n (U[0,1]) random numbers x1, x2, . . . ,∫ 1

0g(x) dx ≈ g(x1) + g(x2) + · · ·+ g(xn)

n,



because the density of X ∼ U [0, 1] is given by

fX(x) =

1, 0 ≤ x ≤ 1

0, otherwise.

2. (estimating probabilities) Let Y be a random variable with the density function fY .If we are interested in the probability P[Y ∈ [a, b]] for some a < b, we simulate n drawsy1, y2, . . . , yn from the distribution FY and the required approximation is

P[Y ∈ [a, b]] ≈ number of yn’s falling in the interval [a, b]n

.

One of the nicest things about the Monte-Carlo method is that even if the density ofthe random variable is not available, but you can simulate draws from it, you can stillpreform the calculation above and get the desired approximation. Of course, everythingworks in the same way for probabilities involving random vectors in any number ofdimensions.

3. (approximating π)We can devise a simple procedure for approximatingπ ≈ 3.141592 by using the Monte-Carlo method. Allwe have to do is remember that π is the area of theunit disk. Therefore, π/4 equals to the portion of thearea of the unit disk lying in the positive quadrant(see Figure), and we can write

π

4=∫ 1

0

∫ 1

0g(x, y) dxdy,

where

g(x, y) =

1, x2 + y2 ≤ 1

0, otherwise.

So, simulate n pairs (xi, yi), i = 1 . . . n of uniformlydistributed random numbers and count how manyof them fall in the upper quarter of the unit circle,i.e. how many satisfy x2

i + y2i ≤ 1, and divide by

n. Multiply your result by 4, and you should beclose to π. How close? Well, that is another story. . . Experiment!

0 1

1

(x,y)i i

Fig 6. Count the proportionof (xi, yi)’s in here to get π/4

4. (pricing options) We can price many kinds of European- and Asian-type options usingMonte Carlo, but all in good time . . .



Exercises

Exercise 1.6.8. (Testing RANDU) Implement the linear congruential random numbergenerator RANDU (from the notes) and test its performance by reproducing the 1-d, 2-d and3-d uniformity-test graphs (just like the ones in the notes). (I suggest using Maple for the3-d plot since it allows you to rotate 3d graphs interactively.)

Exercise 1.6.9. (Invent your own Random Number Generator)Use your imagination and produce five different functions that deserve the name random

number generator. One of them can be a linear congruential RNG with some parameters ofyour choice, but the others must be original. Test them and discuss their performance briefly.

Exercise 1.6.10. (Invent your own Random Number Generator Tests) Use more ofyour imagination to devise 3 new RNG tests. The idea is to simulate some quantity manytimes and then draw the histogram of your results. If you can compute explicitly the

density of the distribution of that quantity, than you can compare your results and drawconclusions. Test you tests on the RNG’s you invented in the previous problem. (Hint: oneidea would be to generate m batches of n random numbers, and then transform each randomnumber into 0 or 1 depending whether it is ≥ 0.5 or < 0.5. In this way you obtainm sequencesof n zeroes and ones. In each of the m sequences count the longest streak of zeroes. You willget m natural numbers. Draw their histogram and compare to the theoretical distribution.How do we get the theoretical distribution? Well, you can either look it up in one of themany probability books in the library, search the web, or use a random number generatoryou know is good to get the same histogram.

For the purposes of this homework assignment, you can assume that the default RNG ofyour software package is good. )

Exercise 1.6.11. (Trying out Various Methods)Using the default RNG of your package and the transformation methods we mentiond

in class simulate n draws from each of the following univariate distributions and draw thehistograms of your results

1. Exp(λ) with λ = 1.

2. Cauchy

3. Binomial with n = 50 and p = 0.3

4. Poisson with parameter λ = 1 (truncate at a large value of your choice)

5. Normal with µ = 2, σ = 3 using the Central Limit Theorem Method



6. Normal with µ = 2, σ = 3 using the Box-Muller Method

Exercise 1.6.12. (Brownian Bridge) Simulate and plot 5 trajectories of the BrownianBridge by

1. simulating trajectories of a Brownian motion first, and them transforming them intothe trajectories of the Brownian Bridge

2. using the fact that Brownian Bridge is a Gaussian process and simulate it as a (large)multivariate normal random vector.

Exercise 1.6.13. (Computing the value of π)Use Monte-Carlo integration to approximate the value of π (just like we described it in

class). Vary the number of simulations (n = 10, n = 100, n = 500, . . . ) and draw the graphof n vs. the accuracy of your approximation (absolute value of the difference between thevalue you obtained and the true value of π.)

Exercise 1.6.14. (Probability of 2 stocks going up) The joint distribution of two stocksA and B is bivariate normal with parameters µA = 100, µB = 120, σA = 5, σB = 15 andρ = 0.65. Use the Monte-Carlo method to calculate the probability of the event in whichboth stocks outperform their expectations by at least 10, i.e. P[A ≥ 110, and B ≥ 120].(The trick in the notes we used to solve a similar problem will not work here. You reallyneed a numerical method (such as Monte Carlo) to solve this problem.)


1.7. CONDITIONING CHAPTER 1. BROWNIAN MOTION

1.7 Conditioning

Derek the Daisy’s portfolio consists of two securities - ashare of DMF (Daisy’s Mutual Fund) and a share of GI(Grasshopper’s Jumping Industries) whose prices on May,1st are believed to follow a bivariate normal distributionwith coefficients µX1 = 120, µX2 = 130, σX1 = 10, σX2 =5 and ρ = 0.8. Derek’s friend Dennis the Grasshopperserves as the member of the board of GI and has recentlyheard that GI will enter a merger with CHC (Cricket’sHopping Corporation) which will drive the price of a shareof GJI to the level of 150 on May 1st. This is, of course,insider information, but it is not illegal in the Meadow-world so we might as well exploit it. The question Derekis facing is : what will happen to his other security - one

Fig 7. Dennis the Grasshopper

share of DMF? How can he update his beliefs about its distribution, in the light of this newinformation. For example, what is the probability that DMF will be worth less than 140?

1.7.1 Conditional Densities

Let the random vector (X1, X2) stand for the prices of DMF and GI on May, 1st. Theelementary probability would try to solve Derek’s problem by computing the conditionalprobability

P[X1 ≤ 140|X2 = 150] =P[X1 ≤ 140 ∩ X2 = 150]

P[X2 = 150]=

00,

because P[X2 = 150] = 0 - the naive approach fails miserably. The rescue comes by replacingthe 0-probability-event X2 = 150 by an enlargement X2 ∈ (150− ε, 150 + ε) and lettingε→ 0. Here are the details . . .

Let f(X1,X2)(x1, x2) be the joint density of (X1, X2) and let fX2(x2) be the marginaldensity of X2, i.e.

fX2(x2) =∫ ∞

−∞f(X1,X2)(y1, x2) dy1.

Then P[X2 ∈ (150− ε, 150 + ε)] =∫ 150+ε150−ε fX2(y2) dy2 and

P[X1 ≤ 140 ∩ X2 ∈ (150− ε, 150 + ε)] =∫ 150+ε

150−ε

∫ 140

−∞f(X1,X2)(y1, y2) dy1dy2.

So if we define

P[X1 ≤ 140|X2 = 150] , limε→0

P[X1 ≤ 140 ∩ X2 ∈ (150− ε, 150 + ε)]P[X2 ∈ (150− ε, 150 + ε)]

,


CHAPTER 1. BROWNIAN MOTION 1.7. CONDITIONING

we have

P[X1 ≤ 140|X2 = 150] = limε→0

∫ 150+ε150−ε

∫ 140−∞ f(X1,X2)(y1, y2) dy1dy2∫ 150+ε150−ε fX2(y2) dy2

.

To proceed with the argument, let us prove a simple lemma:

Lemma 1.7.1. For a sufficiently regular (say continuous) function h : R → R we have

limε→0

12ε

∫ x+ε

x−εh(y) dy = h(x).

Proof. Let’s pick a constant a ∈ R and define the indeterminate integral H(x) =∫ xa h(y) dy.

Now

limε→0

12ε

∫ x+ε

x−εh(y) dy = lim

ε→0

H(x+ ε)−H(x− ε)2ε

=12

limε→0

(H(x+ ε)−H(x)

ε+H(x)−H(x− ε)

ε

)=

12(H ′(x) +H ′(x))

= H ′(x) = h(x),

by the Fundamental Theorem of Calculus.

Coming back to our discussion of the conditional expectation, we can use the lemma we havejust proved, and write

P[X1 ≤ 140|X2 = 150] = limε→0

12ε

∫ 150+ε150−ε

∫ 140−∞ f(X1,X2)(y1, y2) dy1dy2

12ε

∫ 150+ε150−ε fX2(y2) dy2

=

∫ 140−∞ f(X1,X2)(y1, 150) dy2

fX2(150)

=∫ 140

−∞

fX1,X2(y1, 150)fX2(150)

dy1.

In the case of the bivariate normal, we know the forms of the densities involved:

f(X1,X2)(x1, x2) =exp

(− 1

2(1−ρ2)

((x1−µX1

)2

σ2X1

+ (x2−µX2)2

σ2X2

− 2ρ (x1−µX1)(x2−µX2

)

σX1σX2

))2πσX1σX2

√1− ρ2

and, since X2 is normal with mean µX2 and variance σ2X2

,

fX2(x2) =1√

2πσ2X2

exp(−(x2 − µX2)2

2σ2X2

).

Thus, when we do the algebra and complete the square we get

f(X1,X2)(x1, x2)fX2(x2)

=exp

(− 1

2σ2X1

(1−ρ2)

(x1 −

(µX1 + ρ

σX1σX2

(x2 − µX2)))2 )

√2πσ2

X1(1− ρ2)

. (1.7.1)



Plugging x2 = 150 into (1.7.1) and integrating from −∞ to 140 we get

P[X1 ≤ 140|X2 = 150] = 0.022, while P[X1 ≤ 140] = 0.976,

and we can see how the information that X2 = 150 has decreased this probability from 0.98to 0.02.

Fig 8. A cross-section of a bivariatenormal density

We could have repeated the above discussion with anyconstant ξ other than 150 and computed the condi-tional probability of any set a ≤ X1 ≤ b other thanX1 ≤ 140. We would have obtained that

P[X1 ∈ A|X2 = ξ] =∫ b

afX1|X2

(x1|X2 = ξ) dx1,

where

fX1|X2(x1|X2 = ξ) =

f(X1,X2)(x1, ξ)fX2(ξ)

,

and observed that the conditional probabilities can be computed similarly to the ordinaryprobabilities - we only need to use a different density function. This new density functionfX1|X2

(·|X2 = ξ) is called the conditional density of X1 given X2 = ξ. Observe thatfor different values of ξ, we get different conditional densities as we should - the updatedprobabilities depend on the information received.

There is another interesting phenomenon involving multivariate normal distributions - aswe have witnessed in (1.7.1) the conditional density of X1 given X2 = ξ is univariate normal.And the new parameters are

µnewX1

= µX1 + ρσX1

σX2

(ξ − µX2), σnewX1

= σX1

√1− ρ2.

This is not an isolated incident - we shall see later that conditional distributions arising fromconditioning components of a multivariate normal given (some other) components of thatmultivariate normal are . . .multivariate normal.

1.7.2 Other Conditional Quantities

Now when we know how to answer Derek’s question, we can the new concept of conditionalexpectation to introduce several new (and important) quantities. In what follows we let(X1, X2) be a random vector with the joint density function fX1,X2(x1, x2).

Definition 1.7.2.



1. the conditional probability of X1 ∈ A given X2 = ξ is defined by

P[X1 ∈ A|X2 = ξ] =∫

AfX1|X2

(y1|X2 = ξ) dy1.

2. the conditional expectation of X1 given X2 = ξ is defined by

E[X1|X2 = ξ] =∫ ∞

−∞y1fX1|X2

(y1|X2 = ξ) dy1.

3. for a function g : R → R, the conditional expectation of g(X1) given X2 = ξ isdefined by

E[g(X1)|X2 = ξ] =∫ ∞

−∞g(y1)fX1|X2

(y1|X2 = ξ) dy1.

4. the conditional variance of X1 given X2 = ξ is defined by

Var[X1|X2 = ξ] =∫ ∞

−∞(y1 − E[X1|X2 = ξ])2fX1|X2

(y1|X2 = ξ) dy1.

Example 1.7.3. Let (X1, X2) have the bivariate normal distribution, just like in the begin-ning of this section, with parameters µX1 , µX2 , σX1 , σX2 , and ρ. Then, using the expressionfor the conditional density from formula (1.7.1), we have

E[X1|X2 = ξ] = µX1 + ρσ1

σ2(ξ − µX2), Var[X1|X2 = ξ] = (1− ρ2)σ2

X1.

Note how, when you receive the information that X2 = ξ, the change of your mean dependson ξ, but the decrease in the variance is ξ-independent.

Of course, conditioning is not reserved only for 2-dimensional random vectors. Let X =(X1, X2, . . . , Xn) be a random vector with density fX(x1, x2, . . . , xn). Let us split X into 2sub-vectors X1 = (X1, X2, . . . , Xk) and X2 = (Xk+1, Xk+2, . . . , Xn), so that X = (X1,X2).For ξ = (ξk+1, ξk+2, . . . , ξn), we can mimic the procedure from the beginning of the section,and define the conditional density fX1|X2

(x1|x2 = ξ2) of the random vector X1 givenX2 = ξ by

fX1|X2(x1|x2 = ξ2) =

fX(x1, x2, . . . , xk, ξk+1, ξk+2, . . . , ξn)fX2(ξk+1, ξk+2, . . . , ξn)

=fX(x1, x2, . . . , xk, ξk+1, ξk+2, . . . , ξn)∫∞

−∞∫∞−∞ . . .

∫∞−∞ fX(y1, y2, . . . , yk, ξk+1, ξk+2, . . . , ξn) dy1dy2 . . . dyk

.



1.7.3 σ-algebra = Amount of Information

Remember how I promised to show you another use for a σ-algebra (apart from being atechnical nuisance and a topic of a hard homework question)? Well, this is where I’ll doit. The last subsection devoted to the conditional densities has shown us how informationcan substantially change our view of likelihoods of certain events. We have learned how tocalculate these new probabilities, in the case when the information supplied reveals the exactvalue of a random variable, e.g. X2 = ξ. It this subsection, we would like to extend thisnotion, and pave the way for the introduction of the concept of conditional expectationwith respect to a σ-algebra.

First of all, we need to establish the relation between σ-algebras and information. You canpicture information as the ability to answer questions (more information gives you a betterscore on the test . . . ), and the lack of information as ignorance or uncertainty. Clearly, thissimplistic metaphor does no justice to the complexity of the concept of information, but it willserve our purposes. In our setting, all the questions can be phrased in terms of the elementsof the state-space Ω. Remember - Ω contains all the possible evolutions of our world (andsome impossible ones) and the knowledge of the exact ω ∈ Ω (the “true state of the world”)amounts to the knowledge of everything. So, a typical question would be “What is the trueω?”, and the ability to answer that would promote you immediately to the level of a SupremeBeing. So, in order for our theory to be of any use to mortals, we have to allow for someignorance, and consider questions like “Is the true ω an element of the event A?”, where Acould be the event in which the price of the DMF Mutual fund is in the interval (120, 150)on May, 1st. And now comes the core of our discussion . . . the collection of all events A suchthat you know how to answer the question “Is the true ω an element of the event A?” isthe mathematical description of your current state of information. And, guess what . . . thiscollection (let us call it G) is a σ-algebra. Why?

• First, I always know that the true ω is an element of Ω, so Ω ∈ G.

• If I know how to answer the question Is the true ω in A? (A ∈ G) , I will also know howto answer the question “Is the true ω in Ac?”. The second answer is just the oppositeof the first, so Ac ∈ G.

• Let (An)n∈N be a sequence of events such that I know how to answer the questions“Is the true ω in An?”, i.e. (An) ∈ G for all n ∈ N. Then I know that the answer tothe question Is the true ω in ∪n∈NAn? is “No” if I answered “No” to each question“Is the true ω in An?”, and it is “Yes” if I answered “Yes” to at least one of them.Consequently ∪n∈NAn ∈ G.



Fig 9. Information in a Tree

When your state space consists of a finite number of elements, σ-algebras are equivalentto partitions of the state space (how exactly?). The finer the partition, the more you knowabout the true ω. The figure above depicts the evolution of a fictitious stock on a two-daytime horizon. The price of the stock is 100 on day 0, moves to one of the three possible valueson day 1, and branches further on day 2.

On day 0, the information available to us is minimal, the only questions we can answerare the trivial ones “ Is the true ω in Ω?” and “ Is the true ω in ∅?” , and this is encodedin the σ-algebra Ω, ∅.

On day 1, we already know a little bit more, having observed the value of the stock -let’s denote this value by S1. We can distinguish between ω1 and ω5, for example. We stilldo not know what will happen the day after, so that we cannot tell between ω1, ω2 and ω3,or ω4 and ω5. Therefore, our information partition is ω1, ω2, ω3 , ω4, ω5 , ω6 and thecorresponding σ-algebra F1 is (I am doing this only once!)

F1 = ∅, ω1, ω2, ω3 , ω4, ω5 , ω6 , ω1, ω2, ω3, ω4, ω5 , ω1, ω2, ω3, ω6 , ω4, ω5, ω6 ,Ω .

The σ-algebra F1 has something to say about the price on day two, but only in the specialcase when S2 = 0. If that special case occurs - let us call it bankruptcy - then we do not needto wait until day 2 to learn what the stock price at day two is going to be. It is going toremain 0. Finally, when day 2 dawns, we know exactly what ω occurred and the σ-algebraF2 consists of all subsets of Ω.

Let us think for a while how we acquired the extra information on each new day. We



have learned it through the random variables S0, S1 and S2 as their values gradually revealedthemselves to us. We can therefore say that the σ-algebra F1 is generated by therandom variable S1, in the notation F1 = σ(S1). In other words, F1 consists of exactlythose subsets of Ω that we can describe in terms of S1 only. Mathematically, F1 is composedof events ω ∈ Ω : S1(ω) ∈ A where A can be 130, 90, 0, 90, 130, 0, 90, 0, 130,0, 90, 130 and, of course, ∅. The σ-algebra3 F2 is generated by S1 and S2. In the sameway we can describe F2 as the set of all subsets of Ω which can be described only in termsof random variables S1 and S2. In this case the notation is F2 = σ(S1, S2).

Imagine now a trader who slept through day 1 and woke up on day two to observeS2 = 110. If asked about the price of the stock at day 1, she could not be sure whether it was130 or 90. In other words, the σ-algebra σ(S2) is strictly smaller (coarser) than σ(S1, S2),even though S2 is revealed after S1.

1.7.4 Filtrations

We have described the σ-algebras F0, F1 and F2 in the previous subsection and interpretedthem as the amounts of information available to our agent at day 0, 1 and 2 respectively. Thesequence (Fn)n∈0,1,2 is an instance of a filtration. In mathematical finance and probabilitytheory, a filtration is any family of σ-algebras which gets finer and finer (an increasing familyof σ-algebras in parlance of probability theory.)

On a state-space in which a stochastic process Sn is defined, it is natural to define thefiltration generated by the process S, denoted by FS

n by FS0 = σ(S0), FS

1 = σ(S0, S1),. . .FS

n = σ(S0, S1, . . . , Sn) . . . . The filtration FS will describe the flow of information of anobserver who has access to the value of Sn at time n - no more and no less. Of course, therecan be more than one filtration on a given asset-price model. Consider the example from theprevious subsection and an insider who knows at time 0 whether or not the company will gobankrupt on day 1, in addition to the information contained in the stock price. His filtration,let us call it (Gn)n∈0,1,2 will contain more information than the filtration (Fn)n∈0,1,2. Thedifference will concern only G0 = ∅, ω6 , ω1, ω2, ω3, ω4, ω5 ,Ω ) ∅,Ω = F0, since theextra information our insider has will be revealed to everyone at time 1, so that G1 = F1

and G2 = F2. Can we still view the insider’s σ-algebra G0 as being generated by a randomvariable? The answer is “yes” if we introduce the random variable B (B is for bankruptcy,not Brownian motion) and define

B(ω) =

1, ω = ω6

0 ω ∈ ω1, ω2, ω3, ω4, ω5 .

3F2 in our case contains all the information there is, but you can easily imagine the time to go beyond day

two, so that there is more to the world than just S1 and S2.



Then G0 = σ(B) = σ(B,S0). Obviously, G1 = σ(B,S1), but the knowledge of S1 implies theknowledge of B, so G1 = σ(S1).

The random variable B has a special name in probability theory. It is called the indicatorof the set ω6 ⊆ Ω. In general, for an event A ⊆ Ω we define the indicator of A as therandom variable 1A defined by

1A(ω) =

1, ω ∈ A0, ω 6∈ A.

Indicators are useful because they can reduce probabilities to expectations: P[A] = E[1A](prove this in the finite-Ω case!) For that reason we will develop the theory of conditionalexpectations only, because we can always understand probabilities as expectations of indica-tors.

Finally, let us mention the concept of measurability. As we have already concluded, theinsider has advantage over the public only at time 0. At time 1, everybody observes S1 andknows whether the company went bankrupt or not, so that the information in S1 contains allthe information in B. You can also rephrase this as σ(B,S1) = σ(S1), or σ(B) ⊆ σ(S1) - thepartition generated by the random variable S1 is finer (has more elements) than the partitiongenerated by the random variable B. In that case we say that B is measurable withrespect to σ(S1),. In general we will say that the random variable X is measurablewith respect to σ-algebra F if σ(X) ⊆ F .

1.7.5 Conditional Expectation in Discrete Time

Let us consider another stock-price model given described in the figure below. The statespace Ω is discrete and Ω = ω1, ω2, ω3, ω4, ω5, ω6. There are 3 time periods, and we al-ways assume that the probabilities are assigned so that in any branch-point all the brancheshave equal probabilities. Stock prices at times t = 0, 1, 2, 3 are given as follows: S0(ω) =100, for all ω ∈ Ω,

80

100

120

130

140

S1 s2 S3S0

T2

T3

T1

T4

T5

T6

Fig 10. Another Tree

S1(ω) =

80, ω = ω1, ω2

120, ω = ω3, ω4

140, ω = ω5, ω6

,

S2(ω) =

100, ω = ω1, ω2, ω3, ω4

130, ω = ω5, ω6

,

S3(ω) =

80, ω = ω1,

100, ω = ω2, ω3,

120, ω = ω4, ω5,

140, ω = ω6



The question we are facing now is how to use the notion of information to give a meaningto the concept of conditional expectation (and then, conditional probability, . . . ). We willuse the notation

E[X|F ]

for the conditional expectation of the random variable X with respect to (given)the σ-algebra F . Let us look at some examples, and try to figure out what properties theconditional expectation should have.

Example 1.7.4. Suppose you are sitting at time t = 0, and trying to predict the future.What is your best guess of the price at time t = 1? Since there is nothing to guide you inyour guessing and tell you whether the stock price will go up to 140, 120, or down to 80, andthe probability of each of those movements is 1/3, the answer should be

13140 +

13120 +

1380 = 113

13, and we can, thus, say that E[S1|FS

0 ] = E[S1] = 11313,

because the expected value of S1 given no information (remember, FS0 = ∅,Ω) is just the

(ordinary) expectation of S1, and that turns out to be 11313 .

Example 1.7.5. After 24 hours, day t = 1 arrives and your information contains the valueof the stock-price at time t = 1 (and all stock prices in the past). In other words, you haveFS

1 at your disposal. The task of predicting the future stock prices at t = 2 is trivial (sincethere is a deterministic relationship between S1 and S2 according to the model depicted inthe figure above). So, we have

E[S2|FS1 ](ω) =

130, S1(ω) = 140,

100, S1(ω) = 120 or 80= . . . surprise, surprise! . . . = S2(ω).

Note that the conditional expectation depends on ω, but only through the available informa-tion.

Example 1.7.6. What about day t = 2? Suppose first that we have all the informationabout the past, i.e. we know FS

2 .If S2 is equal to 130 then we know that S3 be equal to either 140, or 120, each with

probability 1/2, so that E[S3|FS2 ](ω) = 1

2140 + 12120 = 130 for ω such that S2(ω) = 130.

On the other hand, when S2 = 100 and S1 = 120, the value of S3 is equal to either 120or 100, each with probability 1/2. Similarly, when S2 = 100 and S1 = 80, S3 will choosebetween 100 and 80 with equal probabilities. To summarize,

E[S3|FS2 ](ω) =

130, S2(ω) = 130,

110 = 12120 + 1

2100, S2(ω) = 100 and S1(ω) = 120,

90 = 1280 + 1

2100, S2(ω) = 100 and S1(ω) = 80.



Let’s now turn to the case in which we know the value of S2, but are completely ignorantof the value of S1. Then, our σ-algebra is σ(S2) and not FS

2 = σ(S1, S2) anymore. Tocompute the conditional expectation of S3 we reason as follows:

If S2 = 130, S3 could be either 120 or 140 with equal probabilities, so that E[S3|σ(S2)](ω) =12120 + 1

2140 = 130 for ω such that S2(ω) = 120.When S2 = 100, we do not have the knowledge of the value of S1 at our disposal to tell us

whether the price will branch between 100 and 120 (upper 100-dot in the figure), or between80 and 100 (lower 100-dot in the figure). Since it is equally likely that S1 = 120 and S1 = 80,we conclude that, given our knowledge, S3 will take values 80, 100 and 120, with probabilities1/4, 1/2 and 1/4 respectively. Therefore,

E[S3|σ(S2)](ω) =

130, S2(ω) = 120

100 = 1480 + 1

2100 + 14120, S2(ω) = 100.

What have we learned from the previous examples? Here are some properties of theconditional expectation that should be intuitively clear now. Let X be a random variable,and let F be a σ-algebra. Then the following hold true

(CE1) E[X|F ] is a random variable, and it is measurable with respect to F , i.e. E[X|F ]depends on the state of the world, but only through the information contained in F .

(CE2) E[X|F ](ω) = E[X], for all ω ∈ Ω if F = ∅,Ω, i.e. the conditional expectationreduces to the ordinary expectation when you have no information.

(CE3) E[X|F ](ω) = X(ω), if X is measurable in F , i.e. there is no need for expectationwhen you already know the answer (the value of X is known, when you know F .)



Let us try to picture the conditional expectations from theprevious examples from a slightly different point of view.Imagine Ω = ω1, ω2, ω3, ω4, ω5, ω6 as a collection of 6 pointson the real line, and random variables as real valued functionsthere. If we superimpose S3 and E[S3|FS

2 ] on the same plot,we notice that E[S3|FS

2 ] is constant on every atom4 of FS2 ,

as it should be, since there is no information to differentiatebetween the ω’s in the same atom. This situation is depictedin the figure on the right.Moreover, the value of E[S3|FS

2 ] on each ω in a given atomis the average of the values of S3 on that atom. In our case,each ω ∈ Ω has the same probability, so that the averagingis preformed with the same weights. In the general case, onewould use the weights proportional to the probabilities, ofcourse.

4 an atom of a σ-algebra F is each element of the partition that

generates it. It is the smallest set of ω’s such that F cannot distinguish

between them.

Fig 11. Conditional Ex-pectation of S3 w.r.t. FS

2

Fig 12. Conditional Ex-pectation of S3 w.r.t. σ(S2)

The figure on the left depicts the conditional expectationwith respect to the smaller σ-algebra σ(S2). Again, thevalue of E[S3|σ(S2)] is constant on ω1, . . . , ω4, as well ason ω5, ω6, and is equal to the averages of S3 on these twoatoms. By looking at the last two figures, you can easily con-vince yourself that we would get the identical answer if wecomputed the conditional expectation of Y = E[S3|FS

2 ] withrespect to σ(S2) instead of the conditional expectation of S3

with respect to σ(S2). There is more at play here than merecoincidence. The following is always true for the conditionalexpectation, and is usually called the tower property ofconditional expectation

(CE4) When X is a random variable and F ,G two σ-algebras such that F ⊆ G, then

E[E[X|G]|F ] = E[X|F ], i.e.

my expectation (given information F) of the value X is equal to the ex-pectation (given information F) of the expectation I would have about thevalue of X if I had information G on top of F .



The discussion above points towards a general algorithm for computing the conditionalexpectations in the discrete-time case5.

1.7.6 A Formula for Computing Conditional Expectations

Let Ω = ω1, ω2, . . . , ωn be a state space and let the probability P be assigned so that P[ω] >0, for each ω ∈ Ω. . Further, let X : Ω → R be a random variable, and let F be a σ-algebra onΩ which partitions Ω into m atoms A1 =

ω1

1, ω12, . . . , ω

1n1

, A2 =

ω2

1, ω22, . . . , ω

2n2

, . . . and

Am =ωm

1 , ωm2 , . . . , ω

mnm

(in which case A1 ∪A2 ∪ · · · ∪Am = Ω, Ai ∩Aj = ∅, for i 6= j and

n1 + n2 + . . . nm = n). Then

E[X|F ](ω) =

Pn1i=1 X(ω1

i )P[ω1i ]Pn1

i=1 P[ω1i ]

=(∑n1

i=1X(ω1i )P[ω1

i ])/P[A1], for ω ∈ A1,

Pn2i=1 X(ω2

i )P[ω2i ]Pn2

i=1 P[ω2i ]

=(∑n2

i=1X(ω2i )P[ω2

i ])/P[A2], for ω ∈ A2,

. . . . . .

Pnmi=1 X(ωm

i )P[ωmi ]Pnm

i=1 P[ωmi ]

=(∑nm

i=1X(ωmi )P[ωm

i ])/P[Am], for ω ∈ Am.

This big expression is nothing but a rigorous mathematical statement of the facts that theconditional expectation is constant on the atoms of F and that its value on each atom iscalculated by taking the average of X there using the relative probabilities as weights. Testyour understanding of the notation and the formula above by convincing yourself that

E[X|F ](ω) =m∑

i=1

E[X|Ai]1Ai(ω), where E[X|A] = E[X1A]/P[A]. (1.7.2)

The first simple application of the formula above is to prove the linearity of conditionalexpectation:

(CE5) The conditional expectation is linear, i.e. for X1, X2 random variables, α1, α2 realconstants, and a σ-algebra F , we have

E[α1X1 + α2X2|F ] = α1E[X1|F ] + α2E[X2|F ].5unfortunately, this will not work in the continuous case, because the σ-algebras do not have an atomic

structure there



To establish the truth of the above equality, let A1, . . . An be the atoms of the partitiongenerating F . We can use the formula (1.7.2) to get

E[α1X1 + α2X2|F ](ω) =m∑

i=1

E[(α1X1 + α2X2)1Ai ]P[Ai]

1Ai(ω) =m∑

i=1

α1E[X11Ai ] + α2E[X21Ai ]P[Ai]

1Ai(ω)

= α1E[X1|F ] + α2E[X2|F ].

Let us give another illustration of the usefulness of the formula (1.7.1) by computing a slightlymore complicated conditional expectation in our stock-price model.Example 1.7.7. Suppose, for example that all you knowis the difference S3 − S1, so that the your information indescribed by the σ-algebra generated S3−S1, and given by thepartition

A1, A2, A3

where A1 = ω1, ω4, ω6, A2 = ω2

and A3 = ω3, ω5, since

S3(ω)− S1(ω) =

0, ω = ω1, ω4, ω6

−20, ω = ω3, ω5,

20, ω = ω2

.

In this case m = 3, n1 = 3, n2 = 1 and n3 = 2, and

S3(ω)− S0(ω) =

−20, ω = ω1,

0, ω = ω2, ω3,

20, ω = ω4, ω5,

40, ω = ω6,

Fig 13. Conditional Ex-pectation of S3 − S0 w.r.t.σ(S3 − S1)

so that

E[(S3 − S0)|σ(S3 − S1)] =

(

16(−20) + 1

620 + 1640)/1

2 = 1013 ω ∈ A1(

160)/1

6 = 0 ω ∈ A2(160 + 1

620)/1

3 = 10 ω ∈ A3

1.7.7 Further properties of Conditional Expectation

Having established that the conditional expectation is constant on the atoms of the σ-algebraF , we have used this fact in the derivation of the formula (1.7.2). We can easily generalizethis observation (the proof is left as an exercise), and state it in the form of the followingvery useful property:

(CE6) Let F be a σ-algebra, and let X and Y be variables such that Y is measurable withrespect to F . Then

E[XY |F ](ω) = Y (ω)E[X|F ](ω),



i.e. the random variable measurable with respect to the available informationcan be treated as a constant in conditional expectations.

In our final example, we will see how independence affects conditioning, i.e. how condi-tioning on irrelevant information does not differ from conditioning on no information.

Example 1.7.8. In this example we will leave our stock-price model and consider a gamblewhere you draw a card from a deck of 52. The rules are such that you win X = $13 if thecard drawn is a Jack and nothing otherwise. Since there are 4 Jacks in the deck of 52, yourexpected winnings are E[X] = 4

52$13 + 4852$0 = $1. Suppose that you happen to know (by

EPS, let’s assume) the suit of the next card you draw. How will that affect you expectedwinnings? Intuitively, the expectation should stay the same since the proportion of Jacksamong spades (say) is the same as the proportion of Jacks among all cards. To show thatthis intuition is true, we construct a model where

Ω =ω♣1 , ω

♣2 , . . . , ω

♣13, ω

♦1 , ω

♦2 , . . . , ω

♦13, ω

♥1 , ω

♥2 , . . . , ω

♥13, ω

♠1 , ω

♠2 , . . . , ω

♠13

,

and, for example, the card drawn in the state of the world ω♠12 is the Jack of spades. Every ω ∈Ω has the probability of 1

52 . The σ-algebra F corresponding to your extra information (the

suit of the next card) is generated by the partition whose atoms are A♣ =ω♣1 , ω

♣2 , . . . , ω

♣13

,

A♦ =ω♦1 , ω

♦2 , . . . , ω

♦13

, A♥ =

ω♥1 , ω

♥2 , . . . , ω

♥13

, and A♠ =

ω♠1 , ω

♠2 , . . . , ω

♠13

. By the

formula for conditional expectation, we have

E[X|F ] =

( 152$0 + 1

52$0 + · · ·+ 152$13 + 1

52$0)/14 , ω ∈ A♣,

( 152$0 + 1

52$0 + · · ·+ 152$13 + 1

52$0)/14 , ω ∈ A♦,

( 152$0 + 1

52$0 + · · ·+ 152$13 + 1

52$0)/14 , ω ∈ A♥,

( 152$0 + 1

52$0 + · · ·+ 152$13 + 1

52$0)/14 , ω ∈ A♠,

= $1 = E[X],

just as we expected.

The identity E[X|F ] = E[X] is not a coincidence, either. The information containedin the σ-algebra F is, in a sense, independent of value of X. Mathematically we have thefollowing definition

Definition 1.7.9.

• We say that the σ-algebras F and G are independent if for each A ∈ F and B ∈ Gwe have P[A ∩B] = P[A]P[B].

• We say that the random variable X and the σ-algebra F are independent if theσ-algebras σ(X) and F are independent.



In general, we have the following property of conditional expectation

(CE7) If the random variable X and the σ-algebra F are independent, then

E[X|F ](ω) = E[X], for all ω,

i.e. conditioning with respect to independent information is like conditioningon no information at all.

It is quite easy to prove (CE5) by using the formula (1.7.2) and the fact that, because ofindependence of X and F , we have E[X1A] = E[X]E[1A] = E[X]P[A] for A ∈ F (supply allthe details yourself!), so

E[X|F ] =m∑

i=1

E[X1Ai ]P[Ai]

1Ai =m∑

i=1

E[X]1Ai = E[X]m∑

i=1

1Ai = E[X],

because (Ai)i=1,...,m is a partition of Ω.

1.7.8 What happens in continuous time?

All of the concepts introduced above can be defined in the continuous time, as well. Althoughtechnicalities can be overwhelming, the ideas are pretty much the same. The filtration will bedefined in exactly the same way - as an increasing family of σ-algebras. The index set mightbe [0,∞) (or [0, T ] for some time-horizon T ), so that the filtration will look like (Ft)t∈[0,∞)

(or (Ft)t∈[0,T ]). Naturally, we can also define discrete filtrations (Fn)n∈N on infinite Ω’s.The concept of a σ-algebra generated by a random variable X is considerably harder to

define in continuous time, i.e. in the case of an infinite Ω, but caries the same interpretation.Following the same motivational logic as in the discrete case, the σ-algebra σ(X) shouldcontain all the subsets of Ω of the form ω ∈ Ω : X(ω) ∈ (a, b), for a < b, such that a, b ∈ R.These will, unfortunately, not form a σ-algebra, so that (and this is a technicality you canskip if you wish), σ(X) is defined as the smallest σ-algebra containing all the sets of the formω ∈ Ω : X(ω) ∈ (a, b), for a < b, such that a, b ∈ R. In the same way we can define theσ-algebra generated by 2 random variables σ(X,Y ), or infinitely many σ((Xs)s∈[0,t]). Thinkof them as the σ-algebras containing all the sets whose occurrence (or non-occurrence) canbe phrased in terms of the random variables X and Y , or Xs, s ∈ [0, t]. It is important tonote that the σ-algebras on infinite Ω’s do not necessarily carry the atomic structure, i.e.they are not generated by partitions. That is why we do not have a simple way of describingσ-algebras anymore - listing the atoms will just not work.

As an example, consider an asset price modeled by a stochastic process (Xt)t∈[0,∞) - e.g.a Brownian motion. By the time t we have observed the values of the random variables Xs

for s ∈ [0, t], so that our information dynamics can be described by the filtration (Ft)t∈[0,∞)



given by Ft = σ((Xs)s∈[0,t]). This filtration is called the filtration generated by theprocess (Xt)t∈[0,∞) and is usually denoted by (FX

t )t∈[0,∞). In our discrete time example inthe previous subsection, the (non-insider’s) filtration is the one generated by the stochasticprocess (Sn)n∈0,1,2.

Example 1.7.10. Let Bt be a Brownian motion, and let FB be the filtration generated byit. The σ-algebra FB

7 will contain information about B1, B3, B3 +B4, − sin(B4), 1B2≥−7,but not about B8, or B9 − B5, because we do not know their values if our information isFB

7 . We can, thus, say that the random variables B1, B3, B3 +B4, − sin(B4), 1B2≥−7 aremeasurable with respect to FB

7 , but B8 and B9 −B5 are not.

The notion of conditional expectation can be defined on general Ω’s with respect to generalσ-algebras:

Proposition 1.7.11. Let Ω be a state-space, F a σ-algebra on it. For any X such that E[X]exists, we can define the random variable E[X|F ], such that the properties (CE1)-(CE7) carryover from the discrete case.

Remark 1.7.1. The proviso that E[X] exists is of technical nature, and we will always pretendthat it is fulfilled. When (and if) it happens that a random variable admits no expectation,you will be warned explicitly.

Calculating the conditional expectation in continuous time is, in general, considerablyharder than in discrete time because there is no formula or algorithm analogous to the onewe had in discrete time. There are, however, two techniques that will cover all the examplesin this course:

• Use properties (CE1)-(CE7) You will often be told about certain relationships(independence, measurability,. . . ) between random variables and a σ-algebra in thestatement of the problem. The you can often reach the answer by a clever manipulationof the expressions using properties (CE1)-(CE7).

• Use conditional densities In the special case when the σ-algebra F you are con-ditioning on is generated by a random vector (X1, X2, . . . , Xn) and the joint densityof the random vector (X,X1, X2, . . . , Xn) is known you can calculate the conditionalexpectation

E[g(X)|F ] = E[g(X)|σ(X1, X2, . . . , Xn)]

by following this recipe:

– compute the conditional density f(X|X1,X2,...,Xn)(x|X1 = ξ1, X2 = ξ2, . . . , Xn =ξn), and use it to arrive at the function

h(ξ1, ξ2, . . . , ξn) = E[g(X)|X1 = ξ1, X2 = ξ2, . . . , Xn = ξn]

=∫ ∞

−∞g(x)f(X|X1,X2,...,Xn)(x|X1 = ξ1, X2 = ξ2, . . . , Xn = ξn) dx.



– plug X1, X2, . . . , Xn for ξ1, ξ2, . . . , ξn in h(ξ1, ξ2, . . . , ξn), and you are done. Inother words,

E[g(X)|F ] = E[g(X)|σ(X1, X2, . . . , Xn)] = h(X1, X2, . . . , Xn).

Example 1.7.12. Let (Bt)t∈[0,∞) be a Brownian motion, and let (FBt )t∈[0,∞) be the filtration

it generates. For s < t, let us try to compute E[Bt|FBs ] using the rules (CE1)-(CE7). First

of all, by the definition of the Brownian motion, the increment Bt −Bs is independent of allthat happened before (or at) time s, so the random variable Bt−Bs must be independent ofthe σ-algebra FB

s which contains exactly the past before and up to s. Using (CE7) we haveE[Bt −Bs|FB

s ] = E[Bt −Bs] = 0. The linearity (CE5) of conditional expectation will implynow that E[Bt|FB

s ] = E[Bs|FBs ]. We know Bs when we know FB

s , so E[Bs|FBs ] = Bs, (CE3).

Therefore, E[Bt|FBs ] = Bs. Can you compute this conditional expectation using the other

method with conditional densities? How about E[Bt|σ(Bs)]?

Example 1.7.13. Let’s try to compute a slightly more complicated conditional quantity.How much is P[Bt > a|σ(Bs)]? Here, we define P[A|σ(Bs)] , E[1A|σ(Bs)]. We have all thenecessary prerequisites for using the conditional-density method:

• we are conditioning with respect to a σ-algebra generated by a random variable (arandom vector of length n = 1)

• the joint density of (Bt, Bs) is known

• the required conditional expectation can be expressed as

P[Bt ≥ x|σ(Bs)] = E[1Bt≥x|σ(Bs)] = E[g(Bt)|σ(Bs)], where g(x) =

1, x ≥ a

0, x < a.

We have learned at the beginning of this section that the conditional density of Bt givenBs = ξ is the density of a univariate normal distribution with mean ξ and variance t− s, i.e.

fBt|Bs(x|Bs = ξ) =

1√2π(t− s)

exp(−(x− ξ)2

2(t− s)),

and so P[Bt ≥ a|σ(Bs)] = h(Bs), where

h(ξ) =∫ ∞

−∞g(x)

1√2π(t− s)

exp(−(x− ξ)2

2(t− s)) dx =

∫ ∞

a

1√2π(t− s)

exp(−(x− ξ)2

2(t− s)) dx = 1−Φ

(a− ξ√t− s

),

i.e.

P[Bt ≥ a|σ(Bs)] = 1− Φ(a−Bs√t− s

),

where Φ(·) is the cumulative distribution function of a standard unit normal - Φ(x) =∫ x−∞

1√2π

exp(− ξ2

2 ) dξ.



1.7.9 Martingales

Now that we have the conditional expectation at our disposal, we can define the centralconcept of these lecture notes:

Definition 1.7.14. Let (Xt)t∈[0,∞) stochastic process, and let (Ft)t∈[0,∞) be a filtration. Xis called a (continuous-time) Ft-martingale if

• Xt is Ft-measurable for each t, and

• E[Xt|Fs] = Xs, for all s, t ∈ [0,∞), such that s < t.

Before we give intuition behind the concept of a martingale, here are some remarks:

1. Note that the concept of a martingale makes sense only if you specify the filtration F .When the filtration is not specified explicitly, we will always assume that the filtrationis generated by the process itself, i.e. Ft = FX

t = σ(Xs, 0 ≤ s ≤ t).

2. In general, if for a process (Xt)t∈[0,∞), the random variable Xt is measurable withrespect to Ft, for each t ≥ 0, we say that Xt is adapted to the filtration (Ft)t∈[0,∞).Of course, analogous definitions apply to the finite horizon case (Xt)t∈[0,T ], as well asthe discrete case (Xn)n∈N.

3. You can define a discrete time martingale (Xn)n∈N with respect to a filtration (Fn)n∈Nby making the (formally) identical requirements as above, replacing s and t, by n,m ∈N. It is, however, a consequence of the tower property that it is enough to require

E[Xn+1|Fn] = Xn, for all n ∈ N,

with each Xn being Fn-measurable.

4. the process that satisfies all the requirements of the definition 1.7.14, except that wehave

E[Xt|Fs] ≥ Xs instead of E[Xt|Fs] = Xs, for s < t,

is called a submartingale. Similarly, if we require E[Xt|Fs] ≤ Xs, for all s < t,the process X is called a supermartingale. Note the apparent lack of logic in thesedefinitions:

submartingales increase on average, and supermartingales decrease on average.

5. Martingales have the nice property that E[Xt] = E[X0] for each t. In other words, theexpectation of a martingale is a constant function. It is a direct consequence of thetower property:

E[Xt] = E[E[Xt|F0]] = E[X0].



The converse is not true. Take, for example the process Xn = nχ, where χ is a unitnormal random variable. Then E[Xn] = nE[χ] = 0, but

E[X2|FX1 ] = E[X2|σ(X1)] = E[2ξ|σ(X1)] = E[2X1|σ(X1)] = 2X1 6= X1.

Martingales are often thought of as fair games. If the values of the process X up to (andincluding) s is known, then the best guess of Xt is Xs - we do not expect Xt to have eitherupward or downward trend on any interval (s, t]. Here are some examples:

Example 1.7.15. The fundamental example of a martingale in continuous time is Brownianmotion. In the previous subsection we have proved that E[Bt|FB

s ] = Bs, for all s < t, sothat (Bt)t∈[0,∞) is a martingale (we should say an FB

s -martingale, but we will often omit theexplicit mention of the filtration). The Brownian motion with drift Xµ

t = Bt + µt will be amartingale if µ = 0, a supermartingale if µ < 0 and a submartingale if µ > 0. To see that wecompute

E[Xµt |Fs] = E[Bt + µt|FB

s ] = Bs + µt = Xµs + µ(t− s),

and obviously, Xµs +µ(t−s) = Xµ

s , if µ = 0, Xµs +µ(t−s) > Xµ

s is µ > 0 andXµs +µ(t−s) < Xµ

s

if µ < 0.What we actually have proved here is that Xµ is (sub-, super-) martingale with respect

to filtration FB generated by the Brownian motion B. It is however easy to see that thefiltration generated by B and the one generated by Xµ are the same, because one can recoverthe trajectory of the Brownian motion up to time t, from the trajectory of Xµ by simplysubtracting µs from each Xµ

s . Similarly, if one knows the trajectory of B one can get thetrajectory of Xµ by adding µs. Therefore, FB

t = FXµ

t for each t, and we can safely say thatXµ is an FXµ

(super-, sub-) martingale.

Example 1.7.16. To give an example of a martingale in discrete time, let Y1, Y2, . . . be a se-quence of independent and identically distributed random variables such that the expectationµ = E[Y1] exists. Define

X0 = 0, X1 = Y1, X2 = Y1 + Y2, . . . , Xn = Y1 + Y2 + · · ·+ Yn,

so that (Xn)n∈N is a general random walk whose steps are distributed as Y1. Let FX bea filtration generated by the process (Xn)n∈N. Then

E[Xn+1|FXn ] = E[Y1 + Y2 + · · ·+ Yn+1|FX

n ] = E[Xn + Yn+1|FXn ]

(CE3)= Xn + E[Yn+1|FX

n ](CE7)

= Xn + E[Yn+1] = Xn + µ

and we conclude that (Xn)n∈N is a supermartingale if µ < 0, a martingale is µ = 0, anda submartingale if µ > 0. In the equation (1.7.3) we could use (CE3) because Xn is FX

n -adapted, (CE7) because Yn+1 is independent of FX

n , and E[Yn+1] = E[Y1] = µ, because



all Yk are identically distributed. What can you say about the case where X0 = 1, andXn =

∏nk=1 Yk = Y1Y2 . . . Yn?

Here is a slightly more convoluted example in continuous time:

Example 1.7.17. Let (Bt)t∈[0,∞) be a Brownian motion, and let (FB)t∈[0,∞) be the filtra-tion generated by B. Let us prove first that Xt = e

t2 sin(Bt) is a martingale with respect

(FBt )t∈[0,∞). As always, we take s < t, and use the facts that sin(Bs) and cos(Bs) are

measurable with respect to FBs , and sin(Bt −Bs) and cos(Bt −Bs) are independent of FB

s :

E[Xt|FBt ] = E[e

t2 sin(Bt)|FB

s ] = et2 E[sin(Bs + (Bt −Bs))|FB

s ]

= et2 E[sin(Bs) cos(Bt −Bs) + cos(Bs) sin(Bt −Bs)|FB

s ] =

= et2

(sin(Bs)E[cos(Bt −Bs)] + cos(Bs)E[sin(Bt −Bs)]

)=

= sin(Bs)(e

t2 E[cos(Bt −Bs)]

)= Xs

(e

t−s2 E[cos(Bt −Bs)]

)We know that sin(Bt−Bs) is a symmetric random variable (sin is a symmetric function, andBt−Bs ∼ N(0, t−s)) so E[sin(Bt−Bs)] = 0. We are, therefore, left with the task of provingthat E[cos(Bt − Bs)] = e−

t−s2 . We set u = t − s, and remember that Bt − Bs ∼ N(0, u).

Therefore,

E[cos(Bt −Bs)] =∫ ∞

−∞cos(ξ)

1√2πu

exp(− ξ2

2u) dξ = (Maple) = e−

u2 = e−

t−s2 ,

just as required. We can therefore conclude that E[Xt|FBs ] = Xs, for all s < t, proving that

(Xt)t∈[0,∞) is an (FBt )t∈[0,∞)-martingale. In order to prove that (Xt)t∈[0,∞) is an (FX

t )t∈[0,∞)-martingale, we use the tower property because FX

t ⊆ FBt (why?) to obtain

E[Xt|FXs ] = E

[E[Xt|FB

s ]∣∣FX

s

]= E[Xs|FX

s ] = Xs.

Example 1.7.18. Finally, here is an example of a process that is neither a martingale, nora super- or a submartingale, with respect to its natural filtration. Let Bt be a Brownianmotion, and let Xt = Bt − tB1 be a Brownian Bridge. We pick t = 1, s = 1/2 are write

E[X1|FX1/2] = E[0|FX

1/2] = 0.

It is now obvious that for the (normally distributed!) random variable X1/2 we have neitherX1/2 = 0, X1/2 ≥ 0, nor X1/2 ≤ 0.

1.7.10 Martingales as Unwinnable Gambles

Let (Xn)n∈N be a discrete-time stochastic process, and let (Fn)n∈N be a filtration. Think ofXn as a price of a stock on the day n, and think of Fn as the publicly available information



on the day n. Of course, the price of the stock at the day n is a part of that information sothat Xn if Fn-adapted.

Suppose that we have another process (Hn)n∈N which stands for the number of stocks inour portfolio in the evening of day n (after we have observed Xn, and made the trade forthe day). H will be called the trading strategy for the obvious reasons. We start with x

dollars, credit is available at 0 interest rate, and the stock can be freely shorted, so that ourtotal wealth at time n is given by Y H

n , where

Y H1 = x, Y H

2 = H1(X2 −X1), Y H3 = Y2 +H2(X3 −X2), . . . Y H

n =n∑

k=2

Hk−1 ∆Xk,

with ∆Xk = Xk − Xk−1. The process Y H is sometimes denoted by Y H = H · X. Here isa theorem which says that you cannot win by betting on a martingale. A more surprisingstatement is also true: if X is a process such that no matter how you try, you cannot win bybetting on it - then, X must be a martingale.

Theorem 1.7.19. Xn is a martingale if and only if E[Y Hn ] = x for each6 adapted H, and

each n. In that case Y H is an Fn-martingale for each H.

Proof.

• X martingale =⇒ E[Y Hn ] = x for each H,n

Let us compute the conditional expectation E[Y Hn+1|Fn]. Note first that Y H

n+1 = Y Hn +

Hn(Xn+1 −Xn), and that Y Hn and Hn are Fn-measurable, so that

E[Y Hn+1|Fn] = E[Y H

n +Hn(Xn+1 −Xn)|Fn] = Y Hn + E[Hn(Xn+1 −Xn)|Fn]

= Y Hn +HnE[Xn+1 −Xn|Fn] = Y H

n + E[Xn+1Fn]−Xn = Y Hn ,

so that Y H is a martingale, and in particular E[Y Hn ] = E[Y H

1 ] = x.

• E[Y Hn ] = x for each H,n =⇒ X martingale. (You can skip this part of the proof

if you find it too technical and hard!)

Since we know that E[Y Hn ] = x, for all n, and all H, we will pick a special trading

strategy and use it to prove our claim. So, pick n ∈ N, pick any set A ∈ Fn, and let Hbe the following gambling strategy:

– do nothing until day n,6I am ducking some issues here - existence of expectations, for example. We will therefore cheat a little,

and assume that all expectations exist. This will not be entirely correct from the mathematical point of view,

but it will save us from ugly technicalities without skipping the main ideas.



– buy 1 share of stock in the evening of the day n, if the event A happened (youknow whether A happened or not on day n, because A ∈ Fn). If A did not happen,do nothing.

– wait until tomorrow, and liquidate your position,

– retire from the stock market and take up fishing.

Mathematically, H can be expressed as

Hk(ω) =

1, ω ∈ A and k = n

0, otherwise,

so that for k ≥ n + 1 we have Y Hk = x + (Xn+1 − Xn)1A. By the assumption of the

theorem, we conclude that

E[Xn+11A] = E[Xn1A], for all n, and all A ∈ Fn. (1.7.3)

If we succeed in proving that (1.7.3) implies that (Xn)n∈N is a martingale, we are done.We argue by contradiction: suppose (1.7.3) holds, but X is not a martingale. Thatmeans that there exists some n ∈ N such that E[Xn+1|Fn] 6= Xn. In other words, therandom variables Xn and Y = E[Xn+1|Fn] are not identical. It follows that at leastone of the sets

B = ω ∈ Ω : Xn(ω) > Y (ω) , C = ω ∈ Ω : Xn(ω) < Y (ω) ,

is nonempty (has positive probability, to be more precise). Without loss of generality,we assume that P[B] > 0. Since both Xn and Y are Fn-measurable, so is the set B andthus 1B is Fn-measurable. Since Y is strictly smaller than Xn on the set B, we have

E[Xn1B] < E[Y 1B] = E[E[Xn+1|Fn]1B] = E[E[Xn+11B|Fn]] = E[Xn+11B],

which is in contradiction with (1.7.3) (just take A = B). We are done.



Exercises

Exercise 1.7.20. Let (Bt)t∈[0,∞) be a Brownian motion. For 0 < s < t, what is the condi-tional density of Bs, given Bt = ξ. How about the other way around - the conditional densityof Bt given Bs = ξ?

Exercise 1.7.21. Let Y1 and Y2 be two independent random variables with exponentialdistributions and parameters λ1 = λ2 = λ. Find the conditional density of X1 = Y1 givenX2 = ξ, for X2 = Y1 + Y2.

Exercise 1.7.22. In the Cox-Ross-Rubinstein model of stock prices there are n time periods.The stock price at time t = 0 is given by a fixed constant S0 = s. The price at time t = k+1is obtained from the price at time t = k by multiplying it by a random variable Xk with thedistribution

Xk ∼

(1 + a 1− b

p 1− p

),where a > 0, 0 < b < 1, and 0 < p < 1.

The return Xk are assumed to be independent of the past stock prices S0, S1, . . . , Sk. In thisproblem we will assume that n = 3, so that the relevant random variables are S0, S1, S2 andS3.

(a) Sketch a tree representation of the above stock-price model, and describe the state-spaceand the probability measure on it by computing P[ω] for each ω ∈ Ω. Also, write downthe values of S0(ω), S1(ω), S2(ω), S3(ω), X0(ω), X1(ω) and X2(ω) for each ω ∈ Ω.(Note: you should be able to do it on an Ω consisting of only 8 elements).

(b) Find the atoms for each of the following σ-algebras FSk , k = 0, 1, 2, 3, where FS

k =σ(S0, S1, . . . , Sk).

(c) Compute (1) E[S2|FS3 ], (2) E[S2|σ(S3)], (3) E[S2

S1|FS

1 ], (4) E[S1+S2|σ(S3)].

(d) Describe (in terms of the atoms of its σ-algebras) the filtration of an insider who knowswhether S3 > S0 or not, on top of the public information filtration FS .

Exercise 1.7.23. Let Ω be a finite probability space, F a σ-algebra on Ω, and X a randomvariable. Show that for all A ∈ F ,

E[E[X|F ]1A

∣∣] = E[X1A].

Note: this property is often used as the definition of the conditional expectation in the caseof an infinite Ω.

(Hint: use the fact that each A ∈ F can be written as a finite union of atoms of F , andexploit the measurability properties of the conditional expectation E[X|F ].)



Exercise 1.7.24. Let (Sn)n∈N be a random walk, i.e.

(1) S0 = 0, (2) Sn = X1 +X2 + · · ·+Xn,

where the steps Xi’s are independent random variables with distribution

Xi ∼

(−1 11/2 1/2

).

Compute E[Sn|σ(X1)] and E[X1|σ(Sn)]. (Hint: Think and use symmetry!)

Exercise 1.7.25. Let (X,Y ) be a random vector with the density function given by

f(X,Y )(x, y) =

kxy, 1 ≤ x ≤ 2 and 0 ≤ y ≤ x,

0, otherwise.

(a) Find the constant k > 0, so that f above defines a probability density function.

(b) Compute the (marginal) densities of X and Y

(c) Find the conditional expectation E[U |σ(X)], where U = X + Y .

Exercise 1.7.26. Let (Bt)t∈[0,∞) be a Brownian motion, and let FBt = σ((Bs)s∈[0,t]) be the

filtration generated by it.

(a) (4pts) Which of the following are measurable with respect to FB3 :

(1)B3, (2)−B3, (3)B2+B1, (4)B2+B1.5, (5)B4−B2, (6)∫ 3

0Bs ds ?

(b) (4pts) Which of the following are FB-martingales

(1) −Bt, (2) Bt − t, (3) B2t , (4) B2

t − t, (5) exp(Bt −t

2) ?

Exercise 1.7.27. The conditional moment-generating function MX|F (λ) of an n-dimensional random vector X = (X1, X2, . . . , Xn) given the σ-algebra F is defined by

MX|F (λ) = E[exp(−λ1X1 − λ2X2 − · · · − λnXn)|F ], for λ = (λ1, λ2, . . . λn) ∈ Rn.

(a) Let Bt, t ∈ [0,∞) be a Brownian motion. Calculate MBt|FBs

(λ) for t > s ≥ 0 andλ ∈ R. Here (FB

t )t∈[0,∞) denotes the filtration generated by the Brownian motion B.



(b) Two random variables X1 and X2 are said to be conditionally independent giventhe σ-algebra F , if the random vector X = (X1, X2) satisfies

MX|F (λ) = MX1|F (λ1)MX2|F (λ2), for all λ = (λ1, λ2) ∈ R2.

Show that the random variables Bt1 and Bt2 are conditionally independent given theσ-algebra σ(Bt3) for 0 ≤ t1 < t3 < t2 (note the ordering of indices!!!) In other words,for Brownian motion, past is independent of the future, given the present.

Exercises

Exercise 1.7.28. Let (Bt)t∈[0,∞) be a Brownian motion. For 0 < s < t, what is the condi-tional density of Bs, given Bt = ξ. How about the other way around - the conditional densityof Bt given Bs = ξ?

Exercise 1.7.29. Let Y1 and Y2 be two independent random variables with exponentialdistributions and parameters λ1 = λ2 = λ. Find the conditional density of X1 = Y1 givenX2 = ξ, for X2 = Y1 + Y2.

Exercise 1.7.30. In the Cox-Ross-Rubinstein model of stock prices there are n time periods.The stock price at time t = 0 is given by a fixed constant S0 = s. The price at time t = k+1is obtained from the price at time t = k by multiplying it by a random variable Xk with thedistribution

Xk ∼

(1 + a 1− b

p 1− p

),where a > 0, 0 < b < 1, and 0 < p < 1.

The return Xk are assumed to be independent of the past stock prices S0, S1, . . . , Sk. In thisproblem we will assume that n = 3, so that the relevant random variables are S0, S1, S2 andS3.

(a) Sketch a tree representation of the above stock-price model, and describe the state-spaceand the probability measure on it by computing P[ω] for each ω ∈ Ω. Also, write downthe values of S0(ω), S1(ω), S2(ω), S3(ω), X0(ω), X1(ω) and X2(ω) for each ω ∈ Ω.(Note: you should be able to do it on an Ω consisting of only 8 elements).

(b) Find the atoms for each of the following σ-algebras FSk , k = 0, 1, 2, 3, where FS

k =σ(S0, S1, . . . , Sk).

(c) Compute (1) E[S2|FS3 ], (2) E[S2|σ(S3)], (3) E[S2

S1|FS

1 ], (4) E[S1+S2|σ(S3)].

(d) Describe (in terms of the atoms of its σ-algebras) the filtration of an insider who knowswhether S3 > S0 or not, on top of the public information filtration FS .



Exercise 1.7.31. Let Ω be a finite probability space, F a σ-algebra on Ω, and X a randomvariable. Show that for all A ∈ F ,

E[E[X|F ]1A

∣∣] = E[X1A].

Note: this property is often used as the definition of the conditional expectation in the caseof an infinite Ω.

(Hint: use the fact that each A ∈ F can be written as a finite union of atoms of F , andexploit the measurability properties of the conditional expectation E[X|F ].)

Exercise 1.7.32. Let (Sn)n∈N be a random walk, i.e.

(1) S0 = 0, (2) Sn = X1 +X2 + · · ·+Xn,

where the steps Xi’s are independent random variables with distribution

Xi ∼

(−1 11/2 1/2

).

Compute E[Sn|σ(X1)] and E[X1|σ(Sn)]. (Hint: Think and use symmetry!)

Exercise 1.7.33. Let (X,Y ) be a random vector with the density function given by

f(X,Y )(x, y) =

kxy, 1 ≤ x ≤ 2 and 0 ≤ y ≤ x,

0, otherwise.

(a) Find the constant k > 0, so that f above defines a probability density function.

(b) Compute the (marginal) densities of X and Y

(c) Find the conditional expectation E[U |σ(X)], where U = X + Y .

Exercise 1.7.34. Let (Bt)t∈[0,∞) be a Brownian motion, and let FBt = σ((Bs)s∈[0,t]) be the

filtration generated by it.

(a) (4pts) Which of the following are measurable with respect to FB3 :

(1)B3, (2)−B3, (3)B2+B1, (4)B2+B1.5, (5)B4−B2, (6)∫ 3

0Bs ds ?

(b) (4pts) Which of the following are FB-martingales

(1) −Bt, (2) Bt − t, (3) B2t , (4) B2

t − t, (5) exp(Bt −t

2) ?



Exercise 1.7.35. The conditional moment-generating function MX|F (λ) of an n-dimensional random vector X = (X1, X2, . . . , Xn) given the σ-algebra F is defined by

MX|F (λ) = E[exp(−λ1X1 − λ2X2 − · · · − λnXn)|F ], for λ = (λ1, λ2, . . . λn) ∈ Rn.

(a) Let Bt, t ∈ [0,∞) be a Brownian motion. Calculate MBt|FBs

(λ) for t > s ≥ 0 andλ ∈ R. Here (FB

t )t∈[0,∞) denotes the filtration generated by the Brownian motion B.

(b) Two random variables X1 and X2 are said to be conditionally independent giventhe σ-algebra F , if the random vector X = (X1, X2) satisfies

MX|F (λ) = MX1|F (λ1)MX2|F (λ2), for all λ = (λ1, λ2) ∈ R2.

Show that the random variables Bt1 and Bt2 are conditionally independent given theσ-algebra σ(Bt3) for 0 ≤ t1 < t3 < t2 (note the ordering of indices!!!) In other words,for Brownian motion, past is independent of the future, given the present.





The joint density of (Bs, Bt) is bivariate normal withzero mean vector, variances Var[Bs] = s, Var[Bt] = t

and correlation

ρ =Cov[Bs, Bt]√

Var[Bs]√

Var[Bt]=

s√st

=√s

t.

Using (1.7.1) we conclude that the conditional densityof Bs, given Bt = ξ is the density of a normal randomvariable with

µBs|Bt=ξ = 0 +√s

t

√s√t(ξ − 0) =

s

tξ,

and

σBs|Bt=ξ =√s

√1− s

t=

√s(t− s)

t.

0.5

1

1.5

1 2s

Fig 14. Mean (solid) and ± stan-dard deviation (dash-dot) of condi-tional distribution of Bs, given Bt =1.5, for t = 2 and s ∈ [0, 2]

As for ”the other way around” case, we can use the same reasoning to conclude that theconditional density of Bt given Bs = ξ is normal with mean ξ and standard deviation

√t− s.

Solution to Exercise 1.7.29: Let us first determine the joint density of the random vector(X1, X2) = (Y1, Y1 + Y2), by computing its distribution function and then differentiating.Using the fact that an exponentially distributed random variable with parameter λ admits adensity of the form λ exp(−λx), x ≤ 0, we have (0 ≤ x1 ≤ x2)

FX1,X2(x1, x2) = P[Y1 ≤ x1, Y1 + Y2 ≤ x2] =∫ x1

0

∫ x2−y2

0λ1 exp(−λ1y1)λ2 exp(−λ2y2) dy2 dy1

=∫ x1

0λ1 exp(−λ1x1)(1− exp(−λ2(x2 − y1))) dy1,

and we can differentiate with respect to x1 and x2 to obtain

fX1,X2(x1, x2) =

λ2 exp(−λx2), 0 ≤ x1 ≤ x2,

0, otherwise



We use now the formula for conditional probability:

fX1|X2(x1, x2 = ξ) =

fX1,X2(x1, ξ)fX2(ξ)

=

λ2 exp(−λξ)

fX2(ξ) , 0 ≤ x1 ≤ ξ,

0, otherwise

Knowing that, as a function of x1, fX1|X2(x1, x2 = ξ) is a genuine density function, we do not

need to compute fX2 explicitly - we just need to insure that (note that integration is withrespect to x1!)

1 =∫ ξ

0

λ2 exp(λξ)fX2(ξ)

dx1,

and so fX2(ξ) = λ2ξ exp(−λξ) and

fX1|X2(x1, x2 = ξ) =

1ξ , 0 ≤ x1 ≤ ξ

0, otherwise,

so we conclude that, conditionally on Y1 + Y2 = ξ, Y1 is distributed uniformly on [0, ξ].

As a bi-product of this calculation we obtained that the density if X2 = Y1 + Y2 is fX2(x) =λ2x exp(−λx), x ≥ 0. The distribution of X2 is called the Gamma-distribution withparameters (λ, 2).Solution to Exercise 1.7.30:

(a)The figure on the right shows the tree representation of theCRR model, drawn to scale with a = 0.2 and b = 0.5. We canalways take Ω to be the set of all paths the process can take.In this case we will encode each path by a triplet of letters- e.g. UDD will denote the path whose first movement wasup, and the next two down. With this convention we have

Ω = UUU,UUD,UDU,UDD,DUU,DUD,DDU,DDD ,

and P[UUU ] = p3, P[UUD] = P[UDU ] = P[DUU ] =p2(1 − p), P[DDU ] = P[DUD] = P[UDD] = p(1 − p)2, andP[DDD] = (1−p)3, since the probability of an upward move-ment is p independently of the other increments.

Fig 15. The Cox - Ross -Rubinstein Model

The following table shows the required values of the random variables (the row repre-sents the element of Ω, and the column the random variable):



X0 X1 X2 S0 S1 S2 S3

UUU (1 + a) (1 + a) (1 + a) s s(1 + a) s(1 + a)2 s(1 + a)3

UUD (1 + a) (1 + a) (1− b) s s(1 + a) s(1 + a)2 s(1 + a)(1− b)UDU (1 + a) (1− b) (1 + a) s s(1 + a) s(1 + a)(1− b) s(1 + a)2(1− b)UDD (1 + a) (1− b) (1− b) s s(1 + a) s(1 + a)(1− b) s(1 + a)(1− b)2

DUU (1− b) (1 + a) (1 + a) s s(1− b) s(1 + a)(1− b) s(1 + a)2(1− b)DUD (1− b) (1 + a) (1− b) s s(1− b) s(1 + a)(1− b) s(1 + a)(1− b)2

DDU (1− b) (1− b) (1 + a) s s(1− b) s(1− b)2 s(1 + a)(1− b)2

DDD (1− b) (1− b) (1− b) s s(1− b) s(1− b)2 s(1− b)3

(b) The σ-algebra FS0 is trivial so its only atom is Ω. The σ-algebra FS

1 can distinguishbetween those ω ∈ Ω for which S1(ω) = s(1 + a) and those for which S1(ω) = s(1 −b), so its atoms are UUU,UUD,UDU,UDD and DUU,DUD,DDU,DDD. Ananalogous reasoning leads to the atoms UUU,UUD, UDU,UDD, DUU,DUDand DDU,DDD of FS

2 . Finally, the atoms of FS3 are all 1-element subsets of Ω.

(c)

(1) Since S2 is measurable with respect to FS3 (i.e. we know the exact value of S2

when we know FS3 ), the property (CE3) implies that E[S2|FS

3 ] = S2.

(2) There are four possible values for the random variable S3 and so the atoms ofthe σ-algebra generated by it are A0 = UUU, A1 = UUD,UDU,DUU,A2 = UDD,DUD,DDU, and A4 = DDD. By the algorithm for computingdiscrete conditional expectations from the notes, and using the table above, wehave

E[S2|σ(S3)](ω) =

S2(UUU)P[UUU ]P[UUU ] = s(1 + a)2, ω ∈ A0

S2(UUD)P[UUD]+S2(UDU)P[UDU ]+S2(DUU)P[DUU ]P[UUD]+P[UDU ]+P[DUU ] = s (1+a)2+2(1+a)(1−b)

3 , ω ∈ A1

S2(UDD)P[UDD]+S2(DUD)P[DUD]+S2(DDU)P[DDU ]P[UDD]+P[DUD]+P[DDU ] = s2(1+a)(1−b)+(1−b)2

3 , ω ∈ A2

S2(DDD)P[DDD]P[DDD] = s(1− b)2, ω ∈ A3

Note how the obtained expression does not depend on p. Why is that?

(3) S2S1

= X1, and this random variable is independent of the past, i.e. independent ofthe σ-algebra FS

1 . Therefore, by (CE7), we have

E[S2

S1|FS

1 ](ω) = E[S2

S1] = E[X1] = p(1 + a) + (1− p)(1− b), for all ω ∈ Ω.



(4) Just like in the part (2), we can get that

E[S1|σ(S3)](ω) =

S1(UUU)P[UUU ]P[UUU ] = s(1 + a), ω ∈ A0

S1(UUD)P[UUD]+S1(UDU)P[UDU ]+S1(DUU)P[DUU ]P[UUD]+P[UDU ]+P[DUU ] = s (1−b)+2(1+a)

3 , ω ∈ A1

S1(UDD)P[UDD]+S1(DUD)P[DUD]+S1(DDU)P[DDU ]P[UDD]+P[DUD]+P[DDU ] = s2(1−b)+(1+a)

3 , ω ∈ A2

S1(DDD)P[DDD]P[DDD] = s(1− b), ω ∈ A3,

and the desired conditional expectation E[S1 +S2|σ(S3)] is obtained by adding theexpresion above to the expression from (2), by (CE5).

(d) The filtration G of the insider will depend on the values of a and b. I will solve onlythe case in which (1 + a)2(1− b) > 1 and (1 + a)(1− b)2 < 1. The other cases are dealtwith analogously. By looking at the table above we have

G0 = σ(UUU,UUD,UDU,DUU , UDD,DUD,DDU,DDD)G1 = σ(UUU,UUD,UDU , DUU , UDD , DUD,DDU,DDD)G2 = σ(UUU,UUD , UDU , DUU , UDD , DUD , DDU,DDD)G3 = σ(UUU , UUD , UDU , DUU , UDD , DUD , DDU , DDD)

Solution to Exercise 1.7.31: Assume first that A is an atom of the σ-algebra F . ThenE[X|F ] is constant on A, and its value there is the weighted average of X on A - let us call thisnumber α. Therefore E[E[X|F ]1A] = E[α1A] = αP[A]. On the other hand, α = E[X1A]P[A],so, after multiplication by P[A], we get

E[E[X|F ]1A] = αP[A] = E[X1A].

When A is a general set in F , we can always find atoms A1, A2, . . . , An such that A =A1 ∪ A2 ∪ · · · ∪ An, and Ai ∩ Aj = ∅, for i 6= j. Thus, 1A = 1A1 + 1A2 + · · · + 1An , and wecan use linearity of conditional (and ordinary) expectation, together with the result we havejust obtained to finish the proof.Solution to Exercise 1.7.32:

The first conditional expectation is easier: note that X1 is measurable with σ(X1), so by(CE3), E[X1|σ(X1)] = X1. On the other hand X2, X3, . . . , Xn are independent of X1, andtherefore so is X2 + X3 + · · · + Xn. Using (CE7) we get E[X2 + X3 + · · · + Xn|σ(X1)] =E[X2 +X3 + · · ·+Xn] = 0, since the expectation of each increment is 0. Finally,

E[Sn|σ(X1)] = E[X1+X2+· · ·+Xn|σ(X1)] = E[X1|σ(X1)]+E[X2+· · ·+Xn|σ(X1)] = X1+0 = X1.

For the second conditional expectation we will need a little more work. The incrementsX1, X2, . . . , Xn are independent and identically distributed, so the knowledge of X1 should



have the same effect on the expectation of Sn, as the knowledge of X2 or X17. More formally,we have Sn = Xk +X1 +X2 + · · ·+Xk−1 +Xk+1 + · · ·+Xn, so E[X1|σ(Sn)] = E[Xk|σ(Sn)],for any k. We know that E[Sn|σ(Sn)] = Sn, so

Sn = E[Sn|σ(Sn)] = E[X1 +X2 + · · ·+Xn|σ(Sn)] = E[X1|σ(Sn)] + E[X2|σ(Sn)] + · · ·+ E[Xn|σ(Sn)]

= E[X1|σ(Sn)] + E[X1|σ(Sn)] + · · ·+ E[X1|σ(Sn)] = nE[X1|σ(Sn)],

and we conclude that E[X1|σ(Sn)] = Snn .

Solution to Exercise 1.7.33:(a) To find the value of k, we write

1 =∫ 2

1

∫ x

0kxy dy dx = k

158,

and so k = 815 .

(b)

fX(x) =∫ ∞

−∞f(X,Y )(x, y) dy =

∫ x

0

815xy dy

=415x3, for 1 ≤ x ≤ 2, and fX(x) = 0 otherwise.

For y < 0 or y > 1, obviously fY (y) = 0.

For 0 ≤ y ≤ 1,

fY (y) =∫ 2

1

815xy dx =

45y.

For 1 < y ≤ 2, we have

fY (y) =∫ 2

y

815xy dx =

415y(4− y2).

–1

0

1

2

3

–1 1 2 3

Fig 16. Region where f(X,Y )(x, y) > 0

(c) The conditional density fY |X(y, x = ξ), for 1 ≤ ξ ≤ 2 is given by

fY |X(y, x = ξ) =f(X,Y )(ξ, y)fX(ξ)

=

815

ξyfX(ξ) 0 ≤ y ≤ x,

0, otherwise=

2yξ−2 0 ≤ y ≤ ξ,

0, otherwise

Therefore

E[Y |X = ξ] =∫ ξ

02y yξ−2 dy =

23ξ,



and consequently E[Y |σ(X)] = 23X. By (CE3) and (CE5) we have then

E[U |σ(X)] = E[X + Y |σ(X)] = X + E[Y |σ(X)] =53X.


(a) The σ-algebra FB3 “knows” the values of all Bt for 0 ≤ t ≤ 3, so that B3,−B3, B2 +

B1, B2 +B1.5, and∫ 30 Bs ds are measurable there. The random variable B4 −B2 is not

measurable with respect to FB3 , because we would have to peek into the future from

t = 3 in order to know its value.

(b)

(2),(3) We know that martingales have constant expectation, i.e. the function t 7→ E[Xt]is constant for any martingale X. This criterion immediately rules out (2) and (3)because E[Bt − t] = −t, and E[B2

t ] = t - and these are not constant functions.

(1) The process −Bt is a martingale because

E[(−Bt)|Fs] = −E[Bt|Fs] = −Bs,

and we know that Bt is a martingale.

(4) To show that B2t − t is a martingale, we write B2

t =(Bs + (Bt − Bs)

)2 = B2s +

(Bt−Bs)2 +2Bs(Bt−Bs). The random variable (Bt−Bs)2 is independent of FBs ,

so by (CE7), we have

E[(Bt −Bs)2|FBs ] = E[(Bt −Bs)2] = (t− s). (one)

Further, Bs is measurable with respect to FBs , so

E[2Bs(Bt −Bs)|FBs ] = 2BsE[Bt −Bs|FB

s ] = 2BsE[Bt −Bs] = 0, (two)

by (CE6) and (CE7). Finally B2s is FB

s -measurable, so

E[B2s |FB

s ] = B2s . (three)

Adding (one), (two) and (three) we get

E[B2t |FB

s ] = (t− s) +B2s , and so E[B2

t − t|FBs ] = B2

s − s,

proving that B2t − t is a martingale.



(5) Finally, let us prove that Xt = exp(Bt − 12 t) is a martingale. Write

Xt = Xs exp(−12(t− s)) exp(Bt −Bs),

so that Xs = exp(Bs − 12s) is FB

s -measurable, and exp(Bt − Bs) is independentof FB

s (because the increment Bt −Bs is). Therefore, using (CE6) and (CE7) wehave

E[Xt|FBs ] = Xs exp(−1

2(t−s))E[exp(Bt−Bs)|FB

s ] = Xs exp(−12(t−s))E[exp(Bt−Bs)],

and it will be enough to prove that E[exp(Bt − Bs)] = exp(12(t − s)). This will

follow from the fact that Bt − Bs has a N(0, t − s) distribution, and so (usingMaple for example)

E[exp(Bt −Bs)] =∫ ∞

−∞exp(ξ)

1√2π(t− s)

exp(− ξ2

2(t− s)) dξ = exp(

12(t− s)).


(a) Using the properties of conditional expectation and the fact that Bt−Bs (and thereforealso e−λ(Bt−Bs)) is independent of FB

s , we get

MBt|FBs

(λ) = E[exp(−λBt)|FBs ] = exp(−λBs)E[exp(−λ(Bt−Bs))] = exp(−λBs+

λ2(t− s)2

).

(b) Let us calculate first Y = E[exp(−λ1Bt1 − λ2Bt2)|FBt3 ]. By using the properties of

conditional expectation and Brownian motion we get:

Y = exp(−λ1Bt1) exp(−λ2Bt3)E[exp(−λ2(Bt2 −Bt3))],

and thus, by the tower property of conditional expectation, we have

M(Bt1 ,Bt2 )|σ(Bt3 )(λ1, λ2) = E[Y |σ(Bt3)] = exp(−λ2Bt3)E[exp(−λ2(Bt2−Bt3))]E[exp(−λ1Bt1)|σ(Bt3)].

On the other hand,

MBt1 |σ(Bt3 )(λ2) = E[exp(−λ1Bt1)|σ(Bt3)],

and

MBt2 |σ(Bt3 )(λ2) = E[exp(−λ2(Bt2−Bt3)−λ2Bt3)|σ(Bt3)] = exp(−λ2Bt3)E[exp(−λ2(Bt2−Bt3))],

and the claim of the problem follows.


Chapter 2Stochastic Calculus

105

2.1. STOCHASTIC INTEGRATION CHAPTER 2. STOCHASTIC CALCULUS

2.1 Stochastic Integration

2.1.1 What do we really want?

In the last section of Chapter 1, we have seen how discrete-time martingales are exactly thoseprocesses for which you cannot make money (on average) by betting on. It is the task ofthis section to define what it means to bet (or trade) on a continuous process and to studythe properties of the resulting ”wealth”-processes. We start by reviewing what it means tobet on a discrete-time process. Suppose that a security price follows a stochastic process1

(Xn)n∈N0 , and that the public information is given by the filtration (Fn)n∈N0 . The investor(say Derek the Daisy for a change) starts with x dollars at time 0, and an inexhaustible creditline. After observing the value X0, but before knowing X1 - i.e. using the information F0

only - Derek decide how much to invest in the asset X by buying H0 shares. If x dollarsis not enough, Derek can always borrow additional funds from the bank at 0 interest rate2.In the morning of day 1, and after observing the price X1, Derek decides to rebalance theportfolio. Since there are no transaction costs, we might as well imagine Derek first sellinghis H0 shares of the asset X, after breakfast and at the price X1 (making his wealth equal toY1 = x+H0(X1−X0) after paying his debts) and then buying H1 shares after lunch. Derekis no clairvoyant, so the random variable H1 will have to be measurable with respect to theσ-algebra F1. On day 2 the new price X2 is formed, Derek rebalances his portfolio, pays thedebts, and so on . . . Therefore, Derek’s wealth at day n (after liquidation of his position) willbe

Y Hn = x+H0(X1 −X0) + · · ·+Hn−1(Xn −Xn−1). (2.1.1)

The main question we are facing is How do we define the wealth process Y H when X isa continuous process and the portfolio gets rebalanced at every instant? Let us start byrebalancing the portfolio not exactly at every instant, but every ∆t units of time, and thenletting ∆t get smaller and smaller. We pick a time-horizon T , and subdivide the interval[0, T ] into n equal segments 0 = t0 < t1 < t2 < · · · < tn−1 < tn = T , where tk = k∆t and∆t = T/n. By (2.1.1) the total wealth after T = n∆t units of time will be given by

Y HT = x+

n∑k=1

Htk−1(Xtk −Xtk−1

) =n∑

k=1

Htk−1∆Xtk , (2.1.2)

Where ∆Xtk = Xtk −Xtk−1. To make a conceptual headway with the expression (2.1.4), let

us consider a few simple (and completely unrealistic) cases of the structure of process X. We1For the sake of compliance with the existing standards, we will always start our discrete and continuous

processes from n = 0.2Later on we shall add non-zero interest rates into our models, but for now they would only obscure the

analysis without adding anything interesting


CHAPTER 2. STOCHASTIC CALCULUS 2.1. STOCHASTIC INTEGRATION

will also assume that X is a deterministic (non-random) process since the randomness willnot play any role in what follows.

Example 2.1.1. Let X be non random and let Xt = t. In that case

Y HT = x+

n∑k=1


) = x+n∑

k=1

Htk−1∆t→ x+

∫ T

0Ht dt, (2.1.3)

as n→∞, since the Riemann sums (and that is exactly what we have in the expression forY H

T ) converge towards the integral. Of course, we have to assume that H is not too wild, sothat the integral

∫ T0 Ht dt exists.

Example 2.1.2. LetX be non-random again, supposeXt is differentiable and let dXtdt = µ(t).

Then the Fundamental Theorem of Calculus gives

Y HT = x+

n∑k=1


) = x+n∑

k=1

Htk−1

∫ tk

tk−1

µ(t) dt

≈ x+n∑

k=1

Htk−1µ(tk−1)∆t

(n→∞)→ x+∫ T

0Ht µ(t) dt = x+

∫ t

0HtdXt

dtdt.

(2.1.4)

From the past two examples it is tempting to define the wealth-process Y Ht by Y H

t =∫ t0 Hu

dXtdt dt, but there is a big problem. Take Xt = Bt - a Brownian motion - and remember

that the paths t 7→ Bt(ω) are not differentiable functions. The trajectories of the Brownianmotion are too irregular (think of the simulations) to admit a derivative. A different approachis needed, and we will try to go through some of its ideas in the following subsection.

2.1.2 Stochastic Integration with respect to Brownian Motion

Let us take the the price process X to be a Brownian motion, FBt the filtration generated

by it (no extra information here), and Ht a process adapted to the filtration FBt . If we trade

every ∆t = T/n time units, the wealth equation should look something like this:

Y(n)T = x+

n∑k=1

Htk−1(Btk −Btk−1

),

and we would very much like to define

x+∫ T

0Ht dBt = ” lim

n→∞” Y

(n)T ,

and call it the wealth at time T . The quotes around the limit point to the fact that we stilldo not know what kind of limit we are talking about, and, in fact, the exact description ofit would require mathematics that is far outside the scope of these notes. We can, however,try to work out an example.



Example 2.1.3. Take3 Ht = Bt, so that the expression in question become

limn→∞

n∑k=1

Btk−1(Btk −Btk−1

).

We would like to rearrange this limit a little in order to make the analysis easier. We startfrom the identity

B2T = B2

T −B20 =

n∑k=1

(B2tk−B2

tk−1) =

n∑k=1

(Btk −Btk−1)(Btk +Btk−1

)

=n∑

k=1

(Btk −Btk−1)(Btk −Btk−1

+ 2Btk−1) =

n∑k=1

(Btk −Btk−1)2 + 2

n∑k=1


),

So that

limn→∞

n∑k=1


) =12B2

T −12

limn→∞

n∑k=1

(Btk −Btk−1)2

We have reduced the problem to finding the limit of the sum∑n

k=1(Btk −Btk−1)2 as n→∞.

Taking the expectation we get (using the fact that Btk −Btk−1is a normal random variable

with variance tk − tk−1)

E[n∑

k=1

(Btk −Btk−1)2] =

n∑k=1

E[(Btk −Btk−1)2] =

n∑k=1

(tk − tk−1) = T.

This suggests that

limn→∞

n∑k=1

(Btk −Btk−1)2 = T,

a fact which can be proved rigorously, but we shall not attempt to do it here. Therefore

limn→∞

n∑k=1


) =12(B2

T − T ).

The manipulations in this example can be modified so that they work in the general caseand the integral

∫ T0 Hu dBu can be defined for a large class (but not all) “portfolio” processes.

Everything will be OK if H is adapted to the filtration FBt , and conforms to some other (and

less important) regularity conditions. There is, however, more to observe in the example3The trading strategy where the number of shares of an asset (Ht) in your portfolio is equal to the price

of the asset (Bt) is highly unrealistic, but it is the simplest case where we can illustrate the underlying ideas.



above. Suppose for a second that Bt does have a derivative dBtdt = Bt. Suppose also that all

the rules of classical calculus apply to the stochastic integration, so that∫ T

0Bt dBt =

∫ T

0BtBtdt =

∫ T

0

d (12B

2t )

dtdt =

12B2

T

So what happened to −12T? The explanation lies in the fact that limn→∞

∑nk=1(Btk −

Btk−1)2 = T , a phenomenon which does not occur in classical calculus. In fact, for any

differentiable function f : [0, T ] → R with continuous derivative we always have

limn→∞

n∑k=1

(f(tk+1)− f(tk))2 = limn→∞

n∑k=1

(f ′(tk)∆t)2 ≤M2n∑

k=1

1n2

= M2 1n→ 0,

where M = supt∈[0,T ] |f ′(t)|, because the Mean Value Theorem implies that f(tk)−f(tk+1) =f ′(tk)(tk − tk−1) = f ′(tk)∆t, for some tk ∈ (tk−1, tk), and M is finite by the continuity off ′(t). In stochastic calculus the quantity

limn→∞

n∑k=1

(Xtk+1−Xtk)2,

is called the quadratic variation of the process X and is denoted by 〈X,X〉T . If the pathsof X are differentiable, then the previous calculation shows that 〈X,X〉T = 0, and for theBrownian motion B we have 〈B,B〉T = T , for each T .

2.1.3 The Ito Formula

Trying to convince the reader that it is possible to define the stochastic integral∫ T0 Hu dBu

with respect to Brownian motion, is one thing. Teaching him or her how to compute thoseintegrals is quite a different task, especially in the light of the mysterious −1

2T term thatappears in the examples given above. We would like to provide the reader with a simpletool (formula?) that will do the trick. Let us start by examining what happens in classicalcalculus.

Example 2.1.4. Suppose that we are asked to compute the integral∫ π/2

−π/2cos(x) dx.

We could, of course, cut the interval [−π/2, π/2] into n pieces, form the Riemann-sums andtry to find the limit of those, as n → ∞, but very early in the calculus sequences, we aretold that there is a much faster way - The Newton-Leibnitz formula (or, The FundamentalTheorem of Calculus). The recipe is the following: to integrate the function f(x) over theinterval [a, b],



• Find a function F such that F ′(x) = f(x) for all x.

• The result is F (a)− F (b).

Our problem above can then be solved by noting that sin′(x) = cos(x), and so∫ π/2−π/2 cos(x) dx =

sin(π/2)− sin(−π/2) = 2.

Can we do something like that for stochastic integration? The answer is yes, but the ruleswill be slightly different (remember −1

2T ). Suppose now that we are given a function f , andwe are interested in the integral

∫ T0 f(Bt) dBt. Will its value be F (BT )−F (0), for a function

F such that F ′(x) = f(x)? Taking f(x) = x, and remembering the example 2.1.3, we realizethat it cannot be the case. We need an extra term ? to rectify the situation:

F (BT )− F (B0) =∫ T

0F ′(Bt) dBt+ ?.

In the late 1940’s, K. Ito has shown that this extra term is given by 12

∫ T0 F ′′(Bt) dt, intro-

ducing the famous Ito-formula:

F (BT ) = F (B0) +∫ T

0F ′(Bt) dBt +

12

∫ T

0F ′′(Bt) dt,

for any function F whose second derivative F ′′ is continuous. Heuristically, the derivation ofthe Ito-formula would spell as follows: use the second order Taylor expansion for the functionF -

F (x) ≈ F (x0) + F ′(x0)(x− x0) +12F ′′(x0)(x− x0)2.

Substitute x = Btk and x0 = Btk−1to get

F (Btk)− F (Btk−1) ≈ F ′(Btk−1

)(Btk −Btk−1) +

12F ′′(Btk−1

)(Btk −Btk−1)2.

Sum the above expression over k = 1 to n, so that the telescoping sum of the left-hand sidebecomes F (BT )− F (B0) :

F (BT )− F (B0) ≈n∑

k=1

F ′(Btk−1)(Btk −Btk−1

) +12

n∑k=1

F ′′(Btk−1)(Btk −Btk−1

)2,

and upon letting n → ∞ we get the Ito-formula. Of course, this derivation is non-rigorousand a number of criticisms can be made. The rigorous proof is, nevertheless, based on theideas presented above. Before we move on, let us give a simple example illustrating the useof Ito’s formula.



Example 2.1.5. Let us compute∫ T0 B2

t dBt. Applying the Ito-formula with F (x) = 13x

3 weget

13B3

T =13B3

T −13B3

0 =∫ T

0B2

t dBt +12

∫ T

02Bt dt,

so that∫ T0 B2

t dBt = 13B

3T −

∫ T0 Bt dt.

Having introduced the Ito-formula it will not be hard to extend it to deal with the integralsof the form

∫ t0 f(s,Bs) dBs. We follow an analogous approximating procedure as above:

F (t+ ∆t, Bt + ∆Bt)− F (t, Bt) ≈ ∆t∂

∂tF (t, Bt) + ∆Bt

∂

∂xF (t, Bt)

+12

((∆t)2

∂2

∂t2F (t, Bt) + 2∆t∆Bt

∂2

∂x∂tF (t, Bt) + (∆Bt)2

∂2

∂x2F (t, Bt)

),

(2.1.5)

only using the multidimensional version of the Taylor formula, truncated after the second-order terms. By comparison with the terms (∆Bt)2 = (Btk − Btk−1

)2 appearing the thequadratic variation, we conclude that (∆t)2 and (∆t∆Bt) are of smaller order, and can thusbe safely excluded. Telescoping now gives the, sa called, inhomogeneous form of theIto-formula:

F (t, Bt) = F (0, B0) +∫ t

0

( ∂∂tF (t, Bt) +

12∂2

∂x2F (t, Bt)

)dt+

∫ t

0

∂

∂xF (t, Bt) dBt,

for a function F of two arguments t and x, continuously differentiable in the first, and twicecontinuously differentiable in the second.

2.1.4 Some properties of stochastic integrals

The Martingale Property: We have seen at the very end of Chapter 1, that bettingon martingales produces no expected gains (or losses), and we even characterized discrete-time martingales as the class of processes possessing that property. It does not require anenormous leap of imagination to transfer the same property to the continuous-time case.Before we formally state this property, it is worth emphasizing that the true strength ofstochastic analysis comes from viewing stochastic integrals as processes indexed by the upperlimit of integration. From now on, when we talk about the stochastic integral of the processH with respect to the Brownian motion B, we will have the process Xt =

∫ t0 Hs dBs in mind

(and then X0 = 0). The first properties of the process Xt is given below:

(SI1) For a process H, adapted with respect to the filtration FBt (generated by the Brownian

motion B), the stochastic integral process Xt =∫ t0 Hs dBs is a martingale, and its

paths are continuous functions.



We have to remark right away that the statement above is plain wrong unless we require someregularity on the processH. If it is too wild, various things can go wrong. Fortunately, we willhave no real problems with such phenomena, so we will tacitely assume that all integrandswe mention behave well. For example, X is a martingale if the the function h(s) = E[H2

s ]satisfies

∫ t0 h(s) ds < ∞, for all t > 0. As for the continuity of the paths, let us just remark

that this property is quite deep and follows indirectly from the continuity of the paths of theBrownian motion. Here is an example that illustrates the coupling between (SI1) and the Itoformula.

Example 2.1.6. By the Ito formula applied to the function F (x) = x2 we get

B2t =

∫ t

02Bs dBs +

12

∫ T

02 ds =

∫ t

02Bs dBs + t.

The property (SI1) implies that B2t − t =

∫ t0 2Bs dBs is a martingale, a fact that we have

derived before using the properties (CE1)-(CE7) of conditional expectation.

The Ito Isometry: Having seen that the stochastic integrals are martingales, we im-mediately realize that E[Xt] = E[X0] = 0, so we know how to calculate expectations ofstochastic integrals. The analogous question can be posed about the variance Var[Xt] =E[X2

t ]− E[Xt]2 = E[X2t ] (because X0 = 0). The answer comes from the following important

property of the stochastic integral, sometimes known as the Ito isometry.

(SI2) For a process H, adapted with respect to the filtration FBt (generated by the Brownian

motion B), such that∫ t0 E[H2

s ] ds <∞ we have

E[(∫ t

0Hs dBs

)2

] =∫ t

0E[H2

s ] ds.

Let us try to understand this property by deriving it for the discrete-time analogue of thestochastic integral

∑nk=1Htk−1

(Btk − Btk−1). The continuous-time version (SI2) will follow

by taking the appropriate limit. We start by exhibiting two simple facts we shall use later.The proofs follow from the properties of conditional expectations (CE1)-(CE7).

• For 0 < k < l ≤ n, we have

E[Htk−1

(Btk −Btk−1)Htl−1

(Btl −Btl−1)]

= E[E[Htk−1


(Btl −Btl−1)|FB

tl−1]]

= E[Htk−1


E[(Btl −Btl−1)|FB

tl−1]]

= E[Htk−1


· 0]

= 0



• For 0 < k ≤ n we have

E[H2

tk−1(Btk −Btk−1

)2]

= E[E[H2


)2|FBtk−1

]]

= E[H2

tk−1E[(Btk −Btk−1

)2|FBtk−1

]]

= E[H2tk−1

](tk − tk−1).

By expanding the square and using the properties above we get

E[

(n∑

k=1

Htk−1(Btk −Btk−1

)

)2

] =∑

0<k,l≤n

E[Htk−1


(Btl −Btl−1)]

=n∑

k=1

E[H2


)2]

+ 2∑

0<k<l≤n

E[Htk−1


(Btl −Btl−1)]

=n∑

k=1

E[H2tk−1

](tk − tk−1)(n→∞)→

∫ t

0E[H2

s ] ds

The Ito isometry has an immediate and useful corollary, the proof of which is delightfullysimple. Let Hs and Ks be two processes adapted to the filtration FB

t generated by theBrownian motion. Then

E[(∫ t

0Hs dBs

)·(∫ t

0Ks dBs

)] =

∫ t

0E[HsKs] ds.

To prove this, let Ls = Hs +Ks and apply the Ito isometry to the process Xt =∫ t0 Ls dBs,

as well as to the processes∫ t0 Hs dBs and

∫ t0 Ks dBs to obtain

2E[∫ t

0Hs dBs ·

∫ t

0Ks dBs] = E[

(∫ t

0(Ks +Hs) dBs

)2]− E[

(∫ t

0Ks dBs

)2]− E[

(∫ t

0Ls dBs

)2]

=∫ t

0E[(Ks +Hs)2] ds−

∫ t

0E[K2

s ] ds−∫ t

0E[H2

s ] ds

= 2∫ t

0E[KsHs] ds

We are in position now to give several examples, illustrative of the martingale and the Itoisometry properties of the stochastic integration, as well as the use of the Ito formula.

Example 2.1.7. The purpose of this example is to calculate the moments (expectations ofkth powers) of the normal distribution using stochastic calculus. Remember that we did thisby classical integration in Section 1 of Chapter 1. Define Xt =

∫ t0 B

ks dBs, for some k ∈ N.

By Ito’s formula applied to function F (x) = xk+1

k+1 we have

Xt =1

k + 1Bk+1

t − 12

∫ t

0kBk−1

s ds.



We know that E[Xt] = 0 by (SI1), so putting t = 1,

E[Bk+11 ] =

k(k + 1)2

E[∫ 1

0Bk−1

s ds]. (2.1.6)

Define M(k) = E[Bk1 ] - which is nothing but the expectation of the unit normal raised to the

kth power, and note that sk/2M(k) = E[Bks ], by the normality of Bs, and the fact that its

variance is s.Fubini Theorem4 applied to equation (2.1.6) gives

M(k + 1) =k(k + 1)

2

∫ 1

0E[Bk−1

s ] ds =k(k + 1)

2

∫ 1

0s(k−1)/2M(k − 1) ds = kM(k − 1).

Trivially, M(0) = 1, and M(1) = 0, so we get M(2k − 1) = 0, for k ∈ N and M(2k) =(2k − 1)M(2k − 2) = (2k − 1) · (2k − 3)M(2k − 4) = · · · = (2k − 1) · (2k − 3) . . . 3 · 1.

Example 2.1.8. In Chapter 1 we used approximation by random walks to calculate E[(∫ 10 Bs ds)2].

In this example we will do it using stochastic calculus. Let us start with the functionF (t, Bt) = tBt and apply the inhomogeneous Ito-formula to it

tBt = 0 +∫ t

0Bs ds+

∫ t

0s dBs,

and therefore

E[(∫ t

0Bs ds

)2

] = E[(tBt −

∫ t

0s dBs

)2

]

= t2E[B2t ] + E[

(∫ t

0s dBs

)2

]− 2tE[Bt

∫ t

0s dBs].

(2.1.7)

The first term t2E[B2t ] equals t3 by normality of Bt. Applying the Ito isometry to the second

term we get

E[(∫ t

0s dBs

)2

] =∫ t

0E[s2] ds =

∫ t

0s2 ds =

t3

3.

Finally, we apply the equation (2.1.6) (the corollary of Ito isometry) to the third term in(2.1.6). The trick here is to write Bt =

∫ t0 1 dBs:

2tE[Bt

∫ t

0s dBs] = 2tE[

∫ t

01 dBs ·

∫ t

0s dBs] = 2tE[

∫ t

01 · s ds] = 2t

t2

2= t3.

4the classical Fubini Theorem deals with switching integrals in the double integration. There is a version

which says that

E[

Z t

0

Hs ds] =

Z t

0

E[Hs] ds,

whenever H is not ”too large”. It is a plausible result once we realize that the expectation is a sort of an

integral itself.



Putting all together we get

E[(∫ t

0Bs ds

)2

] =t3

3.

2.1.5 Ito processes

In this subsection we will try to view the Ito formula in a slightly different light. Just likethe Fundamental theorem of calculus allows us to write any differentiable function F (t) asan indefinite integral (plus a constant)

F (t) = F (0) +∫ t

0F ′(s) ds

of the function F ′ (which happens to be the derivative of F ), the Ito formula states that ifwe apply a function F to the Brownian motion Bt the result will be expressible as a sum ofconstant and a pair of integrals - one classical (with respect to ds) and the other stochastic(with respect to dBs):

F (Bt) = F (B0) +12

∫ t

0F ′′(Bs) ds+

∫ t

0F ′(Bs) dBs,

or, in the inhomogeneous case, when F (·, ·) is a function of two arguments t and x

F (t, Bt) = F (0, B0) +∫ t

0(∂

∂tF (s,Bs) +

12∂2

∂x2F (s,Bs)) ds+

∫ t

0

∂

∂xF (s,Bs) dBs.

The processes Xt which can be written as

Xt = X0 +∫ t

0µs ds+

∫ t

0σs dBs, (2.1.8)

for some adapted processes µs and σs are called Ito processes and constitute a class ofstochastic processes structured enough so that their properties can be studied using stochasticcalculus, yet large enough to cover most of the financial applications. The process µs is calledthe drift process and the process σs the volatility process5 of the Ito process Xt. The twoparts

∫ t0 µs ds and

∫ t0 σs dBs can be thought of as signal and noise, respectively. The signal

component∫ t0 µs ds describes some global trend around which the process Xs fluctuates, and

the volatility can be interpreted as the intensity of that fluctuation. The Ito formula statesthat any process Xt obtained by applying a function F (t, x) to a Brownian motion is an Itoprocess with

µs =∂

∂tF (s,Bs) +

12∂2

∂x2F (s,Bs) and σs =

∂

∂xF (s,Bs).

The question whether an Ito process Xt is a martingale (sub-, super-) is very easy to answer:5An important subclass of the class of Ito processes are the ones where µs and σs are functions of the time

s and the value of the process Xs only. Such processes are called diffusions.



• Xt is a martingale if µt = 0, for all t,

• Xt is a supermartingale if µt ≤ 0 for all t, and

• Xt is a submartingale if µt ≥ 0 for all t.

Of course, if µt is sometimes positive and sometimes negative, Xt is neither a supermartingalenor a submartingale.

There is a useful, if not completely correct, notation for Ito processes, sometimes calledthe differential notation. For an Ito process given by (2.1.8), we write

dXt = µt dt+ σt dBt, X0 = x0. (2.1.9)

The initial value specification (X0 = x0) is usually omitted because it is often not importantwhere the process starts. For a real function F , the representations F (t) = F (0)+

∫ t0 F

′(s) dsand dF (t) = F ′(t) dt are not only formally equivalent because dF (t)

dt equals F ′(t) in a veryprecise way described in the fundamental theorem of calculus. On the other hand, it is notcorrect to say that

dXt

dt= µt + σt

dBt

dt,

since the paths of the Brownian motion are nowhere differentiable, and the expression dBtdt

has no meaning. Therefore, anytime you see (2.1.9), it really means (2.1.8). The differentialnotation has a number of desirable properties. First, it suggests that the Ito processes are builtof increments (a view very popular in finance) and that the increment ∆Xt = Xt+∆t−Xt canbe approximated by the sum µt∆t+σt∆Bt - making it thus “normally distributed” with meanµt∆t and variance σ2

t ∆t. These approximations have to be understood in the conditionalsense, since µt and σt are random variables measurable with respect to Ft. Another usefulfeature of the differential notation is the ease with which the Ito formula can be applied toIto processes. One only has to keep in mind the following simple multiplication table:

× dBt dt

dBt dt 0dt 0 0

Now, the Ito formula for Ito processes can be stated as

dF (t,Xt) =∂

∂xF (t,Xt) dXt +

∂

∂tF (t,Xt) dt+

12∂2

∂x2F (t,Xt)(dXt)2, (2.1.10)

where, using the multiplication table,

(dXt)2 = (µt dt+ σt dBt)(µt dt+ σt dBt) = µ2t (dt)2 + 2µtσ2 dtdBt + σ2

t (dBt)2 = σ2t dt.



Therefore,

dF (t,Xt) =( ∂∂xF (t,Xt)µt +

∂

∂tF (t,Xt) +

12σ2

t

∂2

∂x2F (t,Xt)

)dt+

∂

∂xF (t,Xt)σt dBt

(2.1.11)

Of course, the formula (2.1.10) is much easier to remember and more intuitive than (2.1.11).Formula (2.1.11), however, reveals the fact that F (t,Xt) is an Ito process if Xt is. We havealready mentioned a special case of this result when Xt = Bt. A question that can be askedis whether a product of two Ito processes is an Ito process again, and if it is, what are itsdrift and volatility? To answer the question, let Xt and Yt be two Ito processes

dXt = µXt dt+ σX

t dBt, dYt = µYt dt+ σY

t dBt.

In a manner entirely analogous to the one in which we heuristically derived the Ito formula byexpanding alla Taylor and using telescoping sums, we can get the following formal expression(in differential notation)

d(XtYt) = Xt dYt + Yt dXt + dXtdYt.

This formula looks very much like the classical integration-by-parts formula, except for theextra term dXtdYt. To figure out what this term should be, we resort to the differentialnotation:

dXtdYt = (µXt dt+ σX

t dBt)(µYt dt+ σY

t dBt) = σXt σ

Yt dt,

so that we can write

d(XtYt) = (XtµYt + Ytµ

Xt + σX

t σYt ) dt+ (Xtσ

Yt + Ytσ

Xt ) dBt,

showing that the productXtYt is an Ito process and exhibiting its drift and volatility. Anotherstrength of the differential notation (and the last one in this list) is that it formally reducesstochastic integration with respect to an Ito process to stochastic integration with respect toa Brownian motion. For example, if we wish to define the integral∫ t

0Hs dXs, with respect to the process dXt = µt dt+ σt dBt,

we will formally substitute the expression for dXt into the integral obtaining∫ t

0Hs dXs =

∫ t

0Hsµs ds+

∫ t

0Hsσs dBs.

A closer look at this formula reveals another stability property of Ito processes - they areclosed under stochastic integration. In other words, if Xt is an Ito process and Hs is anadapted process, then the process Yt =

∫ t0 Hs dXs is an Ito process.



Examples. After all the theory, we give a number of examples. Almost anything you canthink of is an Ito process.

Example 2.1.9. The fundamental example of an Ito process is the Brownian motion Bt. Itsdrift and volatility are µt = 0, σt = 1, as can be seen from the representation

Bt = 0 +Bt =∫ t

00 ds+

∫ t

01 dBs.

An adapted process whose paths are differentiable functions is an Ito process:

Xt =∫ t

0µs ds, where µt =

dXt

dt.

In particular, the deterministic process Xt = t is an Ito process with µt = 1, σt = 0. Then,Brownian motion with drift Xt = Bt + µt is an Ito process with µt = µ and σt = 1. Theprocesses with jumps, or the processes that look into the future are not Ito processes.

Fig 17. Paul Samuelson

Example 2.1.10. [Samuelson’s model of the stock prices] PaulSamuelson proposed the Ito process

St = exp(σBt + (µ− 1

2σ2)t

), S0 = s0,

as the model for the time-evolution of the price of a common stock.This process is sometimes referred to as the geometric Brownianmotion with drift. We shall talk more about this model in thecontext of derivative pricing in the following sections. Let us nowjust describe its structure by finding its drift and volatility processes.The Ito formula yields the following expression for the process S inthe differential notation:

dSt = exp(σBt + (µ− 1

2σ2)t

)(µ− 1

2σ2 +

12σ2) dt+ exp

(σBt + (µ− 1

2σ2)t

)σ dBt

= Stµdt+ StσdBt.

Apart from being important in finance, this example was a good introduction to theconcept of a stochastic differential equation. We say that an Ito process Xt is a solution tothe stochastic differential equation (SDE)

dXt = F (t,Xt) dt+G(t,Xt) dBt, X0 = x0,

if, of course,

Xt = x0 +∫ t

0F (s,Xs) ds+

∫ t

0G(s,Xs) dBs, for all t ≥ 0.



The process St in the Samuelson’s stock-price model is thus a solution to the stochasticdifferential equation

dSt = Stµdt+ Stσ dBt, S0 = s0, (2.1.12)

and F (t, x) = µx, G(t, x) = σx. This particular SDE has a closed-form solution in terms oft and Bt,

St = S0 exp(σBt + t(µ− 12σ2)),

as we have checked above using the Ito formula. It is important to notice that the conceptof an SDE is a generalization of the concept of an ordinary differential equation (ODE). Justset G(t, x) = 0, and you will end-up with the familiar expression

Fig 18. Historical stock-price of a stocktraded on NYSE

dXt = F (t,Xt) dt, i.e.dXt

dt= F (t,Xt),

for a generic ODE. With this in mind, an SDEcan be thought of as a perturbation or a noisyversion of an ODE. Let us try to apply theseideas to the simple case of Samuelson’s model.The noise-free version of the SDE (2.1.12) willbe (let us call the solution process S):

dSt = St dt, i.e.dSt

dt= µSt. (2.1.13)

This is an example of a linear homogeneous ODEwith constant coefficients and its solution de-scribes the exponential growth:

St = s0 exp(µt).

If we view µ as an interest-rate, S will model theamount of money we will have in the bank after tunits of time, starting from s0 and with the con-tinuously compounded interest at the constantrate of µ. The full SDE (2.1.12) models stock-price evolution, so that we might say that thestock-prices, according to Samuelson, are bankaccounts + noise.

Fig 19. Simulated path of a geometricBrownian motion (Samuelson’s model)



Of course, there is much more than that to finance, and Paul Samuelson received a NobelPrize in economics in 1970 for a number of contributions including his work on financialmarkets. In terms of the “drift and volatility“ interpretation of Ito processes, we may putforward the following non-rigorous formula

∆Sti

Sti

=Sti+1 − Sti

Sti

≈ µ∆t+ σ∆Bt,

stating that in a small time-interval ∆t, the change in the stock-price, relative to the stock-price at the beginning of the interval, will move deterministically by µ∆t coupled with thenoise-movement of σ∆Bt. In particular, the Samuelson model predicts the relative stockreturns over small intervals (and over large intervals, too) to be normally distributed. Wewill have very little to say about the rich theory of stochastic differential equations in thiscourse, mainly due to the lack of time and tools, as well as the fact that we shall work almostexclusively with the Samuelson’s model (2.1.12) whose solution can be obtained explicitly.However, here are some examples of famous SDE’s and their solutions in the closed form(whenever they exist).

Example 2.1.11 (Ornstein-Uhlenbeck process). The Ornstein-Uhlenbeck process (oftenabbreviated to OU-process) is an Ito process with a long history and a number of applications.The SDE satisfied by the OU-process is givenby

dXt = α(m−Xt) dt+ σdBt, X0 = x0.

for some constants m,α, σ ∈ R, α, σ > 0. In-tuitively, the OU-process models a motion of aparticle which drifts toward the level m, per-turbed with the Brownian noise of intensityσ. Notice how the drift coefficient is negativewhen Xt > m and positive when Xt < m.Because of that feature, OU-process is some-times also called a mean-reverting process,and a variant of it is widely used as a modelof the interest-rate dynamics. Another of itsnice properties is that the OU process is aGaussian process, making it amenable to em-pirical and theoretical study.

Fig 20. Ornstein-Uhlenbeck process (solid)and its noise-free (dashed) version, with pa-rameters m = 1, α = 8, σ = 1, started atX0 = 2.

In the remainder of this example, we will show how this SDE can be solved using some meth-ods similar to the ones you might have seen in your ODE class. The solution we get will notbe completely explicit, since it will contain a stochastic integral, but as SDE’s go, even that is



usually more than we can hope for. First of all, let us simplify the notation (without any realloss of generality) by setting m = 0, σ = 1. Let us start by defining the process Yt = RtXt,with Rt = exp(αt), inspired by the technique one would use to solve the correspondingnoiseless equation

dXt = −αXt dt.

The process Rt is a deterministic Ito process with dRt = αRt dt, so the formula for theproduct (generalized integration by parts formula) gives

dYt = Rt dXt +XtdRt + dXtdRt = −αXtRt dt+Rt dBt +XtαRt dt = Rt dBt = eαt dBt.

We have obtained that dRtdXt = 0 by using the multiplication table, since there is no dBt

term in the expression for Rt. It follows now that

Xteαt − x0 = Yt − Y0 =

∫ t

0eαs dBs,

so that

Xt = x0e−αt + e−αt

∫ t

0eαs dBs.

This expression admits no further significant simplifications.


2.2. APPLICATIONS TO FINANCE CHAPTER 2. STOCHASTIC CALCULUS

2.2 Applications to Finance

2.2.1 Modelling Trading and Wealth

In the Meadow Stock Exchange there are only 2 financial instruments. A riskless money-market account Rt and risky GII (Grasshopper’s Industries Incorporated) common stock St.As the only person even remotely interested in investments, Derek the Daisy is allowed toassume any position (long or short) in the stock and in the money-market account (borrowingand lending interest rates are equal). The only requirement is the one of self-financing, i.e.starting with $x, Derek is not allow to inject of withdraw any funds from the two accounts.He can, however, sell stocks and invest the money in the money-market account, or vise versa.Suppose Derek chooses the following strategy: at time t his holdings will be πt shares6 ofstock and $bt in the money-market account. Of course, πt and bt will not be deterministicin general. The exposure to the stock may depend on Derek’s current wealth at time t,the stock-price, or any number of other factors. Also, not every combination (πt, bt) will beallowable - remember, money cannot be created out of thin air. The self-financing conditionrequires that the total wealth at time t be composed from the initial wealth $x, the gainsfrom trading, and the deterministic return from the money-market. Equivalently we may saythat if Derek’s wealth at time t is Wt, and is he decides to hold exactly πt shares of stock,the holdings in the money-market account will have to be

bt = Wt − πtSt. (2.2.1)

Naturally, if Wt is small and πt is large, bt might become negative, amounting to borrowingfrom the money-market account to finance the stock purchase. Assume now that the dynamicsof the money-market account and the stock price are given by the following Ito processes:

dSt = Stµdt+ Stσ dBt, S0 = s0,

dRt = Rtr dt, R0 = 1,

where r is the continuously compounded interest-rate, and µ and σ are the rate of returnand volatility of the stock St, respectively. We will assume that these coefficients are known7

and constant8 during the planning horizon [0, T ]. Even though St is defined in terms of astochastic differential equation, we know from previous section that it admits a closed-form

6some authors use πt to denote the total amount of dollars invested in the stock. In our case, this quantity

would correspond to Stπt.7There are efficient statistical methods for estimation of these parameters. The problems and techniques

involved in the procedure would require a separate course, so we will simply assume that µ and σ are reliable

numbers obtained from a third party.8For relatively short planning horizons this assumption is not too bad from the practical point of view. For

longer horizons we would have to model explicitly the dynamics of the market coefficients µ and σ and the

interest-rate r, as well.


CHAPTER 2. STOCHASTIC CALCULUS 2.2. APPLICATIONS TO FINANCE

solution St = S0 exp(σBt+t(µ− 12σ

2)). However, the SDE representation is more intuitive andwill provide us with more information in the sequel. Having specified the models involved, letus look into the dynamics of our wealth and see how it depends on portfolio processes (πt, bt),subject to the self-financing condition (2.2.1). We start by discretizing time and looking atwhat happens in a small time-period ∆t. Using the fact that the change in wealth comeseither from the gains from the stock-market, or the interest paid by the money-market, andsubstituting from equation (2.2.1), we get:

Wt+∆t ≈Wt + bt/Rt ∆Rt + πt ∆St

= Wt + (Wt − πtSt)r∆t+ πt ∆St

≈Wt + rWt ∆t+ πtSt

((µ− r) ∆t+ σ∆Bt

) (2.2.2)

Passing to the limit as ∆t→ 0, we get the following SDE, which we call the wealth equationfor the evolution of the wealth Wt = W x,π

t depending on the portfolio process πt:

dW x,πt = rW x,π

t dt+ πtSt

((µ− r) dt+ σ dBt

), W x,π

0 = x. (2.2.3)

A quick look at the equation above leads us to give another way of thinking about the growthof wealth. We can think of the wealth Wt as growing continuously at the riskless rate of r,coupled with the risky growth at the excess growth rate µ− r and volatility σ, multipliedby the factor πtSt - our exposure to the stock fluctuations. Moral of the story: any profitexcessive to the interest r comes with a certain amount of risk. The more excess profit youwant, the more risk you will have to face.

2.2.2 Derivatives

One of the greatest achievements of the modern finance theory is a satisfactory and empiricallyconfirmed derivative-pricing theory. To refresh your memory, here is a “simple” definition ofa derivative security from General Rules and Regulations promulgated under the SecuritiesExchange Act of 1934 :

The term derivative securities shall mean any option, warrant, convertible se-curity, stock appreciation right, or similar right with an exercise or conversionprivilege at a price related to an equity security, or similar securities with a valuederived from the value of an equity security. . .

Seriously, a derivative security is a financial instrument whose price (value, payoff, . . . )depends (sometimes in very complicated ways) on the price (value, payoff, price history,. . . )of an underlying security. Here is a number of examples. In all these examples we shallassume that St is a stock traded on a securities exchange. That will be our underlyingsecurity.



Example 2.2.1 (Call Options). The call option is basically a colored piece of paper thatgives the holder the right (but not the obligation, hence the name option) to buy a share ofthe stock St at time T for price K.

The underlying security St, maturity T andthe strike price K are a part of the contract.For different K and/or T you would get differ-ent call options. In the case of the call option,it doesn’t take much to figure out when to ex-ercise the option, i.e. when to use your rightto purchase a share of the stock St for theprice K. You will do it only when St > K.Otherwise you could buy the same share ofstock at the market for a price less than K.In that case the option is worthless. In finan-cial slang, we say that the call option is inthe money when St > K, at the moneywhen St = K, and out of the money whenSt < K, even for t < T .

Fig 21. Payoff of a call option (solid line)with strike K = 90, and the price of the un-derlying stock (dashed line)

Given that it is easy to figure out when the option should be exercised, and we suppose thatthe owner of the option will immediately sell the stock at the prevailing market rate if shedecides to exercise it, we see that the net payoff of the call option is given by

C = (ST −K)+,

where the function (·)+ is defined by(x)+ = x, x ≥ 0,

(x)+ = 0, x < 0.

When the option’s payoff depends on the options price solely through its value ST at maturity,i.e. payoff = f(ST ), it is customary to depict the payoff by plotting the function f(·), so thatthe x-axis stands for the underlying’s price, and the y-axis for the payoff. Very often, theoption’s price at time t < T will be a function of t and St only, so that we can plot it at thesame graph with f .



Example 2.2.2 (Put Options). The putoption is a relative of a call option, with thenotable difference that here you are entitledto sell one share of the stock St at the priceK. Just like in the case of the call option, theowner of a put option will choose to exerciseit if the stock price is below the level K. Thenotions of in-, at- and out of the money aredefined in the same way as in the case of a calloption. The payoff of a put option is given by

P = (K − ST )+,

where T is the maturity, and K is the strikeprice of the put option.

Fig 22. Payoff of a put option (solid line)with strike K = 90, and the price of the un-derlying stock (dashed line)

Fig 23. Payoff of a forward option (solidline) with strike K = 90, and the price of theunderlying stock (dashed line)

Example 2.2.3 (Future and forwardcontracts). A forward option is likea call option, but now the owner has theobligation to buy a share of the stock St ata specified price K. A future contract isan institutionalized (and liquid) versionof a forward contract traded in the mar-ket. The payoff F of a forward contractis given by F = ST − K. Unlike callsand futures, the forward contract can becostly to its holder in case ST < K.

So far we have talked about the most basic and most wide spread options, so called vanillaoptions. As described in the examples above, all of these are European options, mean-ing that their maturities are fixed and contractually specified in advance. In practice, themajority of traded options are of American type, meaning that the holder (owner) can notonly choose whether to exercise the options, but also when to to it. The pricing theory forthose options is technically much more involved than the theory of European options, so weshall stick to Europeans in this course. The next few examples are about less traded, butstill important, exotic options.



Example 2.2.4 (Asian Options). An Asian option with maturity T and strike price Kgives its holder the payoff of

A =( 1T

∫ T

0St dt−K

)+.

In words, Asian options work just like calls, with the difference that the payoff depends onthe average stock prices during the entire planning horizon, and not only on the stock-priceat the maturity. Stock exchanges like Asian options because they are less amenable to pricemanipulation by large investors (how?).

Example 2.2.5 (Knock-in options). A Knock-in option is an option whose payoff equalsthat of a call option, but only if the price of the stock has crossed a predetermined barriersometime in [0, T ]. If that doesn’t happen, the option is worthless, even if ST > K. Usuallywe take a constant barrier at the level b, so that the value of a knock-in option is given by

K = (ST −K)+ · 1max0≤t≤T St≥b.

There are all kinds of variants of knock-in options. Sometimes the option will lose value ifthe stock-price crosses a barrier. Such options are called Knock-out options.

Example 2.2.6 (Binary (digital) options). A Binary option (also called a Digital option)gives a payoff of $1 if the stock price at time T is above some level b, and nothing if St isbelow b. In symbols

B = 1ST >b.

Binary options are nothing but gambling instruments because of their extreme sensitivity tothe price of the underlying. They are rarely allowed because of the incentive the give to pricemanipulation. Apparently, George Soros made a large part of his fortune by speculating withforeign-exchange binary options.

2.2.3 Arbitrage Pricing

The derivative securities are either traded on stock-exchanges or offered over-the-counter. Ineither case, it is important to know what its “fair” price should be. In general, it is difficultto speak of and define the notion of a fair price, since the prices are results of the equilibriumbetween supply and demand. In financial markets9, there is a very precise notion of a fairprice which leaves very little room for discussion. This price is called the arbitrage price,and is defined by the requirement that it is impossible to make a riskless profit in the market.The economic justification of this requirement is that, if such an opportunity existed, peoplewould exploit it until the prices of the securities are pushed to non-arbitrage levels. Let uslook at a simple application of this reasoning to pricing of call options. Suppose that at time

9complete financial markets, to be precise



t = 0, the price of the underlying security is S0. Given that the payoff of the call option is(ST −K)+, we claim that if such a call option traded at the price pC higher than S0, therewould exist an arbitrage opportunity. To exploit it, at time t = 0, an arbitrageur wouldwrite (sell) one call option with strike K and maturity T , receiving pC dollars, and then buy1 share of the stock for S0 dollars, which leaves pC − S0 > 0 dollars in her pocket. At thematurity, she would sell the share of the stock she bought at t = 0, and use the money ST topay her obligation (ST −K)+ ≤ ST to the holder of the option, locking in a risk-free profit of(pC −S0)+ (ST − (ST −K)+) > 0 dollars. We conclude that the price of a call option shouldbe less then the spot-price of the underlying stock. Otherwise, the clever arbitrageurs wouldpush it down. Suppose now that there exists a self-financing portfolio π such that, startingat $x and investing according to π, our total wealth WT at time T equals C = (ST −K)+.In that case, there would be no alternative, but to call $x the fair price of the call option C.Any other price would produce an arbitrage opportunity similar to the one in the paragraphabove. Therefore, the problem of option pricing transforms into the problems of finding thecapital x, from which the payoff of the option can be attained using an appropriate portfolioprocess. Formally, the arbitrage price of the option C is the (unique) amount of capital xsuch that there exists a portfolio process π such that the wealth process W x,π

t from (2.2.3)satisfies

W x,πT = C.

In that case, π is called a replicating portfolio for the option C. Of course, we can speakof the price of our option at any time t = T , and this price at time 0 < t < T will be thevalue of the the wealth W x,π

t . under the replicating portfolio π at time t. In the followingtwo subsection, we will give two derivations for the celebrated Black-and-Scholes formula forthe price of a European call. The first one will be based on an explicit solution of a partialdifferential equation, and the second on a trick which could eventually lead to the powerfullconcept of change of measure.

2.2.4 Black and Scholes via PDEs

The PDE approach is based on two assumptions, which can be actually proved, but we shalltake them for granted with no proof since it is not hard to grasp their validity intuitively andan a priori proof would require more mathematics than we have developed so far. Also, oncewe finish the derivation below, we can check that assumptions we put forth a posteriori. So,the assumptions are

• there exists a replicating portfolio π and an initial wealth x, such that W x,πT = (ST −

K)+, and

• the value of the wealth process W x,πt is a smooth function of St and t only, i.e. W x,π

t =u(t, St) for some nice (differentiable enough) function u(·, ·).



Without the first assumption there would be no arbitrage price of the call option C, andthe second one states that the price of C at time t depends only on t and St. This secondassumption is also quite plausible, since the history of the stock prices is irrelevant to its futureevolution given the stock price today. Having made these two assumptions we immediatelysee that all we need now is to determine the value of the function u(·, ·) : [0, t] × R+ → R,at the point (t, x) = (0, S0), since u(0, S0) = W x,π

0 = x. We shall see, however, that in orderto do that we shall have to determine the values of u for all t and x. Let’s start figuring outhow to compute the values of the function u. First of all, since we know that W x,π

t = u(t, St)we apply the Ito formula to u(t, St)

du(t, St) =∂

∂tu(t, St) dt+

12∂2

∂x2u(t, St) (dSt)2 +

∂

∂xu(t, St) dSt,

=( ∂∂tu(t, St) +

12σ2S2

t

∂2

∂x2u(t, St) + µSt

∂

∂xu(t, St)

)dt+ σSt

∂

∂xu(t, St) dBt.

remember the expression for the process W x,πt

dW x,πt = rW x,π

t dt+ πtSt


)and equate the terms corresponding to dt and those corresponding to dBt terms to get

∂∂tu(t, St) + 1

2σ2S2

t∂2

∂x2u(t, St) + µSt∂∂xu(t, St) = ru(t, St) + πtSt(µ− r), and

σSt∂∂xu(t, St) = σπtSt.

Substituting the expression for πtSt into the first equation we conclude that u(·, ·) shouldsatisfy the following partial differential equation usually called the Black and ScholesPDE:

∂

∂tu(t, x) = −1

2σ2x2 ∂

2

∂x2u(t, x)− rx

∂

∂xu(t, x) + ru(t, x). (2.2.4)

There are infinitely many solutions to this equation, so we need to impose an extra conditionto get a unique solution. Fortunately, we know the exact value of u(T, x), by the requirementthat

u(T, ST ) = W x,πT = (ST −K)+, and so u(T, x) = (x−K)+.

We have transformed a financial problem of finding a fair price of a call option C to a purelyanalytic problem of solving the partial differential equation (2.2.5), subject to the terminalcondition u(T, x) = (x − K)+. To plan an attack on the equation (2.2.5), we start by asimplifying substitution

u(t, x) = ertv(T − t,log(x)σ

) = er(T−s)v(s, y),



with y = log(x)σ , s = T − t so that the equation for v becomes

∂

∂sv(s, y) =

12∂2

∂y2v(s, y) + (

r

σ− σ

2)∂

∂yv(s, y), (2.2.5)

together with (now initial!) condition v(0, y) = (exp(σy)−K)+. The reason for this specificsubstitution comes from the desire to get rid of x and x2 in front of ∂

∂xu(t, x) and ∂2

∂x2u(t, x),as well as the term ru(t, x). By doing this we will transform our partial differential equation(2.2.5) into a standard for of a heat equation. We need another change of variables, though,to accomplish that - we have to get rid of the term involving ∂

∂yv(s, y). For that, we set

v(s, y) = w(s, y + (

r

σ− σ

2)s)

= w(s, z),

with z = y + ( rσ −

σ2 )s, which finally reduces our PDE to the canonical form of a heat

equation

∂

∂sw(s, z) =

12∂2

∂z2w(s, z), (2.2.6)

together with the initial condition

w(0, z) = v(0, y + (r

σ− σ

2) · 0) =

(exp(σz)−K

)+. (2.2.7)

The mathematical theory of partial differential equation has a very precise answer aboutthe solutions of the heat equation with an initial condition assigned. We will cite a generaltheorem without proof:

Theorem 2.2.7 (Solution of the heat equation). Given a sufficiently nice function10

ψ : R → R,

there exists a unique function w : [0,∞)× R → R such that

∂

∂sw(s, z) =

12∂2

∂z2w(s, z), for s > 0, and z ∈ R,

andw(0, z) = ψ(z), for z ∈ R.

Furthermore, the function w is given by the following integral:

w(s, z) =∫ ∞

−∞

e−(z−ξ)2

2s

√2πs

ψ(ξ) dξ.

10all the functions appearing in finance are nice enough



Using this result we can immediately write down the solution to our equation (2.2.6) withinitial condition (2.2.7):

w(s, z) =∫ ∞

−∞

e−(z−ξ)2

2s

√2πs

(exp(σξ)−K

)+dξ =

∫ ∞

log(K)/σ

e−(z−ξ)2

2s

√2πs

(exp(σξ)−K

)dξ

= −KΦ(zσ − log(K)√

sσ) + ezσ+ sσ2

2 Φ(σ (z + sσ)− log(K)√

sσ),

where Φ(x) =∫ x−∞

e−ξ2/2√

2πdξ. Reversing substitutions we get the following expressions

v(s, y) = −KΦ(2rs+ 2yσ − sσ2 − 2 log(K)

2√sσ

)+ ers+yσΦ

(2rs+ 2yσ + sσ2 − 2 log(K)2√sσ

)and finally

u(t, x) =− er(T−t)KΦ((T − t)

(2r − σ2

)− 2 log(K/x)

2σ√T − t

)+ xΦ

((T − t)(2r + σ2

)− 2 log(K/x)

2σ√T − t

).

(2.2.8)

Setting t = 0, and x = S0 - the stock-price at time t = 0, we get the famous Black andScholes formula for the price of a European call option:

u(0, S0) = −erTKΦ(T (2r − σ2

)− 2 log(K/S0)

2σ√T

)+ S0Φ

(T (2r + σ2)− 2 log(K/S0)

2σ√T

).

2.2.5 The “Martingale method” for pricing options

In this subsection we would like to present another approach to option pricing in Samuelson’smodel. First of all, here is crucial observations about what we did in the previous subsection:

• The moment we wrote down the Black and Scholes PDE the symbol µ disappearedwithout a trace. The function u and the option price do not depend on µ at all !!!!

Since µ does not matter, we might as well set it to 0, or to any other value we please, andstill obtain the same price. We will use that to derive the Black and Scholes formula in a veryelegant and quick manner11. Before we do that, let us examine the wealth equation (2.2.3) -in particular let us see what happens to it when we discount it and denominate the wealth

11We couldn’t have arrived to the conclusion that µ does not matter directly, without using the PDE

approach, so do not dismiss it entirely. There are some other very good reason to keep the PDE approach

handy. You will understand which reasons I am talking about when you read through this section.



in real instead of nominal terms. Define the process Vt to be the discounted version of W x,πt ,

i.e. Vt = e−rtW x,πt . By Ito’s formula Vt is an Ito process and we have

dV x,πt = −re−rtW x,π

t dt+ e−rt dW x,πt

= −re−rtW x,πt dt+ e−rt

(rW x,π

t dt+ πtSt


))= e−rtπtSt


).

Let us now move to a different market - the one in which µ = r, and denote the stock-pricethere by Sr

t . The stochastic process Srt is now the geometric Brownian motion given by the

following SDE:dSr

t = Srt r dt+ Sr

t σ dBt, Sr0 = S0.

In this market we have

dV x,πt = e−rtπtS

rt σ dBt

and so V x,πt is a martingale, since its dt-term disappears (µ − r = 0). Suppose now that

x is the arbitrage price of a contingent claim X (not necessarily a call option), and a let πbe the replicating portfolio, so that that W x,π

T = X, and W x,π0 = x. The process V x,π

t (thediscounted wealth W x,π

t ) satisfies now the following

• V x,πt is a martingale,

• V x,πT = e−rTX, and

• V x,π0 = x,

so we must havex = V x,π

0 = E[V x,πT ] = E[e−rTX].

Thus, the arbitrage price of X is given by x = E[e−rTX] in the market in which µ = r. Weknow, however, that the arbitrage price does not depend on the parameter µ, so we have thefollowing important conclusion:

• the arbitrage price of a European contingent claim X is the expectation of e−rTX inthe market in which µ = r.

Let us apply this principle to valuation of the European call option. In this case X =(ST −K)+, so the Black and Scholes price of X is given by

x = E[e−rT (SrT −K)+].



We know that SrT = exp(σBT + (r − 1

2σ2)T ), and that BT has a normal distribution with

mean 0 and variance T , so we have

x = E[e−rT (SrT −K)+] = E[e−rT

(exp(σBT + (r − 1

2σ2)T )−K

)+]

=∫ ∞

−∞

e−ξ2

2T

√2πT

(exp(σξ + (r − 1

2σ2)T )−K

)+dξ

=∫ ∞

1σ

(log(K)−T (r− 1

2σ2)

) e−ξ2

2T

√2πT

(exp(σξ + (r − 1

2σ2)T )−K

)dξ,

and the (standard) evaluation of this integral gives the Black and Scholes formula (2.2.8).Why is the PDE approach sometimes better? Well, the martingale approach requires you tocalculate expectations, which is often faster than solving a PDE. However, not all expectationscan be evaluated explicitly (try Asian options!) and we have to resort to numerical techniquesand use PDEs.


Appendix AOdds and Ends

133

A.1. NO P FOR ALL SUBSETS OF [0, 1] APPENDIX A. ODDS AND ENDS

A.1 No P for all Subsets of [0, 1]

The following example shows a mind-bending phenomenon - and elucidates the necessity ofthe nuisance of having probabilities defined on σ-algebras. In order to proceed we have tostate an axiom of set theory.

Axiom of Choice. Given any collection C of nonempty and pairwise disjoint sets, thereexists a set containing precisely one element from each set in C.

Example A.1.1. Let us define an equivalence relation ∼ on R, by stating that x ∼ y ifand only if y − x is a rational number. This equivalence relation will break the set of realnumbers R into a collection of disjoint nonempty equivalence classes, and by the Axiomof Choice, there exists a set A ⊆ R such that A contains exactly one representative fromeach class. For each rational number q ∈ Q we can define the translation Aq of A byAq , (q + a) (mod 1) : a ∈ A. Here (q + a) (mod 1) denotes the fractional part of q + a

(e.g. 7.14 (mod 1) = 0.14, −0.92 (mod 1) = 0.08). Convince yourself that Aq constitutes acountable partition of [0, 1]:⋃

q∈QAq = R, Aq ∩Ap = ∅, for p, q ∈ Q, q 6= q.

Suppose there is a probability P on all the subsets of [0, 1] such that P[A] = P[A + x], forany A ⊆ [0, 1] such that A + x ⊆ [0, 1]. We first note that P[Aq] = P[Ap] for any p, q ∈ Q.Can se have the common value of P[Ap], p ∈ Q strictly positive, say ε > 0? Surely not! Thecountable additivity of P dictates

1 = P[0, 1] = P[⋃q∈Q

Aq] =∑q∈Q

P[Aq] =∑q∈Q

ε = ∞.

So, we must have P[Ap] = 0 for all p ∈ Q, but we are no better off than before:

1 = P[0, 1] = P[⋃q∈Q

Aq] =∑q∈Q

P[Aq] =∑q∈Q

0 = 0.

A contradiction again. Sherlock Holmes said : “ When you have eliminated all which isimpossible, then whatever remains, however improbable, must be the truth”, so we must agreethat there is simply no nice (translationally invariant) probability on [0, 1]


APPENDIX A. ODDS AND ENDS A.2. MARGINALS: NORMAL, JOINT: NOT

A.2 Marginals: Normal, Joint: Not

In this section we give two examples of bivariate random vectors X = (X1, X2) such that X1

and X2 are normally distributed, but the random vector X is not. In the sequel we use thefollowing shorthand:

ϕ(x) =1√2π

exp(−12x2),

for the density of the standard normal.

Example A.2.1. Let h : R → R be any nontrivial function such that

1. |h(x)| ≤ (2πe)−1/2, for all x ∈ R.

2. h(x) = 0, for x 6∈ [−1, 1].

3. h(−x) = −h(x), for all x ∈ R.

Define the random vector X = (X,Y ) to have the density

fX(x, y) = ϕ(x)ϕ(y) + h(x)h(y).

Conditions 1. and 2. ensure that fX ≥ 0, and the integral of fX over R×R is equal to 1 dueto condition 3. Thus, fX is a true density function. Also, fX is not a density function of abivariate normal distribution for no nontrivial h (why?). Let us show that the marginals offX are standard normal. We compute

fX(x) =∫ ∞

−∞fX(x, y) dy =

∫ ∞

−∞ϕ(x)ϕ(y) dy +

∫ ∞

−∞h(x)h(y) dy

= ϕ(x)∫ ∞

−∞ϕ(y) dy + h(x)

∫ ∞

−∞h(y) dy = ϕ(x) · 1 + h(x) · 0 = ϕ(x),

and the last two integrals are equal to 1 and 0, respectively due to the facts that ϕ is adensity function, and that h is odd (h(−y) = −h(y)). By symmetry we would conclude thatfY (y) = ϕ(y).

Example A.2.2. In the following example I will only define the candidate density functionand leave it for you as an exercise to prove that it in fact defines a random vector whosemarginals are normal, but the joint distribution is not.

fX(x, y) =

2ϕ(x)ϕ(y), xy ≥ 0

0, xy < 0.


A.3. MERSENNE TWISTER APPENDIX A. ODDS AND ENDS

A.3 Mersenne Twister

In this appendix I am including a C-code for the Mersenne Twister written by M. Mat-sumoto and T. Nishimura, one of the best free random number generators (its order is 624and period 219937 − 1). The interesting thing is that its good properties stem from somedeep results in number theory (hence the name). For more info, check the article referencedbelow. More info (including the source code) is available from ttp://www.mat.keio.ac.jp/ ma-tumoto/emt.html+

/* A C-program for MT19937: Real number version([0,1)-interval) */

/* (1999/10/28) */

/* genrand() generates one pseudorandom real number (double) */

/* which is uniformly distributed on [0,1)-interval, for each */

/* call. sgenrand(seed) sets initial values to the working area */

/* of 624 words. Before genrand(), sgenrand(seed) must be */

/* called once. (seed is any 32-bit integer.) */

/* Integer generator is obtained by modifying two lines. */

/* Coded by Takuji Nishimura, considering the suggestions by */

/* Topher Cooper and Marc Rieffel in July-Aug. 1997. */

/* This library is free software; you can redistribute it and/or */

/* modify it under the terms of the GNU Library General Public */

/* License as published by the Free Software Foundation; either */

/* version 2 of the License, or (at your option) any later */

/* version. */

/* This library is distributed in the hope that it will be useful, */

/* but WITHOUT ANY WARRANTY; without even the implied warranty of */

/* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. */

/* See the GNU Library General Public License for more details. */

/* You should have received a copy of the GNU Library General */

/* Public License along with this library; if not, write to the */

/* Free Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA */

/* 02111-1307 USA */

/* Copyright (C) 1997, 1999 Makoto Matsumoto and Takuji Nishimura. */

/* Any feedback is very welcome. For any question, comments, */

/* see http://www.math.keio.ac.jp/matumoto/emt.html or email */

/* [email protected] */


APPENDIX A. ODDS AND ENDS A.3. MERSENNE TWISTER

/* REFERENCE */

/* M. Matsumoto and T. Nishimura, */

/* "Mersenne Twister: A 623-Dimensionally Equidistributed Uniform */

/* Pseudo-Random Number Generator", */

/* ACM Transactions on Modeling and Computer Simulation, */

/* Vol. 8, No. 1, January 1998, pp 3--30. */

#include<stdio.h>

/* Period parameters */

#define N 624

#define M 397

#define MATRIX_A 0x9908b0df /* constant vector a */

#define UPPER_MASK 0x80000000 /* most significant w-r bits */

#define LOWER_MASK 0x7fffffff /* least significant r bits */

/* Tempering parameters */

#define TEMPERING_MASK_B 0x9d2c5680

#define TEMPERING_MASK_C 0xefc60000

#define TEMPERING_SHIFT_U(y) (y >> 11)

#define TEMPERING_SHIFT_S(y) (y << 7)

#define TEMPERING_SHIFT_T(y) (y << 15)

#define TEMPERING_SHIFT_L(y) (y >> 18)

static unsigned long mt[N]; /* the array for the state vector */

static int mti=N+1; /* mti==N+1 means mt[N] is not initialized */

/* Initializing the array with a seed */

void

sgenrand(seed)

unsigned long seed;

int i;

for (i=0;i<N;i++)

mt[i] = seed & 0xffff0000;

seed = 69069 * seed + 1;



mt[i] |= (seed & 0xffff0000) >> 16;

seed = 69069 * seed + 1;

mti = N;

/* Initialization by "sgenrand()" is an example. Theoretically, */

/* there are 2^19937-1 possible states as an intial state. */

/* This function allows to choose any of 2^19937-1 ones. */

/* Essential bits in "seed_array[]" is following 19937 bits: */

/* (seed_array[0]&UPPER_MASK), seed_array[1], ..., seed_array[N-1]. */

/* (seed_array[0]&LOWER_MASK) is discarded. */

/* Theoretically, */

/* (seed_array[0]&UPPER_MASK), seed_array[1], ..., seed_array[N-1] */

/* can take any values except all zeros. */

void

lsgenrand(seed_array)

unsigned long seed_array[];

/* the length of seed_array[] must be at least N */

int i;

for (i=0;i<N;i++)

mt[i] = seed_array[i];

mti=N;

double /* generating reals */

/* unsigned long */ /* for integer generation */

genrand()

unsigned long y;

static unsigned long mag01[2]=0x0, MATRIX_A;

/* mag01[x] = x * MATRIX_A for x=0,1 */

if (mti >= N) /* generate N words at one time */

int kk;


APPENDIX A. ODDS AND ENDS A.3. MERSENNE TWISTER

if (mti == N+1) /* if sgenrand() has not been called, */

sgenrand(4357); /* a default initial seed is used */

for (kk=0;kk<N-M;kk++)

y = (mt[kk]&UPPER_MASK)|(mt[kk+1]&LOWER_MASK);

mt[kk] = mt[kk+M] ^ (y >> 1) ^ mag01[y & 0x1];

for (;kk<N-1;kk++)

y = (mt[kk]&UPPER_MASK)|(mt[kk+1]&LOWER_MASK);

mt[kk] = mt[kk+(M-N)] ^ (y >> 1) ^ mag01[y & 0x1];

y = (mt[N-1]&UPPER_MASK)|(mt[0]&LOWER_MASK);

mt[N-1] = mt[M-1] ^ (y >> 1) ^ mag01[y & 0x1];

mti = 0;

y = mt[mti++];

y ^= TEMPERING_SHIFT_U(y);

y ^= TEMPERING_SHIFT_S(y) & TEMPERING_MASK_B;

y ^= TEMPERING_SHIFT_T(y) & TEMPERING_MASK_C;

y ^= TEMPERING_SHIFT_L(y);

return ( (double)y * 2.3283064365386963e-10 ); /* reals: [0,1)-interval */

/* return y; */ /* for integer generation */

/* This main() outputs first 1000 generated numbers. */

main()

int i;

sgenrand(4357);

for (i=0; i<1000; i++)

printf("%10.8f ", genrand());

if (i%5==4) printf("\n");


APPENDIX A. ODDS AND ENDSA.4. CONDITIONAL EXPECTATION CHEAT-SHEET

A.4 Conditional Expectation Cheat-Sheet

Here are the 7 properties of the conditional expectation:

(CE1) E[X|F ] is a random variable, and it is measurable with respect to F , i.e. E[X|F ]depends on the state of the world, but only through the information contained in F .

(CE2) E[X|F ](ω) = E[X], for all ω ∈ Ω if F = ∅,Ω, i.e. the conditional expectationreduces to the ordinary expectation when you have no information.

(CE3) E[X|F ](ω) = X(ω), if X is measurable in F , i.e. there is no need for expectationwhen you already know the answer (the value of X is known, when you know F .)

(CE4) When X is a random variable and F ,G two σ-algebras such that F ⊆ G, then

E[E[X|G]|F ] = E[X|F ], i.e.

my expectation (given information F) of the value X is equal to the ex-pectation (given information F) of the expectation I would have about thevalue of X if I had information G on top of F .

(CE5) The conditional expectation is linear, i.e. for X1, X2 random variables, α1, α2 realconstants, and a σ-algebra F , we have

E[α1X1 + α2X2|F ] = α1E[X1|F ] + α2E[X2|F ].

(CE6) Let F be a σ-algebra, and let X and Y be variables such that Y is measurable withrespect to F . Then

E[XY |F ](ω) = Y (ω)E[X|F ](ω),

i.e. the random variable measurable with respect to the available informationcan be treated as a constant in conditional expectations.

(CE7) If the random variable X and the σ-algebra F are independent, then

E[X|F ](ω) = E[X], for all ω,

i.e. conditioning with respect to independent information is like conditioningon no information at all.


continuous-time finance: lecture notes

Documents