st2334 notes (probability and statistics - nus)

55
ST2334 Probability and Statistics A/P Ajay Jasra Office 06-18, S16 National University of Singapore E-Mail:[email protected] Department of Statistics and Applied Probability National University of Singapore 6 Science Drive 2, Singapore, 117546

Upload: manmeet-singh-saluja

Post on 23-Oct-2015

235 views

Category:

Documents


34 download

DESCRIPTION

This contains all the notes required for ST2334 (Probability and Statistics) taught by Prof Ajay Jasra applicable for AY 2013/2014, National Univeristy of Singapore)

TRANSCRIPT

Page 1: ST2334 Notes (Probability and Statistics - NUS)

ST2334 Probability and Statistics

A/P Ajay JasraOffice 06-18, S16

National University of SingaporeE-Mail:[email protected]

Department of Statistics and Applied ProbabilityNational University of Singapore

6 Science Drive 2, Singapore, 117546

Page 2: ST2334 Notes (Probability and Statistics - NUS)
Page 3: ST2334 Notes (Probability and Statistics - NUS)
Page 4: ST2334 Notes (Probability and Statistics - NUS)

Contents

1 Introduction to Probability 21.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Probability Triples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2.1 Sample Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2.2 σ−Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2.3 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.3.1 Bayes Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.3.2 Theorem of Total Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.4 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Random Variables and their Distributions 82.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2 Random Variables and Distribution Functions . . . . . . . . . . . . . . . . . . . . . . 8

2.2.1 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2.2 Distribution Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3 Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.3.1 Probability Mass Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.3.2 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.3.3 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.3.4 Dependence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.3.5 Conditional Distributions and Expectations . . . . . . . . . . . . . . . . . . . 15

2.4 Continuous Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.4.1 Probability Density Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.4.2 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.4.3 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.4.4 Dependence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.4.5 Conditional Distributions and Expectations . . . . . . . . . . . . . . . . . . . 242.4.6 Functions of Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.5 Convergence of Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312.5.1 Convergence in Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312.5.2 Convergence in Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322.5.3 Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3 Introduction to Statistics 343.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.2 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.2.2 The Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.2.3 Examples of Computing the MLE . . . . . . . . . . . . . . . . . . . . . . . . 35

3.3 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.3.2 Constructing Test Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.4 Bayesian Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.4.2 Bayesian Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.4.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

i

Page 5: ST2334 Notes (Probability and Statistics - NUS)

4 Miscellaneous Results 434.1 Set Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434.2 Summation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434.3 Exponential Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444.4 Taylor Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444.5 Integration Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444.6 Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

ii

Page 6: ST2334 Notes (Probability and Statistics - NUS)
Page 7: ST2334 Notes (Probability and Statistics - NUS)
Page 8: ST2334 Notes (Probability and Statistics - NUS)

1

Course Information

Lecture times: This course is held from January 2013-May 2013. Lecture times are at 0800-1000 Wednesday and 1200-1400 Thursday at LT 27. The notes are available on my website:http://www.stat.nus.edu.sg/~staja/. I do not use IVLE.

Office Hour: My office hour is at 1600 on Thursday in 06-18, Department of Statistics and AppliedProbability. I am not available at other times, unless there is a time-table clash.

Assessed Coursework: During the course there will be two assignments which make up 40% ofthe final grade (equal weighting). The dates of when these assessments will be handed out will beprovided at least two weeks beforehand and you are given two weeks to complete the assessment.The assessments are to be handed to me in person or, at the statistics office, S16, level 7. Dueto the number of students on this course I do not accept assessments via e-mail (unless there areextreme circumstances, which would need to be verified by the department before-hand). There isNO mid-term examination.

Exam:. There is a 2 hour final exam of 4 questions on 4th May 1-3pm. No calculators are allowedand the examination is closed-book. You will be given a formula sheet to assist you in the examina-tion (Tables 4.1 and 4.2).

Problem Sheets: There are also 10 non-assessed problem sheets available on my website. Typedsolutions will be available on my website. In some weeks, we will discuss the solutions of the assess-ments after the deadline.

Course Details: These notes are not sufficient to replace lectures. In particular, many examplesand clarifications are given during the class. This course investigates the following concepts: Basicconcepts of probability, conditional probability, independence, random variables, joint and marginaldistributions, mean and variance, some common probability distributions, sampling distributions,estimation and hypothesis testing based on a normal population. There are three chapters of coursecontent, the first covering the foundations of probability. We then move on to random variablesand their distributions which features the main content of the course and is a necessary prerequisitefor further work in statistical modeling and randomized algorithms. The third chapter is a basicintroduction to statistics and gives some basic notions which would be used in statistical modeling.The final chapter of the notes provides some mathematical background for the course, which youare strongly advised to read in the first week. You are expected to know everything that is in thischapter. In particular the notion of double summation and integration is used routinely in this courseand you should spend some time recalling these ideas.

References: The recommended reference for this course is [1] (Chapters 1-5). For a non-technicalintroduction the book [2] provides an entertaining and intuitive look at probability.

Page 9: ST2334 Notes (Probability and Statistics - NUS)

Chapter 1

Introduction to Probability

1.1 Introduction

This Chapter provides a basic introduction to the notions that underlie the basics of probability. Thelevel of notes is below that required for complete mathematical rigor, but does provide a technicaldevelopment of these ideas. Essentially, the basic ideas of probability theory start with a probabilitytriple of an ‘event-space’ (e.g. in flipping a coin a head or tail), sets of events (e.g.‘tail’ or ‘tail andhead’) and a way to compare the ‘likelihood’ of events by a probability distribution (the chance ofobtaining a tail). Moving on from these notions, we consider the probability of events, given otherevents are known to have occurred (conditional probability) (the probability a coin lands tails, givenwe know it has two ‘heads’). Some events have probabilities which decouple in a special way andare called independent (for example, flipping a fair coin twice, the outcome of the first flip, may notinfluence that of the second).

The structure of this Chapter is as follows: In Section 1.2, we discuss the idea of a probabilitytriple; in Section 1.3 conditional probability is introduced and in Section 1.4 we discuss independence.

1.2 Probability Triples

As mentioned in the introduction, probability begins with a triple:

• A sample space (possible outcomes)

• A collection of sets on outcomes (σ−field)

• A way to compare the likelihood of events (probability measure)

We will slowly introduce and analyze these concepts.

1.2.1 Sample Space

The basic notion we begin with is that of an experiment, with a collection of possible outcomes towhich we cannot (usually) determine exactly what will happen. For example, if we flip a coin, or ifwe watch a football game and so on; in general we do not know for certain what will happen. Theidea of a sample space is as follows.

Definition 1.2.1. The set of all possible outcomes of an experiment is called the sample spaceand is denoted by Ω.

Example 1.2.2. Consider rolling a six sided fair die, once. There are 6 possible outcomes, andthus: Ω = 1, 2, . . . , 6. We may be interested (for example, for betting purposes) in the followingevents:

1. we roll a 1

2. we roll an even number

3. we roll an even number, which is less than 3

2

Page 10: ST2334 Notes (Probability and Statistics - NUS)

3

Notation Set Terminology Probability TerminologyΩ Collection of objects Sample Spaceω Member of Ω Elementary eventA Subset of Ω Event that an outcome in A occursAc Complement of A Event that no outcome in A occurs

A ∩B Intersection Both A and BA ∪B Union A or BA \B Difference A but not BA∆B Symmetric difference A or B but not bothA ⊆ B Inclusion If A then B∅ Empty set Impossible event

Table 1.1: Terminology of Set and Probability Theory

4. we roll an odd number

One thing that we immediately realize is that each of the events in the above example are subsetsof Ω, that is each event (1)-(4) correspond to:

1. 1

2. 2,4,6

3. 2

4. 1,3,5

As a result, we think of events as subsets of the sample space Ω. Such events can be constructedby unions, intersections and complements of other events (we will formally explain why, below). Forexample, letting A = 2, 4, 6 in the case (2) above, we immediately yield that the event (4) is Ac.Similarly letting A be the event in (2) and B be the event of rolling a number less than 4 we obtainthat event (3) is A ∩B. For those of you that have forgotten basic set notations and terminologies,see Table1 1.1. In particular, we will think of Ω as the certain event (we will roll a 1, 2, . . . , 6) andits complement Ωc = ∅ as the impossible event (we must roll something).

1.2.2 σ−FieldsNow we have a notion of an event, in particular, that they are subsets of of Ω. A particular questionof interest is then: are all subsets of Ω events? The answer actually turns out to be no, but thetechnical reasons for this lie far out of the scope of this course. We will content ourselves to use aparticular collection of sets F (a ‘set of sets’) of Ω which contains all the events that we can makeprobability statements about. This collection of sets is a σ−field.

Definition 1.2.3. A collection of sets F of Ω is called a σ-field if it satisfies the following conditions:

1. ∅ ∈ F

2. If A1, . . . ,∈ F then⋃∞i=1Ai ∈ F

3. If A ∈ F then Ac ∈ F

It can be shown that σ−fields are closed under countable intersections (i.e. if A1, . . . ,∈ F then⋂∞i=1Ai ∈ F ). Whilst it may seem quite abstract, it will turn out that, for this course, all the sets

we consider will lie in the σ−field F .

Example 1.2.4. Consider flipping a single coin, letting H denote a head and T denoting a tail then:Ω = H,T and F = Ω, ∅, H, T.

Thus, to summarize, so far (Ω,F ) is sample space and a σ−field. The former is the collectionof all possible outcomes and the latter is collection of sets on Ω (the events) that follow Definition1.2.3. Our next objective is define a way of assigning a likelihood to each event.

1For those of you that have forgotten set theory, Section 4.1 has a refresher

Page 11: ST2334 Notes (Probability and Statistics - NUS)

4

Exercise 1.2.5. You are given De-Morgan’s Laws:(⋃i

Ai

)c=⋂i

Aci

(⋂i

Ai

)c=⋃Aci .

Show that if A,B ∈ F , A ∩B and A \B are in F (just use the rules of σ-fields).

1.2.3 Probability

We now introduce a way to assign a likelihood to an event, via probability. One possible interpretationis the following: suppose my experiment is repeated many times, then the probability of any eventis the limit of the ratio of times the event occurs over the number of events. We note that this isnot the only interpretation of probability, but we do not diverge into a discussion of the philosophyof probability. Formally, we introduce the notion of a probability measure:

Definition 1.2.6. A probability measure P on (Ω,F ) is a function P : F → [0, 1] which satisfies:

1. P(Ω) = 1 and P(∅) = 0

2. For A1, A2, . . . disjoint (∀i 6= j, Ai ∩Aj = ∅) members of F then

P( ∞⋃i=1

Ai

)=

∞∑i=1

P(Ai).

The triple (Ω,F ,P), comprising a set Ω, a σ−field F of subsets of Ω and a probability measure P,is called a probability space (or probability triple).

The idea is to associate the probability space with an experiment:

Example 1.2.7. Consider flipping a coin as in Example 1.2.4, with Ω = H,T and F = Ω, ∅, H, T.Then we can set:

P(H) = p P(T) = 1− p p ∈ [0, 1].

If p = 1/2 we say that coin is fair.

Example 1.2.8. Consider rolling a 6-sided die, then Ω = 1, . . . , 6 and F = 0, 1Ω (the powerset of Ω the set of all subsets of Ω). Define a probability measure on (Ω,F ) as:

P(A) =∑i∈A

pi ∀A ∈ F

where ∀i ∈ 1, . . . , 6, 0 ≤ pi ≤ 1 and∑6i=1 pi = 1. If pi = 1/6, ∀i ∈ 1, . . . , 6 then:

P(A) =Card(A)

6

where Card(A) is the cardinality of A (the number of elements in the set A).

We now consider a sequence of results on probability spaces.

Lemma 1.2.9. We have the following properties on probability space (Ω,F ,P):

1. For any A ∈ F , P(Ac) = 1− P(A).

2. For any A,B ∈ F , if A ⊆ B then, P(B) = P(A) + P(B \A) ≥ P(A).

3. For any A,B ∈ F , P(A ∪B) = P(A) + P(B)− P(A ∩B).

4. (inclusion-exclusion formula) For any A1, . . . , An ∈ F then:

P( n⋃i=1

Ai

)=∑i

P(Ai)−∑i<j

P(Ai ∩Aj) +∑i<j<k

P(Ai ∩Aj ∩Ak)− · · ·+ (−1)n+1P( n⋂i=1

Ai

).

where, for example,∑i<j =

∑nj=1

∑j−1i=1 etc.

Page 12: ST2334 Notes (Probability and Statistics - NUS)

5

Proof. For (1), A ∪ Ac = Ω, so 1 = P(Ω) = P(A ∪ Ac) = P(A) + P(Ac) (A ∩ Ac = ∅). Thus, oneconcludes by rearranging.For (2), as B = A ∪ (B \ A) and A ∩ (B \ A) = ∅, (recall Exercise 1.2.5) P(B) = P(A) + P(B \ A).The inequality is clear from the definition of a probability.For (3), A ∪B = A ∪ (B \A), which is a disjoint union, and B \A = (B \ (A ∩B)) hence

P(A ∪B) = P(A) + P(B \A) = P(A) + P(B \ (A ∩B)).

Now (A ∩B) ⊆ B so by (2)

P(B \ (A ∩B)) = P(B)− P(A ∩B)

and thusP(A ∪B) = P(A) + P(B)− P(A ∩B).

For (4), this can be proved by induction and is an exercise.

To conclude the Section, we introduce two terms:

• An event A ∈ F is null if P(A) = 0.

• An event A ∈ F occurs almost surely if P(A) = 1.

Null events are not impossible. We will see this when considering random variables which take valuesin the real line.

1.3 Conditional Probability

We now move onto the notion of conditional probability. It takes the concept of a probability of anevent, given another event is known to have occurred. For example, what is the probability that itrains today, given that it rained yesterday? We formalize the notion of conditional probability:

Definition 1.3.1. Consider probability space (Ω,F ,P) and let A,B ∈ F with P(B) > 0. Then theconditional probability that A occurs given B occurs is defined to be:

P(A|B) :=P(A ∩B)

P(B).

Example 1.3.2. A family has two children of different ages. What is the probability that bothchildren are boys, given that at least one is a boy? The older and younger child may be either boysor girl so the sample space is:

Ω = GG,BB,GB,BG

and we assume that all outcomes are equally likely P(GG) = P(BB) = P(GB) = P(BG) = 1/4 (theuniform distribution). We are interested in:

P(BB|GB ∪BB ∪BG) =P(BB ∩ (GB ∪BB ∪BG))

P(GB ∪BB ∪BG)

=P(BB)

P(GB ∪BB ∪BG)

=1/4

3/4=

1

3.

One can also ask the question: what is the probability that in a family of two children, where theyounger of two is a boy, that both are boys? We want:

P(BB|BG ∪BB) =P(BB ∩ (BG ∪BB))

P(BG ∪BB)

=P(BB)

P(BG ∪BB)

=1/4

1/2=

1

2.

Page 13: ST2334 Notes (Probability and Statistics - NUS)

6

1.3.1 Bayes Theorem

An important result in conditional probability is Bayes theorem.

Theorem 1.3.3. Consider probability space (Ω,F ,P) and let A,B ∈ F with P(A),P(B) > 0. Thenwe have:

P(B|A) =P(A|B)P(B)

P(A).

Proof. By definition

P(B|A) =P(A ∩B)

P(A).

As P(B) > 0, P(A ∩B) = P(A|B)P(B) and hence:

P(B|A) =P(A|B)P(B)

P(A).

1.3.2 Theorem of Total Probability

We will use Bayes Theorem after introducing the following result (the Theorem of Total Probability.We begin with a preliminary definition:

Definition 1.3.4. A family of sets B1, . . . , Bn is called a partition of Ω if:

∀i 6= j Bi ∩Bj = ∅ and

n⋃i=1

Bi = Ω.

Lemma 1.3.5. Consider probability space (Ω,F ,P) and let for any fixed n ≥ 2, B1, . . . , Bn ∈ F bea partition of Ω, with P(Bi) > 0,∀i ∈ 1, . . . , n. Then, for any A ∈ F :

P(A) =

n∑i=1

P(A|Bi)P(Bi).

Proof. We give the proof for n = 2, the other cases being very similar. We have A = (A∩B1)∪(A∩B2)(recall B1 ∪B2 = Ω, and B2 = Bc1), so

P(A) = P(A ∩B1) + P(A ∩B2)

= P(A|B1)P(B1) + P(A|B2)P(B2).

Example 1.3.6. Consider two factories which manufacture a product. If the product comes fromfactory I, it is defective with probability 1/5 and if it is from factory II, it is defective with probability1/20. It is twice as likely that a product comes from factory I. What is the probability that a givenproduct operates properly (i.e. is not defective)? Let A denote the event of ‘operates properly’ and Bdenote the event ‘the product is made in factory I’. Then:

P(A) = P(A|B)P(B) + P(A|Bc)P(Bc)

=4

5

2

3+

19

20

1

3=

51

60.

If a product is defective, what is the probability that it came from factory I? This is P(B|Ac); byBayes Theorem

P(B|Ac) =P(Ac|B)P(B)

P(Ac)

=(1/5)(2/3)

9/60=

8

9.

Page 14: ST2334 Notes (Probability and Statistics - NUS)

7

1.4 Independence

The idea of independence is roughly that the probability of an event A is ‘unaffected’ by the occur-rance of a (non-null) event B; that is, P(A) = P(A|B), when P(B) > 0. More formally:

Definition 1.4.1. Consider probability space (Ω,F ,P) and let A,B ∈ F . A and B are indepen-dent if

P(A ∩B) = P(A)P(B).

More generally, a family of F−sets A1, . . . , An (∞ > n ≥ 2) are independent if

P( n⋂i=1

Ai

)=

n∏i=1

P(Ai).

Two important concepts include pairwise independence:

P(Ai ∩Aj) = P(Ai)P(Aj) ∀i 6= j.

This does not necessarily mean that A1, . . . , An are independent events. Another important conceptis conditional independence; given an event C ∈ F , with P(C) > 0 , A,B ∈ F are conditionallyindependent events if

P(A ∩B|C) = P(A|C)P(B|C).

This can be extended to families of sets Ai given C.

Example 1.4.2. Consider a pack of playing cards, and one chooses a card completely at random(i.e. no card counting etc). One can assume that the probability of choosing a suit (e.g. spade) isindependent of its rank. So, for example:

P(spade king) = P(spade)P(king) =13

52

4

52=

1

52.

Page 15: ST2334 Notes (Probability and Statistics - NUS)

Chapter 2

Random Variables and theirDistributions

2.1 Introduction

Throughout the Chapter we assume that there is a probability space (Ω,F ,P), but its presence isminimized from the discussion. This is the main Chapter of the course and we focus upon randomvariables and their distribution (Section 2.2). In particular, a random variable is (essentially) just amap from Ω to some subset of the real line. Once we are there we introduce notions of probabilitythrough distribution functions. Sometimes the random variables take values on a countable space(possibly countably infinite) and we call such random variables discrete (Section 2.3). The proba-bilities of such random variables are linked to probability mass functions and we use this conceptto revisit independence and conditional probability. We also consider expectation (the ‘theoreticalaverage’) and conditional expectation for discrete random variables. These ideas are revisited whenthe random variables are continuous (Section 2.4). The Chapter is concluded when we discuss theconvergence of random variables (Section 2.5) and, in particular, the famous central limit theorem.

2.2 Random Variables and Distribution Functions

2.2.1 Random Variables

In general, we are often concerned by an experiment itself, but by an associated consequence of theoutcome. For example, a gambler is often interested in his or hers profit or loss, rather that in theresult that occurs. To deal with this issue, we introduce the notion of a random variable:

Definition 2.2.1. A random variable is a function X : Ω → R such that for each x ∈ R,ω ∈ Ω : X(ω) ≤ x ∈ F . Such a function is said to be F−measurable.

For the purpose of this course, the technical constraint of F−measurability can be ignored; youcan content yourself to think of X(ω) as a mapping from the sample space to the real line andcontinue onwards. In general we omit the argument ω and just write X, with possible realized values(numbers) written in lower-case x. The distinction between a random variable X and its realizedvalue is very important and you should maintain this convention. To move from the random variable,back to our probability measure P, we write events ω ∈ Ω : X(ω) ≤ x as X ≤ x, and hence wewrite P(X ≤ x) as P(X ≤ x) to mean P(ω ∈ Ω : X(ω) ≤ x) (recall ω ∈ Ω : X(ω) ≤ x ∈ Fand so this makes sense). This leads us to the notion of a distribution function:

2.2.2 Distribution Functions

Definition 2.2.2. The distribution function of a random variable X is the function F : R→ [0, 1]given by F (x) = P(X ≤ x), x ∈ R.

Example 2.2.3. Consider flipping a fair coin twice; then Ω = HH,TT,HT, TH, with F =0, 1Ω and P(HH) = P(TT ) = P(HT ) = P(TH) = 1/4. Define a random variable X as the numberof heads; so

X(HH) = 2 X(HT ) = X(TH) = 1 X(TT ) = 0.

8

Page 16: ST2334 Notes (Probability and Statistics - NUS)

9

The associated distribution function is:

F (x) =

0 if x < 014 if 0 ≤ x < 134 if 1 ≤ x < 21 if x ≥ 2

A distribution function has the following properties, which we state without proof. See [1] forthe proof. The lemma characterizes a distribution function: a function F is a distribution functionfor some random variable if and only if it satisfies the three properties in the following lemma.

Lemma 2.2.4. A distribution function F has the following properties:

1. limx→−∞ F (x) = 0, limx→∞ F (x) = 1,

2. If x < y then F (x) ≤ F (y),

3. F is right continuous; limδ↓0 F (x+ δ) = F (x) for each x ∈ R.

Example 2.2.5. Consider flipping a coin as in Example 1.2.4, which lands heads with probabilityp ∈ (0, 1). Let X : Ω→ R be given by

X(H) = 1 X(T ) = 0.

The associated distribution function is

F (x) =

0 if x < 01− p if 0 ≤ x < 11 if x ≥ 1

X is said to have a Bernoulli distribution.

We end the Section with another lemma, whose proof can be found in [1].

Lemma 2.2.6. A distribution function F of random variable X, has the following properties:

1. P(X > x) = 1− F (x) for any fixed x ∈ R,

2. P(x < X ≤ y) = F (y)− F (x) for any fixed x, y ∈ R with x < y,

3. P(X = x) = F (x)− limy↑x F (y) for any fixed x ∈ R.

2.3 Discrete Random Variables

2.3.1 Probability Mass Functions

We now move onto an important class of random variable called discrete random variables.

Definition 2.3.1. A random variable is said to be discrete if it takes values in some countable subsetX = x1, x2, . . . of R.

A discrete random variable takes values only at countable values and hence its distribution is ajump function, in that it shifts between different values when they change. An important concept isthe probability mass function (PMF):

Definition 2.3.2. The probability mass function of a discrete random variable X, is the functionf : X→ [0, 1] defined by f(x) = P(X = x).

Remark 2.3.3. We generally use sans-serif notation X, Z to denote the supports of arandom variable. That is the range of values for which its PMF (or PDF - as willbe defined much later on) is (potentially) non-zero. Note however, the PMF may bedefined on say Z or R but is taken as (potentially) non-zero only at those points in X -it is always zero outside X. We call this the support of a random variable.

Page 17: ST2334 Notes (Probability and Statistics - NUS)

10

We remark thatF (x) =

∑i:xi≤x

f(x) f(x) = F (x)− limy↑x

F (y)

which provides and association between the distribution function and the PMF. We have the followingresult, whose proof follows easily from the above definitions and results.

Lemma 2.3.4. A PMF satisfies:

1. the set of x such that f(x) 6= 0 is countable,

2.∑x∈X f(x) = 1.

Example 2.3.5. A coin is flipped n times and the probability one obtains a head is p ∈ (0, 1);Ω = H,Tn. Let X denote the number of heads, which takes values in the set X = 0, 1, . . . , n andis thus a discrete random variable. Consider x ∈ X, exactly

(nx

)points in Ω give us x heads and each

point occurs with probability px(1− p)n−x; hence

f(x) =

(n

x

)px(1− p)n−x x ∈ X.

The random variable X is said to have a Binomial distribution and this is denoted X ∼ B(n, p).Note that (

n

x

)=

n!

(n− x)!x!

with x! = x× (x− 1)× · · · × 1 and 0! = 1.

Example 2.3.6. Let λ > 0 be given. A random variable X that takes values on X = 0, 1, 2, . . . , is said have a Poisson distribution with parameter λ (denoted X ∼ P(λ)) if its PMF is:

f(x) =λxe−λ

x!x ∈ X.

2.3.2 Independence

We now extend the notion of independence of events to the domain of random variables. Recall thatevents A and B are independent if the joint probability is equal to the product (A does not affectB).

Definition 2.3.7. Discrete random variables X and Y are independent if the events X = x andY = y are independent for each (x, y) ∈ X× Y.

To understand this idea, let X = x1, x2, . . . , Y = y1, y2, . . . , then X and Y are independentif and only if the events Ai = X = xi, Bj = Y = yj are independent for every possible pair i, j.

Example 2.3.8. Consider flipping a coin once, which lands tail with probability p ∈ (0, 1). Let Xbe the number of heads seen and Y the number of tails. Then:

P(X = Y = 1) = 0

andP(X = 1)P(Y = 1) = (1− p)p 6= 0

so X and Y cannot be independent.

A useful result (which we do not prove) that is worth noting is the following.

Theorem 2.3.9. If X and Y are independent random variables and g : X → R, h : Y → R, thenthe random variables g(X) and h(Y ) are also independent.

In full generality1, consider a sequence of discrete random variables X1, X2, . . . , Xn, Xi ∈ Xi;they are said to be independent if the events X1 = x1, . . . , Xn = xn are independent for everypossible (x1, . . . , xn) ∈ X1 × · · · × Xn. That is:

P(X1 = x1, . . . , Xn = xn) =

n∏i=1

P(Xi = xi) ∀(x1, . . . , xn) ∈ X1 × · · · × Xn.

1this point will not make too much sense, until we see Section 2.3.4

Page 18: ST2334 Notes (Probability and Statistics - NUS)

11

2.3.3 Expectation

Throughout your statistical training, you may have encountered the notion of an average or meanvalue of data. In this Section we consider the idea of the average value of a random variable, whichis called the expected value.

Definition 2.3.10. The expected value of a discrete random variable X on X, with PMF f , denotedE[X] is defined as

E[X] :=∑x∈X

xf(x)

whenever the sum on the R.H.S. is absolutely convergent.

Example 2.3.11. Recall the Poisson random variable from Example 2.3.6, X ∼ P(λ). The expectedvalue is:

E[X] =

∞∑x=0

xλxe−λ

x!

= λe−λ∞∑x=1

λx−1

(x− 1)!

= λe−λeλ

= λ

where we have used the Taylor series expansion for the exponential function on the third line.

We now look at the idea of an expectation of a function of a random variable:

Lemma 2.3.12. Given a discrete random variable X on X, with PMF f and g : X→ R:

E[g(X)] =∑x∈X

g(x)f(x)

whenever the sum on the R.H.S. is absolutely convergent.

Example 2.3.13. Returning to Example 3.2.1, we have

E[X2] =

∞∑x=0

x2λxe−λ

x!

= λe−λ∞∑x=1

xλ(x−1)

(x− 1)!

= λe−λ∞∑x=1

d

dλ(λx)

1

(x− 1)!

= λe−λd

[λ( ∞∑x=1

(λx−1)1

(x− 1)!

)]= λe−λ

d

dλ(λeλ)

= λ2 + λ

where we have assumed that it is legitimate to swap differentiation and summation (it turns out thatthis is true here).

An important concept is the moment generating function:.

Definition 2.3.14. For a discrete random variable X the moment generating function (MGF)is

M(t) = E[eXt] =∑x∈X

extf(x) t ∈ T

where T is the set of t for which∑

X extf(x) <∞.

Page 19: ST2334 Notes (Probability and Statistics - NUS)

12

Exercise 2.3.15. Show that

E[X] = M ′(0) E[X2] = M (2)(0)

when the right-hand derivatives exist.

The moment generating function is thus a simple way to obtain moments, if it is simple to differ-entiate M(t). Note also, that it can be proven that the MGF uniquely characterizes a distribution.

Example 2.3.16. Let X ∼ P(λ) then:

E[eXt] =

∞∑x=0

extλx

x!e−λ

=

∞∑x=0

(λet)x

x!e−λ

= expλ(et − 1).

ThenM ′(0) = λ.

An important special case of functions of random variables are:

Definition 2.3.17. Given a discrete random variable X on X, with PMF f and k ∈ Z+, the kth

moment of X isE[Xk]

and the kth central moment of X isE[(X − E[X])k].

Of particular importance, is the idea of the variance:

Var[X] := E[(X − E[X])2].

Now, we have the following calculations:

E[(X − E[X])2] =∑x∈X

(x− E[X])2f(x)

=∑x∈X

(x2 − 2xE[X] + E[X]2)f(x)

=∑x∈X

x2f(x)− 2E[X]∑x∈X

xf(x) + E[X]2∑x∈X

f(x)

= E[X2]− 2E[X]2 + E[X]2 = E[X2]− E[X]2.

Hence we have shown thatVar[X] = E[X2]− E[X]2. (2.3.1)

Note a very important point: for any absolutely convergent sum∑an, if an ≥ 0 for each n, then∑

an ≥ 0. Now as Var[X] := E[(X − E[X])2] =∑x∈X(x− E[X])2f(x) and clearly

(x− E[X])2f(x) ≥ 0 ∀x ∈ X

it follows that

Variances cannot be negative2.

Example 2.3.18. Returning to Examples 3.2.1 and 2.3.13, when X ∼ P(λ), we have shown that

E[X2] = λ2 + λ E[X] = λ.

Hence on using (2.3.1):Var[X] = λ.

That is, for a Poisson random variable, its mean and variance are equal.

2This is very important and you will not be told again

Page 20: ST2334 Notes (Probability and Statistics - NUS)

13

Exercise 2.3.19. Compute the mean and variance for a Binomial distribution B(n, p) random vari-able. [Hint: consider the differentiation, w.r.t. x, of the equality

n∑k=0

(n

k

)xk = (1 + x)n.

].

We now state a Theorem, whose proof we do not give. In general, it is simple to establish, oncethe concept of a joint distribution has been introduced; we have not done this, so we simply statethe result.

Theorem 2.3.20. The expectation operator has the following properties:

1. if X ≥ 0, E[X] ≥ 0

2. if a, b ∈ R then E[aX + bY ] = aE[X] + bE[Y ]

3. if X = c ∈ R always, then E[X] = c.

An important result that we will use later on and whose proof is omitted is as follows.

Lemma 2.3.21. If X and Y are independent then E[XY ] = E[X]E[Y ].

A notion of dependence (linear dependence) is correlation. This will be discussed in details, lateron.

Definition 2.3.22. X and Y are uncorrelated if E[XY ] = E[X]E[Y ].

It is important to remark that random variables that are independent are uncorrelated. However,uncorrelated variables are not necessarily independent; we will explore this idea later on.

We end the section with some useful properties of the variance operator.

Theorem 2.3.23. For random variables X and Y

1. For a ∈ R, Var[aX] = a2Var[X],

2. For X,Y uncorrelated Var[X + Y ] = Var[X] + Var[Y ].

Proof. For 1. we have:

Var[aX] = E[(aX)2)− E[aX]2

= a2E[X2]− a2E[X]2 = a2Var[X]

where we have used Theorem 2.3.20 2. in the second line.Now for 2. we have:

Var[(X + Y )] = E[(X + Y − E[X + Y ])2)

= E[X2 + Y 2 + 2XY + E[X + Y ]2 − 2(X + Y )E[X + Y ]]

= E[X2 + Y 2 + 2XY + E[X]2 + E[Y ]2 + 2E[XY ]]− 2E[X + Y ]2

= E[X2] + E[Y 2] + 4E[X]E[Y ] + E[X]2 + E[Y ]2 − 2(E[X]2 + E[Y ]2 + 2E[X]E[Y ])

= E[X2] + E[Y 2]− E[X]2 − E[Y ]2 = Var[X] + Var[Y ]

where we have repeatedly used Theorem 2.3.20 2..

2.3.4 Dependence

As we saw in the previous Section, there as a need to define distributions on more than one randomvariable. We will do that in this Section. We start with the following definition (of which one caneasily generalize to a collection of n ≥ 1 discrete random variables):

Definition 2.3.24. The joint distribution function F : R2 → [0, 1] of X,Y where X and Y arediscrete random variables, is given by

F (x, y) = P(X ≤ x ∩ Y ≤ y).

Their joint mass function f : R2 → [0, 1] is given by

f(x, y) = P(X = x ∩ Y = y).

Page 21: ST2334 Notes (Probability and Statistics - NUS)

14

y = −1 y = 0 y = 2 f(x)x = 1 1

18318

218

618

x = 2 218 0 3

18518

x = 3 0 418

318

718

f(y) 318

718

818

Table 2.1: The Joint Mass Function in Example 2.3.26. The row totals are the marginal massfunction on X (and sum to 1) and respectively the column totals are the marginal mass function onY .

In general, it may be that the random variables are defined on a space Z which may not bedecomposable into a cartesian product X×Y. In such scenarios we write the joint support as simplyZ and omit X and Y, in this scenario, we will use the notation

∑x or

∑y to denote sums over the

supports induced by Z. This concept will be clarified during a reading of the subsequent text.Of particular importance, are the marginal PMFs of X and Y :

f(x) =∑y

f(x, y) f(y) =∑x

f(x, y).

Note that, clearly ∑x

f(x) = 1 =∑y

f(y).

Example 2.3.25. Consider Theorem 2.3.20 2. and for simplicity suppose Z = X×Y. Now, we have

E[aX + bY ] =∑x∈X

∑y∈Y

(ax+ by)f(x, y)

=∑x∈X

ax∑y∈Y

f(x, y) +∑y∈Y

by∑x∈X

f(x, y)

= a∑x∈X

xf(x) + b∑y∈Y

yf(y)

= aE[X] + bE[Y ].

Example 2.3.26. Suppose X = 1, 2, 3 and Y = −1, 0, 2 then an example of a joint PMF can befound in Table 2.3.26. From the table, we have:

E[XY ] =∑x∈X

∑y∈Y

xyf(x, y) =29

18

(just sum the 9 values in the table, multiplying each time by the product of the associated x and y.Similarly

E[X] =6

18+ 2

5

18+ 3

7

18=

37

18

and

E[Y ] = −13

18+ 0

7

18+ 2

8

18=

13

18.

We now formalize independence in a result, which provides us a more direct way to ascertain iftwo random variables X and Y are independent.

Lemma 2.3.27. The discrete random variables X and Y are independent if an only if

f(x, y) = f(x)f(y) ∀x, y ∈ R.

More generally, X and Y are independent if and only if f(x, y) factorizes into the product g(x)h(y)with g a function only of x and h a function only of y.

Example 2.3.28. Consider the joint PMF:

f(x, y) =λxe−λ

x!

λye−λ

y!X = Y = 0, 1, . . . , .

Page 22: ST2334 Notes (Probability and Statistics - NUS)

15

Now clearly

f(x) =λxe−λ

x!x ∈ X f(y) =

λye−λ

y!y ∈ Y.

It is also clear that via Lemma 2.3.27 the random variables X and Y are independent and identicallydistributed (and Poisson distributed).

As in the case of a single variable, we are interested in the expectation of a functional of tworandom variables (strictly, in the proof of Theorem 2.3.23 we have already used the following result):

Lemma 2.3.29. E[g(X,Y )] =∑

(x,y)∈X×Y g(x, y)f(x, y).

Recall that in the previous Section, we mentioned a notion of dependence called correlation. Inorder to formally define this concept, we introduce first the covariance and then correlation.

Definition 2.3.30. The covariance between X and Y is

Cov[X,Y ] := E[(X − E[X])(Y − E[Y ])].

The correlation between X and Y is

ρ(X,Y ) =Cov[X,Y ]√Var[X]Var[Y ]

.

Exercise 2.3.31. Show that

E[(X − E[X])(Y − E[Y ])] = E[XY ]− E[X]E[Y ].

Thus, independent random variables have zero correlation.

Example 2.3.32. Returning to Example 2.3.26, we have:

Var[X] = 233/324 Var[Y ] = 461/324

andCov[X,Y ] = 29/18− (37/18)(13/18) = 41/324.

Thusρ(X,Y ) = 41/

√107413.

Some useful facts about correlations that we do not prove:

• |ρ(X,Y )| ≤ 1. Random variables with correlation 1 or -1 are said to be perfectly positively ornegatively correlated.

• The correlation coefficient is 1 iff Y increases linearly with X and -1 iff Y decreases linearly asX increases.

When we consider continuous random variables an example of uncorrelated random variables thatare not independent, will be given.

2.3.5 Conditional Distributions and Expectations

In Section 1.3 we discussed the idea of conditional probabilities associated to events. This idea canthen be extended to discrete-valued random variables:

Definition 2.3.33. The conditional distribution function of Y given X, written FY |x(·|x), is definedby

Fy|x(y|x) = P(Y ≤ y|X = x)

for any x with P(X = x) > 0. The conditional PMF of Y given X = x is defined by

f(y|x) = P(Y = y|X = x)

when x is such that P(X = x) > 0.

Page 23: ST2334 Notes (Probability and Statistics - NUS)

16

We remark that, in particular, the conditional mass function of Y given X = x is:

f(y|x) =f(x, y)

f(x)=

f(x, y)∑y f(x, y)

.

Example 2.3.34. Suppose that Y ∼ P(λ) and X|Y = y ∼ B(y, p). Find the conditional probabilitymass function of Y |X = x. Note that the random variables lie in the space Z = (x, y) : y ∈0, 1, . . . , x ∈ 0, 1, . . . , y. We have

f(y|x) =f(x, y)

f(x)=f(x|y)f(y)

f(x)(x, y) ∈ Z

which is a version of Bayes Theorem for discrete random variables. Note that for (x, y) ∈ Z all thePMFs above are positive. Now for x ∈ 0, 1, . . .

f(x) =

∞∑y=x

(y

x

)px(1− p)y−xλ

ye−λ

y!

= (λp)xe−λ

x!

∞∑y=x

((1− p)λ)y−x

(y − x)!

=(λp)x

x!e−λp.

Thus, we have for y ∈ x, x+ 1, . . . :

f(y|x) =

(yx

)px(1− p)y−x λ

ye−λ

y!

(λp)x

x! e−λp

which after some algebra, becomes:

f(y|x) =(λ(1− p))y−xe−λ(1−p)

(y − x)!y ∈ x, x+ 1, . . . .

Given the idea of a conditional distribution, we move onto the idea of a conditional expectation:

Definition 2.3.35. The conditional expectation of a random variable Y , given X = x is

E[Y |X = x] =∑y

yf(y|x)

given that the conditional PMF is well-defined. We generally write E[Y |X] or E[Y |x].

An important result associated to conditional expectations is as follows:

Theorem 2.3.36. The conditional expectation satisfies:

E[E[Y |X]] = E[Y ]

assuming the expectations all exist.

Proof. We have

E[Y ] =∑y

yf(y)

=∑

(x,y)∈Z

yf(x, y)

=∑

(x,y)∈Z

yf(y|x)f(x)

=∑x

[∑y

yf(y|x)]f(x)

= E[E[Y |X]].

Page 24: ST2334 Notes (Probability and Statistics - NUS)

17

Example 2.3.37. Let us return to Example 2.3.34. Find E[X|Y ], E[Y |X] and E[X]. From Exercise2.3.19, you should have derived that if Z ∼ B(n, p) then E[Z] = np. Then

E[X|Y = y] = yp.

ThusE[X] = E[E[X|Y = y]] = pE[Y ].

As Y ∼ P(λ)E[X] = pλ.

From Example 2.3.34,

f(y|x) =(λ(1− p))y−xe−λ(1−p)

(y − x)!y ∈ x, x+ 1, . . . .

Thus

E[Y |X = x] =

∞∑y=x

y(λ(1− p))y−xe−λ(1−p)

(y − x)!.

Setting u = y − x in the summation, one yields

E[Y |X = x] =

∞∑u=0

(u+ x)(λ(1− p))ue−λ(1−p)

u!

=

∞∑u=0

u(λ(1− p))ue−λ(1−p)

u!+ x

= λ(1− p) + x.

Here, we have used the fact that U ∼ P(λ(1− p)).

We end the Section with a result which can be very useful in practice.

Theorem 2.3.38. We have, for any g : R→ R

E[E[Y |X]g(X)] = E[Y g(X)]

assuming the expectations all exist.

Proof. We have

E[Y g(X)] =∑

(x,y)∈Z

yg(x)f(x, y)

=∑

(x,y)∈Z

yg(x)f(y|x)f(x)

=∑x

g(x)[∑y

yf(y|x)]f(x)

= E[E[Y |X]g(X)].

2.4 Continuous Random Variables

In the previous Section, we considered random variables which produce real numbers within a count-ably finite or infinite set. However, there are of course a wide variety of applications where one seesnumerical outcomes of an experiment that can lie potentially anywhere on the real-line. Thus weextend each of concepts that we saw for discrete random variables to continuous ones. In a ratherinformal (and incorrect) manner, one can simply think of the idea of replacing summation withintegration; of course things will be more challenging than this, but one should keep this idea in theback of your mind.

Page 25: ST2334 Notes (Probability and Statistics - NUS)

18

2.4.1 Probability Density Functions

A random variable is said to be continuous if its distribution function F (x) = P(X ≤ x), can bewritten as

F (x) =

∫ x

−∞f(u)du

where f : R→ [0,∞), the R.H.S. is the usual Riemann integral and we will assume that the R.H.S. isdifferentiable.

Definition 2.4.1. The function f is called the probability density function (PDF) of the con-tinuous random variable X.

We note that, under our assumptions, f(x) = F ′(x).

Example 2.4.2. One of the most commonly used PDFs is the Gaussian or normal distribution:

f(x) =1

σ√

2πexp− 1

2σ2(x− µ)2 x ∈ X = R

where µ ∈ R, σ2 > 0. We use the notation X ∼ N (µ, σ2).

One of the key points associated to PDFs is as follows. The numerical value f(x) does notrepresent the probability that X takes the value x. The technical explanation goes far beyond themathematical level of this course, but perhaps an intuitive reason is simply that there are uncountablyinfinite points in X (so assigning a probability to each point is seemingly impossible). In general, oneassigns probability to sets of ‘non-zero width’. For example let A = [a, b], −∞ < a < b < ∞, thenone might expect:

P(X ∈ A) =

∫A

f(x)dx.

Indeed this holds true, but we are deliberately vague about this. We give the following result, whichis not proved and should be taken as true.

Lemma 2.4.3. If X has a PDF f , then

1.∫Xf(x)dx = 1.

2. P(X = x) = 0 for each x ∈ X.

3. P(X ∈ [a, b]) =∫ baf(x)dx, −∞ ≤ a < b ≤ ∞.

Example 2.4.4. Returning to Example 2.4.2, we have

P(X ∈ X) =

∫ ∞−∞

1

σ√

2πexp− 1

2σ2(x− µ)2dx

=1√2π

∫ ∞−∞

e−u2/2du = 1.

Here, we have used the substitution u = (x − µ)/σ to go to the second line and the fact that∫∞−∞ e−u

2/2du =√

2π.

2.4.2 Independence

To define an idea of independence for continuous random variables, we cannot use the one for discreterandom variables (recall Definition 2.3.7); the sets X = x and Y = y have zero probability andare hence trivially independent. Thus we use a new definition for independence:

Definition 2.4.5. The random variables X and Y are independent if

X ≤ x Y ≤ y

are independent events for each x, y ∈ R.

As for PMFs one can consider independence through PDFs; however, as we are yet to define thenotion of joint PDFs, we will leave this idea for now. We note that one can show that for any realvalued functions h and g (at least within some technical conditions which are assumed in this course)that if X and Y are independent, so are the random variables h(X) and g(Y ). We do not prove why.

Page 26: ST2334 Notes (Probability and Statistics - NUS)

19

2.4.3 Expectation

As for discrete random variables one can consider the idea of the average of a random variable. Thisis simply brought about by replacing summations with integrations:

Definition 2.4.6. The expectation of a continuous random variable X with PDF f is given by

E[X] =

∫X

xf(x)dx

whenever the integral exists.

Example 2.4.7. Consider the exponential density:

f(x) = λe−λx x ∈ X = [0,∞) λ > 0.

We use the notation X ∼ E(λ). Then

E[X] =

∫ ∞0

xλe−λxdx = [−xe−λx]∞0 +

∫ ∞0

e−λxdx = [− 1

λe−λx]∞0 =

1

λ

where we have used integration by parts.

Example 2.4.8. Let us return to Example 2.4.2. Then

E[X] =

∫ ∞−∞

x1

σ√

2πexp− 1

2σ2(x− µ)2dx

=

∫ ∞−∞

(uσ + µ)1√2πe−u

2/2du

= σ

∫ ∞−∞

u1√2πe−u

2/2du+ µ

=σ√2π

[−e−u2/2]∞−∞ + µ

= µ.

Here, we have used the substitution u = (x− µ)/σ to go to the second line and the fact that∫ ∞−∞

1√2πe−u

2/2du = 1

to go to the third.

Example 2.4.9. Consider the gamma density:

f(x) =λα

Γ(α)xα−1e−λx x ∈ X = [0,∞) λ, α > 0

where Γ(α) =∫∞

0tα−1e−tdt. We use the notation X ∼ G(α, λ) and note that if X ∼ G(1, λ) then

X ∼ E(λ). Now

E[X] =λα

Γ(α)

∫ ∞0

xαe−λxdx

=λα

Γ(α)

1

λα+1

∫ ∞0

uαe−udu

=1

λΓ(α)Γ(α+ 1) =

α

λ.

Where we have used the substitution u = λx to go the second line and Γ(α+ 1) = αΓ(α) on the finalline (from herein you may use that identity without proof).

We now state a useful technical result

Theorem 2.4.10. If X and g(x) (g : R→ R) are continuous random variables

E[g(X)] =

∫X

g(x)f(x)dx.

Page 27: ST2334 Notes (Probability and Statistics - NUS)

20

We remark that all the extensions to expectations discussed in Section 2.3.3 can be extended tothe continuous case. In particular Definitions 2.3.17 and 2.3.22 can be imported to the continuouscase. In addition the results: Theorems 2.3.20 and 2.3.23 and Lemma 2.3.21 all can be extended. Itis assumed that this is the case from herein.

Example 2.4.11. Let us return to Example 2.4.7

E[X2] =

∫ ∞0

x2λe−λxdx

= [−x2e−λx]∞0 +2

λ

∫ ∞0

xλe−λxdx

=2

λ2.

Thus, for an exponential random variable:

Var[X] =1

λ2.

Example 2.4.12. Let us return to Example 2.4.2. Then

E[X2] =

∫ ∞−∞

x2 1

σ√

2πexp− 1

2σ2(x− µ)2dx

=

∫ ∞−∞

(uσ + µ)2 1√2πe−u

2/2du

= σ2

∫ ∞−∞

u2 1√2πe−u

2/2du+ µ2 + 2σµ

∫ ∞−∞

u1√2πe−u

2/2du

=σ2

√2π[−ue−u

2/2]∞−∞ +

∫ ∞−∞

e−u2/2du+ µ2

=σ2

√2π

√2π + µ2.

Here we have used∫∞−∞ u 1√

2πe−u

2/2du = 0 and∫∞−∞ e−u

2/2du =√

2π. Thus, for a normal random

variable:Var[X] = σ2.

Example 2.4.13. Let us return to Example 2.4.9 Now

E[X2] =λα

Γ(α)

∫ ∞0

xα+1e−λxdx

=λα

Γ(α)

1

λα+2

∫ ∞0

uα+1e−udu

=1

λ2Γ(α)Γ(α+ 2) =

α(α+ 1)

λ2

where we have used Γ(α+ 2) = (α+ 1)αΓ(α). Thus, for a gamma random variable:

Var[X] =α

λ2.

To end this section we consider an important concept; the MGF for continuous random variables.

Definition 2.4.14. For a continuous random variable X the moment generating function(MGF) is

M(t) = E[eXt] =

∫X

extf(x)dx t ∈ T

where T is the set of t for which∫Xextf(x)dx <∞.

Exercise 2.4.15. Show that

E[X] = M ′(0) E[X2] = M (2)(0)

when the right-hand derivatives exist.

Page 28: ST2334 Notes (Probability and Statistics - NUS)

21

As for discrete random variables, the moment generating function is a simple way to obtainmoments, if it is simple to differentiate M(t). Note also, that it can be proven that the MGFuniquely characterizes a distribution.

Example 2.4.16. Suppose X ∼ E(λ) then, supposing λ > t

M(t) =

∫ ∞0

λe−x(λ−t)dx

=[− λ

(λ− t)e−x(λ−t)

]∞0

(λ− t).

Clearly

M ′(t) =λ

(λ− t)2

and thus E[X] = 1/λ.

Example 2.4.17. Suppose X ∼ N (µ, σ2) then

M(t) =1

σ√

∫ ∞−∞

exp− 1

2σ2(x− µ)2 + xtdx

=1

σ√

∫ ∞−∞

exp− 1

2σ2[(x− (µ+ tσ2))2 − (µ+ tσ2)2 + µ2]dx

=1

σ√

2πexpµt+

σ2t2

2∫ ∞−∞

exp− 1

2σ2(x− (µ+ tσ2))2dx

= expµt+σ2t2

2

where we have used a change-of-variables u = (x− (µ+ tσ2))/σ to deal with the integral.

2.4.4 Dependence

Just as for discrete random variables, one can consider the idea of joint distributions for continuousrandom variables.

Definition 2.4.18. The joint distribution function of X and Y is the function F : R2 → [0, 1]given by

F (x, y) = P(X ≤ x, Y ≤ y).

Again, as for discrete random variables, one requires a PDF:

Definition 2.4.19. The random variables are jointly continuous with joint PDF f : R2 → [0,∞) if

F (x, y) =

∫ y

−∞

∫ x

−∞f(u, v)dudv

for each (x, y) ∈ R2.

In this course, we will assume (generally) that

f(x, y) =∂2

∂x∂yF (x, y).

Note that, P(X = x, Y = y) = 0 and for sufficiently well defined sets A ⊆ Z ⊆ R2

P((X,Y ) ∈ A) =

∫A

f(x, y)dxdy

note that the R.H.S. is a double integral, but we will only write one integral sign in such contexts.As for discrete random variables, we can introduce the idea of marginal PDFs. Here, we take a

little longer to consider these ideas:

Page 29: ST2334 Notes (Probability and Statistics - NUS)

22

Definition 2.4.20. The marginal distribution functions of X and Y are

F (x) = limy→∞

F (x, y) F (y) = limx→∞

F (x, y).

As

F (x) =

∫ x

−∞

∫ ∞−∞

f(u, y)dydu F (y) =

∫ y

−∞

∫ ∞−∞

f(x, u)dxdu

the marginal density functions of X and Y

f(x) =

∫ ∞−∞

f(x, y)dy f(y) =

∫ ∞−∞

f(x, y)dx.

We remark that expectation is much the same for joint distributions (as for Section 2.4.3) ofcontinuous random variables:

E[g(X,Y )] =

∫Z

g(x, y)f(x, y)dxdy =

∫ ∞−∞

∫ ∞−∞

g(x, y)f(x, y)dxdy.

So as before if g(x, y) = ah(x) + bg(y)

E[g(X,Y )] = aE[h(X)] + bE[g(Y )].

We now turn to independence; we state the following result with no proof. If you are unconvinced,simply take the below as a definition.

Theorem 2.4.21. The random variables X and Y are independent if and only if

F (x, y) = F (x)F (y)

or equivalentlyf(x, y) = f(x)f(y).

Example 2.4.22. Let Z = (R+)2 = X× Y and

f(x, y) = λ2e−λ(x+y) (x, y) ∈ Z, λ > 0.

Then one hasf(x) = λe−λx x ∈ X f(y) = λe−λy y ∈ Y

In addition, what is the probability that X > Y ?

P(X > Y ) =

∫ ∞0

∫ x

0

f(x, y)dydx

=

∫ ∞0

∫ x

0

λe−λydyλe−λxdx

=

∫ ∞0

[− e−λy

]x0λe−λxdx

=

∫ ∞0

(1− e−λx)λe−λxdx

=[− e−λx +

1

2e−2λx

]∞0

= 1− 1

2=

1

2.

Again, one has the idea of covariance and correlation for continuous valued random variables:

Cov[X,Y ] = E[(X − E[X])(Y − E[Y ])] = E[XY ]− E[X]E[Y ]

=

∫Z

xyf(x, y)dxdy −∫xf(x)dx

∫yf(y)dy

ρ(X,Y ) =Cov[X,Y ]√Var[X]Var[Y ]

Page 30: ST2334 Notes (Probability and Statistics - NUS)

23

Example 2.4.23. Let Z = R2 and define

f(x, y) =1

2π√

1− ρ2exp− 1

2(1− ρ2)(x2 + y2 − 2xyρ) (x, y) ∈ Z ρ ∈ [−1, 1].

Let us check if this is a PDF; clearly it is non-negative on Z. So∫Z

f(x, y)dxdy =

∫ ∞−∞

∫ ∞−∞

1

2π√

1− ρ2exp− 1

2(1− ρ2)(x2 + y2 − 2xyρ)dxdy

=1

2π√

1− ρ2

∫ ∞−∞

exp− 1

2(1− ρ2)(y2 − y2ρ2)

∫ ∞−∞

exp− 1

2(1− ρ2)(x− yρ)2dxdy

=1√2π

∫ ∞−∞

exp− 1

2(1− ρ2)(y2 − y2ρ2)dy

=1√2π

∫ ∞−∞

exp− 1

2(1− ρ2)(1− ρ2)y2dy = 1.

Here we have completed the square in the integral in x and used the change of variable u = (x −ρy)/(1 − ρ2)1/2. Thus f(x, y) defines a joint PDF. Moreover, it is called the standard bi-variatenormal distribution. Now let us consider the marginal PDFs:

f(x) =

∫ ∞−∞

1

2π√

1− ρ2exp− 1

2(1− ρ2)(x2 + y2 − 2xyρ)dy x ∈ R

=1

2π√

1− ρ2exp− 1

2(1− ρ2)(x2 − x2ρ2)

∫ ∞−∞

exp− 1

2(1− ρ2)(y − xρ)2dy

=1√2πe−x

2/2.

That is X ∼ N (0, 1). By Similar arguments, Y ∼ N (0, 1). Now, let us consider:

E[XY ] =

∫ ∞−∞

∫ ∞−∞

xy1

2π√

1− ρ2exp− 1

2(1− ρ2)(x2 + y2 − 2xyρ)dxdy

=1

2π√

1− ρ2

∫ ∞−∞

y exp− 1

2(1− ρ2)(y2 − y2ρ2)

∫ ∞−∞

x exp− 1

2(1− ρ2)(x− yρ)2dxdy

=1

2π√

1− ρ2

∫ ∞−∞

y exp− 1

2(1− ρ2)(y2 − y2ρ2)

∫ ∞−∞

[u(1− ρ2)1/2 + ρy](1− ρ2)1/2e−u2/2dudy

=1

2π√

1− ρ2

∫ ∞−∞

y exp− 1

2(1− ρ2)(y2 − y2ρ2)[(1− ρ2)1/2(1− ρ2)1/2

√2πE[X] + ρy

√2π]dy

=1

2π√

1− ρ2

∫ ∞−∞

y exp− 1

2(1− ρ2)(y2 − y2ρ2)(1− ρ2)1/2ρy

√2πdy

=ρ√2π

∫ ∞−∞

y2e−y2/2dy = ρ.

Here we have used the results in Examples 2.4.8 and 2.4.12 that the first and second (raw) momentof N (µ, σ2) are µ and µ2 + σ2. Therefore

Cov[X,Y ] = ρ

ρ(X,Y ) =ρ√1

= ρ.

Note also that if ρ = 0 then:

f(x, y) =1

2π√

1− ρ2exp−1

2(x2 + y2) (x, y) ∈ Z.

so as

f(x) =1√2πe−x

2/2 x ∈ R f(y) =1√2πe−y

2/2 y ∈ R

f(x, y) = f(x)f(y).

That is standard bi-variate normal random variables are independent if and only they are uncorre-lated. Note that this does not always occur for other random variables.

Page 31: ST2334 Notes (Probability and Statistics - NUS)

24

We conclude this section with a mention of joint moment generating functions:

Definition 2.4.24. For a continuous random variables X and Y the joint moment generatingfunction is

M(t1, t2) = E[eXt1+Y t2 ] =

∫Z

ext1+yt2f(x, y)dxdy (t1, t2) ∈ T

where T is the set of t1, t2 for which∫Zext1+yt2f(x, y)dxdy <∞.

Exercise 2.4.25. Show that

E[X] =∂

∂t1M(t1, t2)

∣∣∣∣t1=t2=0

, E[Y ] =∂

∂t2M(t1, t2)

∣∣∣∣t1=t2=0

, E[XY ] =∂2

∂t1∂t2M(t1, t2)

∣∣∣∣t1=t2=0

when the derivatives exist.

2.4.5 Conditional Distributions and Expectations

The idea of conditional distributions and expectations can be further extended to continuous randomvariables. However, we recall that the event X = x has zero probability, so how can we conditionon it? It turns out that we cannot have a rigorous answer to this question (i.e. at the level ofmathematics in this course), so some faith must be extended towards the following definitions.

Definition 2.4.26. The conditional distributions function of Y given X = x is the function:

F (y|x) =

∫ y

−∞

f(x, v)

f(x)dv

for any x such that f(x) > 0.

Definition 2.4.27. The conditional density function of F (y|x), written f(y|x) is given by:

f(y|x) =f(x, y)

f(x)

for any x such that f(x) > 0.

We remark that clearly

f(y|x) =f(x, y)∫∞

−∞ f(x, y)dy.

Example 2.4.28. Suppose that Y |X = x ∼ N (x, 1) and X ∼ N (0, 1), Z = R2. Find the PDFs:f(x, y) and f(x|y). We have

f(x, y) = f(y|x)f(x)

so

f(x, y) =1

2πexp−1

2((y − x)2 + x2) (x, y) ∈ Z.

Now

f(x|y) =f(x, y)

f(y)=f(y|x)f(x)

f(y)

which is Bayes theorem for continuous random-variables. Now for y ∈ R

f(y) =

∫ ∞−∞

1

2πexp−1

2((y − x)2 + x2)dx

=1

2πexp−y2/2

∫ ∞−∞

exp−1

2(2x2 − 2xy)dx

=1

2πexp−y2/2

∫ ∞−∞

exp−[(x− y/2)2 − y2/4]dx

=1

2πexp−y2/4

∫ ∞−∞

e−u2/2 1√

2du

=1

2πexp−y2/4

√2π√2

=1√

2√

2πexp−y2/4.

Page 32: ST2334 Notes (Probability and Statistics - NUS)

25

So Y ∼ N (0, 2). Then:

f(x|y) =1

2π exp− 12 ((y − x)2 + x2)

1√2√

2πexp−y2/4

x ∈ R.

After some calculations (exercise) one obtains:

f(x|y) =

√2

2πexp−(x− y

2)2 x ∈ R

that is X|Y = y ∼ N (y/2, 1/2).

Example 2.4.29. Let X and Y have joint PDF:

f(x, y) =1

x(x, y) ∈ Z = (x, y) : 0 ≤ y ≤ x ≤ 1.

Then for x ∈ [0, 1]

f(x) =

∫ x

0

1

xdy = 1.

In addition, (x, y) ∈ Z:

f(y|x) =1/x

1=

1

x

that is if x ∈ [0, 1], then conditional upon x Y ∼ U[0,x] (the uniform distribution on [0, x]). Now,suppose we are interested in P(X2 + Y 2 ≤ 1|X = x). Let x ≥ 0, and define

A(x) := y ∈ R : 0 ≤ y ≤ x, x2 + y2 ≤ 1

that is, A(x) = [0, x ∧ (1− x2)1/2], where a ∧ b = mina, b. Thus (x ∈ [0, 1])

P(X2 + Y 2 ≤ 1|X = x) =

∫A(x)

f(y|x)dy

=

∫ x∧(1−x2)1/2

0

1

xdy

=x ∧ (1− x2)1/2

x.

To compute P(X2 + Y 2 ≤ 1), we note, by setting A = (x, y) : (x, y) ∈ Z, x2 + y2 ≤ 1, we areinterested in P(A), which is:

P(X2 + Y 2 ≤ 1) =

∫A

f(x, y)dydx

=

∫ 1

0

∫A(x)

f(y|x)dyf(x)dx

=

∫ 1

0

x ∧ (1− x2)1/2

xdx.

Now x2 ≤ (1− x2)1/2 if

x ≤ 1/√

2

hence

P(X2 + Y 2 ≤ 1) =

∫ 1/√

2

0

dx+

∫ 1

1/√

2

(1− x2)1/2

xdx. (2.4.1)

Now we have, setting x = cos(θ)∫ 1

1/√

2

(1− x2)1/2

xdx =

∫ π/4

0

sin(θ) tan(θ)dθ

=[

log(|sec(θ) + tan(θ)|)− sin(θ)]π/4

0

= log(1 +√

2)− 1√2

Page 33: ST2334 Notes (Probability and Statistics - NUS)

26

where we have used integration tables to obtain the integral3. Thus, returning to (2.4.1):

P(X2 + Y 2 ≤ 1) =1√2

+ log(1 +√

2)− 1√2

= log(1 +√

2).

Given the previous discussion, the notion of a conditional expectation for continuous randomvariables:

E[g(Y )|X = x] =

∫g(y)f(y|x)dy.

As for discrete-valued random variables, we have the following result, which is again not proved.

Theorem 2.4.30. Consider jointly continuous random variables X and Y with g, h : R → R withh(X), g(Y ) continuous. Then:

E[h(X)g(Y )] = E[E[g(Y )|X]h(X)] =

∫ (∫g(y)f(y|x)dy

)h(x)f(x)dx

whenever the expectations exist.

Example 2.4.31. Let us return to Example 2.4.28. Then

E[XY ] = E[Y E[X|Y ]]

= E[Y (Y/2)]

=1

2[2] = 1.

Thus as Var[X] = 1 and Var[Y ] = 2

ρ(X,Y ) =1√2.

Example 2.4.32. Suppose Y |X = x ∼ N (x, σ21) and X ∼ N (ξ, σ2

2). To find the marginal distribu-tion of Y , one can use MGFs and conditional expectations. We have:

M(t) = E[eY t]

= E[E[eY t|X]]

= E[eXt+(σ21t

2)/2]

= e(σ21t

2)/2E[eXt]

= e(σ21t

2)/2eξt+(σ22t

2)/2

= expξt+ (σ21 + σ2

2)t2/2.

Here we have used Example 2.4.17 for the MGF of a normal distribution. Thus we can conclude that

Y ∼ N (ξ, σ21 + σ2

2).

2.4.6 Functions of Random Variables

Consider a random variable X and g : R → R a continuous and invertible function (continuity canbe weakened, but let us use it for now). Suppose Y = g(X); what is the distribution of Y ? We have:

F (y) = P(Y ≤ y) = P(g(X) ≤ y) = P(X ∈ g−1((−∞, y)) =

∫g−1(−∞,y]

f(x)dx.

Note that for A ⊆ R, then g−1A = x ∈ R : g(x) ∈ A.

Example 2.4.33. Let X ∼ N (0, 1) and set

Φ(x) =

∫ x

−∞

1√2πe−u

2/2du

to denote the standard normal CDF. Suppose Y = X2, then for y ≥ 0 (note Y = R+)

P(Y ≤ y) = P(X2 ≤ y) = P(−√y ≤ X ≤ √y).

3you will not be expected to perform such an integration in the examination

Page 34: ST2334 Notes (Probability and Statistics - NUS)

27

Now

P(−√y ≤ X ≤ √y) = P(X ≤ √y)− P(X ≤ −√y)

= Φ(√y)− Φ(−√y)

= Φ(√y)− [1− Φ(

√y)]

= 2Φ(√y)− 1.

Then

f(y) =d

dyF (y) =

1√y

Φ′(√y) =

1√2πy

e−y/2 y ∈ Y

that is Y ∼ G(1/2, 1/2). This is also called the chi-squared distribution on one degree of freedom.We remark that Y and X are clearly not independent (if X changes so does Y and in a deterministicmanner). Now

E[XY ] = E[X3] = 0.

In addition E[X] = 0 and E[Y ] = 1. So X and Y are uncorrelated, but they are not independent.

We now move onto a more general change of variable formula. Suppose X1 and X2 have a jointdensity on Z and we set

(Y1, Y2) = T (X1, X2)

where T : Z → T where T is differentiable and invertible and T ⊆ R2. What is the joint densityfunction of (Y1, Y2) on T? As T is invertible, we set

X1 = T−11 (Y1, Y2) X2 = T−1

2 (Y1, Y2)

Then we define

J(y1, y2) :=

∣∣∣∣∣∣∂T−1

1

∂y1

∂T−12

∂y1∂T−1

1

∂y2

∂T−12

∂y2

∣∣∣∣∣∣ =∂T−1

1

∂y1

∂T−12

∂y2− ∂T−1

2

∂y1

∂T−11

∂y2

to be the Jacobian of transformation. Then we have the following result, which simply follows fromthe change of variable rule for integration:

Theorem 2.4.34. If (X1, X2) have joint density f(x, y) on Z, then for (Y1, Y2) = T (X1, X2), withT as described above, the joint density of (Y1, Y2), denoted g is:

g(y1, y2) = f(T−11 (y1, y2), T−1

2 (y1, y2))|J(y1, y2)| (y1, y2) ∈ T.

Example 2.4.35. Suppose Z = (R+)2 and

f(x1, x2) = λ2e−λ(x1+x2) (x1, x2) ∈ Z.

Let(Y1, Y2) = (X1 +X2, X1/X2).

To find the joint density of (Y1, Y2) we first note that T = Z and that

(X1, X2) =( Y1Y2

1 + Y2,

Y1

1 + Y2

).

Then the Jacobian is

J(y1, y2) :=

∣∣∣∣∣ y21+y2

11+y2

y1(1+y2)2 − y1

(1+y2)2

∣∣∣∣∣ = − y1

(1 + y2)2.

Thus:g(y1, y2) =

y1

(1 + y2)2λ2e−λy1 (y1, y2) ∈ T.

One can check that indeed Y1 and Y2 are independent and, that the marginal densities are:

g(y1) = λ2y1e−λy1 y1 ∈ R+ g(y2) =

1

(1 + y2)2y2 ∈ R+.

Page 35: ST2334 Notes (Probability and Statistics - NUS)

28

Example 2.4.36. Suppose we have (X1, X2) as in the case of Example 2.4.35, except we have themapping:

(Y1, Y2) = (X1, X1 +X2).

Now, clearly Y2 ≥ Y1, so this transformation induces the support for (Y1, Y2) as:

T = (y1, y2) : 0 ≤ y1 ≤ y2, y2 ∈ R+.

Then(X1, X2) = (Y1, Y2 − Y1)

and clearly J(y1, y2) = 1; thus

g(y1, y2) = λ2e−λy2 (y1, y2) ∈ T.

Example 2.4.37. Let X1 and X2 be independent N (0, 1) random variables (Z = R2) and let

(Y1, Y2) = (X1, ρX1 +√

1− ρ2X2) ρ ∈ [−1, 1]

Then T = Z and

(X1, X2) =(Y1,

Y2 − ρY1√1− ρ2

).

Then the Jacobian is

J(y1, y2) :=

∣∣∣∣∣∣1 − ρ√

1−ρ2

0 1√1−ρ2

∣∣∣∣∣∣ =1√

1− ρ2.

Thus

g(y1, y2) =1

2π√

1− ρ2exp−1

2(y2

1 + (y2 − ρy1)2/(1− ρ2) (y1, y2) ∈ T

=1

2π√

1− ρ2exp− 1

2(1− ρ2)(y2

1 + y22 − 2y1y2ρ) (y1, y2) ∈ T.

On inspection of Example 2.4.23, (Y1, Y2) follow a standard bivariate normal distribution.

Distributions associated to the Normal Distribution

In the following Subsection we investigate some distributions that are related to the normal distri-bution. They appear frequently in hypothesis testing, which is something that we will investigatein the following Chapter. We note that, in a similar way to the way in which joint distributions aredefined (Definition 2.4.19), one can easily extend to joint distributions of n ≥ 2 variables (so thatthere is a joint distribution F (x1, . . . , xn) and density f(x1, . . . , xn)).

We begin with the idea of the chi-square distribution on n > 0 degrees of freedom:

f(x) =1

(2)n/2Γ(n/2)xn/2−1e−x/2 x ∈ X = R+.

We write X ∼ X 2n ; you will notice that also X ∼ G(n/2, 1/2), so that E[X] = n. Note that from

Table 4.2, we have that

M(t) =1

(1− 2t)n/21

2> t.

We have the following result:

Proposition 2.4.38. Let X1, . . . , Xn be independent and identically distributed (i.i.d.) standard

normal random variables (i.e. Xii.i.d.∼ N (0, 1)). Let

Z = X21 + · · ·+X2

n.

Then Z ∼ X 2n .

Page 36: ST2334 Notes (Probability and Statistics - NUS)

29

Proof. We will use moment generating functions. We have

M(t) = E[eZt]

= E[et∑ni=1X

2i ]

=

n∏i=1

E[etX2i ]

= E[eX21 t]n.

Here we have used the independent property on the third line and the fact that the random variablesare identically distributed on the last line. Now

E[eX21 t] =

1√2π

∫ ∞−∞

exp−1

2(1− 2t)x2

1dx11

2> t

=1√

2π(1− 2t)1/2

∫ ∞−∞

exp−1

2u2du

=1

(1− 2t)1/2.

Hence

M(t) =1

(1− 2t)n/21

2> t

and we conclude the result.

The standard t−distribution on n−degrees of freedom is (n > 0):

p(x) =Γ([n+ 1]/2)√nπΓ(n/2)

(1 +

x2

n

)−(n+1)/2

x ∈ X = R.

We write X ∼ Tn. Then we have the following important result.

Proposition 2.4.39. Let X ∼ N (0, 1) and independently Y ∼ X 2n , n > 0. Let

T =X√Yn

.

Then T ∼ Tn.

Proof. We have

f(x, y) =1√2πe−

12x

2 1

2n/2Γ(n/2)yn/2−1e−

12y Z = R× R+.

We will use Theorem 2.4.34, with the transformation defined by

T =X√Yn

S = Y

and then marginalize out S. The Jacobian of transformation is:

J(t, s) :=

∣∣∣∣ √ sn 0t

2√sn

1

∣∣∣∣ =

√s

n.

Then

f(t, s) =1√

2nπ2n/2Γ(n/2)sn/2−1/2 exp−s

2[t2/n+ 1] (t, s) ∈ R× R+.

Then we have for t ∈ R

f(t) =1√

2nπ2n/2Γ(n/2)

∫ ∞0

sn/2−1/2 exp−s2

[t2/n+ 1]ds

=1√

2nπ2n/2Γ(n/2)

∫ ∞0

1

( 12 + t2

2n )n+12

un/2−1/2e−udu

=1√

2nπ2n/2Γ(n/2)

1

( 12 + t2

2n )n+12

Γ([n+ 1]/2)

=Γ([n+ 1]/2)√nπΓ(n/2)

(1 +

t2

n

)−(n+1)/2

Page 37: ST2334 Notes (Probability and Statistics - NUS)

30

and we conclude.

The last distribution that we consider is the standard F−distribution on d1, d2 > 0 degrees offreedom:

f(x) =1

B(d12 ,d22 )

(d1

d2

)d1/2xd1/2−1

(1 +

d1

d2x)− d1+d2

2

x ∈ X = R+

where

B(d1

2,d2

2) =

Γ(d12 )Γ(d22 )

Γ(d1+d22 )

.

We write X ∼ F(d1,d2). We have the following result.

Proposition 2.4.40. Let X ∼ X 2d1

and independently Y ∼ X 2d2

, d1, d2 > 0. Let

F =X/d1

Y/d2.

Then F ∼ F(d1,d2).

Proof. We have

f(x, y) =1

2d1/2Γ(d1/2)xd1/2−1e−

12x

1

2d2/2Γ(d2/2)yd2/2−1e−

12y Z = R+ × R+.

We will use Theorem 2.4.34, with the transformation defined by

T =X/d1

Y/d2

S = Y

and then marginalize out S. The Jacobian of transformation is:

J(t, s) :=

∣∣∣∣∣ d1sd2

0d1td2

1

∣∣∣∣∣ =d1s

d2.

Then

f(t, s) =1

2d1/2+d2/2Γ(d1/2)Γ(d2/2)

d1s

d2

(d1st

d2

)d1/2−1

e−d1st2d2 sd2/2−1e−s/2 T = (R+)2.

To shorten the subsequent notations, let

g =1

2d1/2+d2/2Γ(d1/2)Γ(d2/2)

(d1

d2

)d1/2.

Then, for t ∈ R+

f(t) = gtd1/2−1

∫ ∞0

sd1+d2

2 −1 exp−s(1

2[1 +

d1t

d2])ds

= gtd1/2−1

∫ ∞0

u(d1+d2)/2−1e−u(1

2[1 +

d1t

d2])−(d1+d2)/2du

= gtd1/2−1(1

2[1 +

d1t

d2])−(d1+d2)/2Γ((d1 + d2)/2)

= gΓ((d1 + d2)/2)2d1+d2

2 td1/2−1(

1 +d1

d2t)− d1+d2

2

=1

B(d12 ,d22 )

(d1

d2

)d1/2td1/2−1

(1 +

d1

d2t)− d1+d2

2

and we conclude.

Page 38: ST2334 Notes (Probability and Statistics - NUS)

31

2.5 Convergence of Random Variables

In the following Section we provide a brief introduction to convergence of random variable and inparticular two modes of convergence:

• Convergence in Distribution

• Convergence in Probability

These properties are rather important in probability and mathematical statistics and provide a wayto justify many statistical and numerical procedures. The properties are rather loosely associatedto the notion of a sequence of random variables X1, X2, . . . , Xn (there will be infinitely many ofthem typically) and we will be concerned with the idea of what happens to the distribution or somefunctional as n→∞. The second part of this Section will focus on perhaps the most important resultin probability: the central limit theorem. This provides a characterization of a random variable:

Sn =1√n

n∑i=1

Xi

with X1, . . . , Xn independent and identically distributed with zero mean and unit variance. Thisparticular result is very useful in hypothesis testing, which we shall see later on. Throughout, wewill focus on continuous random variables, but one can extend this notion.

2.5.1 Convergence in Distribution

Consider a sequence of random variablesX1, X2, . . . , with associated distribution functions F1, F2, . . . ,we then have the following definition:

Definition 2.5.1. We say that the sequence of distribution functions F1, F2, . . . converges to adistribution function F , if limn→∞ Fn(x) = F (x) at each point where F is continuous. If X has

distribution function F , we say that Xn converges in distribution to X and write Xnd→ X.

Example 2.5.2. Consider a sequence of random variables Xn ∈ Xn = [0, n], n ≥ 1, with

Fn(x) = 1−(

1− x

n

)nx ∈ Xn.

Then for any fixed x ∈ R+

limn→∞

Fn(x) = 1− e−x.

Now if X ∼ E(1):

F (x) =

∫ x

0

e−udu = 1− e−x.

So Xnd→ X, X ∼ E(1).

Example 2.5.3. Consider a sequence of random variables Xn ∈ X = [0, 1], n ≥ 1, with

Fn(x) = 1− sin(2nπx)

2nπ.

Then for any fixed x ∈ [0, 1]limn→∞

Fn(x) = x.

Now if X ∼ U[0,1], x ∈ [0, 1]:

F (x) =

∫ x

0

du = x.

So Xnd→ X, X ∼ U[0,1].

Page 39: ST2334 Notes (Probability and Statistics - NUS)

32

2.5.2 Convergence in Probability

We now consider an alternative notion of convergence for a sequence of random variables X1, X2, . . .

Definition 2.5.4. We say that Xn converges in probability to a constant c ∈ R (written XnP→ c) if

for every ε > 0limn→∞

P(|Xn − c| > ε) = 0.

We note that convergence in probability can be extended to convergence to a random variableX, but we do not do this, as it is beyond the level of this course. We have the following useful result,which we do not prove.

Theorem 2.5.5. Consider a sequence of random variables X1, X2, . . . , with associated distributionfunctions F1, F2, . . . . If for every x ∈ Xn

limn→∞

Fn(x) = c

then XnP→ c.

Example 2.5.6. Consider a sequence of random variables Xn ∈ X = R+, n ≥ 1, with

Fn(x) =( x

1 + x

)n.

Then for any fixed x ∈ Xlimn→∞

Fn(x) = 0.

Thus XnP→ 0.

We finish the Section with a rather important result, which we again do not prove. It is calledthe weak law of large numbers:

Theorem 2.5.7. Let X1, X2, . . . be a sequence of independent and identically distributed randomvariables with E[|X1|] <∞. Then

1

n

n∑i=1

XiP→ E[X1].

Note that this result extends to functions; i.e. if g : R→ R with E[|g(X1)|] <∞ then

1

n

n∑i=1

g(Xi)P→ E[g(X1)].

Example 2.5.8. Consider an integral

I =

∫Rg(x)dx

where we will suppose that g(x) 6= 0. Suppose we cannot calculate I. Consider any non-zero pdf f(x)on R. Then

I =

∫R

g(x)

f(x)f(x)dx

and suppose that ∫R

∣∣∣ g(x)

f(x)

∣∣∣f(x)dx <∞.

Then by the weak law of large numbers, if X1, . . . , Xn are i.i.d. with pdf f then

1

n

n∑i=1

g(Xi)

f(Xi)

P→ I.

This provides a justification for a numerical method (called the Monte Carlo method) to approximateintegrals.

Page 40: ST2334 Notes (Probability and Statistics - NUS)

33

2.5.3 Central Limit Theorem

We close the Section and Chapter with the central limit theorem (CLT) which we state withoutproof.

Theorem 2.5.9. Let X1, X2, . . . be a sequence of independent and identically distributed randomvariables with E[|X1|] <∞ and 0 < Var[X1] <∞. Then

√n

(1

n

n∑i=1

(Xi − E[X1])√Var[X1]

)d→ Z

where Z ∼ N (0, 1).

The CLT asserts that the distribution of the summation∑ni=1Xi suitably normalized and cen-

tered, will converge to that of a normal distribution. An often used idea, via the CLT, is normalapproximation. That is, for n ‘large’

√n

(1

n

n∑i=1

(Xi − E[X1])√Var[X1]

)·∼ N (0, 1)

where·∼ means ‘approximately distributed as’, so we have4,

n∑i=1

Xi·∼ N (nE[X1], nVar[X1]).

Example 2.5.10. Let X1, . . . , Xn be i.i.d. G(a/n, b) random variables. Then the distribution ofZn =

∑ni=1Xi is for b > t:

M(t) = E[eZnt] =

n∏i=1

E[eXit] = E[ex1 ]n =(( b

b− t

)a/n)nwhere we have used the i.i.d. property and the MGF of a Gamma random variable (see Table 4.2).Thus Zn ∼ G(a, b) for any n ≥ 1. However, from the CLT one can reasonably approximate Zn, whenn is large by a normal random variable with mean

na

bn=a

b

and variancena

nb2=

a

b2.

4If Z ∼ N (0, 1) you can verify that if X = σZ + µ then X ∼ N (µ, σ2)

Page 41: ST2334 Notes (Probability and Statistics - NUS)

Chapter 3

Introduction to Statistics

3.1 Introduction

In this final Chapter, we give a very brief introduction to statistical ideas. Here the notion is thatone has observed data from some sampling distribution ‘the population distribution’ and one wishesto infer what the properties of the ‘population’ are on the basis of observed samples. In particularwe will be interested in estimating the parameters of sampling distributions such as the maximumlikelihood method (Section 3.2) as well as testing hypotheses about parameters (3.3). We end theChapter with an introduction to Bayesian statistics (Section 3.4) which is another alternative wayto estimate parameters, which is more complex, but much richer than the MLE method.

3.2 Maximum Likelihood Estimation

3.2.1 Introduction

So far in this course, we have proceeded to discuss pmfs and pdfs such as P(λ) or N (µ, σ2), but wehave not discussed what the parameters λ or µ might be. We discuss in this Section, a particularlyimportant way to estimate parameters called maximum likelihood estimation (MLE).

The basic idea is this. Suppose one observes data x1, x2, . . . , xn and we hypothesize that theyfollow some joint distribution Fθ(x1, . . . , xn), θ ∈ Θ (Θ is the parameter space e.g. Θ = R). Then theidea of maximum likelihood estimation is to find the parameter which maximizes the joint pmf/pdfof the data, that is:

θn = argmaxθ∈Θfθ(x1, . . . , xn).

3.2.2 The Method

Throughout this section, we assume that X1, . . . , Xn are mutually independent. So, we have Xii.i.d.∼

Fθ, and the joint pmf/pdf is:

fθ(x1, . . . , xn) = fθ(x1)× fθ(x2)× · · · × fθ(xn) =

n∏i=1

fθ(xi).

We call fθ(x1, . . . , xn) the likelihood of the data. As maximizing a function is equivalent to maxi-mizing a monotonic increasing transformation of the function, we often work with the log-likelihood :

lθ(x1, . . . , xn) = log(fθ(x1, . . . , xn)

)=

n∑i=1

log(fθ(xi)

).

If Θ is some continuous space (as it generally is for our examples) and θ = (θ1, . . . , θd), then we cancompute the gradient vector:

∇lθ(x1, . . . , xn) =(∂lθ(x1, . . . , xn)

∂θ1, . . . ,

∂lθ(x1, . . . , xn)

∂θd

)and we would like to solve, for θ ∈ Θ (below 0 is the d−dimensional vector of zeros)

∇lθ(x1, . . . , xn) = 0. (3.2.1)

34

Page 42: ST2334 Notes (Probability and Statistics - NUS)

35

The solution of this equation (assuming it exists) is a maximum if the hessian matrix is negativedefinite:

H(θ) :=

∂2lθ(x1,...,xn)

∂θ21

∂2lθ(x1,...,xn)∂θ1∂θ2

· · · ∂2lθ(x1,...,xn)∂θ1∂θd

......

......

∂2lθ(x1,...,xn)∂θd∂θ1

∂2lθ(x1,...,xn)∂θd∂θ2

· · · ∂2lθ(x1,...,xn)∂θ2d

.If the d numbers λ1, . . . , λd which solve|λId −H(θ)| = 0, with Id the d × d identity matrix, are allnegative, then θ is a local maximum of lθ(x1, . . . , xn). If d = 1 then this just boils down to checkingwhether the second derivative of the log-likelihood is negative at the solution of (3.2.1).

Thus in summary, the approach we employ is as follows:

1. Compute the likelihood fθ(x1, . . . , xn).

2. Compute the log-likelihood lθ(x1, . . . , xn) and its gradient vector ∇lθ(x1, . . . , xn).

3. Solve ∇lθ(x1, . . . , xn) = 0, with respect to θ ∈ Θ, call this solution θn (we are assuming there

is only one θn).

4. If H(θn) is negative definite, then θn = θn.

In general, point 3. may not be possible analytically (so for example, one can use Newton’s method).However, you will not be asked to solve ∇lθ(x1, . . . , xn) = 0, unless there is an analytic solution.

3.2.3 Examples of Computing the MLE

Example 3.2.1. Let X1, . . . , Xn be i.i.d. P(λ) random variables. Let us compute the MLE of λ = θ,given observations x1, . . . , xn. First, we have

fλ(x1, . . . , xn) =

n∏i=1

λxi

xi!e−λ

= e−nλλ∑ni=1 xi

1∏ni=1 xi!

.

Second the log-likelihood is:

lλ(x1, . . . , xn) = log(fλ(x1, . . . , xn)) = −nλ+( n∑i=1

xi

)log(λ)− log(

n∏i=1

xi!).

The gradient vector is a derivative:

dlλ(x1, . . . , xn)

dλ= −n+

1

λ

( n∑i=1

xi

).

Thirdly

−n+1

λ

( n∑i=1

xi

)= 0

so,

λn =1

n

n∑i=1

xi.

Fourthly,

d2lλ(x1, . . . , xn)

dλ2= − 1

λ2

( n∑i=1

xi

)< 0

for any λ > 0 (assuming there is at least one i such that xi > 0). Thus assuming there is at leastone i such that xi > 0

λn =1

n

n∑i=1

xi.

Page 43: ST2334 Notes (Probability and Statistics - NUS)

36

Remark 3.2.2. As a theoretical justification of λn as in Example 3.2.1, we note that

1

n

n∑i=1

xi

will converge in probability (see Theorem 2.5.7) to E[X1] = λ if our assumptions hold true. That is,we recover the true parameter value; such a property is called consistency - we do not address thisissue further.

Example 3.2.3. Let X1, . . . , Xn be i.i.d. E(λ) random variables. Let us compute the MLE of λ = θ,given observations x1, . . . , xn. First, we have

fλ(x1, . . . , xn) =

n∏i=1

λe−λxi

= λn exp−λn∑i=1

xi.

Second the log-likelihood is:

lλ(x1, . . . , xn) = log(fλ(x1, . . . , xn)) = n log(λ)− λn∑i=1

xi.

The gradient vector is a derivative:

dlλ(x1, . . . , xn)

dλ=n

λ−

n∑i=1

xi.

Thirdly

n

λ−

n∑i=1

xi = 0

so

λn = (1

n

n∑i=1

xi)−1.

Fourthly,d2lλ(x1, . . . , xn)

dλ2= − n

λ2< 0.

Thus

λn = (1

n

n∑i=1

xi)−1.

Example 3.2.4. Let X1, . . . , Xn be i.i.d. N (µ, σ2) random variables. Let us compute the MLE of(µ, σ2) = θ, given observations x1, . . . , xn. First, we have

fθ(x1, . . . , xn) =

n∏i=1

1√2πσ2

exp− 1

2σ2(xi − µ)2

=( 1√

2πσ2

)nexp− 1

2σ2

n∑i=1

(xi − µ)2.

Second the log-likelihood is:

lθ(x1, . . . , xn) = log(fθ(x1, . . . , xn)) = −n2

log(2πσ2)− 1

2σ2

n∑i=1

(xi − µ)2.

The gradient vector is(∂lθ(x1, . . . , xn)

∂µ,∂lθ(x1, . . . , xn)

∂σ2

)=( 1

σ2

n∑i=1

(xi − µ),− n

2σ2+

1

2(σ2)2

n∑i=1

(xi − µ)2).

Page 44: ST2334 Notes (Probability and Statistics - NUS)

37

Thirdly, we must solve the equations

1

σ2

n∑i=1

(xi − µ) = 0 (3.2.2)

− n

2σ2+

1

2(σ2)2

n∑i=1

(xi − µ)2 = 0 (3.2.3)

simultaneously for µ and σ2. Since (3.2.2) can be solved independently of (3.2.3), we have:

µn =1

n

n∑i=1

xi.

Thus substituting into (3.2.3), we have

1

2(σ2)2

n∑i=1

(xi − µn)2 =n

2σ2

that is:

σ2n =

1

n

n∑i=1

(xi − µn)2.

Fourthly the hessian matrix is:

H(θ) =

[− nσ2 − 1

(σ2)2

∑ni=1(xi − µ)

− 1(σ2)2

∑ni=1(xi − µ) n

2(σ2)2 −1

(σ2)3

∑ni=1(xi − µ)2

].

Note that when θn = (µn, σ2n) the off-diagonal elements are exactly 0, so when solving |λI2−H(θn)| =

0, we simply need to show that the diagonal elements are negative. Clearly for the first diagonalelement

− n

σ2n

< 0

if (xi − µn)2 > 0 for at least one i (we do not allow the case σ2n, in that we assume this doesn’t

occur). In addition

n

2(σ2n)2− 1

(σ2n)3

n∑i=1

(xi − µn)2 =n

2(σ2n)2− n

(σ2n)2

= − n

2(σ2n)2

< 0.

Thus

θn =( 1

n

n∑i=1

xi,1

n

n∑i=1

(xi −

1

n

n∑j=1

xj)2)

.

3.3 Hypothesis Testing

3.3.1 Introduction

The theory of hypothesis testing is concerned with the problem of determining whether or not astatistical hypothesis, that is, a statement about the probability distribution of the data, is consis-tent with the available sample evidence. The particular hypothesis to be tested is called the nullhypothesis and is denoted by H0. The ultimate goal is to accept or reject H0.

In addition to the null hypothesis H0, one may also be interested in a particular set of deviationsfrom H0, called the alternative hypothesis and denoted by H1. Usually, the null and the alternativehypotheses are not on an equal footing: H0 is clearly specified and of intrinsic interest, whereas H1

serves only to indicate what types of departure from H0 are of interest.A statistical test of a null hypothesis H0 is typically based on three elements:

1. a statistic T , called a test statistic

2. a partition of the possible values of T into two distinct regions: the set K of values of T thatare regarded as inconsistent with H0, called the critical or rejection region of the test, and itscomplement Kc, called the non-rejection region

3. a decision rule that rejects the null hypothesis H0 as inconsistent with the data if the observedvalue of T falls in the rejection region K, and does not reject H0 if the observed value of Tbelongs instead to Kc.

Page 45: ST2334 Notes (Probability and Statistics - NUS)

38

3.3.2 Constructing Test Statistics

In order to construct test-statistics, we start with an important result, which we do not prove. Notethat all of our results are associated to normally distributed data.

Theorem 3.3.1. Let X1, . . . , Xn be i.i.d. N (µ, σ2) random variables. Define

Xn :=1

n

n∑i=1

Xi

s2n :=

1

n− 1

n∑i=1

(Xi − Xn)2

Then

1. Xn − µ ∼ N (0, σ2/n).

2. (n− 1)s2n/σ

2 ∼ X 2n−1.

3. Xn − µ and (n− 1)s2n/σ

2 are independent random variables.

We note then, that by Proposition 2.4.39, it follows that

T (µ) :=(Xn − µ)/(σ/

√n)√

s2n/σ

2=

(Xn − µ)√s2n/n

∼ Tn−1.

Note that as one might imagine Proposition 2.4.38 is used to prove Theorem 3.3.1. Note that onecan show that the Tn−1−distribution has a symmetric pdf, around 0.

Now consider testing H0 : µ = µ0 against the two-sided alternative H1 : µ 6= µ0 (that is, we aretesting whether the population mean, the ‘true’ mean of the data, is a particular value). Now weknow for any µ ∈ R that T (µ) ∼ Tn−1, thus if the null hypothesis is true T (µ0) should be a randomvariable that is consistent with a Tn−1 random variable. To construct the rejection region of thetest, we must choose a confidence level, typically 95%. Then we want T (µ0) to lie in 95% of theprobability. However, this still does not tell us what the rejection region is; this is informed by thealternative hypothesis, that H1 : µ 6= µ0; which indicates that a value which is inconsistent with theTn−1−distribution lies in each ‘tail’ of the distribution. Thus the procedure is:

1. Compute T (µ0).

2. Decide upon your confidence level (1− α)%. This defines the rejection region.

3. Compute the t−values (−t, t). These are the numbers in R such that the probability (under aTn−1−random variable) of exceeding t is α/2 and the probability of being less than −t is α/2.

4. If −t < T (µ0) < t then we do not reject the null hypothesis, otherwise we reject the nullhypothesis.

A number of remarks are in order. First, a test can only disprove a null hypothesis. The factthat we do not reject the null on the basis of the sample evidence does not mean that the hypothesisis true.

Second, it is useful to distinguish between two types of errors that can be made:

• Type I error: Reject H0 when it is true.

• Type II error: Do not reject H0 when it is false.

In statistics, Type I error is usually regarded as the most serious. The analogy is with the judiciarysystem, where condemning an innocent is typically considered a much more serious problem thanletting a guilty person free.

A second test statistic we consider is for testing whether two samples have the same variance. Let

X1, . . . , Xn and Y1, . . . , Ym be independent sequences of random variables where Xii.i.d.∼ N (µX , σ

2X)

and Yii.i.d.∼ N (µY , σ

2Y ). Now we know from Theorem 3.3.1 that

(n− 1)s2X,n/σ

2X ∼ X 2

n−1

Page 46: ST2334 Notes (Probability and Statistics - NUS)

39

and independently:(m− 1)s2

Y,m/σ2Y ∼ X 2

m−1.

Now suppose that we want to test H0 : σ2X = σ2

Y against H1 : σ2X 6= σ2

Y . Now if H0 is true, byTheorem 3.3.1 and Proposition 2.4.40.

F (σ2X) := (n− 1)s2

X,n/σ2X/(n− 1)

/(m− 1)s2

Y,m/σ2X/(m− 1) = s2

X,n/s2Y,m ∼ Fn−1,m−1

Thus, we perform the following procedure:

1. Compute F (σ2X).

2. Decide upon your confidence level (1− α)%. This defines the rejection region.

3. Compute the F−values (f, f). These are the numbers in R+ such that the probability (under

a Fn−1,m−1−random variable) of exceeding f is α/2 and the probability of being less than fis α/2.

4. If f < F (σ2X) < f then we do not reject the null hypothesis, otherwise we reject the null

hypothesis.

Remark 3.3.2. All of our results concern Normal samples. However one can use the CLT (Theorem2.5.9) to extend these tests to non-normal data. We do not follow this idea in this course.

3.4 Bayesian Inference

3.4.1 Introduction

So far we have considered point estimation methods for the unknown parameter. In the followingSection, we consider the idea of Bayesian inference for unknown parameters. Recall, from Examples2.3.34 and 2.4.28, that there is a Bayes theorem for discrete and continuous random variables. Indeed,there is version of Bayes theorem that can mix continuous and discrete random variables. Let Xand Y be two jointly defined random variables such that X or Y are either continuous or discrete.Then Bayes theorem is:

f(x|y) =f(y|x)f(x)

f(y)

as long as the associated pmfs/pdfs are well defined. Throughout the section, we assume that thepmfs/pdfs are always positive on their supports.

3.4.2 Bayesian Estimation

Throughout, we will assume that we will have a random sample X1, X2, . . . , Xn which are condi-tionally independent, given a parameter θ ∈ Θ. As we will see, in Bayesian statistics, the parameter

is a random variable. So, we assume Xi|θi.i.d.∼ F (·|θ) where F (·|θ) is assumed to be a distribution

function for any θ ∈ Θ. Thus we have that the joint pmf/pdf of X1, . . . , Xn|θ is:

f(x1, . . . , xn|θ) =

n∏i=1

f(xi|θ).

The main key behind Bayesian statistics is the choice of a prior probability distribution for theparameter θ ∈ Θ. That is, Bayesian statisticians specify a probability distribution on the parameterθ before the data are observed. This probability distribution is supposed to reflect the informationone might have before seeing the observations. To make this idea concrete, consider the followingexample:

Example 3.4.1. Suppose that we will observe data Xi|θi.i.d.∼ E(λ). Then one has to construct a

probability distribution on λ ∈ R+. A possible candidate is λ ∼ G(a, b). If one has some prior beliefsabout the mean and variance of λ (say E[λ] = 10, Var[λ] = 10) then one determine what a and bare.

Page 47: ST2334 Notes (Probability and Statistics - NUS)

40

Throughout, we will write the prior pmf/pdf as π(θ).Now the way in which Bayesian inference works is to update the prior beliefs on θ via the posterior

pmf/pdf. That is, ‘in the light of the data’ the distributional properties of the prior are updated.This is achieved by Bayes theorem; the posterior pmf/pdf is:

π(θ|x1, . . . , xn) =f(x1, . . . , xn|θ)π(θ)

f(x1, . . . , xn)

where

f(x1, . . . , xn) =

∫Θ

f(x1, . . . , xn|θ)π(θ)dθ

if θ is continuous and, if θ is discrete:

f(x1, . . . , xn) =∑θ∈Θ

f(x1, . . . , xn|θ)π(θ).

For a Bayesian statistician, the posterior is the ‘final answer’, in that all statistical inference shouldbe associated to the posterior. For example, if one is interested in estimating θ then one can use theposterior mean:

E[Θ|x1, . . . , xn] =

∫Θ

θπ(θ|x1, . . . , xn)dθ.

In addition, consider θ ∈ R (or indeed any univariate component of θ), then we can compute aconfidence interval, which is called an credible interval in Bayesian statistics. The highest 95%-posterior-credible (HPC) interval is the shortest region [θ, θ], such that∫ θ

θ

π(θ|x1, . . . , xn)dθ = 0.95.

By shortest, we mean the [θ, θ], such that |θ − θ| is smallest.It should be clear by now, that the posterior distribution is much ‘richer’ than the MLE, in the

sense that one now has a whole distribution which reflects the parameter, instead of a point estimate.What also might be apparent now, is the fact that the posterior is perhaps difficult to calculate. Forexample the posterior mean will require you to compute two integrations over Θ (in general) and thismay not be easy to calculate in practice. In addition, it could be difficult, analytically to calculatea HPC. As a result, most Bayesian inference is done via numerical methods, based on Monte Carlo(see Example 2.5.8); we do not mention this further.

3.4.3 Examples

Despite the fact that Bayesian inference is very challenging, there are still many examples where onecan do analytical calculations.

Example 3.4.2. Let us consider Example 3.4.1. Here we have that

f(x1, . . . , xn) =

∫ ∞0

λn exp−λn∑i=1

xiba

Γ(a)λa−1e−bλdλ

=ba

Γ(a)

∫ ∞0

λn+a−1 exp−λ[

n∑i=1

xi + b]dλ

=ba

Γ(a)

( 1

[∑ni=1 xi + b]

)n+a∫ ∞

0

un+a−1e−udu

=ba

Γ(a)

( 1

[∑ni=1 xi + b]

)n+a

Γ(n+ a).

So, as:

f(x1, . . . , xn|λ)π(λ) =ba

Γ(a)λn+a−1 exp−λ[

n∑i=1

xi + b]

we have:

π(λ|x1, . . . , xn) =λn+a−1 exp−λ[

∑ni=1 xi + b](

1[∑ni=1 xi+b]

)n+a

Γ(n+ a)

Page 48: ST2334 Notes (Probability and Statistics - NUS)

41

i.e.

λ|x1, . . . , xn ∼ G(n+ a, b+

n∑i=1

xi).

Thus, the posterior distribution on λ is in the same family as the prior, except, with updated param-eters, reflecting the data. So, for example:

E[Λ|x1, . . . , xn] =n+ a

b+∑ni=1 xi

.

In comparison to the MLE in Example 3.2.3, we see that the posterior mean and MLE correspondas a, b→ 0.

Example 3.4.3. Let Xi|λi.i.d.∼ P(λ), i ∈ 1, . . . , n. Suppose the prior on λ is G(a, b). Then

f(x1, . . . , xn) =

∫ ∞0

λ∑ni=1 xi exp−nλ 1∏n

i=1 xi!

ba

Γ(a)λa−1e−bλdλ

=ba

Γ(a)∏ni=1 xi!

∫ ∞0

λ∑ni=1 xi+a−1 exp−λ[n+ b]dλ

=ba

Γ(a)∏ni=1 xi!

( 1

[n+ b]

)∑ni=1 xi+a

∫ ∞0

u∑ni=1 xi+a−1e−udu

=ba

Γ(a)∏ni=1 xi!

( 1

[n+ b]

)∑ni=1 xi+a

Γ(

n∑i=1

xi + a).

So as:

f(x1, . . . , xn|λ)π(λ) =ba

Γ(a)∏ni=1 xi!

λ∑ni=1 xi+a−1 exp−λ[n+ b]

we have:

π(λ|x1, . . . , xn) =λ∑ni=1 xi+a−1 exp−λ[n+ b](

1[n+b]

)∑ni=1 xi+a

Γ(∑ni=1 xi + a)

i.e.

λ|x1, . . . , xn ∼ G( n∑i=1

xi + a, n+ b).

Thus, the posterior distribution on λ is in the same family as the prior, except, with updated param-eters, reflecting the data. So, for example:

E[Λ|x1, . . . , xn] =

∑ni=1 xi + a

n+ b.

In comparison to the MLE in Example 3.2.1, we see that the posterior mean and MLE correspondas a, b→ 0.

Example 3.4.4. Let Xi|µi.i.d.∼ N (µ, 1), i ∈ 1, . . . , n. Suppose the prior on µ is N (ξ, κ). Then

f(x1, . . . , xn) =

∫ ∞−∞

( 1√2π

)nexp−1

2

n∑i=1

(xi − µ)2 1√2πκ

exp− 1

2κ(µ− ξ)2dµ

=( 1√

)n 1√2πκ

∫ ∞−∞

exp−1

2

n∑i=1

(xi − µ)2 − 1

2κ(µ− ξ)2dµ.

To compute the integral, let us first manipulate the exponent inside the integral:

−1

2

n∑i=1

(xi − µ)2 − 1

2κ(µ− ξ)2 = −1

2[µ2(n+

1

κ)− 2µ(

ξ

κ+

n∑i=1

xi) +ξ2

κ+

n∑i=1

x2i ]

Let c(ξ, κ, x1, . . . , xn) = − 12ξ2

κ +∑ni=1 x

2i , then

−1

2

n∑i=1

(xi − µ)2 − 1

2κ(µ− ξ)2 = −1

2[µ2(n+

1

κ)− 2µ(

ξ

κ+

n∑i=1

xi)] + c(ξ, κ, x1, . . . , xn)

= −1

2(n+

1

κ)[(µ−

( ξκ +∑ni=1 xi)

n+ 1κ

)2 − (( ξκ +

∑ni=1 xi)

n+ 1κ

)2]+ c(ξ, κ, x1, . . . , xn).

Page 49: ST2334 Notes (Probability and Statistics - NUS)

42

Let c′(ξ, κ, x1, . . . , xn) = c(ξ, κ, x1, . . . , xn) + 12 (n+ 1

κ )(( ξκ+

∑ni=1 xi)

n+ 1κ

)2

, then we have

f(x1, . . . , xn) =( 1√

)n 1√2πκ

expc′(ξ, κ, x1, . . . , xn)∫ ∞−∞

exp−1

2(n+

1

κ)(µ−

( ξκ +∑ni=1 xi)

n+ 1κ

)2dµ

=( 1√

)n 1√2πκ

expc′(ξ, κ, x1, . . . , xn)√

(n+ 1κ )1/2

.

So as:

f(x1, . . . , xn|µ)π(µ) =( 1√

)n 1√2πκ

expc′(ξ, κ, x1, . . . , xn) exp−1

2(n+

1

κ)(µ−

( ξκ +∑ni=1 xi)

n+ 1κ

)2

we have

π(µ|x1, . . . , xn) =exp− 1

2 (n+ 1κ )(µ− ( ξκ+

∑ni=1 xi)

n+ 1κ

)2√

2π(n+ 1

κ )1/2

i.e.

µ|x1, . . . , xn ∼ N( ( ξκ +

∑ni=1 xi)

n+ 1κ

, (n+1

κ)−1)

Thus, the posterior distribution on µ is in the same family as the prior, except, with updated param-eters, reflecting the data. So, for example:

E[Ξ|x1, . . . , xn] =( ξκ +

∑ni=1 xi)

n+ 1κ

.

In comparison to the MLE in Example 3.2.4, we see that the posterior mean and MLE correspondas κ→∞, for any fixed ξ ∈ R.

We note that in all of our examples the posterior is in the same family as the prior. This is notby chance; for every member of the exponential family, with parameter θ it is often possible to finda prior which obeys this former property. Such priors are called conjugate priors.

Page 50: ST2334 Notes (Probability and Statistics - NUS)

Chapter 4

Miscellaneous Results

In the following Chapter we quote some results of use in the course. We do not cover this materialin lectures and is there for your convenience and revision. The following Sections give general factswhich should be taken as true. You can easily find the theory behind each of these results in anyundergraduate text in maths or statistics.

4.1 Set Theory

Recall that A ∪B means all the elements in A or all the elements of B. E.g. A = 1, 2, B = 3, 4then A∪B = 1, 2, 3, 4. In addition that A∩B means all the elements in A and B. E.g. A = 1, 2, 3,B = 3, 4 then A ∩ B = 3. Finally recall Ac is the complement of the set A, that is everythingin Ω that is not in A, e.g. Ω = 1, 2, 3 and A = 1, 2, then Ac = 3.

Let A, B,C be sets in some state-space Ω. Then the following holds true:

A ∪B = B ∪A A ∩B = B ∩A ASSOCIATIVITY

A ∪ (B ∪ C) = (A ∪B) ∪ (A ∪ C) A ∩ (B ∩ C) = (A ∩B) ∩ (A ∩ C) DISTRIBUTIVITY.

Note that the second rule holds when mixing intersection and union, e.g.:

A ∪ (B ∩ C) = (A ∪B) ∩ (A ∪ C).

We also have De-Morgan’s laws:

(A ∪B)c = Ac ∩Bc (A ∩B)c = Ac ∪Bc.

A neat trick for probability:A = (A ∩B) ∪ (A ∩Bc).

4.2 Summation

1

1− z=

∞∑k=0

zk |z| < 1 GEOMETRIC

ez =

∞∑k=0

zk

k!z ∈ R EXPONENTIAL

(1 + z)n =

n∑k=0

(n

k

)zk n ∈ Z+ BINOMIAL

(1 + z)−n =

n∑k=0

(n+ k

k

)zk n ∈ Z+, |z| < 1 NEGATIVE BINOMIAL

− log(1− z) =

∞∑k=1

zk

k|z| < 1 LOGARITHMIC

log(1 + z) =

∞∑k=1

(−1)k+1 zk

k|z| < 1 LOGARITHMIC

43

Page 51: ST2334 Notes (Probability and Statistics - NUS)

44

Note that, suppose |z| < 1, then we have

n∑k=j

zk = zj1− zn−j+1

1− z,

4.3 Exponential Function

For x ∈ R+:

limn→∞

(1 +

x

n

)n= ex

limn→∞

(1− x

n

)n= e−x.

4.4 Taylor Series

Under assumptions that are in effect in this course, for infinitely differentiable f : R→ R:

f(x) =∞∑k=0

(x− c)k

k!f (k)(c) c ∈ R.

4.5 Integration Methods

Some ideas for integration:

d

dx[f(g(x))] = g′(x)f ′(g(x))

∫g′(x)f ′(g(x))dx = f(g(x)) + C.

The methods of ‘integration by parts’ and ‘integration by substitution’; you should have coveredthese prior to your university education.

Consider the following integration ‘trick’. Suppose you want to compute∫g(x)dx

where g is a positive function. Suppose that there is a PDF f(x) such that

f(x) = cg(x)

where you know c. Then ∫g(x)dx =

∫1

cf(x)dx =

1

c.

Recall that

I =

∫ ∞−∞

e−u2/2du =

√2π.

This can be established by the fact that

I2 =

∫ ∞−∞

∫ ∞−∞

e−u2/2−v2/2dudv

and then changing to polar co-ordinates.

4.6 Distributions

A Table of discrete distributions can be found in Table 4.1 and continuous Distributions in Table4.2. Recall that

Γ(a) =

∫ ∞0

ta−1e−tdt a > 0.

Page 52: ST2334 Notes (Probability and Statistics - NUS)

45

Note that one can show Γ(a + 1) = aΓ(a) and for a ∈ Z+, Γ(a) = (a − 1)!. In addition that fora, b > 0

B(a, b) =Γ(a)Γ(b)

Γ(a+ b).

One can show also that

B(a, b) =

∫ 1

0

ua−1(1− u)b−1du.

Page 53: ST2334 Notes (Probability and Statistics - NUS)

46

Su

pp

ortX

Par

.P

MF

CD

FE[X

]V

ar[X

]M

GF

B(1,p

)0,1

p∈

(0,1

)px(1−p)1−x

pp(1−p)

(1−p)

+pet

B(n,p

)0,1,...,n

p∈

(0,1

),n∈Z+

( n x) px (1−p)n−x

np

np(1−p)

((1−p)

+pet

)n

P(λ

)0,1,2,...

λ∈R

+λxe−λ

x!

λλ

expλ

(et−

1)

Ge(p

)1,2,...

p∈

(0,1

)(1−p)x−

1p

1−qx

1/p

(1−p)/p

2pet

1−et(1−p)

Ne(n,p

)n,n

+1,...

p∈

(0,1

),n∈Z+

( x−1 n−

1

) (1−p)x−npn

n/p

n(1−p)/p

2(

pet

1−et(1−p)

) nT

able

4.1

:T

ab

leof

Dis

cret

eD

istr

ibu

tion

s.N

ote

thatq

=1−p.

Page 54: ST2334 Notes (Probability and Statistics - NUS)

47

Su

pp

ortX

Par

.P

DF

CD

FE[X

]V

ar[X

]M

GF

U [a,b

][a,b

]−∞<a<b<∞

1b−a

x−a

b−a

(a+b)/2

(b−a)2/12

ebt−eat

t(b−a)

E(λ

)R

+λ∈R

+λe−

λx

1−e−

λx

1/λ

1/λ

2λλ−t

G(a,b

)R

+a,b∈R

+ba

Γ(a

)xa−

1e−

bx

a/b

a/b2

(bb−t)a

N(a,b

)R

(a,b

)∈R×R

+1√

2πbe−

1 2b(x−a)2

ab

eat+bt2/2

Be(a,b

)[0,1

](a,b

)∈

(R+

)2B

(a,b

)−1xa−

1(1−x

)b−

1a/(a

+b)

ab

(a+b)2

(a+b+

1)

Tab

le4.

2:T

able

ofC

onti

nu

ou

sD

istr

ibu

tion

s.N

ote

thatB

(a,b

)−1

(a+b)/[

Γ(a

)Γ(b

)].

Page 55: ST2334 Notes (Probability and Statistics - NUS)

Bibliography

[1] Grimmett, G, & Stirzaker, D. (2001). Probability and Random Processes. Third Edition.Oxford: OUP.

[2] Tijms, H. (2012). Understanding Probability. Third Edition. Cambridge: CUP.

48