mt580 mathematics for statistics - boston college · mt580 mathematics for statistics: “all the...

MT580 Mathematics for Statistics:

“All the Math You Never Had”

Jenny A. BaglivoMathematics Department

Boston CollegeChestnut Hill, MA 02467-3806

(617) 552 [email protected]

Third Draft, Summer 2006

MT580 is an introduction to probability, discrete and continuous methods, and linear al-gebra for graduate students in applied statistics with little or no formal training in thesefields. Applications in statistics are emphasized throughout.

The topics covered in MT580 are drawn from several subject areas. Thus, the need for aspecialized set of course notes. I would like to thank GSAS Dean Mick Smyer for supportingthe development of these notes and Mathematics Professor Dan Chambers for many usefulcomments and suggestions on an earlier draft.

c© Jenny A. Baglivo 2006. All rights reserved.

1 Introductory Probability Concepts 1

1.1 Set Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Sample Spaces and Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Probability Axioms and Properties Following From the Axioms . . . . . . . . . . . . 3

1.3.1 Kolmogorov Axioms for Finite Sample Spaces . . . . . . . . . . . . . . . . . . 3

1.3.2 Equally Likely Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3.3 Properties Following from the Axioms . . . . . . . . . . . . . . . . . . . . . . 4

1.4 Counting Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.5 Permutations and Combinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.5.1 Example: Simple Urn Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.5.2 Binomial Coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.6 Partitioning Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.6.1 Multinomial Coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.6.2 Example: Urn Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.7 Applications in Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.7.1 Maximum Likelihood Estimation in Simple Urn Models . . . . . . . . . . . . 11

1.7.2 Permutation and Nonparametric Methods . . . . . . . . . . . . . . . . . . . . 12

1.8 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

1.8.1 Multiplication Rules for Probability . . . . . . . . . . . . . . . . . . . . . . . 17

1.8.2 Law of Total Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

1.8.3 Bayes Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

1.9 Independent Events and Mutually Independent Events . . . . . . . . . . . . . . . . . 21

2 Discrete Random Variables 23

2.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.1.1 PDF and CDF for Discrete Distributions . . . . . . . . . . . . . . . . . . . . 24

2.2 Families of Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.2.1 Example: Discrete Uniform Distribution . . . . . . . . . . . . . . . . . . . . . 25

2.2.2 Example: Hypergeometric Distribution . . . . . . . . . . . . . . . . . . . . . 25

2.2.3 Distributions Related to Bernoulli Experiments . . . . . . . . . . . . . . . . . 26

2.2.4 Simple Random Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

i

2.3 Mathematical Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.3.1 Definitions for Finite Discrete Random Variables . . . . . . . . . . . . . . . . 28

2.3.2 Properties of Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.3.3 Mean, Variance and Standard Deviation . . . . . . . . . . . . . . . . . . . . . 30

2.3.4 Chebyshev’s Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3 Joint Discrete Distributions 33

3.1 Bivariate Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.1.1 Joint PDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.1.2 Joint CDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.1.3 Marginal Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.1.4 Conditional Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.1.5 Independent Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.1.6 Mathematical Expectation for Finite Bivariate Distributions . . . . . . . . . 36

3.1.7 Covariance and Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.1.8 Conditional Expectation and Regression . . . . . . . . . . . . . . . . . . . . . 38

3.1.9 Example: Bivariate Hypergeometric Distribution . . . . . . . . . . . . . . . . 40

3.1.10 Example: Trinomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 41


3.2 Multivariate Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.2.1 Joint PDF and Joint CDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.2.2 Mutually Independent Random Variables and Random Samples . . . . . . . . 44

3.2.3 Mathematical Expectation for Finite Multivariate Distributions . . . . . . . . 44

3.2.4 Example: Multivariate Hypergeometric Distribution . . . . . . . . . . . . . . 45

3.2.5 Example: Multinomial Distribution . . . . . . . . . . . . . . . . . . . . . . . 46


3.2.7 Sample Summaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4 Introductory Calculus Concepts 51

4.1 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.2 Limits and Continuity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

ii

4.2.1 Neighborhoods, Deleted Neighborhoods and Limits . . . . . . . . . . . . . . . 54

4.2.2 Continuous Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.3 Differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.3.1 Derivative and Tangent Line . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.3.2 Approximations and Local Linearity . . . . . . . . . . . . . . . . . . . . . . . 60

4.3.3 First and Second Derivative Functions . . . . . . . . . . . . . . . . . . . . . . 61

4.3.4 Some Rules for Finding Derivative Functions . . . . . . . . . . . . . . . . . . 62

4.3.5 Chain Rule for Composition of Functions . . . . . . . . . . . . . . . . . . . . 63

4.3.6 Rules for Exponential and Logarithmic Functions . . . . . . . . . . . . . . . . 63

4.3.7 Rules for Trigonometric and Inverse Trigonometric Functions . . . . . . . . . 64

4.3.8 Indeterminate Forms and L’Hospital’s Rule . . . . . . . . . . . . . . . . . . . 65

4.4 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.4.1 Local Extrema and Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.4.2 First Derivative Test for Local Extrema . . . . . . . . . . . . . . . . . . . . . 67

4.4.3 Second Derivative Test for Local Extrema . . . . . . . . . . . . . . . . . . . . 68

4.4.4 Global Extrema on Closed Intervals . . . . . . . . . . . . . . . . . . . . . . . 68

4.4.5 Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70


4.5.1 Maximum Likelihood Estimation in Binomial Models . . . . . . . . . . . . . . 71

4.5.2 Maximum Likelihood Estimation in Multinomial Models . . . . . . . . . . . . 72

4.5.3 Minimum Chi-Square Estimation in Multinomial Models . . . . . . . . . . . . 73

4.6 Rolle’s Theorem and the Mean Value Theorem . . . . . . . . . . . . . . . . . . . . . 75

4.6.1 Generalized Mean Value Theorem and Error Analysis . . . . . . . . . . . . . 76

4.7 Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4.7.1 Riemann Sums and Definite Integrals . . . . . . . . . . . . . . . . . . . . . . 78

4.7.2 Properties of Definite Integrals . . . . . . . . . . . . . . . . . . . . . . . . . . 81

4.7.3 Fundamental Theorem of Calculus . . . . . . . . . . . . . . . . . . . . . . . . 81

4.7.4 Antiderivatives and Indefinite Integrals . . . . . . . . . . . . . . . . . . . . . 82

4.7.5 Some Formulas for Indefinite Integrals . . . . . . . . . . . . . . . . . . . . . . 82

4.7.6 Approximation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

iii

4.7.7 Improper Integrals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86


4.8.1 deMoivre-Laplace Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . 88

4.8.2 Computing Quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

5 Continuous Random Variables 91

5.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

5.1.1 PDF and CDF for Continuous Distributions . . . . . . . . . . . . . . . . . . . 91

5.1.2 Quantiles and Percentiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

5.2 Families of Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

5.2.1 Example: Uniform Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 93

5.2.2 Example: Exponential Distribution . . . . . . . . . . . . . . . . . . . . . . . . 94

5.2.3 Euler Gamma Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

5.2.4 Example: Gamma Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 96

5.2.5 Example: Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 97

5.2.6 Transforming Continuous Random Variables . . . . . . . . . . . . . . . . . . 99

5.3 Mathematical Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

5.3.1 Definitions for Continuous Random Variables . . . . . . . . . . . . . . . . . . 101

5.3.2 Properties of Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

5.3.3 Mean, Variance and Standard Deviation . . . . . . . . . . . . . . . . . . . . . 102

5.3.4 Chebyshev’s Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

6 Infinite Sequences and Series 103

6.1 Limits and Convergence of Infinite Sequences . . . . . . . . . . . . . . . . . . . . . . 103

6.1.1 Convergence Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

6.1.2 Geometric Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

6.1.3 Bounded Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

6.1.4 Monotone Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

6.2 Partial Sums and Convergence of Infinite Series . . . . . . . . . . . . . . . . . . . . . 105

6.2.1 Convergence Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

6.2.2 Geometric Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

iv

6.3 Convergence Tests for Positive Series . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

6.4 Alternating Series and Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 109

6.5 Absolute and Conditional Convergence of Series . . . . . . . . . . . . . . . . . . . . . 110

6.5.1 Product of Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

6.6 Taylor Polynomials and Taylor Series . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

6.6.1 Higher Order Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

6.6.2 Taylor Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

6.6.3 Taylor Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

6.6.4 Radius of Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

6.7 Power Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

6.7.1 Power Series and Taylor Series . . . . . . . . . . . . . . . . . . . . . . . . . . 116

6.7.2 Differentiation and Integration of Power Series . . . . . . . . . . . . . . . . . 117

6.8 Discrete Random Variables, revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

6.8.1 Countably Infinite Sample Spaces . . . . . . . . . . . . . . . . . . . . . . . . . 118

6.8.2 Infinite Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . 118

6.8.3 Example: Geometric Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 120

6.8.4 Example: Poisson Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 121

6.9 Continuous Random Variables, revisited . . . . . . . . . . . . . . . . . . . . . . . . . 122

6.9.1 Exponential Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

6.9.2 Gamma Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

6.10 Summary: Poisson Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

7 Multivariable Calculus Concepts 125

7.1 Vector And Matrix Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

7.1.1 Representations of Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

7.1.2 Addition and Scalar Multiplication of Vectors . . . . . . . . . . . . . . . . . . 125

7.1.3 Length and Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

7.1.4 Dot Product of Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

7.1.5 Matrix Notation and Matrix Transpose . . . . . . . . . . . . . . . . . . . . . 129

7.1.6 Addition and Scalar Multiplication of Matrices . . . . . . . . . . . . . . . . . 129

7.1.7 Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

v

7.1.8 Determinants of Square Matrices . . . . . . . . . . . . . . . . . . . . . . . . . 130

7.2 Limits and Continuity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

7.2.1 Neighborhoods, Deleted Neighborhoods and Limits . . . . . . . . . . . . . . . 132

7.2.2 Continuous Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

7.3 Differentiation in Several Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

7.3.1 Partial Derivatives and the Gradient Vector . . . . . . . . . . . . . . . . . . . 134

7.3.2 Second-Order Partial Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . 135

7.3.3 Differentiability, Local Linearity and Tangent (Hyper)Planes . . . . . . . . . 135

7.3.4 Total Derivative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

7.3.5 Directional Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

7.4 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

7.4.1 Linear and Quadratic Approximations . . . . . . . . . . . . . . . . . . . . . . 140

7.4.2 Second Derivative Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

7.4.3 Constrained Optimization and Lagrange Multipliers . . . . . . . . . . . . . . 144

7.4.4 Method of Steepest Descent/Ascent . . . . . . . . . . . . . . . . . . . . . . . 148


7.5.1 Least Squares Estimation in Quadratic Models . . . . . . . . . . . . . . . . . 149

7.5.2 Maximum Likelihood Estimation in Multinomial Models . . . . . . . . . . . . 150

7.6 Multiple Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

7.6.1 Riemann Sums and Double Integrals over Rectangles . . . . . . . . . . . . . . 151

7.6.2 Iterated Integrals over Rectangles . . . . . . . . . . . . . . . . . . . . . . . . . 153

7.6.3 Properties of Double Integrals . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

7.6.4 Double and Iterated Integrals over General Domains . . . . . . . . . . . . . . 156

7.6.5 Triple and Iterated Integrals over Boxes . . . . . . . . . . . . . . . . . . . . . 158

7.6.6 Triple Integrals over General Domains . . . . . . . . . . . . . . . . . . . . . . 160

7.6.7 Partial Anti-Differentiation and Partial Differentiation . . . . . . . . . . . . . 161

7.6.8 Change of Variables and the Jacobian . . . . . . . . . . . . . . . . . . . . . . 162


7.7.1 Computing Probabilities for Continuous Random Pairs . . . . . . . . . . . . 166

7.7.2 Contours for Independent Standard Normal Random Variables . . . . . . . . 167

vi

8 Joint Continuous Distributions 169

8.1 Bivariate Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169


8.1.2 Marginal Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170

8.1.3 Conditional Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

8.1.4 Independent Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 173

8.1.5 Mathematical Expectation for Bivariate Continuous Distributions . . . . . . 173

8.1.6 Covariance and Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174

8.1.7 Conditional Expectation and Regression . . . . . . . . . . . . . . . . . . . . . 175

8.1.8 Example: Bivariate Uniform Distribution . . . . . . . . . . . . . . . . . . . . 177

8.1.9 Example: Bivariate Normal Distribution . . . . . . . . . . . . . . . . . . . . . 177

8.1.10 Transforming Continuous Random Variables . . . . . . . . . . . . . . . . . . 179

8.2 Multivariate Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182


8.2.2 Marginal and Conditional Distributions . . . . . . . . . . . . . . . . . . . . . 183

8.2.3 Mutually Independent Random Variables and Random Samples . . . . . . . . 184

8.2.4 Mathematical Expectation For Multivariate Continuous Distributions . . . . 184

8.2.5 Sample Summaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185

9 Linear Algebra Concepts 189

9.1 Solving Systems of Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189

9.1.1 Elementary Row Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189

9.1.2 Row Reduction and Echelon Forms . . . . . . . . . . . . . . . . . . . . . . . . 191

9.1.3 Vector and Matrix Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . 194

9.2 Matrix Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197

9.2.1 Basic Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198

9.2.2 Inverses of Square Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200

9.2.3 Matrix Factorizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203

9.2.4 Partitioned Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205

9.3 Determinants of Square Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207

9.4 Vector Spaces and Subspaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209

vii

9.4.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209

9.4.2 Subspaces of Rm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211

9.4.3 Linear Independence, Basis and Dimension . . . . . . . . . . . . . . . . . . . 211

9.4.4 Rank and Nullity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214

9.5 Orthogonality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215

9.5.1 Orthogonal Complements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215

9.5.2 Orthogonal Sets, Bases and Projections . . . . . . . . . . . . . . . . . . . . . 216

9.5.3 Gram-Schmidt Orthogonalization . . . . . . . . . . . . . . . . . . . . . . . . . 217

9.5.4 QR-Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218

9.5.5 Linear Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219

9.6 Eigenvalues and Eigenvectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221

9.6.1 Characteristic Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221

9.6.2 Diagonalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222

9.6.3 Symmetric Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223

9.6.4 Positive Definite Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224

9.6.5 Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . 224

9.7 Linear Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228

9.7.1 Range and Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229

9.7.2 Linear Transformations and Matrices . . . . . . . . . . . . . . . . . . . . . . . 229

9.7.3 Linearly Independent Columns . . . . . . . . . . . . . . . . . . . . . . . . . . 230

9.7.4 Composition of Functions and Matrix Multiplication . . . . . . . . . . . . . . 230

9.7.5 Diagonalization and Change of Bases . . . . . . . . . . . . . . . . . . . . . . . 231

9.7.6 Random Vectors and Linear Transformations . . . . . . . . . . . . . . . . . . 232

9.7.7 Example: Multivariate Normal Distribution . . . . . . . . . . . . . . . . . . . 233


9.8.1 Least Squares Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235

9.8.2 Principal Components Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 236

10 Additional Reading 241

viii

1 Introductory Probability Concepts

Probability is the study of random phenomena. Probability theory can be applied, forexample, to study games of chance (e.g. roulette games, card games), occurrences of catas-trophic events (e.g. tornados, earthquakes), survival of animal species, and changes in stockand commodity markets.

This chapter introduces probability theory.

1.1 Set Operations

Sets and operations on sets are fundamental to any area of mathematics. A set is a well-defined collection of objects called the elements or members of the set. (Well-defined meansthat there is definite way of determining whether a given element belongs to the set.) Thenotation x ∈ A means “x is an element of A,” and the notation x /∈ A means “x is not anelement of A.”

The empty set is the set containing no elements, and the universal set is the set containingall elements under consideration in a given study. The notation ∅ is used to denote theempty set, and the notation Ω is used to denote the universal set.

A is a subset of B (denoted by A ⊆ B) if each element of A is an element of B.

Union and intersection. The union of the sets A and B (denoted by A ∪ B) is the setof elements that belong to either A or B, and the intersection of the sets A and B (denotedby A ∩ B) is the set of elements that belong to both A and B. That is,

A ∪ B = x | x ∈ A or x ∈ B and A ∩ B = x | x ∈ A and x ∈ B.

Note that A ∪ B = B ∪ A and A ∩ B = B ∩ A.

The operations of union and intersection are associative, that is,

(A ∪ B) ∪ C = A ∪ (B ∪ C) and (A ∩ B) ∩ C = A ∩ (B ∩ C) for sets A, B, and C.

Thus, the union A∪B ∪C and the intersection A∩B∩C are well-defined. More generally,the union of the k sets A1, A2, . . ., Ak (denoted by ∪k

i=1Ai) consists of all elements belongingto at least one Ai, and the intersection of the k sets A1, A2, . . ., Ak (denoted by ∩k

i=1Ai)consists of all elements belonging to all k sets.

The operations of union and intersection satisfy the following distributive laws:

A ∩ (B ∪ C) = (A ∩ B) ∪ (A ∩ C) and A ∪ (B ∩ C) = (A ∪ B) ∩ (A ∪ C)

for sets A, B, and C.

Complement. The complement of the set A in the universal set (denoted by Ac = Ω−A)is the set of all elements in the universal set that do not belong to A. Note, in particular,that A ∩ Ac = ∅, A ∪ Ac = Ω and (Ac)c = A.

1

Pairwise disjoint sets. The sets A1 and A2 are said to be disjoint if A1 ∩ A2 = ∅. Thesets A1, A2, . . ., Ak are said to be pairwise disjoint if Ai ∩ Aj = ∅ whenever i 6= j.

If A and B are sets, then B can be written as the disjoint union of the part of B containedin A with the part of B contained in the complement of A. That is,

B = (B ∩ A) ∪ (B ∩ Ac) where (B ∩ A) ∩ (B ∩ Ac) = ∅.

Partitions. A partition of Ω is a collection A1, A2, . . ., Ak of pairwise disjoint sets whoseunion is Ω. If A1, A2, . . ., Ak partitions Ω, then (B∩A1), (B∩A2), . . ., (B∩Ak) partitionsthe set B. That is,

B = ∪ki=1(B ∩ Ai) where (B ∩ Ai) ∩ (B ∩ Aj) = ∅ when i 6= j.

De Morgan’s laws. De Morgan’s laws relate complements to unions and intersections:

(A ∪ B)c = Ac ∩ Bc and (A ∩ B)c = Ac ∪ Bc for sets A and B.

More generally,(∪k

i=1Ai

)c= ∩k

i=1Aci and

(∩k

i=1Ai

)c= ∪k

i=1Aci .

1.2 Sample Spaces and Events

The term experiment (or random experiment) is used in probability theory to describe aprocedure whose outcome is not known in advance with certainty. Further, experimentsare assumed to be repeatable (at least in theory) and to have a well-defined set of possibleoutcomes.

The sample space S is the set of all possible outcomes of an experiment. An event is asubset of the sample space. A simple event is an event with a single outcome. Events areusually denoted by capital letters (A, B, C, . . .) and outcomes by lower case letters (x, y,z, . . .). If x ∈ A is observed, then A is said to have occurred. The favorable outcomes of anexperiment form the event of interest.

Each repetition of an experiment is called a trial. Repeated trials are repetitions of theexperiment using the specified procedure, with the outcomes of the trials having no influenceon one another.

Example 1.1 (Coin Tossing) Suppose you toss a fair coin 5 times and record h (forhead) or t (for tail) each time. The sample space for this experiment is the collection of32 = 25 sequences of 5 h’s or t’s:

S = hhhhh, hhhht, hhhth, hhthh, hthhh, thhhh, hhhtt, hhtht, hthht, thhht,hhtth, hthth, thhth, htthh, ththh, tthhh, ttthh, tthth, thtth, httth, tthht,ththt, httht, thhtt, hthtt, hhttt, htttt, thttt, tthtt, tttht, tttth, ttttt.

If you are interested in getting exactly 5 heads, then the event of interest is the simple eventA = hhhhh. If you are interested in getting exactly 3 heads, then the event of interest is

A = hhhtt, hhtht, hthht, thhht, hhtth, hthth, thhth, htthh, ththh, tthhh.

2

Example 1.2 (Birthdays) Consider the experiment “Ask ten people their birthdays andrecord the list of dates.” The sample space, S, is the collection of all sequences of 10birthdays. If you are interested in getting at least one match, then the event of interest, A,is the collection of all sequences of ten birthdays with at least one match.

1.3 Probability Axioms and Properties Following From the Axioms

The basic rules (or axioms) of probability were introduced by A. Kolmogorov in the 1930’s.Let A ⊆ S be an event, and P (A) be the probability that A will occur.

1.3.1 Kolmogorov Axioms for Finite Sample Spaces

A probability distribution, or simply a probability, on a finite sample space S is a specificationof numbers P (A) satisfying the following axioms:

1. P (S) = 1.

2. If A is an event, then 0 ≤ P (A) ≤ 1.

3. If A1 and A2 are disjoint, then P (A1 ∪ A2) = P (A1) + P (A2).

3′. More generally, if the events A1, A2, . . ., Ak are pairwise disjoint, then

P (A1 ∪ A2 ∪ · · · ∪ Ak) = P (A1) + P (A2) + · · · + P (Ak).

Since S is the set of all possible outcomes, an outcome in S is certain to occur; the probabilityof an event that is certain to occur must be 1 (axiom 1). Probabilities must be between0 and 1 (axiom 2), and probabilities must be additive when events are pairwise disjoint(axiom 3).

Note that the second part of axiom 3 can be proven from the first part using mathematicalinduction. The inductive step of the proof assumes that the equality holds for k− 1 events.Then P (A1 ∪ A2 ∪ · · · ∪ Ak)

= P ((A1 ∪ A2 ∪ · · · ∪ Ak−1) ∪ Ak) (associativity of union)

= P (A1 ∪ A2 ∪ · · · ∪ Ak−1) + P (Ak) (axiom 3)

= P (A1) + P (A2) + · · · + P (Ak−1) + P (Ak) (induction hypothesis)

An additional extension of axiom 3 is needed for infinite sample samples. Namely,

3′′. If A1, A2, . . . are pairwise disjoint, then

P (∪∞i=1Ai) = P (A1) + P (A2) + P (A3) + · · ·

where the righthand side is understood to be the sum of a convergent infinite series.

Methods for sequences and series are covered in Chapter 6 of these notes.

3

Relative frequencies. The probability of an event can be written as the limit of relativefrequencies. That is, if A ⊆ S is an event, then

P (A) = limn→∞

#(A)

n

where #(A) is the number of occurrences of event A in n repeated trials of the experiment.

Expected number. If P (A) is the probability of event A, then nP (A) is the expectednumber of occurrences of event A in n repeated trials of the experiment.

1.3.2 Equally Likely Outcomes

If S is a finite set with N elements, A is a subset of S with n elements, and each outcomeis equally likely, then

P (A) =Number of elements in A

Number of elements in S =|A||S| =

n

N.

Example 1.3 (Coin Tossing, continued) For example, if you toss a fair coin 5 timesand record heads or tails each time, then the probability of getting exactly 3 heads is10/32=0.3125. In 2000 repetitions of the experiment, you expect to observe exactly 3 heads

2000P (exactly 3 heads) = 2000(0.3125) = 625 times.

1.3.3 Properties Following from the Axioms

Properties following from the Kolmogorov axioms include

1. Complement rule: Let Ac = S − A be the complement of A in S. Then

P (Ac) = 1 − P (A).

In particular, P (∅) = 0.

2. Subset rule: If A is a subset of B, then P (A) ≤ P (B).

3. Inclusion-exclusion rule: If A and B are events, then

P (A ∪ B) = P (A) + P (B) − P (A ∩ B).

Example 1.4 (Demonstration of Inclusion-Exclusion) To demonstrate the inclusion-exclusion rule using the axioms, first note that

A ∪ B = A ∪ (B ∩ Ac) where A ∩ (B ∩ Ac) = ∅.

4

Since the intersection is empty, we know that P (A ∪ B) = P (A) + P (B ∩ Ac) by axiom 3.Similarly, since

B = (B ∩ A) ∪ (B ∩ Ac) where (B ∩ A) ∩ (B ∩ Ac) = ∅,

P (B) = P (B ∩ A) + P (B ∩ Ac) or P (B ∩ Ac) = P (B) − P (B ∩ A). Thus,

P (A ∪ B) = P (A) + P (B ∩ Ac) = P (A) + P (B) − P (B ∩ A) = P (A) + P (B) − P (A ∩ B),

since P (A ∩ B) = P (B ∩ A).

Inclusion-exclusion rules. The inclusion-exclusion rule can be generalized to more thantwo events. For three events, for example, the rule is

P (A ∪ B ∪ C) = P (A) + P (B) + P (C) − P (A ∩ B) − P (A ∩ C) − P (B ∩ C) + P (A ∩ B ∩ C).

1.4 Counting Methods

Methods for counting the number of outcomes in a sample space or event are important inprobability. The multiplication rule is the basic counting formula.

Theorem 1.5 (Multiplication Rule) If an operation consists of r steps of which thefirst can be done in n1 ways, for each of these the second can be done in n2 ways, for each ofthe first and second steps the third can be done in n3 ways, etc., then the entire operationcan be done in n1 × n2 × · · · × nr ways.

Sampling with/without replacement. Two special cases of the multiplication rule are

1. Sampling with replacement. For a set of size n and a sample of size r, there are a totalof nr = n × n × · · · × n ordered samples, if duplication is allowed.

2. Sampling without replacement. For a set of size n and a sample of size r, there are atotal of

n!

(n − r)!= n × (n − 1) × · · · × (n − r + 1)

ordered samples, if duplication is not allowed.

If n is a positive integer, the notation n! (“n factorial”) is used for the product

n! = n × (n − 1) × · · · × 1.

For convenience, 0! is defined to equal 1 (0! = 1).

Example 1.6 (Birthdays, continued) For example, suppose there are r unrelated peoplein a room, none of whom was born on February 29th of a leap year. You would like todetermine the probability that at least two people have the same birthday.

5

1. You ask, and record, each person’s birthday. There are

365r = 365 × 365 × · · · × 365

possible outcomes, where an outcome is a sequences of r responses.

2. Consider the event “everyone has a different birthday.” The number of outcomes inthis event is

365!

(365 − r)!= 365 × 364 × · · · × (365 − r + 1).

3. Suppose that each sequence of birthdays is equally likely. The probability that atleast two people have a common birthday is 1 minus the probability that everyonehas a different birthday, or

1 − 365 × 364 × · · · × (365 − r + 1)

365r.

Let A be the event “at least two people have a common birthday”. The following tabledemonstrates how quickly P (A) increases as the number of people in the room increases:

r 5 10 15 20 25 30 35 40 45 50

P (A) 0.03 0.12 0.25 0.41 0.57 0.71 0.81 0.89 0.94 0.97

1.5 Permutations and Combinations

A permutation is an ordered subset of r distinct objects out of a set of n objects. Acombination is an unordered subset of r distinct objects out of the n objects.

By the multiplication rule (page 5), there are a total of

nPr =n!

(n − r)!= n × (n − 1) × · · · × (n − r + 1)

permutations of r objects out of n objects.

Since each unordered subset corresponds to r! ordered subsets (the r chosen elements arepermuted in all possible ways), there are a total of

nCr =nPr

r!=

n!

(n − r)! r!=

n × (n − 1) × · · · × (n − r + 1)

r × (r − 1) × · · · × 1

combinations of r objects out of n objects.

For example, there are a total of 5040 ordered subsets of size 4 from a set of size 10, and atotal of 5040/24=210 unordered subsets.

The notation (n

r

)= nCr =

n!

(n − r)! r!(read “n choose r”)

is used to denote the total number of combinations.

6

Special cases are as follows:(

n

0

)=

(n

n

)= 1 and

(n

1

)=

(n

n − 1

)= n.

Further, since choosing r elements to form a subset is equivalent to choosing the remainingn − r elements to form the complementary subset,

(n

r

)=

(n

n − r

)for r = 0, 1, . . . , n.

1.5.1 Example: Simple Urn Model

Suppose there are M special objects in an urn containing a total of N objects. In a subsetof size n chosen from the urn, exactly m are special.

Unordered subsets. There are a total of(

M

m

)×(

N − M

n − m

)

unordered subsets with exactly m special objects (and exactly n−m other objects). If eachchoice of subset is equally likely, then for each m

P (m special objects) =

(Mm

)× (N−Mn−m

)(N

n

) .

Ordered subsets. There are a total of(

n

m

)× MPm × N−MPn−m

ordered subsets with exactly m special objects. (The positions of the special objects areselected first, followed by the special objects to fill these positions, followed by the nonspecialobjects to fill the remaining positions.) If each choice of subset is equally likely, then foreach m

P (m special objects) =

(nm

)× MPm × N−MPn−m

NPn.

Computing probabilities. Interestingly, P (m special objects) is the same in both cases.For example, let N = 25, M = 10, n = 8, and m = 3. Then, using the first formula

P (3 special objects) =

(103

)× (155)

(258

) =120 × 3003

1081575=

728

2185≈ 0.333.

Using the second formula, the probability is

P (3 special objects) =

(83

)× 10P3 × 15P5

25P8=

56 × 720 × 360360

43609104000=

728

2185≈ 0.333.

7

Example 1.7 (Quality Management) A batch of 20 microwaves contains 4 defectiveand 16 good machines. You decide to select 3 machines at random and test them. If eachchoice of subset is equally likely, then the probability that there are k defective machinesin the chosen subset is

P (k defectives) =

(4k

)( 163−k

)(20

3

) k = 0, 1, 2, 3.

The following table shows the distribution of probabilities:

k 0 1 2 3

P (k defectives) 5601140 = 0.4912 480

1140 = 0.4211 961140 = 0.0842 4

1140 = 0.0035

If you decide to accept the batch as long as none of the 3 chosen machines is defective, thenthere is a 49.12% chance that you will accept a batch containing 4 defective machines.

1.5.2 Binomial Coefficients

The quantities(nr

), r = 0, 1, . . . , n, are often referred to as the binomial coefficients because

of the following theorem.

Theorem 1.8 (Binomial Theorem) For all real numbers x and y and each positiveinteger n,

(x + y)n =n∑

r=0

(n

r

)xryn−r.

Idea of proof. The product on the left can be written as a sequence of n factors:

(x + y)n = (x + y) × (x + y) × · · · × (x + y).

The product expands to 2n summands, where each summand is a sequence of n letters,where one letter is chosen from each factor. For each r, exactly

(nr

)sequences have r copies

of x and n − r copies of y.

Example 1.9 (n=12) If n = 12, then the numbers of subsets of size r are as follows:

r 0 1 2 3 4 5 6 7 8 9 10 11 12(12r

)1 12 66 220 495 792 924 792 495 220 66 12 1

and the total number of subsets is∑12

r=0

(12r

)= 4096 = 212.

1.6 Partitioning Sets

The multiplication rule (page 5) can be used to find the number of partitions of a set of nelements into k distinguishable subsets of sizes r1, r2, . . ., rk.

8

Specifically, r1 of the n elements are chosen for the first subset, r2 of the remaining n − r1

elements are chosen for the second subset, and so forth. The result is the product of thenumbers of ways to perform each step:

(n

r1

)×(

n − r1

r2

)× · · · ×

(n − r1 − · · · − rk−1

rk

).

The product simplifies ton!

r1! r2! ··· rk!and is denoted by

(n

r1,r2,...,rk

)(read “n choose

r1, r2, . . . , rk”).

Example 1.10 (Partitioning Students) For example, there are a total of

(15

5, 5, 5

)=

15!

5! 5! 5!= 756, 756

ways to partition the members of a class of 15 students into recitation sections of size 5each led by Joe, Sally, and Mary, respectively. (The recitation sections are distinguished bytheir group leaders.)

Permutations of indistinguishable objects. The formula above also represents thenumber of ways to permute n objects, where the first r1 are indistinguishable, the next r2

are indistinguishable, . . ., the last rk are indistinguishable. The computation is done asfollows: r1 of the n positions are chosen for the first type of object, r2 of the remainingn − r1 positions are chosen for the second type of object, etc.

In the example above, imagine that the 15 positions correspond to the 15 students inalphabetical order, and that there are 5 J’s (for Joe), 5 S’s (for Sally), and 5 M’s (for Mary)to permute. Each permutation of the 15 letters corresponds to an assignment of the 15students to the recitation sections led by Joe, Sally, and Mary.

Footnote on partitioning students. Suppose instead that the 15 students are assignedto 3 teams of 5 students each and that each team will work on the same project. Then thetotal number of different assignments becomes

( 155,5,5

)/3! = 126, 126.

1.6.1 Multinomial Coefficients

The quantities( nr1,r2,...,rk

)are often referred to as the multinomial coefficients because of

the following theorem.

Theorem 1.11 (Multinomial Theorem) For all real numbers x1, x2, . . . , xk and eachpositive integer n,

(x1 + x2 + · · · + xk)n =

∑

(r1,r2,...,rk)

(n

r1, r2, . . . , rk

)xr1

1 xr2

2 · · · xrk

k ,

where the sum is over all k-tuples of nonnegative integers with∑

i ri = n.

9

Idea of proof: The product on the left can be written as a sequence of n factors

(x1 + x2 + · · · + xk)n = (x1 + x2 + · · · + xk)×

(x1 + x2 + · · · + xk) × · · · × (x1 + x2 + · · · + xk).

The product expands to kn summands, where each summand is a sequence of n letters (onefrom each factor). For each r1, r2, . . . , rk, exactly

( nr1,r2,...,rk

)sequences have r1 copies of x1,

r2 copies of x2, etc.

1.6.2 Example: Urn Models

The simple urn model from Section 1.5.1 can be generalized.

Assume that an urn contains k types of objects, Mi is the number of objects of type i, andN =

∑i Mi is the total number of objects in the urn. A subset of size n is chosen. If each

choice of subset is equally likely, then the probability that there are exactly mi objects oftype i for i = 1, 2, . . . , k is

P (Exactly mi objects of type i for each i) =

(M1

m1

)(M2

m2

) · · · (Mk

mk

)(N

n

) .

Example 1.12 (3 Types of Objects) Let M1 = 3, M2 = 4, M3 = 5 and n = 4. Thefollowing table gives the probability distribution for all possible values of m1 and m2 (withm3 = 4 − m1 − m2):

m2 = 0 m2 = 1 m2 = 2 m2 = 3 m2 = 4

m1 = 0 5/495 40/495 60/495 20/495 1/495 126/495m1 = 1 30/495 120/495 90/495 12/495 252/495m1 = 2 30/495 60/495 18/495 108/495m1 = 3 5/495 4/495 9/495

70/495 224/495 168/495 32/495 1/495 495/495

Since N = M1 + M2 + M3 = 12, the total number of subsets of size 4 is(12

4

)= 495. To get

the numerator for m1 = 2 and m2 = 1, for example, we compute(32

)(41

)(51

)= 60.

The rightmost column gives the proportion of time we expect to have 0, 1, 2, or 3 elementsof type 1 in the subset of size 4. The bottom row gives the proportion of time we expect tohave 0, 1, 2, 3, or 4 of type 2 in the subset of size 4.

Urn models can be used to analyze surveys conducted in small populations. Additionalexamples will be discussed in the context of joint discrete distributions (Chapter 3).

1.7 Applications in Statistics

This section introduces two important applications of the methods we have seen so far.

10

450 500 550 600 650M

0.02

0.04

0.06

0.08

Prob

415 420 425 430 435 N0.020.040.060.080.10.120.140.16

Prob

Figure 1.1: Likelihood for survey example (left) and capture-recapture example (right).

1.7.1 Maximum Likelihood Estimation in Simple Urn Models

In statistics, information from a sample is used when complete information is impossible toobtain or would take too long to gather. For example, information from a randomly chosensubset of elements from an urn can be used to

1. Estimate the number of special elements in an urn, M , when N is known.

2. Estimate the total number of elements in an urn, N , when M is known.

The method uses proportions: if all four numbers (N , M , n, m) are positive, then theproportion of special elements in the subset (m/n) is set equal to the proportion of specialelements in the urn (M/N), and the equation is solved for the unknown quantity.

The method is intuitively appealing. Further, it can be shown that estimates obtained usingproportions have maximum (or close to maximum) probabilities.

Example 1.13 (Survey Analysis) Assume that in a well-conducted survey of 150 of the1000 students in the senior class at a local university, 82 students preferred Susan for classpresident. Since (82/150)1000 ≈ 547, we estimate that 547 seniors support Susan.

The left part of Figure 1.1 is a plot of the function

f(M) =

(M82

)(1000−M68

)(1000

150

) for M = 450, 451, . . . , 625.

If the subset of students chosen for the survey was one of(1000

150

)equally likely choices, then

the plot suggests that 547 is a maximum likelihood estimate of M .

Example 1.14 (Capture-Recapture Model) Two independent techniques were used toestimate the total number of potentially fatal errors in a complex computer program. Thefirst method identified M = 380 errors. The second method identified a total of n = 278errors, m = 250 of which were already identified by the first method.

Since (278/250)380 ≈ 423, we estimate that there are a total of 423 potentially fatal errorsin the program (408 identified by at least one method; 15 not identified by either method).

11

The right part of Figure 1.1 is a plot of

f(N) =

(380250

)(N−38028

)( N278

) for N = 411, 412, . . . , 437.

If the subset of errors identified by the second method was one of( N278

)equally likely choices,

then the plot suggests that 423 is a maximum likelihood estimate of N .

Probability (or likelihood) ratio. An alternative to reporting a single value of M orN with maximum probability is to report a range of likely values. Often, the range isdetermined using probability ratios. That is, M or N is in the reported range if

Probability for M or N

Maximum Probability≥ p for a fixed proportion p.

Using p = 0.25, the range of likely values in the survey example is 485, 608 and the rangeof likely values in the capture-recapture example is 415, 431.

Survey analysis when N is large. When N is large, we usually work with the un-known proportion of special objects p = M/N and the estimate p = m/n. Further, as anapproximation, we assume that p takes values anywhere in the interval 0 ≤ p ≤ 1 and weuse “sampling with replacement” in place of “sampling without replacement.”

Survey analysis with three types of objects. The simple method of proportions canbe generalized to urn models with three (or more) types of objects. For example, supposewe wish to estimate the numbers of students who will vote for each of three candidates(Susan plus two others) and that in a well-conducted survey of 150 of the 1000 students inthe senior class, 72 prefer Susan (candidate 1), 56 prefer the second candidate and 22 preferthe third candidate. We estimate that

(72

150

)1000 ≈ 480,

(56

150

)1000 ≈ 373 and

(22

150

)1000 ≈ 147

will vote for candidates 1, 2, 3, respectively. Further, these estimates represent maximum(or near maximum) probabilities over all triples (M1,M2,M3).

1.7.2 Permutation and Nonparametric Methods

In many statistical applications, the null and alternative hypotheses of interest can beparaphrased in the following simple terms:

Ho: Any patterns appearing in the data are due to chance alone.

Ha: There is a tendency for a certain type of pattern to appear.

Permutation methods allow researchers to determine whether to accept or reject a null hy-pothesis of randomness and, in some cases, to construct confidence intervals for unknown

12

parameters. The methods are applicable in many settings since they require few mathe-matical assumptions.

For example, consider the analysis of two samples. Let x1, x2, . . . , xn and y1, y2, . . . , ymbe the observed samples, where each is a list of numbers with repetitions.

The data could have been generated under one of two sampling models.

Population model. Data generated under a population model can be treated as thevalues of independent random samples from continuous distributions. (These concepts aredefined formally in Chapter 8 of these notes.)

For example, consider testing the null hypothesis that the X and Y distributions are equalversus the alternative hypothesis that X is stochastically smaller than Y . (That is, thatthe values of X tend to be smaller than the values of Y .)

If n = 3, m = 5, and the observed lists are

x1, x2, x3 = 1.4, 1.2, 2.8 and y1, y2, y3, y4, y5 = 1.7, 2.9, 3.7, 2.3, 1.3,

then the eventX2 < Y5 < X1 < Y1 < Y4 < X3 < Y2 < Y3

has occurred. Under the null hypothesis, each permutation of the n + m = 8 randomvariables is equally likely; thus, the observed event is one of (n + m)! = 40, 320 equallylikely choices. Patterns of interest under the alternative hypothesis are events where theXi’s tend to be smaller than the Yj ’s.

Randomization model. Data generated under a randomization model cannot be treatedas the values of independent random samples from continuous distributions, but can bethought of as one of N equally likely choices under the null hypothesis.

In the example above, the null hypothesis of randomness is that the observed data is oneof N = 40, 320 equally likely choices, where a choice corresponds to a matching of the 8numbers to the labels x1, x2, x3, y1, y2, y3, y4, y5.

Permutation tests. To conduct a permutation test using a test statistic T ,

1. The sampling distribution of T is obtained by computing the value of the statistic foreach reordering of the data, and

2. The observed significance level (or p value) is computed by comparing the observedvalue of T to the sampling distribution from step 1.

The sampling distribution from the first step is called the permutation distribution of thestatistic T , and the p value is called a permutation p value.

13

Table 1.1: Sampling distribution of the sum of observations in the first sample.

sum x sample sum x sample sum x sample

3.9 1.2, 1.3, 1.4 5.9 1.4, 1.7, 2.8 7.1 1.4, 2.8, 2.94.2 1.2, 1.3, 1.7 5.9 1.3, 1.7, 2.9 7.2 1.2, 2.3, 3.74.3 1.2, 1.4, 1.7 6.0 1.4, 1.7, 2.9 7.3 1.3, 2.3, 3.74.4 1.3, 1.4, 1.7 6.2 1.2, 1.3, 3.7 7.4 1.4, 2.3, 3.74.8 1.2, 1.3, 2.3 6.3 1.2, 1.4, 3.7 7.4 1.7, 2.8, 2.94.9 1.2, 1.4, 2.3 6.3 1.2, 2.3, 2.8 7.7 1.2, 2.8, 3.75.0 1.3, 1.4, 2.3 6.4 1.3, 2.3, 2.8 7.7 1.7, 2.3, 3.75.2 1.2, 1.7, 2.3 6.4 1.2, 2.3, 2.9 7.8 1.2, 2.9, 3.75.3 1.2, 1.3, 2.8 6.4 1.3, 1.4, 3.7 7.8 1.3, 2.8, 3.75.3 1.3, 1.7, 2.3 6.5 1.3, 2.3, 2.9 7.9 1.4, 2.8, 3.75.4 1.2, 1.4, 2.8 6.5 1.4, 2.3, 2.8 7.9 1.3, 2.9, 3.75.4 1.4, 1.7, 2.3 6.6 1.2, 1.7, 3.7 8.0 1.4, 2.9, 3.75.4 1.2, 1.3, 2.9 6.6 1.4, 2.3, 2.9 8.0 2.3, 2.8, 2.95.5 1.2, 1.4, 2.9 6.7 1.3, 1.7, 3.7 8.2 1.7, 2.8, 3.75.5 1.3, 1.4, 2.8 6.8 1.4, 1.7, 3.7 8.3 1.7, 2.9, 3.75.6 1.3, 1.4, 2.9 6.8 1.7, 2.3, 2.8 8.8 2.3, 2.8, 3.75.7 1.2, 1.7, 2.8 6.9 1.2, 2.8, 2.9 8.9 2.3, 2.9, 3.75.8 1.2, 1.7, 2.9 6.9 1.7, 2.3, 2.9 9.4 2.8, 2.9, 3.75.8 1.3, 1.7, 2.8 7.0 1.3, 2.8, 2.9

Continuing with the example above, let S be the sum of numbers in the x sample. Sincethe value of S depends only on which observations are labelled x’s, and not on the relativeordering of all 8 observations, the sampling distribution of S under the null hypothesis isobtained by computing its value for each partition of the 8 numbers into subsets of sizes 3and 5, respectively.

Table 1.1 shows the values of S for each of the(83

)= 56 choices of observations for the first

sample. The observed value of S if 5.4. Since small values of S support the alternativehypothesis that x values tend to be smaller than y values, the observed significance level isP (S ≤ 5.4) = 13/56.

Example 1.15 (Olympic Sprinting Times) As a second permutation test example,consider a test of association between x-values and y-values using data on Olympic sprintingtimes (Hand et al., Chapman & Hall, 1994, p. 248), and the sample correlation statistic

R =

∑ni=1(Xi − X)(Yi − Y )√ ∑n

i=1(Xi − X)2∑n

i=1(Yi − Y )2,

where X and Y are the sample means of the X and Y samples, respectively, and n is thetotal number of data pairs.

The left part of Figure 1.2 shows the winning times in seconds (vertical axis) for the men’s100 meter finals versus the year since 1900 (horizontal axis) for the years 1912, 1924, · · ·,1972.

14

10 30 50 70 x

10.2

10.5

10.8

y

125 200 275 x

100

250

400

y

Figure 1.2: Winning time in seconds (y) versus year since 1900 (x) for Olympic sprintingtimes example (left plot). Triglycerides in mg/dL (y) versus cholesterol in mg/dL (y) forheart disease example (right plot).

The data are reproduced in the following table:

x 12 24 36 48 60 72

y 10.8 10.6 10.3 10.3 10.2 10.14

For these data, the observed value of R is −0.941.

To determine if the observed association is due to chance alone, a permutation analysis ofR = 0 versus R 6= 0 was conducted at the 5% significance level.

The sampling distribution of R under the null hypothesis is obtained by matching x-valuesto permuted y-values in all possible ways, and computing the value of R in each case. Sincetwo y-values are equal, there are a total of 6!/2 = 360 permutations to consider. Since thepermutation p value is

p value =#(|R| ≥ | − 0.941|)

360=

2

360≈ 0.006

(obtained using the computer), there is evidence that the observed negative association isnot due to chance alone.

Conditional test, nonparametric test. A permutation test is an example of a condi-tional test, since the sampling distribution of T is computed conditional on the observations.

Note that the term nonparametric test is often used to describe permutation tests whereobservations have been replaced by ranks. The Wilcoxon rank sum and Mann-Whitneytests for the analysis of two samples, and the Wilcoxon signed rank test for the analysis ofpaired samples, are examples of nonparametric tests.

Monte Carlo test. If the number of reorderings of the data is very large, then thecomputer can be used to approximate the sampling distribution of T by computing its

15

value for a fixed number of random reorderings of the data. The approximate samplingdistribution can then be used to estimate the p value.

A test conducted in this way is an example of a Monte Carlo test. (In a Monte Carloanalysis, simulation is used to estimate a quantity of interest. Here the quantity of interestis a p value.)

Example 1.16 (Heart Disease Study) (Hand et al., Chapman & Hall, 1994, p. 221)Cholesterol and triglycerides belong to the class of chemicals known as lipids (fats). As partof a study to determine the relationship between high levels of lipids and coronary arterydisease, researchers measured plasma levels of cholesterol and triglycerides in milligramsper deciliter (mg/dL) in 371 men complaining of chest pain.

The right part of Figure 1.2 compares the cholesterol (x) and triglycerides (y) levels for the51 men with no evidence of heart disease. The sample correlation is 0.325. (See the lastexample for the definition of the sample correlation statistic.)

To determine if the observed association is due to chance alone, a permutation analysis canbe conducted using the sample correlation statistic R as test statistic.

Consider testing R = 0 versus R 6= 0 at the 5% significance level. Under the null hypothesisof randomness, each matching of observed x values to permuted y values is equally likely.Thus, the observed matching is one of about 52! ≈ 8 × 1067 equally likely choices.

In a Monte Carlo analysis using 5000 random matchings (including the observed matching),the estimated permutation p value was

estimated p value =#(|R| ≥ |0.325|)

5000=

98

5000≈ 0.0196.

Thus, there is evidence that the positive association between levels of cholesterol and triglyc-erides is not due to chance alone.

1.8 Conditional Probability

Assume that A and B are events, and that P (B) > 0. Then the conditional probability ofA given B is defined as follows:

P (A|B) =P (A ∩ B)

P (B).

Event B is often referred to as the conditional sample space. The conditional probabilityP (A|B) is the relative “size” of A within B.

Example 1.17 (Respiratory Problems) For example, suppose that 40% of the adults ina certain population smoke cigarettes and that 28% smoke cigarettes and have respiratoryproblems. Then

P (Respiratory problems | Smoker) =0.28

0.40= 0.70

is the probability that an adult has respiratory problems given that the adult is a smoker.(70% of smokers have respiratory problems.)

16

Equally likely outcomes. Note that if the sample space S is finite and each outcome isequally likely, then the conditional probability of A given B simplifies to the following:

P (A|B) =|A ∩ B|/|S||B|/|S| =

|A ∩ B||B| =

Number of elements in A ∩ B

Number of elements in B.

1.8.1 Multiplication Rules for Probability

Assume that A and B are events with positive probability. Then the definition of conditionalprobability implies that the probability of the intersection, P (A ∩ B), can be written as aproduct of probabilities in two different ways:

P (A ∩ B) = P (B) × P (A|B) = P (A) × P (B|A).

More generally,

Theorem 1.18 (Multiplication Rule for Probability) If A1, A2, . . . , Ak are eventsand P (A1 ∩ A2 ∩ · · · ∩ Ak−1) > 0, then

P (A1 ∩ A2 ∩ · · · ∩ Ak) =

P (A1) × P (A2|A1) × P (A3|A1 ∩ A2) × · · · × P (Ak|A1 ∩ A2 ∩ · · · ∩ Ak−1).

Example 1.19 (Sampling Without Replacement) For example, suppose that fourslips of paper are sampled without replacement from a well-mixed urn containing twenty-five slips of paper: fifteen slips with the letter X written on each and ten slips of paperwith the letter Y written on each. Then the probability of observing the sequence XY XXis P (XY XX) =

P (X) × P (Y |X) × P (X|XY ) × P (X|XY X) =15

25× 10

24× 14

23× 13

22≈ 0.09.

(The probability of choosing an X slip is 15/25; with an X removed from the urn, theprobability of drawing a Y slip is 10/24; with an X and Y removed from the urn, theprobability of drawing an X slip is 14/23; with two X slips and one Y slip removed fromthe urn, the probability of drawing an X slip is 13/22.)

1.8.2 Law of Total Probability

The law of total probability can be used to write an unconditional probability as theweighted average of conditional probabilities. Specifically,

Theorem 1.20 (Law of Total Probability) Let A1, A2, . . ., Ak and B be events withnonzero probability. If A1, A2, . . ., Ak are pairwise disjoint with union S, then

P (B) = P (A1) × P (B|A1) + P (A2) × P (B|A2) + · · · + P (Ak) × P (B|Ak).

17

To demonstrate the law of total probability, note that if A1, A2, . . ., Ak are pairwise disjointwith union S, then the sets B ∩A1, B ∩A2, . . ., B ∩Ak are pairwise disjoint with union B.Thus, axiom 3 and the definition of conditional probability imply that

P (B) =k∑

j=1

P (B ∩ Aj) =k∑

j=1

P (Aj)P (B|Aj).

Example 1.21 (Respiratory Problems, continued) For example, suppose that 70% ofsmokers and 15% of nonsmokers in a certain population of adults have respiratory problems.If 40% of the population smoke cigarettes, then

P(Respiratory problems) = 0.40(0.70) + 0.60(0.15) = 0.37

is the probability of having respiratory problems.

Law of average conditional probabilities. The law of total probability is often calledthe law of average conditional probabilities. Specifically, P (B) is the weighted average ofthe collection of conditional probabilities P (B|Aj), using the collection of unconditionalprobabilities P (Aj) as weights.

In the respiratory problems example above, 0.37 is the weighted average of 0.70 (the prob-ability that a smoker has respiratory problems) and 0.15 (the probability that a nonsmokerhas respiratory problems).

1.8.3 Bayes Rule

Bayes rule, proven by the Reverend T. Bayes in the 1760’s, can be used to update proba-bilities given that an event has occurred. Specifically,

Theorem 1.22 (Bayes Rule) Let A1, A2, . . ., Ak and B be events with nonzero proba-bility. If A1, A2, . . ., Ak are pairwise disjoint with union S, then

P (Aj |B) =P (Aj) × P (B|Aj)∑ki=1 P (Ai) × P (B|Ai)

, j = 1, 2, . . . , k.

Bayes rule is a restatement of the definition of conditional probability: the numerator inthe formula is P (Aj ∩ B), the denominator is P (B) (by the law of total probability), andthe ratio is P (Aj |B).

Example 1.23 (Defective Products) For example, suppose that 2% of the productsassembled during the day shift and 6% of the products assembled during the night shift ata small company are defective and need reworking. If the day shift accounts for 55% of theproducts assembled by the company, then

P(Day shift | Defective) =0.55(0.02)

0.55(0.02) + 0.45(0.06)≈ 0.289

is the probability that a product was assembled during the day shift given that the productis defective. (About 28.9% of defective products are assembled during the day shift.)

18

Table 1.2: Distribution of unaided distance scores for right (x) and left (y) eyes.

y = 1 y = 2 y = 3 y = 4 total

x = 1 0.203 0.036 0.017 0.009 0.265x = 2 0.031 0.202 0.058 0.010 0.301x = 3 0.016 0.048 0.237 0.027 0.328x = 4 0.005 0.011 0.024 0.066 0.106

total 0.255 0.297 0.336 0.112 1.000

1 2 3 4 j

0.2

0.4

0.6

0.8

Prob

1 2 3 4 j

0.2

0.4

0.6

0.8

Prob

Figure 1.3: Distributions of left eye scores given right eye score is 1 (left) or 4 (right).

Prior and posterior distributions. In applications of Bayes rule, the collection ofprobabilities P (Aj) are often referred to as the prior probabilities (the probabilities beforeobserving an outcome in B), and the collection of probabilities P (Aj |B) are often referredto as the posterior probabilities (the probabilities after event B has occurred).

Example 1.24 (Eyesight Study) (Stuart, Biometrika 42:412-416, 1955) Between 1943and 1946 the eyesight of more than seven thousand women aged 30-39 employed in theBritish weapons industry was tested. For each woman the unaided distance vision of theright eye (x) and of the left eye (y) was recorded using the four-point scale 1, 2, 3, 4, where1 is the highest grade and 4 is the lowest grade. Table 1.2 summarizes the results of theexperiment.

For example, for 20.3% of women who participated in the study both eyes were given thehighest score (x = y = 1) and for 6.6% of women who participated in the study both eyeswere given the lowest score (x = y = 4).

Consider the experiment “Choose a woman from the set of women studied by Stuart andrecord the scores for both eyes,” assume each choice is equally likely, and let Aj be theevent that the left eye score equals j.

The left part of Figure 1.3 shows the posterior distribution P (Aj |B) (in black) superim-posed on the prior distribution P (Aj) (in gray) when B is the event that the right eye

19

score is 1. (If the right eye score is 1, then the left eye score is most likely 1.)

The right part of Figure 1.3 shows the posterior distribution P (Aj |B) (in black) super-imposed on the prior distribution P (Aj) (in gray) when B is the event that the right eyescore is 4. (If the right eye score is 4, then the left eye score is most likely 4.)

Example 1.25 (Religion and Education) As a last example, suppose that 45% ofthe population of a certain adult community are Catholic, 15% are Jewish, and 40% areProtestant. Further, suppose that

1. Education levels among the Catholic population are as follows:

Level: 8th grade Some HS Some College Post-or less High School Graduate College Graduate Graduate

Percent: 21% 28% 32% 8% 7% 4%

2. Education levels among the Jewish population are as follows:


Percent: 7% 23% 24% 15% 18% 13%

3. Education levels among the Protestant population are as follows:


Percent: 13% 22% 27% 11% 19% 8%

The information above can be used to construct a contingency table of proportions in eachreligion-education group in this community:


Catholic: 0.0945 0.126 0.144 0.036 0.0315 0.018 0.45Jewish: 0.0105 0.0345 0.036 0.0225 0.027 0.0195 0.15Protestant: 0.052 0.088 0.108 0.044 0.076 0.032 0.40

0.157 0.2485 0.288 0.1025 0.1345 0.0695 1

Each number in the body of the table is obtained using the multiplication rule. For example,

P (Catholic and HS Graduate) = 0.45 × 0.32 = 0.144.

Each number in the bottom row is obtained using the law of total probability. For example,

P (HS Graduate) = 0.45 × 0.32 + 0.15 × 0.24 + 0.40 × 0.27 = 0.288.

The prior distribution of Catholic, Jewish, and Protestant is 0.45, 0.15, and 0.40, respec-tively. Given that a person is a high school graduate (and no more), for example, theposterior distribution of Catholic, Jewish, and Protestant is

0.144

0.288= 0.5,

0.036

0.288= 0.125, and

0.108

0.288= 0.375, respectively.

This posterior distribution was computed using Bayes rule.

20

1.9 Independent Events and Mutually Independent Events

Events A and B are said to be independent if

P (A ∩ B) = P (A) × P (B).

Otherwise, A and B are said to be dependent.

If A and B are independent and have positive probabilities, then the multiplication rule forprobability (page 17) implies that

P (A|B) = P (A) and P (B|A) = P (B).

(The relative size of A within B is the same as its relative size within S; the relative size ofB within A is the same as its relative size within S.)

If A and B are independent, 0 < P (A) < 1, and 0 < P (B) < 1, then A and Bc are independent,Ac and B are independent, and Ac and Bc are independent.

More generally, events A1, A2, . . . , Ak are said to be mutually independent if

• For each pair of distinct indices (i1, i2): P (Ai1 ∩ Ai2) = P (Ai1) × P (Ai2),

• For each triple of distinct indices (i1, i2, i3):

P (Ai1 ∩ Ai2 ∩ Ai3) = P (Ai1) × P (Ai2) × P (Ai3),

• and so forth.

Example 1.26 (Sampling With Replacement) For example, suppose that four slipsof paper are sampled with replacement from a well-mixed urn containing twenty-five slipsof paper: fifteen slips with the letter X written on each and ten slips of paper with theletter Y written on each. Then the probability of observing the sequence XY XX is

P (XY XX) = P (X) × P (Y ) × P (X) × P (X) =(

1525

)3 (1025

)= 54

625 = 0.0864.

Further, since(43

)= 4 sequences have exactly 3 X’s, the probability of observing a sequence

with exactly 3 X’s is 4(0.0864) = 0.3456.

Example 1.27 (Three Coins) You toss a fair coin three times and record an h (forheads) or a t (for tails) each time. The sample space is

S = hhh, hht, hth, thh, htt, tht, tth, ttt.

Let Hi be the event that the ith toss results in a head:

H1 = hhh, hht, hth, htt, H2 = hhh, hht, thh, tht, H3 = hhh, hth, thh, tth.

21

Since P (H1) = P (H2) = P (H3) = 12 ,

P (Hi ∩ Hj) =1

4= P (Hi) × P (Hj) for (i, j) = (1, 2), (1, 3), (2, 3),

and P (H1 ∩ H2 ∩ H3) = 18 = P (H1)P (H2)P (H3), the events are mutually independent.

Example 1.28 (Not Mutually Independent) Suppose that the sample space of anexperiment consists of the 3! = 6 permutations of the letters a, b, and c, along with thetriples of each letter:

S = aaa, bbb, ccc, abc, acb, bac, bca, cab, cba.

Assume each outcome is equally likely, and let Ai be the event that the ith letter is an a:

A1 = aaa, abc, acb, A2 = aaa, bac, cab, A3 = aaa, bca, cba.

Then P (A1) = P (A2) = P (A3) = 13 and

P (A1 ∩ A2) = P (A1 ∩ A3) = P (A2 ∩ A3) = P (A1 ∩ A2 ∩ A3) =1

9

since each intersection set equals the simple event aaa.

Since

P (Ai ∩ Aj) =1

9= P (Ai) × P (Aj) for (i, j) = (1, 2), (1, 3), (2, 3),

the events are independent in pairs. However, since P (A1 ∩A2 ∩A3) 6= P (A1)P (A2)P (A3),the three events are not mutually independent.

Repeated trials and mutual independence. As stated at the beginning of the chapter,the term experiment is used in probability theory to describe a procedure whose outcomeis not known in advance with certainty. Experiments are assumed to be repeatable, and tohave a well-defined set of outcomes. Repeated trials are repetitions of an experiment usingthe specified procedure, with the outcomes of the trials having no influence on one another.The results of repeated trials of an experiment are mutually independent.

22

2 Discrete Random Variables

Researchers use random variables to describe the numerical results of experiments. Forexample, if a fair coin is tossed 5 times and the total number of heads is recorded, then arandom variable whose values are 0, 1, 2, 3, 4, 5 is used to give a numerical description ofthe results.

This chapter focuses on discrete random variables and their probability distributions.

2.1 Definitions

A random variable is a function from the sample space of an experiment to the real numbers.The range of a random variable is the set of values the random variable assumes. Randomvariables are usually denoted by capital letters (X, Y , Z, . . .) and their values by lower caseletters (x, y, z, . . .).

If the range of a random variable is a finite or countably infinite set, then the randomvariable is said to be discrete; if the range is an interval or a union of intervals, the randomvariable is said to be continuous; otherwise, the random variable is said to be mixed.

If X is a discrete random variable, then P (X = x) is the probability that an outcomehas value x. Similarly, P (X ≤ x) is the probability that an outcome has value x or less,P (a < X < b) is the probability that an outcome has value strictly between a and b, andso forth.

Example 2.1 (Coin Tossing) Let H be the number of heads in 8 tosses of a fair coinand let X be the difference between the number of heads and the number of tails. That is,let X = H − (8 − H) = 2H − 8. Since

P (H = h) =

(8h

)

256when h = 0, 1, . . . , 8 and 0 otherwise,

we know that X is a discrete random variable with range −8,−6,−4,−2, 0, 2, 4, 6, 8.

The following table displays P (X = x) for each x in the range of X, using a commondenominator throughout:

h 0 1 2 3 4 5 6 7 8

x −8 −6 −4 −2 0 2 4 6 8

P (X = x) 1256

8256

28256

56256

70256

56256

28256

8256

1256

From this table, we can compute all probabilities of interest. For example,

P (X ≥ 3) = P (X = 4, 6, 8) =37

256≈ 0.1445.

(There are 256 sequences of heads and tails. Of these, 37 have either 6, 7, or 8 heads.)

23

2.1.1 PDF and CDF for Discrete Distributions

If X is a discrete random variable, then the frequency function (FF) or probability densityfunction (PDF) of X is defined as follows

f(x) = P (X = x) for all real numbers x.

PDFs satisfy the following properties:

1. f(x) ≥ 0 for all real numbers x.

2.∑

x∈R f(x) equals 1, where R is the range of X.

Since f(x) is the probability of an event and events have nonnegative probabilities, thevalues of f(x) must be nonnegative (property 1). Since the events X = x for x ∈ R aremutually disjoint with union S (the sample space), the sum of the probabilities of theseevents must be 1 (property 2).

The cumulative distribution function (CDF) of the discrete random variable X is definedas follows

F (x) = P (X ≤ x) for all real numbers x.

Note that F (x) is a sum of probabilities:

F (x) =∑

y∈Rx

f(y), where Rx = y ∈ R : y ≤ x.

CDFs satisfy the following properties:

1. limx→−∞ F (x) = 0 and limx→+∞ F (x) = 1.

2. If x1 ≤ x2, then F (x1) ≤ F (x2).

3. F (x) is right continuous. That is, for each a, limx→a+ F (x) = F (a).

F (x) represents cumulative probability, with limits 0 and 1 (property 1). Cumulative prob-ability increases with increasing x (property 2), and has discrete jumps at values of x inthe range of the random variable (property 3). (The concept of limit is central to calculus,and will be defined in Chapter 4 of these notes.)

Plotting PDF and CDF functions. The PDF of a discrete random variable is repre-sented graphically

1. By using a scatter plot of pairs (x, f(x)) for x ∈ R, or

2. By using a probability histogram, where area is used to represent probability.

24

-8 -6 -4 -2 0 2 4 6 8 x0

0.05

0.1

0.15

0.2

0.25

Prob.

-8 -6 -4 -2 0 2 4 6 8 x0

0.2

0.4

0.6

0.8

1

Cum.Prob.

Figure 2.1: Probability histogram (left) and cumulative distribution function (right) for thedifference between the number of heads and the number of tails in eight tosses of a fair coin.

The CDF of a discrete random variable is represented graphically as a step function, withsteps of height f(x) at each x ∈ R.

Example 2.2 (Coin Tossing, continued) For example, the left part of Figure 2.1 is theprobability histogram for the difference between the number of heads and the number oftails in eight tosses of a fair coin. For each x in the range of the random variable, a rectanglewith base equal to the interval [x − 0.50, x + 0.50] and with height equal to f(x) is drawn.The total area is 1.0. The right part is a representation of the cumulative distributionfunction. Note that F (x) is nondecreasing, F (x) = 0 when x < −8, and F (x) = 1 whenx > 8. Steps occur at x = −8,−6, . . . , 6, 8.

2.2 Families of Distributions

This section briefly describes four families of discrete probability distributions, and statesproperties of these distributions.

2.2.1 Example: Discrete Uniform Distribution

Let n be a positive integer. The random variable X is said to be a discrete uniform randomvariable, or to have a discrete uniform distribution, with parameter n when its PDF is asfollows:

f(x) =1

nwhen x = 1, 2, . . . , n and 0 otherwise.

For example, if you roll a fair six-sided die and let X equal the number of dots on the topface, then X has a discrete uniform distribution with n = 6.

2.2.2 Example: Hypergeometric Distribution

Let N , M , and n be integers with 0 < M < N and 0 < n < N . The random variable X issaid to be a hypergeometric random variable, or to have a hypergeometric distribution, with

25

0 4 8 x

0.1

0.2

0.3

Prob.

0 4 8 x

0.1

0.2

0.3

Prob.

Figure 2.2: Probability histograms for hypergeometric (left) and binomial (right) examples.

parameters n, M , and N , when its PDF is as follows:

f(x) =

(Mx

)(N−Mn−x

)(N

n

) ,

for integers, x, between max(0, n+M −N) and min(n,M) (and zero otherwise). Note thatthe restrictions on x ensure that all combinations are defined and positive.

Example 2.3 (n = 8, M = 14, N = 35) Assume that X has a hypergeometric distri-bution with parameters n = 8, M = 14 and N = 35. The following table gives the valuesof the PDF of X for all x in its range, using 4 decimal place accuracy.

x 0 1 2 3 4 5 6 7 8

f(x) 0.0086 0.0692 0.2098 0.3147 0.2545 0.1131 0.0268 0.0031 0.0001

The left part of Figure 2.2 is a probability histogram for X.

Hypergeometric distributions are used to model urn experiments (see Section 1.5.1). Sup-pose there are M special objects in an urn containing a total of N objects. Let X be thenumber of special objects in a subset of size n chosen from the urn. If each choice of subsetis equally likely, then X has a hypergeometric distribution. In the example above, 40%(14/35) of the elements in the urn are special.

2.2.3 Distributions Related to Bernoulli Experiments

A Bernoulli experiment is an experiment with two possible outcomes. The outcome ofchief interest is often called “success” and the other outcome “failure.” Let p equal theprobability of success.

Imagine repeating a Bernoulli experiment n times. The expected number of successes in nindependent trials of a Bernoulli experiment with success probability p is np.

For example, suppose that you roll a fair six-sided die and observe the number on the topface each time. Let success be a 1 or 4 on the top face, and failure be a 2, 3, 5, or 6 on thetop face. Then p = 1/3 is the probability of success. In 600 trials of the experiment, youexpect 200 successes.

26

Example: Bernoulli distribution. Suppose that a Bernoulli experiment is run once.Let X equal 1 if a success occurs and 0 if a failure occurs. Then X is said to be a Bernoullirandom variable, or to have a Bernoulli distribution, with parameter p. The PDF of X isas follows:

f(1) = p, f(0) = 1 − p, and f(x) = 0 otherwise.

Example: binomial distribution. Let X be the number of successes in n independenttrials of a Bernoulli experiment with success probability p. Then X is said to be a binomialrandom variable, or to have a binomial distribution, with parameters n and p. The PDF ofX is as follows:

f(x) =

(n

x

)px (1 − p)n−x when x = 0, 1, 2, . . . , n, and 0 otherwise.

For each x, f(x) is the probability of the event “exactly x successes in n independent trials.”(There are a total of

(nx

)sequences with exactly x successes and n − x failures; each such

sequence has probability px(1 − p)n−x.)

Example 2.4 (n = 8, p = 0.4) Assume that X has a binomial distribution with param-eters n = 8 and p = 0.4. The following table gives the values of the PDF of X for all x inits range:

x 0 1 2 3 4 5 6 7 8

f(x) 0.0168 0.0896 0.209 0.2787 0.2322 0.1239 0.0413 0.0079 0.0007

The right part of Figure 2.2 is a probability histogram for X.

Notes. The binomial theorem (page 8) can be used to demonstrate that

f(0) + f(1) + · · · + f(n) = 1.

Note also that a Bernoulli random variable is a binomial random variable with n = 1.

2.2.4 Simple Random Samples

Suppose that an urn contains N objects. A simple random sample of size n is a sequenceof n objects chosen without replacement from the urn, where the choice of each sequence isequally likely.

Let M be the number of special objects in the urn and X be the number of special objectsin a simple random sample of size n. Then X has a hypergeometric distribution withparameters n, M , N . Further, if N is very large, then binomial probabilities can often beused to approximate hypergeometric probabilities.

Theorem 2.5 (Binomial Approximation) If N is large, then the binomial distribu-tion with parameters n and p = M/N can be used to approximate the hypergeometricdistribution with parameters n, M , N . Specifically,

P (x special objects) =

(Mx

)(N−Mn−x

)(N

n

) ≈(

n

x

)px(1 − p)n−x for each x.

27

Note that if X is the number of special objects in a sequence of n objects chosen withreplacement from the urn and if the choice of each sequence is equally likely, then X hasa binomial distribution with parameters n and p = M/N . The theorem says that if N islarge, then the model where sampling is done with replacement can be used to approximatethe model where sampling is done without replacement.

Adequacy of the approximation. One rule of thumb for judging the adequacy of theapproximation is the following: the binomial approximation is adequate when

20n < N and 0.05 < p < 0.95.

Survey analysis. Simple random samples are used in surveys. If the survey populationis small, then hypergeometric distributions are used to analyze the results. If the surveypopulation is large, then binomial distributions are used to analyze the results, even thougheach person’s opinion is solicited at most once.

Example 2.6 (Survey Analysis) For example, suppose that a surveyor is interested indetermining the level of support for a proposal to change the local tax structure, and decidesto choose a simple random sample of size 10 from the registered voter list. If there are atotal of 120 registered voters, one-third of whom support the proposal, then the probabilitythat exactly 3 of the 10 chosen voters support the proposal is

P (X = 3) =

(403

) (807

)(120

10

) ≈ 0.27.

If there are thousands of registered voters, the probability is

P (X = 3) ≈(

10

3

)(1

3

)3 (2

3

)7

≈ 0.26.

Note that the exact number of registered voters is not needed in the approximation.

2.3 Mathematical Expectation

Mathematical expectation generalizes the idea of a weighted average, where probabilitydistributions are used as the weights.

2.3.1 Definitions for Finite Discrete Random Variables

Let X be a discrete random variable with finite range R and PDF f(x). The mean of X(or the expected value of X or the expectation of X) is defined as follows:

E(X) =∑

x∈R

x f(x).

28

Similarly, if g(X) is a real-valued function of X, then the mean of g(X) (or the expectedvalue of g(X) or the expectation of g(X)) is

E(g(X)) =∑

x∈R

g(x) f(x).

Example 2.7 (Nickels and Dimes) Assume you have 3 dimes and 5 nickels in yourpocket. You choose a subset of 4 coins, let X equal the number of dimes in the subset and

g(X) = 10X + 5(4 − X) = 20 + 5X

be the total value (in cents) of the chosen coins. If each choice of subset is equally likely,then X has a hypergeometric distribution with parameters n = 4, M = 3, and N = 8, theexpected number of dimes in the subset is

E(X) =3∑

x=0

x f(x) = 0

(5

70

)+ 1

(30

70

)+ 2

(30

70

)+ 3

(5

70

)= 1.5

and the expected total value of the chosen coins is

E(g(X)) =3∑

x=0

(20 + 5x) f(x) = 20

(5

70

)+ 25

(30

70

)+ 30

(30

70

)+ 35

(5

70

)= 27.5 cents.

Note that E(g(X)) = g(E(X)).

2.3.2 Properties of Expectation

Properties of sums imply the following properties of expectation:

1. If a is a constant, then E(a) = a.

2. If a and b are constants, then E(a + bX) = a + bE(X).

3. If ci is a constant and gi(X) is a real-valued function for i = 1, 2, . . . , k, then

E(c1g1(X) + c2g2(X) + · · · + ckgk(X)) =

c1E(g1(X)) + c2E(g2(X)) + · · · + ckE(gk(X)).

The first property says that the mean of a constant function is the constant itself. Thesecond property says that if g(X) = a+ bX, then E(g(X)) = g(E(X)). The third propertygeneralizes the first two.

Note that if g(X) 6= a + bX, then E(g(X)) and g(E(X)) may be different. For example,let g(X) = X2 in the nickels and dimes example above. Then

E(g(X)) = 02(

5

70

)+ 12

(30

70

)+ 22

(30

70

)+ 32

(5

70

)=

39

14.

Since (39/14) 6= (3/2)2, E(g(X)) 6= g(E(X)) in this case.

29

Table 2.1: Means and variances for four families of discrete distributions.

Distribution E(X) V ar(X)

Discrete Uniform n n+12

n2−112

Hypergeometric n, M , N nMN nM

N

(1 − M

N

) (N−nN−1

)

Bernoulli p p p(1 − p)

Binomial n, p np np(1 − p)

2.3.3 Mean, Variance and Standard Deviation

Let X be a random variable and let µ = E(X) be its mean. The variance of X, V ar(X),is defined as follows:

V ar(X) = E((X − µ)2

).

The notation σ2 = V ar(X) is used to denote the variance. The standard deviation of X,σ = SD(X), is the positive square root of the variance.

The mean is a measure of the center (or location) of a distribution. The variance andstandard deviation are measures of the scale (or spread) of a distribution. If X is the heightof an individual in inches, say, then the values of E(X) and SD(X) are in inches, while thevalue of V ar(X) is in square-inches.

Table 2.1 gives general formulas for the mean and variance of the four families of discreteprobability distributions discussed earlier.

Example 2.8 (Drilling for Oil) Assume that the probability of finding oil in a givendrill hole is 0.3 and that the results (finding oil or not) are independent from drill hole todrill hole.

An oil company drills one hole at a time. If they find oil, then they stop; otherwise, theycontinue. However, the company only has enough money to drill five holes. Let X be thenumber of holes drilled.

The range of X is 1, 2, 3, 4, 5 and its PDF is given in Table 2.2. (The values of f(x) arezero otherwise.) The mean number of holes drilled is

E(X) =5∑

x=1

x f(x) = 1(0.3) + 2(0.21) + 3(0.147) + 4(0.1029) + 5(0.2401) ≈ 2.7731.

Similarly,

V ar(X) =5∑

x=1

(x − E(X))2 f(x) ≈ 2.42182

and SD(X) =√

V ar(X) ≈ 1.55622.

30

Table 2.2: Probability distribution for oil drilling example.

f(x) = P (X = x) Comments:

f(1) = 0.3 Success on first try

f(2) = (0.7)0.3 = 0.21 Failure followed by success

f(3) = (0.7)20.3 = 0.147 Two failures followed by success

f(4) = (0.7)30.3 = 0.1029 Three failures followed by success

f(5) = (0.7)40.3 + (0.7)5 = 0.2401 Four failures followed by success

or failure on all 5 tries

Properties of variance and standard deviation. Properties of sums can be used toprove the following properties of the variance and standard deviation:

1. V ar(X) = E(X2) − (E(X))2.

2. If Y = a + bX, then V ar(Y ) = b2V ar(X) and SD(Y ) = |b|SD(X).

The first property provides a quick by-hand method for computing the variance. For ex-ample, the variance of X in the nickels and dimes example is 39/14 − (3/2)2 = 15/28(confirming the result for hypergeometric random variables given in Table 2.1).

2.3.4 Chebyshev’s Inequality

The Chebyshev inequality illustrates the importance of the concepts of mean, variance, andstandard deviation.

Theorem 2.9 (Chebyshev’s Inequality) Let X be a random variable with finite meanµ and standard deviation σ, and let ǫ be a positive constant. Then

P (µ − ǫ < X < µ + ǫ) > 1 − σ2

ǫ2.

In particular, if ǫ = kσ for some k, then

P (µ − kσ < X < µ + kσ) > 1 − 1

k2.

For example, if ǫ = 2.5σ, then Chebyshev’s inequality states that

• At least 84% of the distribution of X is in the interval |x − µ| < 2.5σ and

• At most 16% of the distribution is in the complementary interval |x − µ| ≥ 2.5σ.

31

Proof of Chebyshev’s inequality: Using complements, it is sufficient to demonstrate that

P (|X − µ| ≥ ǫ) ≤ σ2

ǫ2.

Since σ2 = E((X − µ)2

)=∑

x(x − µ)2 f(x), where f(x) is the PDF of X, and the sum oftwo nonnegative terms is greater than or equal to either term,

σ2 =∑

|x−µ|≥ǫ

(x − µ)2 f(x) +∑

|x−µ|<ǫ

(x − µ)2 f(x) ≥∑

|x−µ|≥ǫ

(x − µ)2 f(x).

Further, since (x − µ)2 ≥ ǫ2 in the sum on the right, the sum on the right is greater thanor equal to the sum whose terms are ǫ2 f(x). Thus,

σ2 ≥∑

|x−µ|≥ǫ

ǫ2 f(x) = ǫ2∑

|x−µ|≥ǫ

f(x) = ǫ2 P (|X − µ| ≥ ǫ).

Dividing both sides by ǫ2 gives the result.

Example 2.10 (Discrete Uniform RV) Let X be a discrete uniform random variablewith n = 5000. Then

µ = E(X) =n + 1

2= 2500.5 and σ = SD(X) =

√n2 − 1

12≈ 1443.38.

The following table compares exact probabilities to lower bounds when ǫ = kσ:

k (µ − kσ, µ + kσ) P (µ − kσ < X < µ + kσ) 1 − 1k2

1 (1057.12, 3943.88) 28865000 = 0.5772 0

1.2 (768.449, 4232.55) 34645000 = 0.6928 0.305556

1.5 (335.437, 4665.56) 43305000 = 0.8660 0.555556

1.8 (−97.5762, 5098.58) 50005000 = 1 0.691358

In each case, the probability is much larger than the Chebyshev lower bound.

32

3 Joint Discrete Distributions

A probability distribution describing the joint variability of two or more random variablesis called a joint distribution.

For example, if X is the height (in feet), Y is the weight (in pounds), and Z is the serumcholesterol level (in mg/dL) of a person chosen from a given population, then we may beinterested in describing the joint distribution of the triple (X,Y,Z).

3.1 Bivariate Distributions

A bivariate distribution is the joint distribution of a pair of random variables.

3.1.1 Joint PDF

Assume that X and Y are discrete random variables. The joint frequency function (jointFF) or joint probability density function (joint PDF) of (X,Y ) is defined as follows:

f(x, y) = P (X = x, Y = y) for all real pairs (x, y),

where the comma is understood to mean the intersection of the events. The notationfXY (x, y) is sometimes used to emphasize the two random variables.

3.1.2 Joint CDF

Assume that X and Y are discrete random variables. The joint cumulative distributionfunction (joint CDF) of (X,Y ) is defined as follows:

F (x, y) = P (X ≤ x, Y ≤ y) for all real pairs (x, y),

where the comma is understood to mean the intersection of the events. The notationFXY (x, y) is sometimes used to emphasize the two random variables.

3.1.3 Marginal Distributions

Let X and Y be discrete random variables with joint PDF f(x, y). The marginal frequencyfunction (marginal FF) or marginal probability density function (marginal PDF) of X is

fX(x) =∑

y

f(x, y) for all real numbers x,

where the sum is taken over all y in the range of Y . Similarly, the marginal frequencyfunction or marginal PDF of Y is

fY (y) =∑

x

f(x, y) for all real numbers y,

where the sum is taken over all x in the range of X.

33

3.1.4 Conditional Distributions

Let X and Y be discrete random variables with joint PDF f(x, y).

If fY (y) 6= 0, then the conditional frequency function (conditional FF) or conditional prob-ability density function (conditional PDF) of X given Y = y is defined as follows:

fX|Y =y(x|y) =f(x, y)

fY (y)for all real numbers x.

Similarly, if fX(x) 6= 0, then the conditional PDF of Y given X = x is

fY |X=x(y|x) =f(x, y)

fX(x)for all real numbers y.

Note that in the first case, the conditional sample space is the collection of outcomes withY = y; in the second case, it is the collection of outcomes with X = x.

Example 3.1 (Eyesight Study, continued) Consider again the eyesight study frompage 19. Let X be the score for the right eye and Y be the score for the left eye. ThenTable 1.2 displays the joint (X,Y ) distribution. The probability that both eyes have thesame score, for example, is

P (X = Y ) = f(1, 1) + f(2, 2) + f(3, 3) + f(4, 4) = 0.708.

The marginal distribution of X is given in the right column of Table 1.2 and the marginaldistribution of Y is given in the bottom row of the table.

Figure 1.3 displays the marginal distribution of Y (in gray), the conditional distributionof Y given X = 1 (left plot, in black), and the conditional distribution of Y given X = 4(right plot, in black).

Example 3.2 (Urn Experiment) Assume that an urn contains 4 red, 3 white and 1 bluechip. Consider the following 2 step experiment:

Step 1. Thoroughly mix the contents of the urn. Choose a chip and record its color.Return the chip plus two more chips of the same color to the urn.

Step 2. Thoroughly mix the contents of the urn. Choose a chip and record its color.

Let X equal the number of red chips chosen and Y equal the number of white chips chosen.Then Table 3.1 displays the joint (X,Y ) distribution. The marginal distribution of X isgiven in the right column and the marginal distribution of Y is given in the bottom row.

The conditional distribution of Y given X = 1, for example, is

fY |X=1(0|1) =0.1

0.4=

1

4, fY |X=1(1|1) =

0.3

0.4=

3

4

and fY |X=1(y|1) = 0 otherwise.

34

Table 3.1: Joint distribution for the urn experiment example.

y = 0 y = 1 y = 2 total

x = 0 0.0375 0.075 0.1875 0.3x = 1 0.1000 0.300 0.4x = 2 0.3000 0.3

total 0.4375 0.375 0.1875 1.0

3.1.5 Independent Random Variables

The discrete random variables X and Y are said to be independent if

f(x, y) = fX(x)fY (y) for all real pairs (x, y).

Otherwise, X and Y are said to be dependent.

The discrete random variables X and Y are independent if the probability of the intersection isequal to the product of the probabilities

P (X = x, Y = y) = P (X = x)P (Y = y)

for all events of interest (for all x, y).

Example 3.3 (Independent Bernoulli Random Variables) Assume that (X,Y ) hasthe joint distribution given in the following table:

y = 0 y = 1 total

x = 0 0.28 0.42 0.70

x = 1 0.12 0.18 0.30

total 0.40 0.60 1.00

X is a Bernoulli random variable with parameter 0.30 and Y is a Bernoulli random variablewith parameter 0.60. Further, since each probability in the body of the 2-by-2 table is theproduct of the appropriate marginal probabilities, X and Y are independent.

Example 3.4 (Eyesight Study, continued) The random variables X and Y in the eye-sight study are dependent. To demonstrate this, we need to show that f(x, y) 6= fX(x)fY (y)for some pair (x, y). For example,

f(1, 1) = 0.203 6= 0.265 × 0.255 = fX(1)fY (1).

Example 3.5 (Urn Experiment, continued) Similarly, the random variables X and Yin the urn experiment are dependent. For example,

f(1, 1) = 0.3 6= 0.4 × 0.375 = fX(1)fY (1).

35

3.1.6 Mathematical Expectation for Finite Bivariate Distributions

Let X and Y be discrete random variables with joint PDF f(x, y) = P (X = x, Y = y),and let g(X,Y ) be a real-valued function. The mean of g(X,Y ) (or the expected value ofg(X,Y ) or the expectation of g(X,Y )) is

E(g(X,Y )) =∑

x

∑

y

g(x, y) f(x, y)

where the double sum includes all pairs with nonzero joint PDF.

Example 3.6 (Eyesight Study, continued) Continuing with the eyesight study, letg(X,Y ) = |X − Y | be the absolute value of the difference in scores for the two eyes. Then

E(g(X,Y )) =4∑

x=1

4∑

y=1

|x − y| f(x, y) = 0.374.

(The average absolute difference in scores for the two eyes is small.)

Properties. Properties of sums and the fact that the joint PDF of independent randomvariables equals the product of the marginal PDFs can be used to show the following:

1. If a and b1, b2, . . . , bn are constants and gi(X,Y ) are real-valued functions for eachi = 1, 2, . . . , n, then

E

(a +

n∑

i=1

bigi(X,Y )

)= a +

n∑

i=1

biE(gi(X,Y )).

2. If X and Y are independent and g(X) and h(Y ) are real-valued functions, then

E(g(X)h(Y )) = E(g(X))E(h(Y )).

The first property generalizes properties from Section 2.3.2. The second property is usefulwhen studying associations between variables.

3.1.7 Covariance and Correlation

Let X and Y be random variables with finite means (µx, µy) and finite standard deviations(σx, σy). The covariance of X and Y , Cov(X,Y ), is defined as follows:

Cov(X,Y ) = E((X − µx)(Y − µy)).

The notation σxy = Cov(X,Y ) is used to denote the covariance. The correlation of X andY , Corr(X,Y ), is defined as follows:

Corr(X,Y ) =Cov(X,Y )

σxσy=

σxy

σxσy.

The notation ρ = Corr(X,Y ) is used to denote the correlation of X and Y ; ρ is called thecorrelation coefficient.

36

Positive and negative association. Covariance and correlation are measures of theassociation between two random variables. Specifically,

1. The random variables X and Y are said to be positively associated if as X increases,Y tends to increase. If X and Y are positively associated, then Cov(X,Y ) andCorr(X,Y ) will be positive.

2. The random variables X and Y are said to be negatively associated if as X increases,Y tends to decrease. If X and Y are negatively associated, then Cov(X,Y ) andCorr(X,Y ) will be negative.

For example, the height and weight of individuals in a given population are positivelyassociated. Educational level and indices of poor health are often negatively associated.

Properties of covariance. The following properties of covariance are important:

1. Cov(X,X) = V ar(X).

2. Cov(X,Y ) = Cov(Y,X).

3. Cov(a + bX, c + dY ) = bdCov(X,Y ), where a, b, c, and d are constants.

4. Cov(X,Y ) = E(XY ) − E(X)E(Y ).

5. If X and Y are independent, then Cov(X,Y ) = 0.

The first two properties follow immediately from the definition of covariance. The thirdproperty relates the covariance of linearly transformed random variables to the covarianceof the original random variables. In particular, if b = d = 1 (values are shifted only), thenthe covariance is unchanged; if a = c = 0 and b, d > 0 (values are re-scaled), then thecovariance is multiplied by bd.

The fourth property gives an alternative method for calculating the covariance; the methodis particularly well-suited for by-hand computations. Since E(XY ) = E(X)E(Y ) for inde-pendent random variables, the fourth property can be used to prove that the covariance ofindependent random variables is zero (property 5).

Properties of correlation. The following properties of correlation are important:

1. −1 ≤ Corr(X,Y ) ≤ 1.

2. If a, b, c, and d are constants, then

Corr(a + bX, c + dY ) =

Corr(X,Y ) when bd > 0

−Corr(X,Y ) when bd < 0

In particular, Corr(X,a + bX) equals 1 if b > 0 and equals −1 if b < 0.

3. If X and Y are independent, then Corr(X,Y ) = 0.

37

Correlation is often called standardized covariance since, by the first property, its valuesalways lie in the [−1, 1] interval. If the correlation is close to −1, then there is a strongnegative association between the variables; if the correlation is close to 1, then there is astrong positive association between the variables.

The second property relates the correlation of linearly transformed variables to the corre-lation of the original variables. Note, in particular, that the correlation is unchanged if therandom variables are shifted (b = d = 1) or if the random variables are re-scaled (a = c = 0and b, d > 0). For example, the correlation between the height and weight of individuals ina population is the same no matter which measurement scale is used for height (e.g. inches,feet) and which measurement scale is used for weight (e.g. pounds, kilograms).

If Corr(X,Y ) = 0, then X and Y are said to be uncorrelated ; otherwise, they are said tobe correlated. The third property says that independent random variables are uncorrelated.Note that uncorrelated random variables are not necessarily independent.

Example 3.7 (Eyesight Study, continued) For (X,Y ) from the eyesight study,

E(X) = 2.275, E(Y ) = 2.305, E(X2) = 6.117, E(Y 2) = 6.259, E(XY ) = 5.905.

Using properties of variance and covariance:

V ar(X) ≈ 0.941, V ar(Y ) ≈ 0.946, Cov(X,Y ) ≈ 0.661.

Finally, ρ = Corr(X,Y ) ≈ 0.701.

Example 3.8 (Uncorrelated but Dependent) Assume that the random pair (X,Y )has the joint distribution given in the following table:

y = −1 y = 0 y = 1 total

x = −1 0.05 0.10 0.05 0.20x = 0 0.10 0.40 0.10 0.60x = 1 0.05 0.10 0.05 0.20

total 0.20 0.60 0.20 1.00

Since E(X) = E(Y ) = E(XY ) = 0, the covariance and correlation must equal 0.

But, X and Y are dependent since, for example, f(0, 0) 6= fX(0)fY (0).

3.1.8 Conditional Expectation and Regression

Let X and Y be discrete random variables with finite ranges and joint PDF f(x, y).

If fX(x) 6= 0, then the conditional expectation (or the conditional mean) of Y given X = x,E(Y |X = x), is defined as follows:

E(Y |X = x) =∑

y

y fY |X=x(y|x),

where the sum is over all y with nonzero conditional PDF (fY |X=x(y|x) 6= 0).

38

0.02 0.02 0.02 0.01 0.01

0.02 0.02 0.01 0.04 0.04

0.02 0.01 0.04 0.04 0.10

0.01 0.04 0.04 0.10 0.10

0.01 0.04 0.10 0.10 0.04

x=0

x=1

x=2

x=3

x=4

y=1 y=2 y=3 y=4 y=5

0 1 2 3 4 x

1

2

3

4

5

C.Exp.

Figure 3.1: Joint (X,Y ) distribution (left) and plot of (x,E(Y |X = x)) pairs (right) for theconditional expectation example.

The conditional expectation of X given Y = y, E(X|Y = y), is defined similarly.

Example 3.9 (Conditional Expectation) Let (X,Y ) be the random pair whose jointdistribution is shown in the left part of Figure 3.1. Then

E(Y |X = 0) = 1

(.02

.08

)+ 2

(.02

.08

)+ 3

(.02

.08

)+ 4

(.01

.08

)+ 5

(.01

.08

)= 2.625.

Similarly,

E(Y |X = 1) = 3.462, E(Y |X = 2) = 3.905, E(Y |X = 3) = 3.828 and E(Y |X = 4) = 3.414.

The right part of Figure 3.1 is a plot of pairs (x,E(Y |X = x)) for x = 0, 1, 2, 3, 4.

Regression equation. The formula for the conditional expectation

E(Y |X = x) as a function of x

is called the regression equation of Y on X. Similarly, the formula for the conditionalexpectation E(X|Y = y) as a function of y is called the regression equation of X on Y .

Linear conditional means. Let X and Y be random variables with finite means (µx,µy), standard deviations (σx, σy), and correlation (ρ).

If E(Y |X = x) is a linear function of x, then the formula is of the form:

E(Y |X = x) = µy +ρσy

σx(x − µx).

Similarly, if E(X|Y = y) is a linear function of y, then the formula is of the form:

E(X|Y = y) = µx +ρσx

σy(y − µy).

39

3.1.9 Example: Bivariate Hypergeometric Distribution

Let n, M1, M2, and M3 be positive integers with n ≤ M1 + M2 + M3. The randompair (X,Y ) is said to have a bivariate hypergeometric distribution with parameters n and(M1,M2,M3) if its joint PDF has the following form:

f(x, y) =

(M1

x

)(M2

y

)( M3

n−x−y

)(M1+M2+M3

n

)

when x and y are nonnegative integers with

x ≤ min(n,M1), y ≤ min(n,M2) and max(0, n − M3) ≤ x + y ≤ min(n,M3),

and is equal to zero otherwise. Note that the restrictions on x and y ensure that allcombinations are defined and positive.

Example 3.10 (n = 4, M1 = 2, M2 = 3, M3 = 5) Assume that (X,Y ) has a bivariatehypergeometric distribution with parameters n = 4, M1 = 2, M2 = 3 and M3 = 5. Thejoint PDF is

f(x, y) =

(2x

)(3y

)( 54−x−y

)(10

4

) x = 0, 1, 2; y = 0, 1, . . . ,min(3, 4 − x)

and 0 otherwise. The joint (X,Y ) distribution and the marginal X and Y distributions aredisplayed in the following table:

y = 0 y = 1 y = 2 y = 3 total

x = 0 0.0238 0.1429 0.1429 0.0238 0.3334x = 1 0.0952 0.2857 0.1429 0.0095 0.5333x = 2 0.0476 0.0714 0.0143 0.1333

total 0.1666 0.5000 0.3001 0.0333 1.0000

Urn experiments. Bivariate hypergeometric distributions are used to model urn exper-iments. Specifically, suppose that an urn contains N objects, M1 of type 1, M2 of type 2,and M3 of type 3 (N = M1 + M2 + M3). Let X equal the number of objects of type 1 andY equal the number of objects of type 2 in a subset of size n chosen from the urn. If eachchoice of subset is equally likely, then (X,Y ) has a bivariate hypergeometric distributionwith parameters n and (M1,M2,M3). In the example above, 20% (2/10) of the objects inthe urn are of type 1, 30% (3/10) are of type 2 and 50% (5/10) are of type 3.

Negative association. If (X,Y ) has a bivariate hypergeometric distribution, then Xand Y are negatively associated with correlation

ρ = Corr(X,Y ) = −√

M1

M2 + M3

M2

M1 + M3.

40

Marginal and conditional distributions. If (X,Y ) has a bivariate hypergeometricdistribution, then

1. X has a hypergeometric distribution with parameters n, M1, and N ; and

2. Y has a hypergeometric distribution with parameters n, M2, and N .

In addition, each conditional distribution is hypergeometric. Finally, bivariate hypergeo-metric distributions have linear conditional means.

3.1.10 Example: Trinomial Distribution

Let n be a positive integer, and p1, p2, and p3 be positive proportions with sum 1. Therandom pair (X,Y ) is said to have a trinomial distribution with parameters n and (p1, p2, p3)when its joint PDF has the following form:

f(x, y) =

(n

x, y, n − x − y

)px1py

2pn−x−y3

when x = 0, 1, . . . , n; y = 0, 1, . . . , n; x + y ≤ n, and is equal to zero otherwise.

Example 3.11 (n = 4, p1 = 0.2, p2 = 0.3, p3 = 0.5) Assume that (X,Y ) has a tri-nomial distribution with parameters n = 4, p1 = 0.2, p2 = 0.3 and p3 = 0.5. The joint PDFis

f(x, y) =

(4

x, y, 4 − x − y

)0.2x0.3y0.54−x−y x, y = 0, 1, 2, 3, 4; x + y ≤ 4,

and 0 otherwise. The joint (X,Y ) distribution and the marginal X and Y distributions aredisplayed in the following table:

y = 0 y = 1 y = 2 y = 3 y = 4 total

x = 0 0.0625 0.1500 0.1350 0.0540 0.0081 0.4096x = 1 0.1000 0.1800 0.1080 0.0216 0.4096x = 2 0.0600 0.0720 0.0216 0.1536x = 3 0.0160 0.0096 0.0256x = 4 0.0016 0.0016

total 0.2401 0.4116 0.2646 0.0756 0.0081 1.0000

Experiments with three outcomes. Trinomial distributions are used to model exper-iments with exactly three outcomes. Specifically, suppose that an experiment has threeoutcomes which occur with probabilities p1, p2, and p3, respectively. Let X be the numberof occurrences of outcome 1 and Y be the number of occurrences of outcome 2 in n inde-pendent trials of the experiment. Then (X,Y ) has a trinomial distribution with parametersn and (p1, p2, p3). In the example above, the probabilities of the three outcomes are 0.2, 0.3and 0.5, respectively.

41

Negative association. If (X,Y ) has a trinomial distribution, then X and Y are nega-tively associated with correlation

ρ = Corr(X,Y ) = −√

p1

1 − p1

p2

1 − p2.

Marginal and conditional distributions. If (X,Y ) has a trinomial distribution, then

1. X has a binomial distribution with parameters n and p1, and

2. Y has a binomial distribution with parameters n and p2.

In addition, each conditional distribution is binomial. Finally, trinomial distributions havelinear conditional means.


The results of Section 2.2.4 can be generalized. In particular, trinomial probabilities canoften be used to approximate bivariate hypergeometric probabilities when N is large enough,and each family of distributions can be used in survey analysis.

Example 3.12 (Survey Analysis) For example, suppose that a surveyor is interestedin determining the level of support for a proposal to change the local tax structure, anddecides to choose a simple random sample of size 10 from the registered voter list. If thereare a total of 120 registered voters, where one-third support the proposal, one-half opposethe proposal, and one-sixth have no opinion, then the probability that exactly 3 support, 5oppose, and 2 have no opinion is

P (X = 3, Y = 5) =

(403

) (605

) (202

)(120

10

) ≈ 0.088.

If there are thousands of registered voters, then the probability is

P (X = 3, Y = 5) ≈(

10

3, 5, 2

)(1

3

)3 (1

2

)5 (1

6

)2

≈ 0.081.

As before, you do not need to know the exact number of registered voters when you use thetrinomial approximation.

3.2 Multivariate Distributions

A multivariate distribution is the joint distribution of k random variables. Ideas studied inthe bivariate case (k = 2) can be generalized to the case where k > 2.

42

3.2.1 Joint PDF and Joint CDF

Let X1, X2, . . ., Xk be discrete random variables and let

X = (X1,X2, . . . ,Xk).

The joint frequency function (joint FF) or joint probability density function (joint PDF) ofX is defined as follows:

f(x) = P (X1 = x1,X2 = x2, . . . ,Xk = xk)

for all real k-tuples x = (x1, x2, . . . , xk), where commas are understood to mean the inter-section of events.

The joint cumulative distribution function (joint CDF) of X is defined as follows:

F (x) = P (X1 ≤ x1,X2 ≤ x2, . . . ,Xk ≤ xk)

for all real k-tuples x = (x1, x2, . . . , xk), where commas are understood to mean the inter-section of events.

Example 3.13 (k = 3) Assume that (X,Y,Z) has joint PDF

f(x, y, z) =5

28(x + y + z)when x, y = 0, 1, z = 1, 2, 3, 4 and 0 otherwise.

The following table shows the joint (X,Y,Z) distribution. The marginal (X,Y ) distributionis displayed in the right column and the marginal Z distribution is displayed in the bottomrow. A common denominator is used throughout.

z = 1 z = 2 z = 3 z = 4

x = 0 y = 0 60/336 30/336 20/336 15/336 125/336

y = 1 30/336 20/336 15/336 12/336 77/336

x = 1 y = 0 30/336 20/336 15/336 12/336 77/336

y = 1 20/336 15/336 12/336 10/336 57/336

140/336 85/336 62/336 49/336 1

X and Y have the same marginal distribution. Namely,

fX(0) = fY (0) =101

168and fX(1) = fY (1) =

67

168

(and 0 otherwise).

The probability that X + Y + Z is at most 3, for example, is

P (X + Y + Z ≤ 3) =1∑

x=0

1∑

y=0

3−x−y∑

z=1

f(x, y, z) =115

168≈ 0.685.

43

3.2.2 Mutually Independent Random Variables and Random Samples

The discrete random variables X1, X2, . . ., Xk are said to be mutually independent (orindependent) if

f(x) = f1(x1)f2(x2) · · · fk(xk) for all real k-tuples x = (x1, x2, . . . , xk),

where fi(xi) = P (Xi = xi) for i = 1, 2, . . . , k. (The probability of the intersection is equalto the product of the probabilities for all events of interest.)

Equivalently, X1, X2, . . ., Xk are said to be mutually independent if

F (x) = F1(x1)F2(x2) · · ·Fk(xk) for all real k-tuples x = (x1, x2, . . . , xk),

where Fi(xi) = P (Xi ≤ xi) for i = 1, 2, . . . , k.

X1, X2, . . ., Xk are said to be dependent if they are not mutually independent.

Example 3.14 (k = 3, continued) The random variables X, Y and Z in the previousexample are dependent. To see this, we just need to show that f(x, y, z) 6= fX(x)fY (y)fZ(z)for some triple (x, y, z). For example, f(1, 1, 1) 6= fX(1)fY (1)fZ(1).

Random sample. If the discrete random variables X1, X2, . . ., Xk are mutually indepen-dent and have a common distribution (each marginal PDF is the same), then the randomvariables X1, X2, . . ., Xk are said to be a random sample from that distribution.

3.2.3 Mathematical Expectation for Finite Multivariate Distributions

Let X1, X2, . . ., Xk be discrete random variables with joint PDF f(x), and let g(X) be areal-valued k-variable function. The mean (or expected value or expectation) of g(X) is

E(g(X)) =∑

xk

∑

xk−1

· · ·∑

x1

g(x1, x2, . . . , xk)f(x1, x2, . . . , xk),

where the multiple sum includes all k-tuples with nonzero joint PDF.

Example 3.15 (Maximum Roll) You roll a fair die three times. Let Xi be the numberon the top face after the ith roll and g(X1,X2,X3) be the maximum of the three numbers.

X1, X2, X3 is a random sample from a uniform distribution with parameter 6. The jointPDF of the random triple is

f(x1, x2, x3) =1

216when x1, x2, x3 = 1, 2, 3, 4, 5, 6 and 0 otherwise

and the expected value of the maximum of the three rolls reduces to

E(g(X1,X2,X3)) =

1

(1

216

)+ 2

(7

216

)+ 3

(19

216

)+ 4

(37

216

)+ 5

(61

216

)+ 6

(91

216

)=

(119

24

)≈ 4.96.

44

Properties of expectation. Properties in the multivariate case directly generalize prop-erties in the univariate and bivariate cases (Sections 2.3.2 and 3.1.6).

3.2.4 Example: Multivariate Hypergeometric Distribution

Suppose that an urn contains N objects,

M1 of type 1, M2 of type 2, . . ., and Mk of type k (M1 + M2 + · · · + Mk = N).

Let Xi be the number of objects of type i in a subset of size n chosen from the urn. If eachchoice of subset is equally likely, then the random k-tuple (X1,X2, . . . ,Xk) is said to havea multivariate hypergeometric distribution with parameters n and (M1,M2, . . . ,Mk).

The joint PDF for the k-tuple is

f(x1, x2, . . . , xk) =

(M1

x1

)(M2

x2

) · · · (Mk

xk

)(N

n

)

when each xi is a nonnegative integer satisfying xi ≤ min(n,Mi) and∑

i xi = n (and zerootherwise).

Note that

1. If X is a hypergeometric random variable with parameters n, M , N , then

(X,n − X)

has a multivariate hypergeometric distribution with parameters n and (M,N − M).

2. If (X,Y ) is bivariate hypergeometric with parameters n and (M1,M2,M3), then

(X,Y, n − X − Y )

is multivariate hypergeometric with parameters n and (M1,M2,M3).

3. If (X1,X2, . . . ,Xk) has a multivariate hypergeometric distribution, then

(a) For each i, Xi is a hypergeometric random variable with parameters n, Mi, N .

(b) For each i 6= j, (Xi,Xj) is bivariate hypergeometric with parameters n and(Mi,Mj , N − Mi − Mj). In particular, Xi and Xj are negatively associated.

Example 3.16 (Physical Activity) Suppose there are 600 women and 400 men living inan adult community. Among the women, 180 are physically active and 420 are not. Amongthe men, 120 are physically active and 280 are not.

Let X1 be the number of women who are physically active, X2 be the number of womenwho are not physically active, X3 be the number of men who are physically active and X4

be the number of men who are not physically active in a simple random sample of size 15chosen from those living in the community. Then, for example,

P (X1 = 3,X2 = 7,X3 = 1,X4 = 4) =

(1803

)(4207

)(1201

)(2804

)(1000

15

) ≈ 0.0182.

45

3.2.5 Example: Multinomial Distribution

Suppose that an experiment has exactly k outcomes, and let

pi be the probability of the ith outcome, i = 1, 2, . . . , k (p1 + p2 + · · · + pk = 1).

Let Xi be the number of occurrences of the ith outcome in n independent trials of the exper-iment. Then the random k-tuple (X1,X2, . . . ,Xk) is said to have a multinomial distributionwith parameters n and (p1, p2, . . . , pk).

The joint PDF for the k-tuple is

f(x1, x2, . . . , xk) =

(n

x1, x2, . . . , xk

)px1

1 px2

2 · · · pxk

k

when x1, x2, . . . , xk = 0, 1, . . . , n and∑

i xi = n (and zero otherwise).

Note that

1. If X is a binomial random variable with parameters n and p, then

(X,n − X)

has a multinomial distribution with parameters n and (p, 1 − p).

2. If (X,Y ) has a trinomial distribution with parameters n and (p1, p2, p3), then

(X,Y, n − X − Y )

has a multinomial distribution with parameters n and (p1, p2, p3).

3. If (X1,X2, . . . ,Xk) has a multinomial distribution, then

(a) For each i, Xi is a binomial random variable with parameters n and pi.

(b) For each i 6= j, (Xi,Xj) has a trinomial distribution with parameters n and(pi, pj, 1 − pi − pj). In particular, Xi and Xj are negatively associated.

Example 3.17 (Independence Model) Assume that A and B are independent eventsof interest to researchers. Let outcomes 1, 2, 3, 4 correspond to

“both A and B occur,” “only A occurs,” “only B occurs,” “neither A nor B occur,”

respectively. If P (A) = α and P (B) = β, then

(p1, p2, p3, p4) = (αβ, α(1 − β), (1 − α)β, (1 − α)(1 − β))

can be used as the list of probabilities in a multinomial model.

If P (A) = 0.6, P (B) = 0.3, and Xi is the number of occurrences of the ith outcome in 15independent trials (i = 1, 2, 3, 4), then, for example,

P (X1 = 3,X2 = 7,X3 = 1,X4 = 4) =

(15

3, 7, 1, 4

)(0.18)3(0.42)7(0.12)1(0.28)4 ≈ 0.0179.

46


The results of Sections 2.2.4 and 3.1.11 can be directly generalized to the multivariatecase. In particular, multinomial probabilities can be used to approximate multivariatehypergeometric probabilities when N is large enough, and each family of distributions canbe used in survey analysis.

3.2.7 Sample Summaries

If X1, X2, . . ., Xn is a random sample from a distribution with mean µ and standarddeviation σ, then the sample mean, X, is the random variable

X =1

n(X1 + X2 + · · · + Xn) ,

the sample variance, S2, is the random variable

S2 =1

n − 1

n∑

i=1

(Xi − X

)2

and the sample standard deviation, S, is the positive square root of the sample variance.

The following theorem can be proven using properties of expectation:

Theorem 3.18 (Sample Summaries) If X is the sample mean and S2 is the samplevariance of a random sample of size n from a distribution with mean µ and standarddeviation σ, then

1. E(X) = µ and V ar(X) = σ2/n.

2. E(S2) = σ2.

Demonstration 1. To demonstrate that E(X) = E(X) = µ,

E(X) = E(

1n (X1 + X2 + · · · + Xn)

)

= 1n (E(X1) + E(X2) + · · · + E(Xn))

= 1n (nE(X)) = 1

nnµ = µ

Demonstration 2. To demonstrate V ar(X) = V ar(X)/n = σ2/n, first note that

V ar(X) = E((X − µ)2

)= E

((

1

n

n∑

i=1

Xi

)− µ

)2 = E

(

1

n

n∑

i=1

(Xi − µ)

)2 .

47

Now,

V ar(X) = 1n2 E

((∑n

i=1(Xi − µ))2)

= 1n2

(∑i=j E((Xi − µ)2) +

∑i6=j E((Xi − µ)(Xj − µ))

).

Since the Xi’s are independent, the last sum is zero. Finally,

V ar(X) =1

n2

(nσ2 + 0

)=

σ2

n.

Demonstration 3. To demonstrate that E(S2) = V ar(X) = σ2, first note that

S2 = 1n−1

∑ni=1

(Xi − X

)2

= 1n−1

∑ni=1

(X2

i − 2XXi + X2)

= 1n−1

(∑ni=1 X2

i − 2XnX + nX2)

since∑n

i=1 Xi = nX

= 1n−1

(∑ni=1 X2

i − nX2)

.

Observe that

E(X2) = V ar(X) + (E(X))2 = σ2 + µ2 and E(X2) = V ar(X) + (E(X))2 =

σ2

n+ µ2 .

Finally,

E(S2) = 1n−1

(∑ni=1 E(X2

i ) − nE(X2))

= 1n−1

(n(σ2 + µ2) − n(σ2/n + µ2)

)

= 1n−1

(nσ2 + nµ2 − σ2 − nµ2

)

= 1n−1

((n − 1)σ2

)= σ2 .

Sample correlation. A random sample of size n from the joint (X,Y ) distribution is alist of n mutually independent random pairs, each with the same distribution as (X,Y ).

If (X1, Y1), (X2, Y2), . . . , (Xn, Yn) is a random sample of size n from a bivariate distributionwith correlation ρ = Corr(X,Y ), then the sample correlation, R, is the random variable

R =

∑ni=1(Xi − X)(Yi − Y )√ ∑n

i=1(Xi − X)2∑n

i=1(Yi − Y )2

where X and Y are the sample means of the X and Y samples, respectively.

48

Applications. In statistical applications, observed values of the sample mean, samplevariance, and sample correlation are used to estimate unknown values of the parameters µ,σ2 and ρ, respectively.

Example 3.19 (n = 600) Assume that the following table summarizes the values of arandom sample of size 600 from the joint (X,Y ) distribution:

y = 1 y = 2 y = 3 y = 4 y = 5 y = 6 y = 7 y = 8 y = 9 y = 10

x = 0 2 1 5 12 15 15 7 2 1 60x = 1 2 9 25 36 42 31 13 158x = 2 3 4 15 39 55 41 21 5 183x = 3 2 7 20 27 39 14 1 110x = 4 1 13 11 16 18 4 63x = 5 1 6 5 6 2 20x = 6 1 2 3 6

8 36 64 118 162 116 68 25 2 1 600

Two pairs of the form (0, 2) were observed, one pair of the form (0, 3) was observed, fivepairs of the form (0, 4) were observed, and so forth.

The sample mean of the x-coordinates is

x =1

600

600∑

i=1

xi =1

600(0(60) + 1(158) + · · · + 6(6)) ≈ 2.07

and the sample variance is

s2x =

1

599

600∑

i=1

(xi − x)2 =1

599

((0 − x)2(60) + (1 − x)2(158) + · · · + (6 − x)2(6)

)≈ 1.72464.

Similarly,y ≈ 4.92333 and s2

y ≈ 2.49161.

The observed sample correlation is

r =−634.78√

(1033.06)(149247)≈ −0.511219.

The results suggest that X and Y are negatively associated.

49

4 Introductory Calculus Concepts

In introductory calculus, we study real-valued functions whose domain (set of input values)is a subset of the real numbers. We often use the shorthand

f : D ⊆ R −→ R,

where D is the domain and R is the set of real numbers, to denote such a function.

This chapter covers limits, derivatives and integrals and applications in statistics.

4.1 Functions

This section reviews functions commonly used in calculus.

Power functions. A power function is a function with rule

f(x) = cxp where the coefficient c and power p are constants.

(The values of f are proportional to a power of the input x.) The domain of f dependson the power p. For example, if p = 2, the domain of f is the set of all real numbers; ifp = 1/2, the domain of f is the set of nonnegative real numbers.

Polynomial and rational functions. A polynomial function is a function with rule

f(x) = anxn + an−1xn−1 + · · · + a1x + a0

where the coefficients ai are constants, n is a positive integer and x is a real number. Ifan 6= 0, then n is the degree of the polynomial.

Special cases of polynomials include: If the degree of f is 0, then f is a constant function; ifthe degree is 1, then f is a linear function; if the degree is 2, then f is a quadratic function;if the degree is 3, then f is a cubic function.

A rational function is the ratio of two polynomials:

f(x) =p(x)

q(x), where p(x) and q(x) are polynomials.

The domain of f is the set of real numbers satisfying q(x) 6= 0.

Exponential functions. An exponential function is a function with rule

f(x) = ax, where a > 0 is a positive constant and x is a real number.

In this formula, a is the base and x is the exponent. The exponential function is strictlyincreasing when a > 1 and strictly decreasing when 0 < a < 1.

51

The exponential function with base e,

f(x) = ex, where e ≈ 2.71828 is Euler’s constant,

is used most often in calculus. (e is the natural base for calculus.)

Exponential functions satisfy the following laws of exponents:

(1) ax+y = ax ay, (2) a−x =1

ax, (3) (ax)y = axy

where a > 0 is a positive number, and x and y are any real numbers.

Logarithmic functions. A logarithmic function is a function with rule

f(x) = loga(x) where a is a positive constant and x is a positive real number.

As above, a is the base of the function. When a = e,

f(x) = loge(x) = ln(x) is called the natural logarithm.

Exponential and logarithmic functions with the same base are related as follows:

y = loga(x) ⇐⇒ ay = x

where a and x are positive numbers and y is any real number. Thus,

loga(ax) = x for all x ∈ R and aloga(x) = x for all x > 0.

Logarithmic functions satisfy the following laws of logarithms:

(1) loga(xy) = loga(x) + loga(y),

(2) loga

(xy

)= loga(x) − loga(y) and

(3) loga (xr) = r loga(x),

where a, x and y are positive numbers and r is any real number.

Using the natural base. Since a = eln(a) for any a > 0, we can write

ax =(eln(a)

)x= eln(a)x, where x is a real number.

Similarly, suppose that y = loga(x) (equivalently, x = ay). Then,

ln(x) = ln (ay) = y ln(a) = loga(x) ln(a) =⇒ loga(x) =ln(x)

ln(a).

Thus, exponential and logarithmic functions in base a can always be written in terms ofthe corresponding functions in the natural base.

52

PHx,yL

tPHx,yL

t

Figure 4.1: A positive angle t (left plot) and a negative angle t (right plot). In each case,the point P (x, y) lies on the unit circle centered at the origin, x = cos(t) and y = sin(t).

Trigonometric functions. The trigonometric functions are defined using a unit circlecentered at the origin and the radian measure of an angle, as illustrated in Figure 4.1.

In the left part of the figure, t is the length of the arc of the unit circle measured coun-terclockwise from (1, 0) to P (x, y); in the right part of the figure, t is the negative of thelength of the arc measured clockwise from (1, 0) to P (x, y).

The sine and cosine functions have the following rules

sin(t) = y and cos(t) = x

where t is any real number, and x and y are described above. Since the circumference of acircle of radius one is 2π, we know that

sin(t + k(2π)) = sin(t) and cos(t + k(2π)) = cos(t)

for every integer k and real number t.

The tangent, cotangent, secant and cosecant functions are defined in terms of sine and cosine.Specifically,

tan(t) =sin(t)

cos(t), cot(t) =

cos(t)

sin(t), sec(t) =

1

cos(t)and csc(t) =

1

sin(t).

In each case, the domain of the function is all t for which the denominator is not 0.

Inverse trigonometric functions. Each trigonometric function has an inverse function:

1. Inverse sine function: For −1 ≤ z ≤ 1,

sin−1(z) = t ⇐⇒ sin(t) = z and − π

2≤ t ≤ π

2.

2. Inverse cosine function: For −1 ≤ z ≤ 1,

cos−1(z) = t ⇐⇒ cos(t) = z and 0 ≤ t ≤ π.

53

3. Inverse tangent function: For any real number z,

tan−1(z) = t ⇐⇒ tan(t) = z and − π

2< t <

π

2.

4. Inverse cotangent function: For any real number z,

cot−1(z) = t ⇐⇒ cot(t) = z and 0 < t < π.

5. Inverse secant function: For |z| ≥ 1,

sec−1(z) = t ⇐⇒ sec(t) = z and

(0 ≤ t <

π

2or π ≤ t <

3π

2

).

6. Inverse cosecant function: For |z| ≥ 1,

csc−1(z) = t ⇐⇒ csc(t) = z and

(0 < t ≤ π

2or π < t ≤ 3π

2

).

Radian and degree measures. Angles can be measured in degrees or in radians. Radianmeasure is preferred in calculus. The following table gives some common conversions:

Degrees 0o 30o 45o 60o 90o 120o 135o 150o 180o 270o 360o

Radians 0 π6

π4

π3

π2

2π3

3π4

5π6 π 3π

2 2π

Note that 2π radians (or 360o) corresponds to a complete revolution of the unit circle.

Some values of sine and cosine. The following table gives some often used values of

the sine and cosine functions:

t 0 π6

π4

π3

π2

2π3

3π4

5π6

π

sin(t) 0 12

√2

2

√3

21

√3

2

√2

212

0

cos(t) 1√

32

√2

212

0 −12

−√

22

−√

32

−1

4.2 Limits and Continuity

This section introduces the concepts of limit and continuity, which are central to calculus.

4.2.1 Neighborhoods, Deleted Neighborhoods and Limits

A neighborhood of the real number a is the set of real numbers x satisfying

|x − a| < r for some positive number r.

A deleted neighborhood of a is the neighborhood with a removed; that is, the set of realnumbers x satisfying 0 < |x − a| < r for some positive r.

54

-0.5 0.5 1 1.5 2x

-2

-1

1

2y

-0.5 0.5 1 1.5 2x

-2

-1

1

2y

Figure 4.2: Plots of y = (x − 1)/(x2 − 1) (left) and y = |x − 1|/(x2 − 1) (right).

Suppose that f(x) is defined for all x in a deleted neighborhood of a. We write

limx→a

f(x) = L

and say “the limit of f(x), as x approaches a, equals L” if for every positive number ǫ thereexists a positive number δ such that

if x satisfies 0 < |x − a| < δ, then f(x) satisfies |f(x) − L| < ǫ.

(The deleted δ-neighborhood of a is mapped into the ǫ-neighborhood of L.)

Example 4.1 (Figure 4.2) Let f(x) = (x−1)/(x2−1) and a = 1 (left part of Figure 4.2).Although f(1) is undefined, the limit as x approaches 1 exists. Specifically,

limx→1

f(x) = limx→1

(x − 1)

(x2 − 1)= lim

x→1

(x − 1)

(x − 1)(x + 1)= lim

x→1

1

(x + 1)=

1

2.

By contrast, if f(x) = |x − 1|/(x2 − 1) and a = 1 (right part of Figure 4.2), then a limitdoes not exist, since function values approach either −1/2 or +1/2 depending on whetherx < 1 or x > 1.

Example 4.2 (Figure 4.3) Let f(x) = sin(π/x) and a = 0 (left part of Figure 4.3). f(x)is undefined at 0. Further, as x approaches 0, values of f(x) fluctuate between −1 and +1,and no limit exists.

By contrast, if f(x) = x sin(π/x) and a = 0 (right part of Figure 4.3), then a limit doesexist. Specifically, function values approach 0 as x approaches 0.

One-sided limits. We writelim

x→a+f(x) = L

and say “the limit of f(x), as x approaches a from above, equals L” if for every positivenumber ǫ there exists a positive number δ such that

if x satisfies 0 < x − a < δ, then f(x) satisfies |f(x) − L| < ǫ.

55

-1 -0.5 0.5 1x

-1

-0.5

0.5

1y

-1 -0.5 0.5 1x

-1

-0.5

0.5

1y

Figure 4.3: Plots of y = sin(π/x) (left) and y = x sin(π/x) (right).

Similarly, we writelim

x→af(x) = L

and say “the limit of f(x), as x approaches a from below, equals L” if for every positivenumber ǫ there exists a positive number δ such that

if x satisfies 0 < a − x < δ, then f(x) satisfies |f(x) − L| < ǫ.

In the first case, we approach a from the right only; in the second case, from the left only.

For example, if f(x) = |x − 1|/(x2 − 1) and a = 1 (right part of Figure 4.2), then

limx→1−

f(x) = −1

2and lim

x→1+f(x) =

1

2.

Infinite limits. We writelim

x→∞f(x) = L

and say “the limit of f(x), as x approaches ∞, equals L” if for every positive number ǫthere exists a number M such that

if x satisfies x > M , then f(x) satisfies |f(x) − L| < ǫ.

Similarly, we writelim

x→−∞f(x) = L

and say “the limit of f(x), as x approaches −∞, equals L” if for every positive number ǫthere exists a number M such that

if x satisfies x < M , then f(x) satisfies |f(x) − L| < ǫ.

Example 4.3 (Exponential Function) Let f(x) = ex. Then

limx→−∞

f(x) = 0,

but no finite limit exists as x → ∞. We often write limx→∞ f(x) = ∞ when function valuesgrow large without bound.

56

Application in statistics. Let F (x) = P (X ≤ x) be the cumulative distribution function(CDF) of a discrete random variable (Section 2.1.1) and let a be an element of the range ofthe random variable. Then both one-sided limits exist,

limx→a+

F (x) = F (a) and limx→a−

F (x) 6= F (a).

Further,lim

x→−∞F (x) = 0 and lim

x→∞F (x) = 1.

Properties of limits. Suppose that

limx→a

f(x) = L, limx→a

g(x) = M,

and let c be a constant. Then

1. Constant rule: limx→a c = c.

2. Rule for x : limx→a x = a.

3. Constant multiple rule: limx→a(cf(x)) = cL.

4. Sum rule: limx→a(f(x) + g(x)) = L + M .

5. Difference rule: limx→a(f(x) − g(x)) = L − M .

6. Product rule: limx→a(f(x)g(x)) = LM .

7. Quotient rule: limx→a(f(x)/g(x)) = L/M when M 6= 0.

8. Power rule: limx→a(f(x))n = Ln when n is a positive integer.

9. Power rule for x : limx→a xn = an when n is a positive integer.

10. Rule for roots: limx→an√

f(x) = n√

L when n is an odd positive integer, or when n isan even positive integer and L is greater than 0.

11. Rule for ordering : If f(x) ≤ g(x) for all x near a (except possibly at a), then L ≤ M .

Squeezing principle. If ℓ(x) ≤ f(x) ≤ u(x) for all x near a (except possibly at a) and

limx→a

ℓ(x) = limx→a

u(x) = L,

then f(x) approaches L as well: limx→a f(x) = L.

For example, let f(x) = x sin(π/x) and a = 0 (right part of Figure 4.3). Since

−|x| ≤ x sin(π/x) ≤ |x|

and the lower and upper functions each approach 0 as x approaches 0, we know that f(x)approaches 0 as well.

57

Big O and little o notation. If

|f(x)| ≤ Kg(x) for some positive K and all x near a (except possibly at a),

we say that f(x) = O(g(x)) for x near a. If

limx→a

f(x)

g(x)= 0

we say that f(x) = o(g(x)) for x near a.

4.2.2 Continuous Functions

The function f(x) is said to be continuous at a if

limx→a

f(x) = f(a).

The function is said to be discontinuous at a otherwise. Note that f(x) is discontinuous ata when f(a) is undefined, or when the limit does not exist, or when the limit and functionvalues exist but are not equal.

Polynomial functions, rational functions (ratios of polynomials), exponential functions, log-arithmic functions, and trigonometric functions are continuous throughout their domains.Step functions are discontinuous at each “step”.

Note that the properties of limits given above imply that sums, differences, products, quo-tients, powers, roots, and constant multiples of continuous functions are continuous when-ever the appropriate operation is defined.

Example 4.4 (Split Definition Functions) Let

f(x) =

−4x + 10 x < 2−0.25x + 2.5 x ≥ 2

and g(x) =

0 x < 00.25 0 ≤ x < 10.75 1 ≤ x < 21 x ≥ 2

The function f(x) (left part of Figure 4.4) is continuous everywhere. In particular,

limx→2−

f(x) = limx→2−

(−4x + 10) = 2 and limx→2+

f(x) = limx→2+

(−0.25x + 2.5) = 2.

The step function g(x) (right part of Figure 4.4) is discontinuous at 0, 1 and 2. Note thatg(x) is the cumulative distribution function of a binomial random variable with parametersn = 2 and p = 1/2.

Compositions. If g(x) is continuous at a and f(x) is continuous at g(a), then the com-posite function f(g(x)) is continuous at a. In other words,

limx→a

f(g(x)) = f(limx→a

g(x))

= f(g(a)).

For example, limx→5 ln(6 − x) = ln(1) = 0, where ln() is the natural logarithm function.

58

5 10x

5

10

y=fHxL

0 1 2 3x

0.5

1

y=gHxL

Figure 4.4: Plots of split definition functions.

4.3 Differentiation

Suppose that f(x) is defined for all x near a. The average rate of change of the function fover the interval [a, x] if x > a (or [x, a] if x < a) is the ratio

f(x) − f(a)

x − a.

The average rate of change corresponds to the slope of the secant line through the points(a, f(a)) and (x, f(x)).

Interest focuses on whether we can define the instantaneous rate of change at a or, equiva-lently, the slope of the curve y = f(x) at the point (a, f(a)).

4.3.1 Derivative and Tangent Line

The derivative of the function f at the number a, denoted by f ′(a) (“f prime of a”), is

f ′(a) = limx→a

f(x) − f(a)

x − aif the limit exists.

The function f(x) is said to be differentiable at a if the limit above exists. Otherwise, f(x)is not differentiable at a.

If f(x) represents your position at time x, then (f(x)− f(a))/(x− a) is your average velocity overthe interval [a, x] (or [x, a]) and f ′(a) is your instantaneous velocity at time a.

Tangent line. If f(x) is differentiable at a, then the line with equation

y = f(a) + f ′(a)(x − a)

is tangent to the curve y = f(x) at the point (a, f(a)), and f ′(a) is said to be the slope ofthe curve at this point.

59

0.5 1 1.5 2 x

0.1

0.2

0.3

0.4

y

1 2 3 4 x

2

4

6

8

y

Figure 4.5: Plots of y = x/(1 + 2x) (left part) and y = 7 − 3|x − 2| (right part).

Example 4.5 (Derivative Exists) Consider the function f(x) = x/(1+2x) and let a = 1(left part of Figure 4.5). The derivative is computed as follows:

f ′(1) = limx→1

x(1+2x) − 1

3

x − 1= lim

x→1

1

3 + 6x=

1

9.

The equation of the tangent line is y = (1/3) + (1/9)(x − 1).

Example 4.6 (Derivative Does Not Exist) Consider the function f(x) = 7 − 3|x − 2|and let a = 2 (right part of Figure 4.5). Then

1. If x > 2, then f(x) = 7 − 3(x − 2) and limx→2+f(x)−f(2)

x−2 = −3.

2. If x < 2, then f(x) = 7 − 3(2 − x) and limx→2−f(x)−f(2)

x−2 = 3.

Since the limits from above and below are not equal, the derivative does not exist.

4.3.2 Approximations and Local Linearity

If f(x) is differentiable at a, then f(x) is locally linear near a. That is, values on the tangentline can be used to approximate values on the curve itself.

Example 4.7 (Derivative Exists, continued) For example, if f(x) = x/(1+2x), a = 1,and ℓ(x) = (1/3) + (1/9)(x − 1) (as calculated above), then the following table confirmsthat values of the linear function ℓ(x) are close to values of f(x) when x is near 1:

x 0.75 0.80 0.85 0.9 0.95 1.00 1.05 1.10 1.15 1.20 1.25

f(x) 0.300 0.308 0.315 0.321 0.328 0.333 0.339 0.344 0.348 0.353 0.357

ℓ(x) 0.306 0.311 0.317 0.322 0.328 0.333 0.339 0.344 0.350 0.356 0.361

60

-1 1 2 3 4 x

-4

-2

2

4

6

y

-1 1 2 3 4 x

-4

-2

2

4

6

y

Figure 4.6: Plots of y = 2 + 9x − 6x2 + x3 with different points highlighted.

4.3.3 First and Second Derivative Functions

Since the derivative can take different values at different points, we can think of it as afunction in its own right. For any function f , we define the derivative function, f ′, by

f ′(x) = limh→0

f(x + h) − f(x)

h.

If the limit exists at a given x, we say that f is differentiable at x. If the derivative existsfor all x in the domain of the function, we say that f is differentiable everywhere.

Alternative notations. Starting with y = f(x) (where x represents the independentvariable and y represents the dependent variable), there are several different ways to writethe derivative function:

f ′(x) =dy

dx=

d

dx(f(x)) =

df

dx= Df(x) = y′.

The “prime” notation (f ′(x), y′) is due to Newton. The ddx (“d-dx”) form is due to Leibniz.

Second derivative function. If f is a differentiable function, then its derivative f ′ isalso a function, so f ′ may have a derivative of its own. The second derivative of f , denotedby f ′′ (“f double prime”), is defined as follows:

f ′′(x) = limh→0

f ′(x + h) − f ′(x)

h.

Starting with y = f(x), alternative notations are

f ′′(x) =d2y

dx2=

d2

dx2(f(x)) =

d2f

dx2= D2f(x) = y′′.

The second derivative is the rate of change of the rate of change. If f ′′(a) > 0, then thecurve y = f(x) is concave up at (a, f(a)). If f ′′(a) < 0, then the curve is concave down at(a, f(a)).

61

Continuity and differentiability. Differentiable functions are continuous, but con-tinuous functions may not be differentiable. For example, the absolute value functionf(x) = 7 − 3|x − 2| (see Figure 4.5) is continuous everywhere and is differentiable forall values of x other than 2.

4.3.4 Some Rules for Finding Derivative Functions

Assume that f and g are differentiable on a common domain D, and let a, b, c, and m beconstants. Then

1. Constant functions: ddx(c) = 0.

2. Linear functions: ddx(mx + b) = m.

3. Constant multiples: ddx(cf(x)) = cf ′(x).

4. Linear combinations: ddx(af(x) + bg(x)) = af ′(x) + bg′(x).

5. Products of functions: ddx(f(x)g(x)) = f ′(x)g(x) + f(x)g′(x).

6. Quotients of functions: ddx

(f(x)g(x)

)= g(x)f ′(x)−f(x)g′(x)

(g(x))2 when g(x) 6= 0.

7. Powers of x: ddx(xn) = nxn−1 for each constant n.

Polynomial functions. Recall that a polynomial function, f(x), has the form

f(x) = anxn + an−1xn−1 + · · · + a2x

2 + a1x + a0

where a0, a1, . . . , an are constants, and n is a nonnegative integer. The rules given aboveimply that

f ′(x) = nanxn−1 + (n − 1)an−1xn−2 + · · · + 2a2x + a1.

Example 4.8 (Cubic Function) Let f(x) = 2 + 9x − 6x2 + x3. Then

f ′(x) = 9 − 12x + 3x2 and f ′′(x) = −12 + 6x.

Since f ′′(x) < 0 when x < 2, f is concave down when x < 2 (left part of Figure 4.6). Sincef ′′(x) > 0 when x > 2, f is concave up when x > 2 (right part of Figure 4.6).

Rational functions. Recall that a rational function, r(x), is the ratio of polynomials.The derivative of r(x) is computed using the quotient rule.

Example 4.9 (Rational Function) For example, if

r(x) =p(x)

q(x)=

3 + 4x − x2

7 + 2x,

then r′(x) =

q(x)p′(x) − p(x)q′(x)

(q(x))2=

(7 + 2x)(4 − 2x) − (3 + 4x − x2)(2)

(7 + 2x)2=

−2(−11 + 7x + x2

)

(7 + 2x)2.

62

4.3.5 Chain Rule for Composition of Functions

If f and g are differentiable functions, then

d

dx(f(g(x))) = f ′(g(x))g′(x).

(The derivative of the outside function evaluated at the inside function times the derivativeof the inside function evaluated at x.)

Starting with y = f(u) and u = g(x), the chain rule can be written as follows:

dy

dx=

dy

du

du

dx.

For example, since√

u = u1/2 and ddu(u1/2) = 1

2u−1/2 = 12√

u,

d

dx

(√x2 + 1

)=

1

2√

x2 + 1(2x) =

x√x2 + 1

.

Inverse functions. The functions f and g are said to be inverse functions if f(g(x)) = x.Using the chain rule:

d

dx(f(g(x))) = f ′(g(x))g′(x) ⇒ 1 = f ′(g(x))g′(x) ⇒ g′(x) =

1

f ′(g(x)).

(The derivative of g at x is the reciprocal of the derivative of f at g(x).)

For example, f(u) = u2 and g(x) =√

x are inverse functions when x is positive. Sincef ′(u) = 2u,

g′(x) =1

2√

x=

1

f ′(g(x)).

4.3.6 Rules for Exponential and Logarithmic Functions

The natural base for exponential and logarithmic functions in calculus is the base e, where

e = limn→∞

(1 +

1

n

)n

= 2.71828 . . . is the Euler constant.

The limit above is illustrated in the following table of approximate values:

n 10 100 1000 10000 100000 1000000(1 + 1

n

)n2.59374 2.70481 2.71692 2.71815 2.71827 2.71828

The domain of the exponential function ex = exp(x) is all real numbers and the domainof the logarithmic function ln(x) = loge(x) is the positive real numbers. Plots of theexponential and logarithmic functions are given in Figure 4.7.

63

-1 1 2x

2

4

6

8y

2 4 6 8 x

-1

1

2y

Figure 4.7: Plots of y = ex (left) and y = ln(x) (right).

The functions ex and ln(x) are inverses. That is,

eln(x) = x when x > 0 and ln(ex) = x for all real numbers.

Their derivatives are as follows:

d

dx(ex) = ex and

d

dx(ln(x)) =

1

x.

More generally,d

dx(eu) = eu du

dxand

d

dx(ln(u)) =

1

u

du

dx.

For example,

d

dx

(e12−x3

)= e12−x3

(−3x2

)and

d

dx

(ln(x4 + 2x2 + 7)

)=

4x3 + 4x

x4 + 2x2 + 7.

4.3.7 Rules for Trigonometric and Inverse Trigonometric Functions

In calculus, we work with trigonometric functions using radian measure. That is, the inputx is the radian measure of an angle (page 53). Figure 4.8 shows plots of the two mostcommonly used functions, sin(x) and cos(x).

Trigonometric functions have specialized derivative formulas:

ddx(sin(x)) = cos(x) d

dx(cos(x)) = − sin(x)

ddx(tan(x)) = sec(x)2 d

dx(cot(x)) = −csc(x)2

ddx(sec(x)) = sec(x) tan(x) d

dx(csc(x)) = − csc(x) cot(x)

64

-9 -6 -3 3 6 9x

-1

-0.5

0.5

1

y

-9 -6 -3 3 6 9x

-1

-0.5

0.5

1

y

Figure 4.8: Plots of y = sin(x) (left) and y = cos(x) (right).

Similarly, inverse trigonometric functions have specialized derivative formulas:

ddx(sin−1(x)) = 1√

1−x2

ddx(cos−1(x)) = −1√

1−x2

ddx(tan−1(x)) = 1

1+x2ddx(cot−1(x)) = −1

1+x2

ddx(sec−1(x)) = 1

x2√

1−x−2

ddx(csc−1(x)) = −1

x2√

1−x−2

For example,d

dx

(sin(3x2 + 8)

)= cos(3x2 + 8) (6x) and

ddx

(x cos(√

x + x)) = cos(√

x + x) − x sin(√

x + x)(

12√

x+ 1

)

= cos(√

x + x) − sin(√

x + x)(√

x2

+ x

).

4.3.8 Indeterminate Forms and L’Hospital’s Rule

If f(x) → 0 (“f(x) approaches 0”) as x → a and g(x) → 0 as x → a, then

limx→a

f(x)

g(x)

may or may not exist, and is called an indeterminate form of type 0/0.

For example,

limx→1

x − 1

x2 + 11x − 12

is an indeterminate form of type 0/0 whose limit can be computed after cancelling thecommon factor (x − 1) from numerator and denominator:

limx→1

x − 1

x2 + 11x − 12= lim

x→1

x − 1

(x − 1)(x + 12)= lim

x→1

1

x + 12=

1

13.

65

Similarly, if f(x) → ±∞ and g(x) → ±∞ as x → a, then the limit of the ratio f(x)/g(x) isan indeterminate form of type ∞/∞.

Theorem 4.10 (L’Hospital’s Rule) Suppose that f and g are differentiable and g′(x) 6=0 near a (except possibly at a). Further, suppose that

1. f(x) → 0 and g(x) → 0 as x → a, or

2. f(x) → ±∞ and g(x) → ±∞ as x → a.

Then

limx→a

f(x)

g(x)= lim

x→a

f ′(x)

g′(x)

if the limit on the righthand side exists.

L’Hospital’s rule says that the limit of the ratio of the functions is the same as the limit ofthe ratio of the rates. For example,

limx→1

x20 − 1

x18 − 1= lim

x→1

20x19

18x17= lim

x→1

10x2

9=

10

9.

In some cases, L’Hospital’s rule needs to be used more than once. For example,

limx→0

1 − cos(x)

x2= lim

x→0

sin(x)

2x= lim

x→0

cos(x)

2=

1

2.

(The first and second limits are indeterminate forms of type 0/0.)

One-sided and infinite limits. Note that L’Hospital’s rule remains true for one-sidedlimits (x → a+ or x → a−) and for limits where x → ±∞.

Indeterminate products. If f(x) → 0 and g(x) → ±∞ as x → a, then

limx→a

f(x)g(x)

is an indeterminate form of type 0·∞. L’Hospital’s rule can sometimes be used to evaluatethe limit. The trick is to rewrite the product as a quotient:

limx→a f(x)g(x) = limx→af(x)

(1/g(x)) (type 0/0) or

= limx→ag(x)

(1/f(x)) (type ∞/∞).

For example,

limx→∞

xe−x = limx→∞

x

ex= lim

x→∞1

ex= 0 (using type ∞/∞).

66

Remarks. There are many other types of indeterminate forms, including

∞−∞, 00, ∞0, 1∞.

In each case, anything can happen (that is, a limit may or may not exist).

Note that the limit used to define the Euler constant (page 63) is an example of an inde-terminate form of type 1∞.

4.4 Optimization

The function f : D ⊆ R → R has a local minimum (or relative minimum) at c ∈ D if

f(c) ≤ f(x) for each x in the intersection of a neighborhood of c with D.

Similarly, f has a local maximum (or relative maximum) at c ∈ D if

f(c) ≥ f(x) for each x in the intersection of a neighborhood of c with D.

Local maxima and minima are called local extrema of the function.

f has a global (or absolute) minimum at c if f(c) ≤ f(x) for all x ∈ D. Similarly, f has aglobal (or absolute) maximum at c if f(c) ≥ f(x) for all x ∈ D. Global maxima and minimaare called global extrema of the function.

4.4.1 Local Extrema and Derivatives

The following theorem establishes a relationship between local extrema and derivatives.

Theorem 4.11 (Fermat’s Theorem) If the continuous function f has a local extremumat c, and if f ′(c) exists, then f ′(c) = 0.

Continuing with the cubic function example (Figure 4.6), f(x) = 2 + 9x − 6x2 + x3 has alocal maximum at 1 (with maximum value 6) and a local minimum at 3 (with minimumvalue 2). Since the derivative function is f ′(x) = 9 − 12x + 3x2, it is easy to check thatf ′(1) = f ′(3) = 0.

Critical numbers. A number c is a critical number of f if either f ′(c) = 0 or f ′(c) doesnot exist. If f has a local extremum at c, then Fermat’s theorem implies that c is a criticalnumber of the function f .

The cubic function f(x) = 2 + 9x − 6x2 + x3 has critical numbers 1 and 3. The absolutevalue function f(x) = 7 − 3|x − 2| (Figure 4.5) has critical number 2.

4.4.2 First Derivative Test for Local Extrema

Suppose that f is differentiable for all x near c. If c is a critical number of f and if f ′

changes sign at c, then f has a local extremum at c. Specifically,

67

1. If f ′(x) is positive below c and negative above c, then f is increasing to the left of cand decreasing to the right of c. Thus, f has a local maximum at c.

2. If f ′(x) is negative below c and positive above c, then f is decreasing to the left of cand increasing to the right of c. Thus, f has a local minimum at c.

Continuing with the cubic function f(x) = 2 + 9x − 6x2 + x3, it is easy to check that

f ′(x) > 0 when x < 1, f ′(x) < 0 when 1 < x < 3, and f ′(x) > 0 when x > 3.

These computations confirm that f has a local maximum at 1 and a local minimum at 3.

4.4.3 Second Derivative Test for Local Extrema

Suppose that f is twice differentiable for all x near c, and that c is a critical number for f .(In this case, f ′(c) = 0.) Then

1. If f ′′(c) > 0, then f has a local minimum at c.

2. If f ′′(c) < 0, then f has a local maximum at c.

Note that if f ′(c) = 0 and f ′′(c) = 0, then anything can happen. For example, if f is thequartic function f(x) = x4, then f ′(0) = f ′′(0) = 0 and f has a local (and global) minimumat 0. By contrast, if f is the cubic function f(x) = x3, then f ′(0) = f ′′(0) = 0, but f hasneither a local minimum nor a local maximum at 0.

4.4.4 Global Extrema on Closed Intervals

Consider f : [a, b] ⊆ R → R (the domain of interest is the closed interval [a, b]). Thefollowing theorem says that continuous functions have global extrema on closed intervals.

Theorem 4.12 (Extreme Value Theorem) If f is continuous on the closed interval[a, b], then there are numbers c, d ∈ [a, b] such that f has a global minimum at c and aglobal maximum at d.

For example, if the domain of the cubic function f(x) = 2 + 9x − 6x2 + x3 (Figure 4.6) isrestricted to the closed interval [−0.5, 3.5], then f has a global minimum value of −4.125at −0.5, and a global maximum value of 6 at 1.

Method for finding global extrema. To find the global extrema of the continuousfunction f on the closed interval [a, b]:

1. Find the values of f at the critical numbers in the open interval (a, b).

2. Find the values of f at the endpoints: f(a), f(b).

68

-2 -1 1 2 3x

10

30

50

70

y

1 2 3x

10

30

50

70

y

Figure 4.9: Plots of y = 40 + 12x2 + 4x3 − 3x4. In the right plot, the domain is restrictedto the interval [0, 3].

3. The largest function value from the first two steps is the global maximum, and thesmallest function value is the global minimum.

Example 4.13 (Quartic Function) Let f(x) = 40 + 12x2 + 4x3 − 3x4 for all x.

1. As illustrated in the left part of Figure 4.9, f(x) has a local maximum of 45 = f(−1)when x = −1, a local minimum of 40 = f(0) when x = 0, and both a local and globalmaximum of 72 = f(2) when x = 2.

2. Since f ′(x) = 24x + 12x2 − 12x3 = −12 (x − 2) x (x + 1), we know that f ′(x) = 0when x = −1, 0, 2. Further,

(a) f ′(x) > 0 when x < −1 and when 0 < x < 2 and

(b) f ′(x) < 0 when −1 < x < 0 and when x > 2.

The values of f(x) are increasing when x < −1, decreasing when −1 < x < 0,increasing when 0 < x < 2, and decreasing when x > 2.

3. Since f ′′(x) = 24 + 24x − 36x2, we know that f ′′(x) = 0 when

x =1

3(1 −

√7) ≈ −0.549 and x =

1

3(1 +

√7) ≈ 1.215.

Further,

(a) f ′′(x) < 0 when x < −0.549 and when x > 1.215 and

(b) f ′′(x) > 0 when −0.549 < x < 1.215.

Concavity changes at both roots of the equation f ′′(x) = 0. Specifically, f(x) isconcave down when x < −0.549, concave up when −0.549 < x < 1.215, and concavedown when x > 1.215.

Example 4.14 (Restricted Domain) Let f(x) = 40 + 12x2 + 4x3 − 3x4 for x ∈ [0, 3].

69

1. As illustrated in the right part of Figure 4.9, f(x) has a local minimum of 40 = f(0)at x = 0, a local and global maximum of 72 = f(2) at x = 2, and a local and globalminimum of 13 = f(3) at x = 3.

2. Since f ′(x) = 24x + 12x2 − 12x3 = −12 (x − 2) x (x + 1), we know that f ′(x) = 0when x = −1, 0, 2. For the two roots in the restricted domain, we consider

f(0) = 40 and f(2) = 72 when determining global extrema.

3. The function values at the endpoints are f(0) = 40 and f(3) = 13.

4. The global maximum value is the largest of the numbers computed in items 2 and 3.The global minimum value is the smallest of the numbers computed in items 2 and 3.

4.4.5 Newton’s Method

Many problems in mathematics involve finding roots. For example, to find the criticalnumbers of the differentiable function f , we need to solve the derivative equation f ′(x) = 0.

Newton’s method is an iterative method for approximating the roots of an equation of theform g(x) = 0 when g is a differentiable function. The method uses local linearity.

Specifically, if r is a root of the equation, x0 (an initial estimate of r) is a number near r,and g′(x0) 6= 0, then the tangent line through (x0, g(x0))

y = g(x0) + g′(x0)(x − x0)

can be used to approximate the curve y = g(x) and the x-coordinate of the point where thetangent line crosses the x-axis (call it x1) is likely to be closer to r than x0. To find x1:

0 = g(x0) + g′(x0)(x1 − x0) ⇒ x1 = x0 −g(x0)

g′(x0).

If g′(x1) 6= 0, the method can be repeated to yield x2 = x1 − g(x1)g′(x1) , and so forth. The

general formula is

xk+1 = xk − g(xk)

g′(xk).

The iteration stops when the error term |g(xk)/g′(xk)| falls below a pre-specified bound.

Example 4.15 (Root of Cubic Function) The cubic function g(x) = x3 − x − 1 has aroot between 1 and 2, as illustrated in the left part of Figure 4.10. Newton’s method canbe used to find the root. The following table shows the results of the first five iterationsstarting with initial value x0 = 1.0.

k 0 1 2 3 4 5

xk 1.0 1.5 1.34783 1.3252 1.32472 1.32472

Since x4 and x5 are identical to within five decimal places of accuracy, we estimate that theroot occurs when x = 1.32472.

70

-2 -1 1 2x

-5

5

y

5 7.5 10 x

8

12

16

20

y

Figure 4.10: Newton’s method examples.

Example 4.16 (Point of Intersection) The functions

f1(x) = x + 4 and f2(x) = ex/3

have a point of intersection between 5 and 10, as illustrated in the right part of Figure 4.10.One way to find the point of intersection is to use Newton’s method to solve the equationg(x) = f1(x) − f2(x) = 0. The following table shows the results of the first six iterationsstarting with initial value x0 = 5.5.

k 0 1 2 3 4 5 6

xk 5.5 8.49133 7.53204 7.28038 7.26519 7.26514 7.26514

Since x5 and x6 are identical to within five decimal places of accuracy, we estimate that thepoint of intersection occurs when x = 7.26514.

Warning. Newton’s method is powerful and fast (in general, few iterations are requiredfor convergence), but is sensitive to the initial value x0. If we let x0 = 0.57 in the cubicexample above, for example, then 35 iterations are needed to approximate the root to within5 decimal places of accuracy.

If g(x) has more than one root and the chosen x0 is far from r, then the method couldconverge to the wrong root. If g′(xk) ≈ 0 for some k, then the sequence of iterates may notconverge at all.


Optimization is an important component of many statistical analyses. This section illus-trates two applications.

4.5.1 Maximum Likelihood Estimation in Binomial Models

Consider the problem of estimating the binomial probability of success, p. If x successes areobserved in n trials, then a natural estimate of p is the sample proportion, p = x/n. If the

71

0.2 0.4 0.6 0.8 1 p

0.04

0.08

0.12

0.16

Lik

0.25 0.5 0.75 1 t

0.0008

0.0016

Lik

Figure 4.11: Examples of binomial (left) and multinomial (right) likelihoods.

data summarize the results of n independent trials of a Bernoulli experiment with successprobability p, then p = x/n is also the maximum likelihood (ML) estimate of p.

The ML estimate of p is the value of p which maximizes the likelihood function, Lik(p), or(equivalently) the log-likelihood function, ℓ(p). Lik(p) is the binomial PDF with n and xfixed and p varying

Lik(p) =

(n

x

)px(1 − p)n−x

and ℓ(p) is the natural logarithm of the likelihood:

ℓ(p) = ln(Lik(p)) = ln

(n

x

)+ x ln(p) + (n − x) ln(1 − p).

If 0 < x < n, then the ML estimate is obtained by solving ℓ′(p) = 0 for p:

ℓ′(p) =x

p− n − x

1 − p= 0 ⇒ x

p=

n − x

1 − p⇒ p =

x

n.

Further, since ℓ′(p) > 0 when 0 < p < x/n and ℓ′(p) < 0 when x/n < p < 1, we know thatthe likelihood is maximized when p = x

n .

For example, the left part of Figure 4.11 shows the binomial likelihood when 17 successesare observed in 25 trials. The function is maximized at the ML estimate, p = 0.68.

4.5.2 Maximum Likelihood Estimation in Multinomial Models

Maximum likelihood estimation can be generalized to multinomial models. Consider, forexample, a multinomial model with three categories and probabilities

(p1, p2, p3) = (θ, (1 − θ)θ, (1 − θ)2), where 0 ≤ θ ≤ 1 is unknown,

and suppose that in 60 independent trials of the experiment, the first outcome was observed15 times, the second outcome 20 times and the third outcome 25 times. That is, supposethat n = 60, x1 = 15, x2 = 20 and x3 = 25.

72

The likelihood function, Lik(θ), is the multinomial PDF written as a function of θ,

Lik(θ) =

(60

15, 20, 25

)θ15 ((1 − θ)θ)20

((1 − θ)2

)25=

(60

15, 20, 25

)θ35(1 − θ)70,

as shown in the right part of Figure 4.11.

The log-likelihood function is the natural logarithm of the likelihood,

ℓ(θ) = ln(Lik(θ)) = ln

(60

15, 20, 25

)+ 35 ln(θ) + 70 ln(1 − θ).

ℓ′(θ) = 0 when θ = 35105 = 1

3 . You can use either the first or second derivative test todemonstrate that the log-likelihood is maximized when θ = 1

3 .

The estimated model has probabilities(

13 , 2

9 , 49

).

4.5.3 Minimum Chi-Square Estimation in Multinomial Models

In 1900, K. Pearson developed a quantitative method to determine if observed data areconsistent with a given multinomial model.

If (X1,X2, . . . ,Xk) has a multinomial distribution with parameters n and (p1, p2, . . . , pk),then Pearson’s statistic is the following random variable:

X2 =k∑

i=1

(Xi − npi)2

npi.

For each i, the observed frequency, Xi, is compared to the expected frequency, E(Xi) = npi,under the multinomial model. If each observed frequency is close to expected, then the valueof X2 will be close to zero. If at least one observed frequency is far from expected, thenthe value of X2 will be large and the appropriateness of the given multinomial model willbe called into question.

In many practical situations, certain parameters of the multinomial model need to be esti-mated from the sample data. For example, if

(p1, p2, . . . , pk) = (p1(θ), p2(θ), . . . , pk(θ)) for some parameter θ,

then the observed value of Pearson’s statistic

f(θ) =k∑

i=1

(xi − npi(θ))2

npi(θ)

is a function of θ, and a natural estimate of θ is the value of θ which minimizes f . Thisestimate is known as the minimum chi-square estimate of θ for the given (x1, x2, . . . , xk).

Example 4.17 (n = 60, k = 3, continued) Consider again the model with probabilities

(p1, p2, p3) = (θ, (1 − θ)θ, (1 − θ)2), where 0 ≤ θ ≤ 1 is unknown,

73

0.25 0.5 0.75 1 t

25

50

XSqr

0.025 0.05 0.075 0.1t

4

8

12

XSqr

Figure 4.12: Plots of y = f(θ) for the minimum chi-square estimation examples.

and assume that in 60 independent trials we obtained x1 = 15, x2 = 20 and x3 = 25.

The function we would like to minimize is

f(θ) =(15 − 60θ)2

60θ+

(20 − 60(1 − θ)θ)2

60(1 − θ)θ+

(25 − 60(1 − θ)2

)2

60(1 − θ)2

(left part of Figure 4.12) and the derivative of this function (after simplification) is

f ′(θ) = −5(9θ3 − 9θ2 + 75θ − 25

)

12(θ − 1)3θ2.

To find the minimum chi-square estimate we need to find the root of

g(θ) = 9θ3 − 9θ2 + 75θ − 25 in the [0, 1] interval.

Using Newton’s method, 4 decimal places of accuracy, and θ0 = 0.5 we get:

θ0 = 0.5, θ1 = 0.343643, θ2 = 0.342592, θ3 = 0.342592.

Let θ = 0.342592. Since

f ′(θ) < 0 when 0 < θ < θ and f ′(θ) > 0 when θ < θ < 1,

the function is minimized at θ.

The estimated model has (approximate) probabilities (0.3426, 0.2252, 0.4322).

Example 4.18 (Genetics Study) (Larsen & Marx, Prentice-Hall, 1986, p.418) Re-searchers were interested in studying two genetic charateristics of corn:

• Starchy (S) or sugary (s), where starchy is the dominant characteristic, and

• Green base leaf (G) or white base leaf (g), where green base leaf is the dominantcharacteristic,

74

Table 4.1: Summary of results of the genetics study.

1 Starchy green (SG) 1997

2 Starchy white (Sg) 906

3 Sugary green (sG) 904

4 Sugary white (sg) 32

in the offspring of self-fertilized hybrid corn plants. The results for 3839 offspring aresummarized in Table 4.1.

A model of interest to geneticists hypothesizes that the probabilities of the four outcomes(SG, Sg, sG, sg) is

(p1, p2, p3, p4) = (0.25(2 + θ), 0.25(1 − θ), 0.25(1 − θ), 0.25θ),

where θ is a parameter in the interval (0, 1). θ is a cross-over parameter, indicating atendency for chromosome pairs Sg and sG in the hybrid parent to cross-over to SG and sgat the time of reproduction. (As θ increases the tendency to cross-over increases.)

The right part of Figure 4.12 suggests that f(θ) is minimized in the interval 0.02 < θ < 0.05.Using Newton’s method to solve the derivative equation

d

dθ(f(θ)) = 0 with initial value 0.03,

the minimum chi-square estimate of θ is θ = 0.0357852.

4.6 Rolle’s Theorem and the Mean Value Theorem

The Mean Value Theorem (MVT) connects the local picture about a curve (namely, theslope of the curve at a given point) to the global picture (namely, the average rate of changeover an interval). Rolle’s Theorem is the first step.

Theorem 4.19 (Rolle’s Theorem) Suppose f is continuous on [a, b], differentiable on(a, b) and that f(a) = f(b). Then there is a c ∈ (a, b) satisfying f ′(c) = 0.

Example 4.20 (Cubic Polynomial) For example, let f(x) = 48 − 44x + 12x2 − x3 on[a, b] = [2, 4] (left part of Figure 4.13). Since f(2) = f(4), Rolle’s theorem applies. On thisinterval, f ′(c) = 0 when c = 4 − 2/

√3 ≈ 2.8453. (c is a critical number corresponding to

the global minimum on the interval.)

Theorem 4.21 (Mean Value Theorem) Suppose f is continuous on the closed interval[a, b] and differentiable on the open interval (a, b). Then

f(b) − f(a)

b − a= f ′(c) for some number c satisfying a < c < b.

75

2 3 4 x

-4

-3

-2

-1

1y

-4 -2 2 4 x

-20

-10

10

20y

Figure 4.13: Plots of y = 48− 44x + 12x2 − x3 (left) and y = 17 + 2x − 3x2 + 0.1x4(right).

The MVT can be proven using Rolle’s theorem. Let

s(x) = f(a) +

(f(b) − f(a)

b − a

)(x − a) and h(x) = f(x) − s(x).

Since h(a) = f(a)− f(a) = 0 and h(b) = f(b)− f(b) = 0, the conditions of Rolle’s theoremare met. Thus, there is a c ∈ (a, b) satisfying

0 = h′(c) = f ′(c) −(

f(b) − f(a)

b − a

)⇒ f ′(c) =

f(b) − f(a)

b − a.

Example 4.22 (Quartic Polynomial) For example, if f(x) = 17+2x−3x2 +0.1x4 and[a, b] = [−5, 5] (right part of Figure 4.13), then the average rate of change over the interval is2. The instantaneous rate of change of the function is 2 when c = 0, c = ±

√15 ≈ ±3.87298.

4.6.1 Generalized Mean Value Theorem and Error Analysis

The MVT can be extended to two functions.

Theorem 4.23 (Generalized MVT) Suppose f and g are continuous on the closedinterval [a, b] and differentiable on the open interval (a, b). Then

(f(b) − f(a))g′(c) = (g(b) − g(a))f ′(c) for some number c satisfying a < c < b.

The generalized MVT can be proven using Rolle’s theorem applied to

h(x) = (f(b) − f(a))g(x) − (g(b) − g(a))f(x).

Since h(a) = f(b)g(a) − g(b)f(a) and h(b) = f(b)g(a) − g(b)f(a), the conditions of Rolle’stheorem are met. Thus, there is a c ∈ (a, b) satisfying

0 = h′(c) = (f(b)−f(a))g′(c)−(g(b)−g(a))f ′(c) ⇒ (f(b)−f(a))g′(c) = (g(b)−g(a))f ′(c).

76

Error in linear approximations. If we apply the MVT to the function f on the interval[a, x], then we get

f(x) = f(a) + f ′(c)(x − a) for some c satisfying a < c < x.

That is, the error in using f(a) instead of using f(x) for values of x near a is exactlyf ′(c)(x − a) for some number c between a and x.

How about the error in using the tangent line instead of f(x)?

Suppose f is twice differentiable. Let ℓ(x) = f(a) + f ′(a)(x − a) be the tangent lineto y = f(x) at (a, f(a)), e(x) = f(x) − ℓ(x) be the difference between f and ℓ, andg(x) = (x − a)2. (The function e(x) is often called the error function.)

If we apply the generalized MVT twice to the functions e and g on the interval [a, x], thenwe obtain the following equation

f(x) = f(a) + f ′(a)(x − a) +f ′′(c)

2(x − a)2 for some c ∈ (a, x).

That is, the error in using values on the tangent line instead of using f(x) when x is near a

is exactly f ′′(c)2 (x − a)2 for some number c between a and x. The form of the result is the

same if you apply the generalized MVT twice to e and g on the interval [x, a].

Error in quadratic approximations. In Section 6.6 we will consider quadratic, cubic,quartic and higher order polynomial approximations to a function f , and error analyses. Asa preview, suppose that f is differentiable to order 3 and consider the following quadraticapproximation to f at a,

q(x) = f(a) + f ′(a)(x − a) +f ′′(a)

2(x − a)2.

(q(x) is a quardratic polynomial satisfying q(a) = f(a), q′(a) = f ′(a) and q′′(a) = f ′′(a).)

Let e(x) = f(x) − q(x) and g(x) = (x − a)3. If we apply the generalized MVT three timesto the functions e and g on the interval [a, x], then we obtain the following equation

f(x) = f(a) + f ′(a)(x − a) +f ′′(a)

2(x − a)2 +

f ′′′(c)6

(x − a)3 for some c ∈ (a, x).

That is, the error in using values on the quadratic curve instead of using f(x) when x is

near a is exactly f ′′′(c)6 (x − a)2 for some number c between a and x. The form of the result

is the same if you apply the generalized MVT three times to e and g on the interval [x, a].

4.7 Integration

If f is the derivative of F , then we can interpret the function f as the instantaneous rateof change of the function F . Starting with a function f representing instantaneous rate ofchange, the techniques of integration allow us to recover F and to compute the total changein F over an interval.

77

1 2 3 4 x

1

2y

1 2 3x

-5

5

10

15y

Figure 4.14: Definite integral as area (left) and difference in areas (right).

4.7.1 Riemann Sums and Definite Integrals

Suppose f is continuous on the closed interval [a, b], and let n be a positive integer. Further,let x = (b − a)/n and xi = a + ix for i = 0, 1, 2, . . . , n. Then the n + 1 numbers

a = x0 < x1 < x2 < · · · < xn = b

partition the interval [a, b] into n subintervals of equal length. Lastly, let x∗i be the midpoint

of the ith subinterval: x∗i = (xi−1 + xi)/2.

The definite integral of f from a to b is

∫ b

af(x) dx = lim

n→∞

n∑

i=1

f(x∗i )x.

In this formula: a is the lower limit of integration, b is the upper limit of integration, f(x)is the integrand, and the sum on the right is called a Riemann sum.

Continuity and Integrability. f is said to be integrable on the interval [a, b] if the limitabove exists. Continuous functions are always integrable. Further, it is possible to showthat the limit exists even if f has a finite number of “jump discontinuities” on [a, b].

Area and difference in areas. If f is nonnegative on [a, b], then∫ ba f(x) dx can be

interpreted as the area under the curve y = f(x) and above the x-axis when a ≤ x ≤ b.Otherwise, the integral can be interpreted as the difference between two areas.

Example 4.24 (Computing Area) Let f(x) =√

x on the interval [0, 4]. To estimatethe area under y = f(x) and above the x-axis we let n = 10 and ∆x = 0.4, as illustrated inthe left part of Figure 4.14. The 10 midpoints are

0.2, 0.6, 1.0, 1.4, 1.8, 2.2, 2.6, 3.0, 3.4, 3.8,

and the sum of the areas is 5.347.

78

7 14 21 28 35x

1800

3600

y

1 2 3 4 x

32

64y

Figure 4.15: Definite integral as average (left) and arc length (right).

The following list of approximate values suggests that the limit is 16/3.

n 10 50 90 130 170 210 250 290 330 370 410

Sum 5.347 5.335 5.334 5.334 5.334 5.333 5.333 5.333 5.333 5.333 5.333

Example 4.25 (Difference in Areas) Let f(x) = x3 − 4x on the interval [0, 3]. Toestimate the definite integral we let n = 9 and ∆x = 1

3 , as illustrated in the right part ofFigure 4.14. The 9 midpoints are

0.167, 0.5, 0.833, 1.167, 1.5, 1.833, 2.167, 2.5, 2.833,

and the sum is∑

i f(x∗i )∆x = 2.125.

The following list of approximate values suggests that the limit is 2.25.

n 9 24 39 54 69 84 99 114 129 144 159

Sum 2.125 2.232 2.243 2.247 2.248 2.249 2.249 2.249 2.249 2.25 2.25

Geometrically, the limit is the difference between areas:

2.25 =

∫ 3

0f(x) dx =

∫ 2

0f(x) dx +

∫ 3

2f(x) dx = −4 + 6.25.

(The area between 0 and 2 is negated because it lies below the x-axis.)

Note that the total area bounded by y = f(x), y = 0, x = 0 and x = 3 is 4 + 6.25 = 10.25.

Average values. The quantity 1(b−a)

∫ ba f(x) dx can be interpreted as the average value

of f(x) on the interval [a, b].

To see this, let f = 1n

∑ni=1 f(x∗

i ) be the sample average. Then

f =(b − a)

(b − a)

(1

n

n∑

i=1

f(x∗i )

)=

1

(b − a)

n∑

i=1

f(x∗i ) x ⇒ lim

n→∞f =

1

(b − a)

∫ b

af(x) dx.

79

Example 4.26 (Average Value) Let f(x) = 225e0.08x on the interval [0, 35]. To estimatethe average of f(x) we let n = 5 and ∆x = 35/5 = 7, as illustrated in the left part ofFigure 4.15. Function values at the 5 midpoints are

f(3.5) ≈ 297.704, f(10.5) ≈ 521.183, f(17.5) ≈ 912.42, f(24.5) ≈ 1597.35, f(31.5) ≈ 2796.43,

and the average of these 5 numbers is 1225.02.


n 5 30 55 80 105 130 155 180 205Avg 1225.02 1240.64 1240.95 1241.02 1241.05 1241.06 1241.07 1241.08 1241.08

Arc lengths. Under mild conditions, the quantity∫ ba

√1 + (f ′(x))2 dx can be interpreted

as the length of the curve y = f(x) for x ∈ [a, b].

To see this, let yi = f(xi), ∆yi = yi − yi−1 and

n∑

i=1

√(∆x)2 + (∆yi)2 =

n∑

i=1

√

1 +

(∆yi

∆x

)2

∆x

be the sum of lengths of line segments joining successive points:

(x0, y0) → (x1, y1) → · · · → (xn, yn).

If f ′ is continuous and not zero on [a, b] (except possibly at the endpoints), then

∆yi

∆x≈ f ′(x∗

i ) for i = 1, 2, . . . , n,

the sum of segment lengths can be approximated as follows,

n∑

i=1

√(∆x)2 + (∆yi)2 ≈

n∑

i=1

√1 + (f ′(x∗

i ))2 ∆x ,

and the limit of the Riemann sum on the right is the arc length.

Example 4.27 (Arc Length) Let f(x) = x3 on the interval [0, 4]. To estimate the lengthof the curve we let n = 4 and ∆x = 4/4 = 1, as illustrated in the right part of Figure 4.15.The sum of the lengths of the segments connecting the points (0,0), (1,1), (2,8), (3,27), (4,64)is 64.525.


n 4 12 20 28 36 44 52 60 68 76Sum 64.525 64.659 64.667 64.669 64.67 64.671 64.671 64.671 64.672 64.672

80

4.7.2 Properties of Definite Integrals

Suppose that f and g are integrable on [a, b], and let c be a constant. Then properties oflimits imply the following properties of definite integrals:

1. Constant functions:∫ ba c dx = c(b − a).

2. Sums and differences:∫ ba (f(x) ± g(x)) dx =

∫ ba f(x) dx ± ∫ b

a g(x) dx.

3. Constant multiples:∫ ba cf(x) dx = c

∫ ba f(x) dx.

4. Adjacent intervals:∫ ba f(x) dx =

∫ ca f(x) dx +

∫ bc f(x) dx when a < c < b.

5. Null interval:∫ aa f(x) dx = 0.

6. Reverse limits of integration:∫ ba f(x) dx = − ∫ a

b f(x) dx.

7. Ordering: If f(x) ≤ g(x) for all x on [a, b], then∫ ba f(x) dx ≤ ∫ b

a g(x) dx.

8. Absolute value: If |f | is integrable, then∣∣∣∫ ba f(x) dx

∣∣∣ ≤∫ ba |f(x)| dx.

4.7.3 Fundamental Theorem of Calculus

The Fundamental Theorem of Calculus (FTC) relates integration and differentiation, andgives us a method for computing definite integrals when f(x) is a derivative function.

Theorem 4.28 (Fundamental Theorem of Calculus) Let f be a continuous functionon the closed interval [a, b]. Then

1. If F (x) =∫ xa f(t) dt, then F ′(x) = f(x) for x ∈ [a, b].

2. If f(x) = F ′(x) for all x in [a, b], then∫ ba f(x) dx = F (b) − F (a).

The first part of the fundamental theorem says that f(x) is the rate of change of F (x) onthe interval [a, b]. The second part of the theorem says that the definite integral of a rateof change is the total change over the interval.

Derivatives of integrals with variable endpoints. A useful generalization of theFundamental Theorem of Calculus is as follows: If ℓ(x) ≤ u(x) are differentiable functions(for the lower and upper endpoints), f is continuous, and

G(x) =

∫ u(x)

ℓ(x)f(t) dt,

then G′(x) = f(u(x))u′(x) − f(ℓ(x))ℓ′(x).

81

Table 4.2: Table of Indefinite Integrals.

∫c f(x) dx = c

∫f(x) dx

∫(f(x) + g(x)) dx =

∫f(x) dx +

∫g(x) dx

∫xn dx = xn+1

n+1 + C (n 6= −1)∫ 1

x dx = ln |x| + C (x 6= 0)

∫ex dx = ex + C

∫ln(x) dx = x ln(x) − x + C (x > 0)

∫sin(x) dx = − cos(x) + C

∫cos(x) dx = sin(x) + C

∫sec2(x) dx = tan(x) + C

∫csc2(x) dx = − cot(x) + C

∫sec(x) tan(x) dx = sec(x) + C

∫csc(x) cot(x) dx = − csc(x) + C

∫ 11+x2 dx = tan−1(x) + C

∫ 1√1−x2

dx = sin−1(x) + C

4.7.4 Antiderivatives and Indefinite Integrals

The function F (x) in the FTC is called an antiderivative of f(x). Since the derivative of aconstant function is zero, antiderivative functions are not unique. The family of antideriva-tives, known as the indefinite integral of f(x), is often written as follows:

∫f(x) dx = F (x) + C where C is a constant (the constant of integration).

If F (x) is an antiderivative of f(x), then the following notation is used when evaluating adefinite integral using the FTC:

∫ b

af(x) dx = [F (x)]ba = F (b) − F (a).

That is, we compute the difference between the function value at the upper endpoint andthe function value at the lower endpoint.

4.7.5 Some Formulas for Indefinite Integrals

A useful first list of indefinite integrals is given in Table 4.2.

Example 4.29 (Computing Area, continued) Since∫

x1/2 dx = 23 x3/2 + C, the area

under the curve y =√

x = x1/2 and above the x-axis for x ∈ [0, 4] is

∫ 4

0

√x dx =

[2

3x3/2

]4

0=

16

3.

This answer confirms the earlier result.

82

Rules for polynomials. If p(x) = anxn + an−1xn−1 + · · · + a1x + a0 is a polynomial

function, then information in Table 4.2 and properties of integrals given earlier imply that∫

p(x) dx =an

n + 1xn+1 +

an−1

nxn + · · · + a2

3x3 +

a1

2x2 + a0x + C.

In particular,∫

a dx = ax + C and∫(ax + b) dx = a

2x2 + bx + C.

Example 4.30 (Difference in Areas, continued) Since∫(x3−4x) dx = 1

4 x4−2x2+C,the definite integral of f(x) = x3 − 4x for x ∈ [0, 3] is

∫ 3

0(x3 − 4x) dx =

[1

4x4 − 2x2

]3

0=

9

4.


Substitution rule for composite functions. The chain rule for compositions says that

d

dx(f(g(x))) = f ′(g(x))g′(x) where f and g are differentiable functions.

Thus, ∫f ′(g(x))g′(x) dx = f(g(x)) + C

is a general rule for finding antiderivatives.

If we let u = g(x) and du = g′(x) dx, then we can rewrite the formula above as∫

f ′(u) du = f(u) + C.

(We substitute u for g(x) and du for g′(x) dx.)

For example, consider finding ∫x√

x2 + 1dx.

To apply substitution, let u = x2 + 1. Then du = 2x dx (or x dx = 12 du), and the integral

becomes ∫1

2√

udu = u1/2 + C =

√x2 + 1 + C.

Compositions when g(x) is linear. If g(x) = ax + b, where a and b are constants anda 6= 0, then a general rule for finding the indefinite integral is as follows:

∫f ′(ax + b) dx =

1

af(ax + b) + C.

Example 4.31 (Average Value, continued) Since∫

225e0.08x dx = 2812.5e0.08x + C,the average value of f(x) = 225e0.08x for x ∈ [0, 35] is

1

35

∫ 35

0225e0.08x dx =

1

35

[2812.5e0.08x

]350

≈ 1241.08.


83

Integration by parts. The derivative rule for products says that

d

dx(f(x)g(x)) = f ′(x)g(x) + f(x)g′(x) where f and g are differentiable functions.

Thus, ∫ (f ′(x)g(x) + f(x)g′(x)

)dx = f(x)g(x) + C

⇒∫

f(x)g′(x) dx = f(x)g(x) −∫

f ′(x)g(x) dx

is a general rule for finding antiderivatives.

If we let u = f(x), du = f ′(x) dx, v = g(x), and dv = g′(x) dx, then we can rewrite theformula above as follows: ∫

u dv = uv −∫

v du.

In integration by parts, the problem of finding the indefinite integral∫

u dv is replaced bythe problem of finding

∫v du. Care must be taken to choose u and v so that the second

integral is easier to compute than the first.

For example, to find∫

xex dx, we let u = x and dv = ex dx. Then du = dx. Since∫

dv =

∫ex dx = ex + C

and v can be any antiderivative, we choose v = ex. Now∫

u dv = uv −∫

v du = xex −∫

ex dx = xex − ex + C.

Similarly, to find∫

x ln(x) dx, we let u = ln(x) and dv = x dx. Then du = 1x dx. Since

∫dv =

∫x dx =

x2

2+ C

and v can be any antiderivative, we choose v = x2

2 . Now,

∫u dv = uv −

∫v du =

1

2x2 ln(x) −

∫ (x2

2

)1

xdx =

1

2x2 ln(x) − x2

4+ C.

4.7.6 Approximation Methods

Finding antiderivatives is not as straightforward as finding derivatives. In fact, in manysituations exact antiderivatives cannot be found. (For example, f(x) = ex/x has no exactantiderivative.) When an exact antiderivative does not exist, an approximate method canbe used to estimate a definite integral.

Midpoint rule. A natural estimate to use is the value of the Riemann sumn∑

i=1

f(x∗i )x when n is large.

Since x∗i was chosen to be the midpoint of [xi−1, xi] for each i, this method of estimating

the definite integral is called the midpoint rule.

84

1 3 5 7 x

50

100

150

y

1 3 5 7 x

50

100

150

y

Figure 4.16: Plots of y = ex/x with trapezoidal (left) and Simpson (right) approximations.

Trapezoidal and Simpson’s rules. In the trapezoidal rule, the area under the piecewiselinear curve defined by the points

(xi, f(xi)) for i = 0, 1, 2, . . . , n

is used to estimate the area under the nonnegative curve y = f(x). (Geometrically, a sumof areas of trapezoids replaces a sum of areas of rectangles.)

In Simpson’s rule, the area under a piecewise quadratic curve is used to estimate the area un-der the nonnegative curve y = f(x), where the quadratic polynomial for the ith subintervalpasses through the points (xi−1, f(xi−1)), (x∗

i , f(x∗i )), and (xi, f(xi)).

Example 4.32 (Figure 4.16) Let f(x) = ex/x on the interval [1, 7]. The left part ofFigure 4.16 shows the setup for the trapezoidal rule, and the right part shows the setupfor Simpson’s rule, when n = 3 subintervals are used. The approximation in the right plotis significantly better than the one in the left plot. (Note that the numerical value of theintegral is 189.61.)

Newton-Cotes and Gaussian methods. The midpoint rule gives exact answers if f isa constant function, the trapezoidal rule gives exact answers if f is a constant function or alinear function, and Simpson’s rule gives exact answers if f is a constant function, a linearfunction or a quadratic function. They are examples of Newton-Cotes approximations.

An approximation of the Newton-Cotes type will give exact answers when f is a polynomialfunction of degree at most a fixed constant. Gaussian quadrature methods take Newton-Cotes methods a step further, by choosing partition points that are not necessarily equally-spaced; the spacing is chosen to minimize the approximation error.

Monte Carlo methods. Monte Carlo methods are based on the idea of averaging (seepage 79). Let x1, x2, . . ., xn be n numbers chosen uniformly from the interval [a, b]. Thena Monte Carlo estimate of

∫ ba f(x) dx is

(b − a)

(1

n

n∑

i=1

f(xi)

).

85

0 c 1 2 x

50

100

150y

c2.5 x

50

100

150y

Figure 4.17: Improper integral examples on finite intervals.

Let f(x) = ex/x and [a, b] = [1, 7] (see Figure 4.16). A Monte Carlo estimate of the integral,based on n = 50000 random numbers, was 190.061.

Power series methods. Methods based on power series expansions can also be used toestimate an integral. See Section 6.7.2.

4.7.7 Improper Integrals

The definition of the definite integral can be extended to include situations where f isundefined at either the lower or upper endpoint of integration, and to include situationswhere a = −∞ or b = ∞.

Specifically,

1. If f is integrable on [c, b] for all a < c < b, then

∫ b

af(x) dx = lim

c→a+

∫ b

cf(x) dx when this limit exists.

2. If f is integrable on [a, c] for all a < c < b, then

∫ b

af(x) dx = lim

c→b−

∫ c

af(x) dx when this limit exists.

3. If f is integrable on [a, b] for all b greater than a fixed a, then

∫ ∞

af(x) dx = lim

b→∞

∫ b


4. If f is integrable on [a, b] for all a less than a fixed b, then

∫ b

−∞f(x) dx = lim

a→−∞

∫ b


86

0 cc2x

0.4y

0 c1 c2x

5

10

y

Figure 4.18: Improper integral examples on infinite intervals.

When the limit exists, the improper integral is said to converge; otherwise, the improperintegral is said to diverge.

Example 4.33 (Improper Integrals on Finite Intervals) Let f(x) = 10/x2 on theinterval (0, 2], as illustrated in the left part of Figure 4.17. Since

∫ 2

0f(x) dx =

[−10

x

]2

c→0+= −5 − lim

c→0+

(−10

c

)does not exist,

the improper integral diverges.

Similarly, let f(x) = 10/√

5 − x on the interval [0, 5), as illustrated in the right part ofFigure 4.17. Since

∫ 5

0f(x) dx =

[−20

√5 − x

]c→5−

0= lim

c→5−

(−20

√5 − c

)+ 20

√5 = 0 + 20

√5 = 20

√5,

the improper integral converges to 20√

5 ≈ 44.7214.

Example 4.34 (Improper Integrals on Infinite Intervals) Let f(x) = xe−x on theinterval [0,∞), as illustrated in the left part of Figure 4.18. Since

∫∞0 f(x) dx = [−xe−x − e−x]

c→∞0 (using integration-by-parts)

= limc→∞ (−ce−c − e−c) + 1

= limc→∞ (−c/ec) − limc→∞ (e−c) + 1

= limc→∞ (−1/ec) − 0 + 1 (using L’Hospital’s Rule)

= 0 − 0 + 1 = 1,

the improper integral converges to 1.

Similarly, let f(x) = 10/√

x on the interval [1,∞), as illustrated in the right part of Fig-ure 4.18. Since∫ ∞

1f(x) dx =

[20√

x]c→∞1 = lim

c→∞(20√

c)− 20 does not exist,

the improper integral diverges.

87

-4 -3 -2 -1 0 1 2 3 4x

y

Figure 4.19: Standardized binomial distribution with standard normal approximation.


Integration methods are important in statistics, and, in particular, when working withdensity functions. Density functions are used to “smooth over” probability histograms ofdiscrete random variables, and as functions supporting the theory of continuous randomvariables. Continuous random variables are studied in Chapter 5 of these notes.

4.8.1 deMoivre-Laplace Limit Theorem

The most famous example of “smoothing over” is given in the following theorem.

Theorem 4.35 (deMoivre-Laplace Limit Theorem) Let Xn be a binomial randomvariable based on n trials of a Bernoulli experiment with success probability p, and let

Zn =(Xn − np)√

npqbe the standardized form of Xn.

(E(Zn) = 0 and SD(Zn) = 1.) Then, for each real number x

limn→∞

P (Zn ≤ x) =

∫ x

−∞

1√2π

e−z2/2 dz.

The deMoivre-Laplace limit theorem implies that the area under the standard normal den-sity function, φ(z) = 1√

2πe−z2/2, can be used to estimate binomial probabilities when n is

large enough. (See Section 5.2.5 for more details on the normal distribution.)

Example 4.36 (Binomial Approximation) Let n = 40 and p = 1/2. In Figure 4.19, thefunction φ(z) is superimposed on the probability histogram for the standardized randomvariable Z40 = (X40 − 20)/

√10.

The probability P (X40 ≤ 22) can be computed using the binomial PDF:

P (X40 ≤ 22) =22∑

x=0

(40

x

)(1

2

)40

= 0.785205.

88

Using the normal approximation (with continuity correction) we get

P (X40 ≤ 22) ≈ Φ

(22.5 − 20√

10

)= 0.785402

where Φ(x) is the area under φ(z) for z ≤ x. The values are close.

Stirling’s formula. An important component of the proof of the deMoivre-Laplace limittheorem is the use of Stirling’s formula for factorials. Stirling proved that

limn→∞

n!√2π nn+1/2 e−n

= 1.

Thus, for large values of n, the formula√

2π nn+1/2 e−n can be used to approximate n!.

4.8.2 Computing Quantiles

If the continuous random variable X has density function f(x), then the cumulative distri-bution function (CDF) of X can be obtained using integration:

F (x) = P (X ≤ x) =

∫ x

−∞f(t) dt.

If 0 < p < 1, then the pth quantile of the X distribution (when it exists) is the number, xp,satisfying the equation

F (xp) =

∫ xp

−∞f(t) dt = p.

Example 4.37 (Pareto Distribution) The Pareto distribution has density function

f(x) =α

xα+1when x > 1 and 0 otherwise,

where α is a positive constant.

Pareto distributions are used to model incomes in a population. The CDF of the Paretodistribution is found using integration. Specifically,

F (x) =

∫ x

−∞f(t)dt =

∫ x

1

α

tα+1dt =

[−t−α]x1 = 1 − x−α

when x > 1 and 0 otherwise. For each p, the pth quantile is

p = F (xp) =⇒ xp = (1 − p)−1/α.

For example, the quartiles (25th, 50th, and 75th percentiles) of the Pareto distribution withparameter α = 2.5 are 1.122, 1.320, and 1.741, respectively.

Additional examples will be given in the next chapter.

89

5 Continuous Random Variables

Researchers use random variables to describe the numerical results of experiments. In thecontinuous setting, the possible numerical values form an interval or a union of intervals.For example, a random variable whose values are the positive real numbers might be usedto describe the lifetimes of individuals in a population.

This chapter focuses on continuous random variables and their probability distributions.

5.1 Definitions

Recall that a random variable is a function from the sample space of an experiment to thereal numbers and that the random variable X is said to be continuous if its range is aninterval or a union of intervals.

5.1.1 PDF and CDF for Continuous Distributions

Let X be a continuous random variable. We assume there exists a nonnegative functionf(x) defined over the entire real line such that

P (X ∈ D) =

∫

Df(x) dx,

where D ⊆ R is an interval or union of intervals. The function f(x) is called the probabilitydensity function (PDF) (or density function) of the continuous random variable X.

PDFs satisfy the following properties:

1. f(x) ≥ 0 whenever it exists.

2.∫R f(x)dx equals 1, where R is the range of X.

f represents the rate of change of probability; the rate must be nonnegative (property 1).The area under f represents probability; the total probability must be 1 (property 2).

The cumulative distribution function (CDF) of X, denoted by F (x), is the function

F (x) = P (X ≤ x) =

∫

(−∞,x]f(x) dx for all real numbers x.

CDFs satisfy the following properties:

1. limx→−∞ F (x) = 0 and limx→+∞ F (x) = 1.

2. If x1 ≤ x2, then F (x1) ≤ F (x2).

3. F (x) is continuous.

91

0 10 20 30 40 50 60x0

0.05

0.1

0.15

0.2

Density

0 10 20 30 40 50 60x0

0.2

0.4

0.6

0.8

1

Cum.Prob.

Figure 5.1: Probability density function (left plot) and cumulative distribution function(right plot) for a continuous random variable with range x ≥ 0.

F represents cumulative probability, with limits 0 and 1 (property 1). Cumulative prob-ability increases with increasing x (property 2). For continuous random variables, thecumulative distribution function is continuous (property 3).

Note that if R is the range of the continuous random variable X and I is an interval, thenthe probability of the event “the value of X is in the interval I” is computed by finding thearea under the density curve for x ∈ I ∩ R:

P (X ∈ I) =

∫

I∩Rf(x)dx.

In particular, if a ∈ R, then P (X = a) = 0 since the area under the curve over an intervalof length zero is zero.

Example 5.1 (Distribution on Nonnegative Reals) Let X be a continuous randomvariable whose range is the nonnegative real numbers and whose PDF is

f(x) =200

(10 + x)3when x ≥ 0 and 0 otherwise.

The left part of Figure 5.1 is a plot of the density function of X and the right part is a plotof its cumulative distribution function:

F (x) = 1 − 100

(10 + x)2when x ≥ 0 and 0 otherwise.

Further, for this random variable, the probability that X is greater than 8 is

P (X > 8) =

∫ ∞

8f(x)dx = 1 − F (8) =

100

182≈ 0.3086.

5.1.2 Quantiles and Percentiles

Assume that 0 < p < 1. The pth quantile (or 100pth percentile) of the X distribution (whenit exists) is the point, xp, satisfying the equation

P (X ≤ xp) = p.

92

To find xp, solve the equation F (x) = p for x.

Important special cases of quantiles are as follows:

1. The median of the X distribution: the 50th percentile.

2. The quartiles of the X distribution: the 25th, 50th, and 75th percentiles.

3. The deciles of the X distribution: the 10th, 20th, . . ., 90th percentiles.

Interquartile range. The interquartile range (IQR) is the difference between the 75th

and 25th percentiles: IQR = x0.75 − x0.25.

Example 5.2 (Distribution on Nonnegative Reals, continued) Continuing with theexample above, a general formula for the pth quantile is

p = 1 − 100

(10 + xp)2⇒ xp = 10/

√1 − p − 10 .

In particular, the quartiles are

x0.25 =20√

3− 10 ≈ 1.547, x0.50 = 10

√2 − 10 ≈ 4.142 and x0.75 = 10.

The IQR of this distribution is IQR = 20 − 20/√

3 ≈ 8.453.

Measures of center and spread. The median is a measure of the center (or location)of a distribution. The IQR is a measure of the scale (or spread) of a distribution.

Additional measures of center and spread are the mean and standard deviation, respectively.Definitions for continuous random variables are given in Section 5.3.3.

5.2 Families of Distributions

This section briefly describes four families of continuous probability distributions, considersproperties of these distributions, and considers continuous transformations.

5.2.1 Example: Uniform Distribution

Let a and b be real numbers with a < b. The continuous random variable X is said to be auniform random variable, or to have a uniform distribution, on the interval [a, b] when itsPDF is as follows:

f(x) =1

b − awhen a ≤ x ≤ b and 0 otherwise.

Note that the open interval (a, b), or one of the half-closed intervals [a, b) or (a, b], can beused instead of [a, b] as the range of a uniform random variable.

93

0 10 20 30 40 50 60 70x0

0.005

0.01

0.015

0.02

0.025

Density

0 10 20 30 40 50 60 70x0

0.2

0.4

0.6

0.8

1

Cum.Prob.

Figure 5.2: PDF (left) and CDF (right) for the uniform random variable on [10, 50].

Uniform distributions have constant density over an interval. That density is the reciprocalof the length of the interval. If X is a uniform random variable on the interval [a, b], and[c, d] ⊆ [a, b] is a subinterval, then

P (c ≤ X ≤ d) =

∫ d

c

1

b − adx =

d − c

b − a=

Length of [c, d]

Length of [a, b].

Also, P (c < X < d) = P (c < X ≤ d) = P (c ≤ X < d) = (d − c)/(b − a).

If 0 < p < 1, then the pth quantile is xp = a + p(b − a).

Example 5.3 (a = 10, b = 50) Let X be a uniform random variable on the interval[10, 50]. X has median 30 and IQR 20. Further,

P (15 ≤ X ≤ 42) =27

40= 0.675.

The PDF and CDF of X are shown in Figure 5.2.

Computer-generated random numbers. Note that computer commands that return“random” numbers in the interval [0, 1] are simulating random numbers from the uniformdistribution on the interval [0, 1].

5.2.2 Example: Exponential Distribution

Let λ be a positive real number. The continuous random variable X is said to be anexponential random variable, or to have an exponential distribution, with parameter λ whenits PDF is as follows:

f(x) = λe−λx when x ≥ 0 and 0 otherwise.

Note that the interval x > 0 can be used instead of the interval x ≥ 0 as the range of anexponential random variable.

94

0 0.5 1 1.5 x0

1

2

3

4

Density

0 0.5 1 1.5 x0

0.2

0.4

0.6

0.8

1

Cum.Prob.

Figure 5.3: PDF (left) and CDF (right) for the exponential random variable with λ = 4.

Exponential distributions are often used to represent the time that elapses before the oc-currence of an event. For example, the time that a machine component will operate beforebreaking down.

If X is an exponential random variable with parameter λ and [a, b] ⊆ [0,∞), then

P (a ≤ X ≤ b) =

∫ b

aλe−λx dx =

[−e−λx

]ba

= −e−λb + e−λa.

If 0 < p < 1, then the pth quantile is xp = − ln(1 − p)/λ.

Example 5.4 (λ = 4) Let X be an exponential random variable with parameter 4. Usingproperties of logarithms, X has median ln(2)/4 ≈ 0.173 and IQR ln(3)/4 ≈ 0.275. Further,

P

(1

4≤ X ≤ 3

4

)= −e−3 + e−1 ≈ 0.318.


The exponential distribution is discussed further in Section 6.9.1.

5.2.3 Euler Gamma Function

Let r be a positive real number. The Euler gamma function is defined as follows:

Γ(r) =

∫ ∞

0xr−1e−xdx.

If r is a positive integer, then Γ(r) = (r−1)!. Thus, the gamma function is said to interpolatethe factorials.

Remark. The property Γ(r) = (r−1)! for positive integers can be proven using induction.To start the induction, you need to demonstrate that Γ(1) = 1. To prove the inductionstep, you need to use integration-by-parts to demonstrate that Γ(r + 1) = r × Γ(r).

95

0 10 20 30 40 x0

0.02

0.04

0.06

0.08Density

0 10 20 30 40 x0

0.2

0.4

0.6

0.8

1

Cum.Prob.

Figure 5.4: PDF (left) and CDF (right) for the gamma random variable with shape param-eter α = 2 and scale parameter β = 5.

5.2.4 Example: Gamma Distribution

Let α and β be positive real numbers. The continuous random variable X is said to be agamma random variable, or to have a gamma distribution, with parameters α and β whenits PDF is as follows:

f(x) =1

βαΓ(α)xα−1e−x/β when x > 0 and 0 otherwise.

Note that if α ≥ 1, then the interval x ≥ 0 can be used instead of the interval x > 0 as therange of a gamma random variable.

The parameter α is called a shape parameter; gamma distributions with the same valueof α have the same shape. The parameter β is called a scale parameter; for fixed α, as βchanges, the scale on the vertical and horizontal axes change, but the shape remains thesame.

Example 5.5 (α = 2, β = 5) Let X be a gamma random variable with shape parameter2 and scale parameter 5. Using Newton’s method, the median of X is approximately 8.39and the IQR is approximately 8.65. Further,

P (8 ≤ X ≤ 20) =

∫ 20

8

1

25x e−x/5 dx =

[−e−x/5 − 1

5x e−x/5

]20

8≈ 0.433.


Remarks. The fact that f(x) is a valid PDF can be demonstrated using the method ofsubstitution and the definition of the Euler gamma function. If the shape parameter α = 1,then the gamma distribution is the same as the exponential distribution with λ = 1/β.

The gamma distribution is discussed further in Section 6.9.2.

96

5.2.5 Example: Normal Distribution

Let µ be a real number and σ be a positive real number. The continuous random variableX is said to be a normal random variable, or to have a normal distribution, with parametersµ and σ when its PDF is as follows:

f(x) =1√

2π σexp

(−(x − µ)2

2σ2

)for all real numbers x

where exp() is the exponential function. The normal distribution is also called the Gaussiandistribution, in honor of the mathematician Carl Friedrich Gauss.

The graph of the PDF of a normal random variable is the famous bell-shaped curve. Theparameter µ is called the mean (or center) of the normal distribution, since the graph ofthe PDF is symmetric around x = µ. The median of the normal distribution is µ. Theparameter σ is called the standard deviation (or spread) of the normal distribution. Theinterquartile range of the normal distribution is (approximately) 1.35σ.

Normal distributions have many applications. For example, normal random variables canbe used to model measurements of manufactured items made under strict controls; normalrandom variables can be used to model physical measurements (e.g. height, weight, bloodvalues) in homogeneous populations.

Note that the PDF of the normal distribution cannot be integrated in closed form. Mostcomputer programs provide functions to compute probabilities and quantiles of the normaldistribution.

Standard normal distribution. The continuous random variable Z is said to be astandard normal random variable, or to have a standard normal distribution, when Z is anormal random variable with µ = 0 and σ = 1. The PDF and CDF of Z have specialsymbols:

φ(z) =1√2π

e−z2/2 and Φ(z) =

∫ z

−∞φ(t)dt where z is a real number.

The notation zp is used for the pth quantile of the Z distribution.

Working with tables. Table 5.1 gives values of Φ(z) for z ≥ 0, where

z = Row Value + Column Value.

For z < 0, use Φ(z) = 1 − Φ(−z). For example,

1. Φ(1.25) = 0.894 (row 1.2, column 0.05).

2. Φ(−0.70) = 1 − Φ(0.70) = 1 − 0.758 = 0.242 (row 0.7, column 0.00).

3. z0.25 ≈ −0.67 and z0.75 ≈ 0.67 since Φ(0.67) = 0.749 (row 0.6, column 0.07).

The graph shown in the table illustrates Φ(z) as an area.

97

Table 5.1: Cumulative probabilities of the standard normal random variable

zz

0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09

0.0 0.500 0.504 0.508 0.512 0.516 0.520 0.524 0.528 0.532 0.5360.1 0.540 0.544 0.548 0.552 0.556 0.560 0.564 0.567 0.571 0.5750.2 0.579 0.583 0.587 0.591 0.595 0.599 0.603 0.606 0.610 0.6140.3 0.618 0.622 0.626 0.629 0.633 0.637 0.641 0.644 0.648 0.6520.4 0.655 0.659 0.663 0.666 0.670 0.674 0.677 0.681 0.684 0.6880.5 0.691 0.695 0.698 0.702 0.705 0.709 0.712 0.716 0.719 0.7220.6 0.726 0.729 0.732 0.736 0.739 0.742 0.745 0.749 0.752 0.7550.7 0.758 0.761 0.764 0.767 0.770 0.773 0.776 0.779 0.782 0.7850.8 0.788 0.791 0.794 0.797 0.800 0.802 0.805 0.808 0.811 0.8130.9 0.816 0.819 0.821 0.824 0.826 0.829 0.831 0.834 0.836 0.8391.0 0.841 0.844 0.846 0.848 0.851 0.853 0.855 0.858 0.860 0.8621.1 0.864 0.867 0.869 0.871 0.873 0.875 0.877 0.879 0.881 0.8831.2 0.885 0.887 0.889 0.891 0.893 0.894 0.896 0.898 0.900 0.9011.3 0.903 0.905 0.907 0.908 0.910 0.911 0.913 0.915 0.916 0.9181.4 0.919 0.921 0.922 0.924 0.925 0.926 0.928 0.929 0.931 0.9321.5 0.933 0.934 0.936 0.937 0.938 0.939 0.941 0.942 0.943 0.9441.6 0.945 0.946 0.947 0.948 0.949 0.951 0.952 0.953 0.954 0.9541.7 0.955 0.956 0.957 0.958 0.959 0.960 0.961 0.962 0.962 0.9631.8 0.964 0.965 0.966 0.966 0.967 0.968 0.969 0.969 0.970 0.9711.9 0.971 0.972 0.973 0.973 0.974 0.974 0.975 0.976 0.976 0.9772.0 0.977 0.978 0.978 0.979 0.979 0.980 0.980 0.981 0.981 0.9822.1 0.982 0.983 0.983 0.983 0.984 0.984 0.985 0.985 0.985 0.9862.2 0.986 0.986 0.987 0.987 0.987 0.988 0.988 0.988 0.989 0.9892.3 0.989 0.990 0.990 0.990 0.990 0.991 0.991 0.991 0.991 0.9922.4 0.992 0.992 0.992 0.992 0.993 0.993 0.993 0.993 0.993 0.9942.5 0.994 0.994 0.994 0.994 0.994 0.995 0.995 0.995 0.995 0.9952.6 0.995 0.995 0.996 0.996 0.996 0.996 0.996 0.996 0.996 0.9962.7 0.997 0.997 0.997 0.997 0.997 0.997 0.997 0.997 0.997 0.9972.8 0.997 0.998 0.998 0.998 0.998 0.998 0.998 0.998 0.998 0.9982.9 0.998 0.998 0.998 0.998 0.998 0.998 0.998 0.999 0.999 0.9993.0 0.999 0.999 0.999 0.999 0.999 0.999 0.999 0.999 0.999 0.9993.1 0.999 0.999 0.999 0.999 0.999 0.999 0.999 0.999 0.999 0.9993.2 0.999 0.999 0.999 0.999 0.999 0.999 0.999 0.999 0.999 0.999

98

-1 0 1 2 3 4 x0

0.2

0.4

0.6

0.8

1

X Dens.

-1 0 1 2 3 4 y0

1

2

3

4Y Dens.

Figure 5.5: Probability density functions of a continuous random variable with values on0 < x < 2 (left plot) and of its reciprocal with values on y > 0.5 (right plot).

Nonstandard normal distributions. If X is a normal random variable with mean µand standard deviation σ, then

P (X ≤ x) = Φ

(x − µ

σ

)

where Φ() is the CDF of the standard normal random variable, and

xp = µ + zpσ

where zp is the pth quantile of the standard normal distribution.

Example 5.6 (µ = 100, σ = 20) Let X be a normal random variable with mean 100and standard deviation 20. The median of X is 100 and the IQR is 20(z0.75 − z0.25) ≈ 26.8.Further,

P (60 ≤ X ≤ 110) = Φ

(110 − 100

20

)− Φ

(60 − 100

20

)≈ 0.668.

5.2.6 Transforming Continuous Random Variables

If X and Y = g(X) are continuous random variables, then the CDF of X can be used todetermine the CDF and PDF of Y .

Example 5.7 (Reciprocal Transformation) Let X be a continuous random variablewhose range is the open interval (0, 2) and whose PDF is

fX(x) =x

2when 0 < x < 2 and 0 otherwise,

and let Y equal the reciprocal of X, Y = 1/X. Then for y > 1/2,

P (Y ≤ y) = P

(1

X≤ y

)= P

(X ≥ 1

y

)=

∫ 2

x=1/y

x

2dx = 1 − 1

4y2

99

and ddyP (Y ≤ y) = 1/(2y3). Thus, the PDF of Y is

fY (y) =1

2y3when y > 1/2 and 0 otherwise

and the CDF of Y is

FY (y) = 1 − 1

4y2when y > 1/2 and 0 otherwise.

The left part of Figure 5.5 is a plot of the density function of X, and the right part is aplot of the density function of Y .

Example 5.8 (Square Transformation) Let Z be the standard normal random variableand W be the square of Z, W = Z2. Then for w > 0,

P (W ≤ w) = P (Z2 ≤ w) = P (−√w ≤ Z ≤ √

w) = Φ(√

w) − Φ(−√w).

Since the graph of the PDF of Z is symmetric around z = 0, Φ(−z) = 1−Φ(z) for every z.Thus, the CDF of W is

FW (w) = 2Φ(√

w) − 1 when w > 0 and 0 otherwise

and the PDF of W is

fW (w) =d

dwFW (w) =

1√2πw

e−w/2 when w > 0 and 0 otherwise.

(Note that W = Z2 is said to have a chi-square distribution with one degree of freedom.)

Monotone transformations. If the differentiable transformation g is strictly monotone(either always increasing or always decreasing) on the range of X, Y = g(X) and y is inthe range of Y , then the PDF of Y has the following form:

fY (y) = fX(x)/

∣∣∣∣dy

dx

∣∣∣∣ where y = g(x) (and x = g−1(y)).

Example 5.9 (Reciprocal Transformation, continued) In the reciprocal transfor-mation example above, y = g(x) = 1/x, x = g−1(y) = 1/y (the inverse of the reciprocalfunction is the reciprocal function), and

fY (y) = fX(g−1(y))/

∣∣∣∣−1

g−1(y)2

∣∣∣∣ =(

1

2y

)/∣∣∣−y2

∣∣∣ =1

2y3when y > 0.5.

Linear transformations. Linear transformations with nonzero slope are examples ofdifferentiable monotone transformations. In particular, if Y = aX + b where a 6= 0 and y isin the range of Y , then

fY (y) =1

|a|fX

(1

a(y − b)

).

100

Example 5.10 (Normal Distribution) If X is a normal random variable with mean µand standard deviation σ and Z is the standard normal random variable with PDF φ(z),then X = σZ + µ with PDF

fX(x) =1

σφ

(x − µ

σ

)=

1√2π σ

exp

(−(x − µ)2

2σ2

)for all x.

This formula for the PDF agrees with the one given earlier.

5.3 Mathematical Expectation

Mathematical expectation generalizes the idea of a weighted average, where probabilitydistributions are used as the weights. (See Section 2.3 also.)

5.3.1 Definitions for Continuous Random Variables

Let X be a continuous random variable with range R and PDF f(x). The mean of X (orexpected value of X or expectation of X) is defined as follows:

E(X) =

∫

x∈Rx f(x) dx

provided that∫x∈R |x| f(x) dx < ∞ (i.e. the integral converges absolutely).

Similarly, if g(X) is a real-valued function of X, then the mean of g(X) (or expected valueof g(X) or expectation of g(X)) is

E(g(X)) =

∫

x∈Rg(x) f(x) dx

provided that∫x∈R |g(x)| f(x) dx < ∞.

Note that the absolute convergence of an integral is not guaranteed. In cases where anintegral with absolute values diverges, we say that the expectation is indeterminate.

Example 5.11 (Triangular Distribution) Let X be a continuous random variable whoserange is the interval [1, 4] and whose PDF is as follows

f(x) =2

9(x − 1) when 1 ≤ x ≤ 4 and 0 otherwise,

and let g(X) = X2 be the square of X. Then

E(X) =

∫ 4

1x f(x) dx =

∫ 4

1x

2

9(x − 1) dx = 3

and

E(g(X)) =

∫ 4

1x2 f(x) dx =

∫ 4

1x2 2

9(x − 1) dx = 19/2.

Note that E(g(X)) 6= g(E(X)).

101

Table 5.2: Means and variances for four families of continuous distributions.

Distribution E(X) V ar(X)

Uniform a, b a+b2

(b−a)2

12

Exponential λ 1λ

1λ2

Gamma α, β αβ αβ2

Normal µ, σ µ σ2

Example 5.12 (Indeterminate Mean) Let X be a continuous random variable whoserange is x ≥ 1 and whose PDF is as follows:

f(x) =1

x2when x ≥ 1 and 0 otherwise.

Since∫∞1 x f(x) dx does not converge, the expectation of X is indeterminate.

5.3.2 Properties of Expectation

The properties of expectation stated in Section 2.3.2 for discrete random variables are alsotrue for continuous random variables.

5.3.3 Mean, Variance and Standard Deviation

Let X be a random variable and let µ = E(X) be its mean. The variance of X, V ar(X),defined as follows:

V ar(X) = E((X − µ)2

).

The notation σ2 = V ar(X) is used to denote the variance. The standard deviation of X,σ = SD(X), is the positive square root of the variance.

These summary measures (mean, variance, standard deviation) have the same propertiesas those stated in the discrete case in Section 2.3.3.

Table 5.2 gives general formulas for the mean and variance of the four families of continuousprobability distributions discussed earlier.

5.3.4 Chebyshev’s Inequality

Chebyshev’s inequality, stated in Section 2.3.4 for discrete random variables, remains truein the continuous case. The proof (with integration replacing summation) is also the same.

102

6 Infinite Sequences and Series

An infinite sequence (or sequence) is a real-valued function whose domain is the positiveintegers. We write a sequence as a list of numbers in a definite order

a1, a2, a3, . . . , an, . . . ,

where a1 is the first term, a2 is the second term, etc. The notations

a1, a2, a3, . . . or an or an∞n=1 are used to denote sequences.

In addition, sequences may start with n = 0 (or with some other integer).

If an∞n=1 is a sequence, then the formal sum

∞∑

n=1

an = a1 + a2 + a3 + · · ·

is called an infinite series (or series). The numbers a1, a2, . . . are the terms of the series.

6.1 Limits and Convergence of Infinite Sequences

Let an be an infinite sequence. We say that an has limit L, and write

limn→∞

an = L or an → L as n → ∞

if for every positive real number ǫ there exists an integer M satisfying

|an − L| < ǫ when n > M .

If the sequence an has limit L, then the sequence converges (or is convergent). Otherwise,the sequence diverges (or is divergent).

Example 6.1 (Convergent and Divergent Sequences) Since

limn→∞

(n2 + 1)

(5n2 + 7)= lim

n→∞(1 + 1/n2)

(5 + 7/n2)=

1

5,

the sequence

(n2+1)(5n2+7)

converges to 1

5 . Similarly, the sequence (−0.5)n converges to 0.

By contrast, the sequence ln(n) diverges. (In fact, the values grow large without bound.)

6.1.1 Convergence Properties

Assume that an converges to L and bn converges to M . Then

1. Constant multiples: can converges to cL, where c is a constant.

103

2. Sums and differences: an ± bn converges to L ± M .

3. Products: anbn converges to LM .

4. Quotients: If bn 6= 0 for all n and M 6= 0, then an/bn converges to L/M .

5. Convergence to zero: an converges to 0 if and only if |an| converges to 0.

6. Squeezing principle: If an ≤ bn ≤ cn for all n ≥ n0, where n0 is a constant, and

limn→∞

an = limn→∞

cn = L ,

then bn converges to L as well.

Example 6.2 (Squeezing Principle) Consider the sequence

sin2(n)n

. Since

0 ≤ sin2(n)

n≤ 1

nand lim

n→∞1

n= 0,

the sequence

sin2(n)n

converges to 0 as well.

6.1.2 Geometric Sequences

Consider the sequence rn where r is a fixed constant (called the ratio of the geometricsequence). Then

rn converges to 0 when |r| < 1 and converges to 1 when r = 1.

In all other cases, the sequence diverges.

6.1.3 Bounded Sequences

The sequence an is said to be bounded above if there is a number M satisfying

an ≤ M for all n ≥ 1.

Similarly, an is said to be bounded below if there is a number m satisfying

an ≥ m for all n ≥ 1.

The sequence an is said to be bounded if it is bounded above and below. Otherwise, anis said to be unbounded.

Theorem 6.3 (Bounded Sequence Theorem) Let an be a sequence.

1. If an converges, then it is bounded.

2. If an is unbounded, then it diverges.

Every convergent sequence is bounded, but not every bounded sequence is convergent. Forexample, the sequence (−1)n is bounded (since −1 ≤ an ≤ 1 for all n) but not convergent.

104

6.1.4 Monotone Sequences

The sequence an is said to be increasing if

a1 < a2 < · · · < an < · · · .

Similarly, an is said to be decreasing if

a1 > a2 > · · · > an > · · · .

an is said to be monotone if it is either increasing or decreasing.

Theorem 6.4 (Monotone Convergence Theorem) If an is a bounded and monotonesequence, then an converges.

Example 6.5 (Convergent Monotone Sequences) Since 0 ≤ 1 − 1n ≤ 1 for all n ≥ 1

and the terms are increasing, the sequence1 − 1

n

∞n=1

converges. In fact, the limit is 1.

Similarly, since 0 ≤ sin(

1n

)≤ sin(1) for all n ≥ 1 and the terms are decreasing, the sequence

sin(

1n

)∞n=1

converges. In fact, the limit is 0.

Note. The proof of the Monotone Convergence Theorem uses the fact that sets of realnumbers that are bounded above have a least upper bound (or supremum), and that sets ofreal numbers that are bounded below have a greatest lower bound (or infimum).

6.2 Partial Sums and Convergence of Infinite Series

Let∑∞

n=1 an be an infinite series. The sequence with terms

sn =n∑

i=1

ai, n = 1, 2, 3, . . .

is known as the sequence of partial sums of the series.

If the sequence of partial sums sn converges to L, then the series converges to L:

∞∑

n=1

an = limn→∞

sn = L

and L is said to be the sum of the series. If the sequence of partial sums sn diverges,then the series is said to diverge.

Example 6.6 (Convergent Series) The series∑∞

n=11

n(n+1) converges to 1.

To see this, note that 1n(n+1) = 1

n − 1n+1 for n ≥ 1. Thus,

sn =

(1 − 1

2

)+

(1

2− 1

3

)+ · · · +

(1

n− 1

n + 1

)= 1 − 1

n + 1

(all middle terms cancel) and sn → 1 as n → ∞.

105

6.2.1 Convergence Properties

Assume that∑∞

n=1 an = L and∑∞

n=1 bn = M . Then

1. Constant multiples:∑∞

n=1 can = c∑∞

n=1 an = cL, where c is a constant.

2. Sums and differences:∑∞

n=1(an ± bn) =∑∞

n=1 an ±∑∞n=1 bn = L ± M .

3. Convergence to zero: If the series∑∞

n=1 an converges, then an converges to 0.

Note that, if the sequence of terms of a series does not have limit zero, then the series mustdiverge. For example,

∞∑

n=1

(n2 + 1)

(5n2 + 7)diverges since lim

n→∞(n2 + 1)

(5n2 + 7)=

1

56= 0.

(The sequence of partial sums is unbounded, since sn+1 ≈ sn + 15 when n is large.)

6.2.2 Geometric Series

Consider the series∑∞

n=1 arn−1 where a and r are fixed constants. Then the nth partialsum of the geometric series is

sn = a + ar + ar2 + · · · + arn−1 =a(1 − rn)

1 − rwhen r 6= 1

(and sn = an when r = 1). Further, the series converges when |r| < 1 and its sum is

∞∑

n=1

arn−1 =a

1 − r(|r| < 1).

If |r| ≥ 1, the series diverges.

Example 6.7 (a = −10/27, r = −2/3) The series

∞∑

n=1

5(−2)n

3n+2converges to − 2

9

since

∑∞n=1

5(−2)n

3n+2 = −1027 + 20

81 − 40243 + 80

729 ∓ · · ·

= −1027

(1 − 2

3 + 49 − 8

27 ± · · ·)

= −10/271−(−2/3)

= −29.

106

6.3 Convergence Tests for Positive Series

Let∑∞

n=1 an and∑∞

n=1 bn be series with positive terms. Then

1. Comparison test: Assume that an ≤ bn for all n ≥ n0. Then

(a) If∑∞

n=1 bn converges, then∑∞

n=1 an converges.

(b) If∑∞

n=1 an diverges, then∑∞

n=1 bn diverges.

2. Integral test: If f(x) is a decreasing function with f(n) = an, then

∞∑

n=1

an and

∫ ∞

1f(x) dx both converge or both diverge.

3. Ratio test: Assume that (an+1/an) → L as n → ∞.

(a) If 0 ≤ L < 1, then∑∞

n=1 an converges.

(b) If L > 1, then∑∞

n=1 an diverges.

4. Root test: Assume that n√

an → L as n → ∞.

(a) If 0 ≤ L < 1, then∑∞

n=1 an converges.

(b) If L > 1, then∑∞

n=1 an diverges.

5. Limit comparison test: If an/bn converges to L > 0, then

∞∑

n=1

an and∞∑

n=1

bn both converge or both diverge.

The comparison test follows from the Monotone Convergence Theorem since the sequencesof partial sums are montone increasing: if the b-series converges, then its sum is an upperbound for the sequence of partial sums of the a-series; if the a-series diverges, then thesequence of partial sums of the b-series must be unbounded.

The ratio and root tests apply to series that are “sufficiently like” geometric series. In eachcase, if L = 1, then no conclusion can be drawn. The limit comparison test applies to seriesthat are approximately constant multiples of one another.

Example 6.8 (p Series) The integral test can be used to demonstrate that

∞∑

n=1

1

npconverges if and only if p > 1.

To see this, let f(x) = 1xp = x−p. If p > 1, then

∫ ∞

1f(x) dx =

[x−p+1

−p + 1

]x→∞

1

= limx→∞

(x−p+1

−p + 1

)− 1

−p + 1= 0 − 1

−p + 1=

1

p − 1.

107

(The limit is 0 since x has a negative exponent.) Since the improper integral converges, sodoes the series.

If p < 1, then the improper integral diverges since x has a positive exponent in the limitabove. If p = 1, then the antiderivative of f(x) is ln(x) and again the improper integraldiverges.

Example 6.9 (Convergent Exponential Series) An important application of the ratiotest is to the series ∞∑

n=0

xn

n!where x is a positive real number.

Since the ratioan+1

an=

xn+1/(n + 1)!

xn/n!=

x

(n + 1)→ 0 as n → ∞

the series above converges. In addition, since the series converges, the sequence of termsmust approach 0. That is, xn/n! → 0 as n → ∞.

Example 6.10 (Comparison Test) The comparison test can be used to demonstratethat ∞∑

n=0

10 sin2(n)

n!converges and that

∞∑

n=1

4√5n − 1

diverges.

In the first case, let an = 10 sin2(n)n! and bn = 10

n! . The ratio test can be used to demonstratethat the b-series converges (as in the last example). Thus, the a-series converges.

In the second case, let an = 4√5n

and bn = 4√5n−1

. The integral test can be used to demon-

strate that the a-series diverges (as in the example above). Thus, the b-series diverges.

Example 6.11 (Limit Comparison Test) The limit comparison test can be used todemonstrate that

∞∑

n=1

4n + 3

n3 + 7n − 4converges and that

∞∑

n=1

n + 5

3n2 + 2n + 2diverges.

In the first case, let an = 4n+3n3+7n−4

and bn = 1n2 . Since

limn→∞

an

bn= lim

n→∞4n3 + 3n2

n3 + 7n − 4= lim

n→∞4 + 3/n

1 + 7/n2 − 4/n3= 4

and the b-series converges (by a previous example), the a-series converges.

In the second case, let an = n+53n2+2n+2 and bn = 1

n . Since

limn→∞

an

bn= lim

n→∞n2 + 5n

3n2 + 2n + 2= lim

n→∞1 + 5/n

3 + 2/n + 2/n2=

1

3

and the b-series diverges (by the same example), the a-series diverges.

108

Example 6.12 (Root Test) The root test can be used to demonstrate that

∞∑

n=1

(n + 5

2n

)n

converges.

Specifically, since

n√

an =

(n + 5

2n

)=

(1 + 5/n

2

)→ 1

2as n → ∞

and the limit is less than 1, the series converges. (For large n, the terms of the series arelike those of a convergent geometric series with r = 1/2.)

Remarks. The convergence tests above can be used to determine if a positive seriesconverges, but tell us nothing about the sum of the series. In most cases, finding an exactsum is challenging. In Section 6.6, we demonstrate that

∞∑

n=0

xn

n!= ex for all x.

Harmonic series. The divergent series∑∞

n=11n is called the harmonic series. Although

the terms of the harmonic series approach 0, the series diverges by the integral test.

6.4 Alternating Series and Error Analysis

An alternating series is a series whose terms alternate in sign. The usual notation is

∞∑

n=1

(−1)n+1an = a1 − a2 + a3 − a4 ± · · · ,

where an > 0 for each n, for series starting with positive terms. For example,

∞∑

n=1

(−1)n+1

2n=

1

2− 1

4+

1

8− 1

16± · · ·

is an alternating series starting with a positive term.

A simple convergence test for alternating series is based on the following theorem.

Theorem 6.13 (Alternating Series Theorem) Assume that an+1 ≤ an for each n andthat an → 0 as n → ∞. Then

1.∑∞

n=1(−1)n+1an and∑∞

n=1(−1)nan are convergent series.

2. If sn is the nth partial sum and L is the sum of one of these series, then

|sn − L| ≤ an+1 for each n.

109

The sequence of partial sums of an alternating series are alternately larger and smaller thanthe limit, L. When the first term is positive, the even partial sums form an increasingsequence and the odd partial sums form a decreasing sequence:

s2 ≤ s4 ≤ s6 ≤ s8 ≤ s10 ≤ · · · ≤ L ≤ · · · ≤ s9 ≤ s7 ≤ s5 ≤ s3 ≤ s1.

When the first term is negative, the even partial sums are decreasing and the odd partialsums are increasing. Each subsequence of partial sums must converge, since each is boundedand monotone. Further, since

|sn − sn−1| = an → 0 as n → ∞,

the limits of the subsequences must be equal.

Example 6.14 (Alternating Harmonic Series) The series

∞∑

n=1

(−1)n+1

n= 1 − 1

2+

1

3− 1

4± · · ·

converges since the sequence 1/n is decreasing with limit 0.

To estimate the sum of the alternating harmonic series to within 2 decimal places, we needto use 200 or more terms since

|sn − L| ≤ an+1 =1

n + 1< 0.005 =⇒ n > 199.

Our estimate is s200 ≈ 0.690653. (In fact, L = ln(2). See Section 6.6.)

Example 6.15 (Alternating p Series) More generally, the series

∞∑

n=1

(−1)n+1

np= 1 − 1

2p+

1

3p− 1

4p± · · ·

converges for all p > 0 since 1/np is decreasing with limit 0.

To estimate the sum when p = 3 to within 2 decimal places, for example, we need to use 5or more terms since

|sn − L| ≤ an+1 =1

(n + 1)3< 0.005 =⇒ n > 4.84804.

Our estimate is s5 ≈ 0.904412.

6.5 Absolute and Conditional Convergence of Series

The following theorem is used to study convergence properties of general series.

Theorem 6.16 (Absolute Convergence Theorem) If the series∑∞

n=1 |an| converges,then the series

∑∞n=1 an converges as well.

Thus, tests for positive series can be used to determine if the series with terms |an| converges.If the series converges, then we can conclude that the series whose terms are an converges

110

as well. Note, however, that if the series with terms |an| diverges, then we cannot draw anyconclusion about the series with terms an.

Let∑∞

n=1 an be a convergent series. If the series∑∞

n=1 |an| converges, then∑∞

n=1 an issaid to be an absolutely convergent series. Otherwise,

∑∞n=1 an is said to be a conditionally

convergent series.

Example 6.17 (Alternating p Series, continued) The series

∞∑

n=1

(−1)n+1

np= 1 − 1

2p+

1

3p− 1

4p± · · ·

is absolutely convergent when p > 1 (by the integral test) and conditionally convergentwhen 0 < p ≤ 1 (by the integral and alternating series tests).

Example 6.18 (Exponential Series, continued) The series

∞∑

n=0

xn

n!= 1 + x +

x2

2+

x3

6+

x4

24+ · · ·

converges for all real numbers x. Specifically, if x = 0, the sum is 1. Otherwise, the seriesconverges absolutely by the ratio test (see page 108).

Example 6.19 (Power Series) The series

∞∑

n=1

xn

n= x +

x2

2+

x3

3+

x4

4+ · · ·

converges absolutely when |x| < 1, converges conditionally when x = −1, and divergesotherwise. To see this,

1. If x = 0, then the sum is 0. If 0 < |x| < 1, then

|an+1||an|

=

(n

n + 1

)|x| → |x| < 1 as n → ∞,

implying that the series converges absolutely by the ratio test. Thus, the series con-verges absolutely when |x| < 1.

2. When x = 1, the series reduces to the harmonic series, which diverges by the integraltest. When x = −1, the series reduces to the (negative) alternating harmonic series,which converges by the alternating series test. Thus, the series converges conditionallywhen x = −1.

3. When |x| > 1, the terms do not converge to 0, and the series diverges.

6.5.1 Product of Series

Let∑∞

n=0 an and∑∞

n=0 bn be series with index starting at 0.

111

The series∑∞

n=0 cn with terms

cn =n∑

i=0

aibn−i = a0bn + a1bn−1 + · · · + an−1b1 + anb0 for n = 0, 1, 2, . . . .

is the product (or the convolution) of∑∞

n=0 an and∑∞

n=0 bn.

Theorem 6.20 (Convolution Theorem) If∑∞

n=0 an = L,∑∞

n=0 bn = M and bothseries are absolutely convergent, then

∑∞n=0 cn is an absolutely convergent series with sum LM .

Example 6.21 (Square of Geometric Series) The geometric series

∞∑

n=0

xn = 1 + x + x2 + x3 + · · ·

converges to 1/(1 − x) when |x| < 1 and diverges otherwise.

The square of this series is( ∞∑

n=0

xn

)2

=∞∑

n=0

(n + 1)xn = 1 + 2x + 3x2 + 4x3 + · · · .

By the convolution theorem, the square series converges to 1/(1 − x)2 when |x| < 1.

Remarks. The last example demonstrates that we can “represent” the functions f(x) =1

(1−x) and g(x) = 1(1−x)2 as sums of convergent series for all x satisfying |x| < 1. Series

representations of functions will be explored further in the next two sections.

6.6 Taylor Polynomials and Taylor Series

We have already seen how to approximate a function using a tangent line. This sectionconsiders using polynomials of higher orders to approximate functions with higher preci-sion, and using the limiting forms of these polynomials to give exact representations of thefunctions on intervals.

6.6.1 Higher Order Derivatives

The nth derivative of the function f is obtained from the function by differentating n times:

f ′′′ is the derivative of f ′′, f ′′′′ is the derivative of f ′′′, and so forth.

Starting with y = f(x), notations for the nth derivative are

f (n)(x) =dny

dxn=

dn

dxn(f(x)) =

dnf

dxn= Dnf(x) = y(n).

For convenience, we let f (0)(x) = f(x).

112

-8 -6 -4 -2 2 4 6 8x

-2

-1

1

2y=cosHxL

H2L

H8L

H14L

-1 -0.5 0.5 1 x

2

4

6

y=1H1-xL

H1L

H3L

H5L

Figure 6.1: Plots of y = cos(x) (left) and y = 1/(1−x) (right) with Taylor approximations.

6.6.2 Taylor Polynomials

The Taylor polynomial of order n centered at a is the polynomial

pn(x) = f(a) + f ′(a)(x − a) +f ′′(a)

2(x − a)2 + · · · + f (n)(a)

n!(x − a)n .

Note that pn(a) = f(a) and that p(k)n (a) = f (k)(a) for k = 1, . . . , n.

Theorem 6.22 (Taylor’s Theorem, Part 1) If the nth derivative of f is continuous onthe closed interval [a, b] and the (n + 1)st derivative exists on the open interval (a, b), thenfor each x in [a, b]

f(x) = pn(x) +f (n+1)(c)

(n + 1)!(x − a)n+1 for some c satisfying a < c < x.

In words, the error in using the Taylor polynomial pn(x) instead of the function f(x) forvalues of x near a is exactly the remainder

Rn(x) =f (n+1)(c)

(n + 1)!(x − a)n+1 for some number c between a and x.

Thus, Taylor’s theorem generalizes results on errors in linear approximations (Section 4.6).Taylor’s theorem can be proven using the Mean Value Theorem.

Note that if |f (n+1)(x)| is bounded in a neighborhood of a, then |Rn(x)| is also bounded.Specifically,

|f (n+1)(x)| ≤ M for |x − a| < r ⇒ |Rn(x)| ≤ M

(n + 1)!|x − a|n+1 for |x − a| < r .

Example 6.23 (Cosine Function) Consider f(x) = cos(x) and let a = 0. The left part ofFigure 6.1 shows graphs of y = f(x), y = p2(x), y = p8(x), and y = p14(x). As n increases,the polynomial approximates the function well over wider intervals centered at 0.

113

Further, since |f (n)(x)| ≤ 1 (each derivative is either ± sin(x) or ± cos(x)) and a = 0,

|Rn(x)| ≤ |x|n+1

(n + 1)!for all x.

Example 6.24 (Geometric Function) Consider f(x) = 1/(1 − x) and let a = 0. Theright part of Figure 6.1 shows graphs of y = f(x), y = p1(x), y = p3(x), and y = p5(x).As n increases, the polynomial approximates the function well over wider subintervals of(−1, 1).

Since f (n)(x) = n!/(1 − x)n+1 and f (n)(0) = n!, pn(x) = 1 + x + x2 + · · · + xn.

Further, since

1 − xn+1

1 − x= 1 + x + x2 + · · · + xn =⇒ 1

1 − x= 1 + x + x2 + · · · + xn +

xn+1

1 − x

(using properties of geometric series), the remainder is Rn(x) = xn+1/(1 − x).

6.6.3 Taylor Series

The Taylor series representation of f(x) centered at a is

f(x) =∞∑

n=0

f (n)(a)

n!(x − a)n = f(a) + f ′(a)(x − a) +

f ′′(a)

2(x − a)2 + · · ·

when the series converges. Note that the Taylor series centered at 0 is often called theMaclaurin series.

Theorem 6.25 (Taylor’s Theorem, Part 2) If f(x) = pn(x) + Rn(x) and

limn→∞

Rn(x) = 0 for all x satisfying |x − a| < r,

then f(x) is equal to the sum of its Taylor series on the interval |x − a| < r.

Example 6.26 (Exponential Function) Consider f(x) = ex and let a = 0.

Since f (n)(x) = ex and f (n)(0) = 1 for all n,

ex = 1 + x +x2

2+ · · · + xn

n!+ Rn(x) where Rn(x) =

ec

(n + 1)!xn+1

for some c between 0 and x. Since

|Rn(x)| ≤ C|x|n+1

(n + 1)!where C = max(ex, 1)

and |x|n/n! → 0 as n → ∞ for all x (page 108), Rn(x) → 0 as n → ∞ as well. Thus, ex isequal to the sum of its Taylor series for all real numbers.

114

Table 6.1: Some Taylor series expansions centered at 0.

11−x = 1 + x + x2 + x3 + · · · for −1 < x < 1

1(1−x)2 = 1 + 2x + 3x2 + 4x3 + · · · for −1 < x < 1

ln(1 + x) = x − x2

2 + x3

3 − x4

4 + · · · for −1 < x ≤ 1

ex = 1 + x + x2

2 + x3

3! + · · · for all x

cos(x) = 1 − x2

2 + x4

4! − x6

6! + · · · for all x

sin(x) = x − x3

3! + x5

5! − x7

7! + · · · for all x

Example 6.27 (Exponential Function, continued) Using the result above, a seriesrepresentation of e−1 is

e−1 =∞∑

n=0

(−1)n

n!= 1 − 1 +

1

2− 1

6+

1

24∓ · · · .

To estimate e−1 to within 2 decimal places of accuracy, for example, we need to use 6 termssince

|Rn(x)| ≤ 1

(n + 1)!< 0.005 =⇒ n > 5.

Our estimate is s6 = 53/144 ≈ 0.368056. (Note that this result could also be obtained usingthe methods of Section 6.4.)

6.6.4 Radius of Convergence

The largest value of r satisfying Taylor’s theorem is called the radius of convergence of theseries. The example above demonstrates that the radius of convergence for the exponentialseries centered at 0 is ∞.

Table 6.1 lists several commonly used series expansions. In the first three cases, the radiusof convergence in 1. In the last three cases, the radius of convergence is ∞.

6.7 Power Series

A power series centered at a is a series of the form

∞∑

n=0

cn(x − a)n = c0 + c1(x − a) + c2(x − a)2 + c3(x − a)3 + · · · ,

115

where cn is a sequence of constants (the coefficients of the series). The series convergesto c0 when x = a. Interest focuses on finding the values of x for which the series convergesand finding exact forms for the sums.

Theorem 6.28 (Convergence Theorem) Let∑∞

n=0 cn(x−a)n be a power series centeredat a. Then exactly one of the following holds:

1. The series converges for x = a only.

2. The series converges for all x.

3. There is a number r > 0 such that the series converges when |x− a| < r and divergeswhen |x − a| > r.

The number r given in the third part of the theorem is called the radius of convergence ofthe power series. If r is the radius of convergence, then the theorem tells us nothing aboutconvergence at x = a ± r. Thus, the interval of convergence could be any one of

[a − r, a + r], (a − r, a + r], [a − r, a + r), (a − r, a + r).

Example 6.29 (Logarithm Function) Consider the Taylor series for f(x) = ln(x)centered at 1. Since

ln(1) = 0, f (n)(x) = (−1)n−1 (n − 1)!

xnand f (n)(1) = (−1)n−1(n − 1)!,

the form of the Taylor series is∞∑

n=1

(−1)n−1 (x − 1)n

n= (x − 1) − (x − 1)2

2+

(x − 1)3

3− (x − 1)4

4± · · · .

The series converges absolutely when |x − 1| < 1 by the ratio test. If |x − 1| > 1, then theterms do not converge to 0, and the series diverges. If x = 0, then the series diverges sinceit is the negative of the harmonic series. If x = 2, then the series converges since it is thealternating harmonic series.

Thus, the radius of convergence is r = 1 and the interval of convergence is (0, 2].

6.7.1 Power Series and Taylor Series

Taylor series are examples of power series. The following theorem states that convergentpower series must be Taylor series within their radius of convergence.

Theorem 6.30 (Representation Theorem) Assume that the radius of convergence ofthe power series

∑∞n=0 cn(x − a)n is r > 0. Let

f(x) =∞∑

n=0

cn(x − a)n for a − r < x < a + r.

Then f has derivatives of all orders on (a − r, a + r) and

f (n)(a) = cn n! for all n.

Thus, the power series is the Taylor series of f centered at a.

116

6.7.2 Differentiation and Integration of Power Series

Power series can be differentiated and integrated term-by-term within their radius of con-vergence. Specifically,

Theorem 6.31 (Term-by-Term Analysis) Assume that the radius of convergence ofthe power series

∑∞n=0 cn(x − a)n is r > 0. Let

f(x) =∞∑

n=0

cn(x − a)n for a − r < x < a + r.

Then f is differentiable (and continuous) on the interval (a − r, a + r). Further,

1. the derivative of f has the following form:

f ′(x) =∞∑

n=1

ncn(x − a)n−1 = c1 + 2c2(x − a) + 3c3(x − a)2 + · · · .

2. the family of antiderivatives of f has the following form:

∫f(x) dx = C +

∞∑

n=0

cn

n + 1(x − a)n+1 = C + c0(x − a) +

c1

2(x − a)2 + · · · .

The radius of convergence of each series is r.

Example 6.32 (Geometric Function) Let

f(x) =1

(1 − x)=

∞∑

n=0

xn = 1 + x + x2 + x3 + · · · for −1 < x < 1.

The derivative of f can be computed two ways:

f ′(x) =d

dx

(1

(1 − x)

)=

1

(1 − x)2and

f ′(x) =d

dx

( ∞∑

n=0

xn

)=

∞∑

n=1

nxn−1 = 1 + 2x + 3x2 + 4x3 + · · · .

Thus,1

(1 − x)2=

∞∑

n=1

nxn−1 = 1 + 2x + 3x2 + 4x3 + · · · for −1 < x < 1,

confirming results obtained on page 112.

Example 6.33 (Standard Normal Distribution) Let Z be the standard normal randomvariable. Using the Taylor series expansion of ex centered at 0, the PDF of Z can be writtenas follows:

φ(z) =1√2π

e−z2/2 =1√2π

∞∑

n=0

(−z2/2)n

n!=

1√2π

∞∑

n=0

(−1)nz2n

2n n!for all z.

117

φ(z) does not have an exact antiderivative. However, by using term-by-term analysis, wecan write the indefinite integral as a family of power series. Specifically,

∫φ(z) dz = C +

∞∑

n=0

(−1)nz2n+1

(2n + 1)2n n!.

This power series representation can be used to compute probabilities for standard normalrandom variables to within any degree of accuracy.

6.8 Discrete Random Variables, revisited

There are many important applications of sequences and series in probability and statistics.This section focuses on discrete random variables whose ranges are countably infinite.

6.8.1 Countably Infinite Sample Spaces

A set is countably infinite when it corresponds to an infinite sequence. When a sample spaceS is countably infinite, we need to include the possibility of infinite unions, and of infinitesums of probabilities. In particular, the third Kolmogorov axiom is extended as follows:

3′′. If A1, A2, . . . are pairwise disjoint, then

P (∪∞i=1Ai) = P (A1) + P (A2) + P (A3) + · · ·

where the righthand side is understood to be the sum of a convergent infinite series.

6.8.2 Infinite Discrete Random Variables

Recall that a discrete random variable is one whose range, R, is either finite or countablyinfinite. If the range of X is countably infinite, f(x) is the PDF of X, and g(X) is areal-valued function, then the expected value (or mean or expectation) of g(X) is

E(g(X)) =∑

x∈R

g(x)f(x) when the series converges absolutely.

That is, when∑

x∈R |g(x)|f(x) converges. If the series with absolute values diverges, wesay that the expectation is indeterminate.

Example 6.34 (Coin Tossing) You toss a fair coin until you get a tail and record thesequence of h’s and t’s (for heads and tails, respectively). The sample space is

S = t, ht, hht, hhht, hhhht, hhhhht, . . ..

Let X be the total number of tosses. Then X is a discrete random variable with PDF

f(x) = P (X = x) =1

2xfor x = 1, 2, 3, . . . and 0 otherwise.

118

0 1 u

1

U Dens.

1 2 3 4 5 6 7 8 9 10 x0

0.1

0.2

0.3

0.4

0.5

X Prob.

Figure 6.2: Probability density function of the uniform random variable on the open interval(0, 1) (left plot) and of the floor of its reciprocal (right plot). The range of the floor ofthe reciprocal is the positive integers. Vertical lines in the left plot are drawn at u =1/2, 1/3, 1/4, . . ..

The probability that you get a tail in 3 or fewer tosses, for example, is

P (X ≤ 3) = P (t, ht, hht) = f(1) + f(2) + f(3) =1

2+

1

4+

1

8=

7

8.

To find the expected value of X, we need to determine the sum of the following series:

E(X) =∞∑

x=1

x1

2x= 1

(1

2

)+ 2

(1

4

)+ 3

(1

8

)+ 4

(1

16

)+ · · · .

By using a simple trick, we can reduce the problem to finding the sum of a convergentgeometric series. Specifically,

E(X) − 12E(X) =

∞∑

x=1

x1

2x−

∞∑

x=1

x1

2x+1=

∞∑

x=1

(x

1

2x− (x − 1)

1

2x

)=

∞∑

x=1

1

2x= 1.

Since 12E(X) = 1, we know that E(X) = 2. (The expected number of tosses is 2.) This

trick is valid since the series is absolutely convergent.

Example 6.35 (Indeterminate Mean) Let U be a continuous uniform random variableon the open interval (0, 1) and let X be the greatest integer less than or equal to thereciprocal of U (the “floor” of the reciprocal of U), X = ⌊1/U⌋. For x in the positiveintegers,

f(x) = P (X = x) = P

(1

x + 1< U ≤ 1

x

)=

1

x(x + 1);

f(x) = 0 otherwise. The expectation of X is

E(X) =∞∑

x=1

x1

x(x + 1)=

∞∑

x=1

1

(x + 1).

119

0 10 20 x

0.05

0.15

0.25

Prob.

0 5 10 x

0.05

0.15

0.25

Prob.

Figure 6.3: Probability histograms for geometric (left) and Poisson (right) examples.

The integral test can be used to demonstrate that the series on the right diverges. Thus,the expectation of X is indeterminate.

The left part of Figure 6.2 is a plot of the PDF of U with the vertical lines

u = 1/2, 1/3, 1/4, 1/5, . . .

superimposed. The right part is a probability histogram of the distribution of X = ⌊1/U⌋.Note that the area between u = 1/2 and u = 1 on the left is P (X = 1), the area betweenu = 1/3 and u = 1/2 is P (X = 2), etc.

6.8.3 Example: Geometric Distribution

Let X be the number of failures before the first success in a sequence of independentBernoulli experiments with success probability p. Then X is said to be a geometric randomvariable, or to have a geometric distribution, with parameter p. The PDF of X is as follows:

f(x) = (1 − p)xp when x = 0, 1, 2, . . ., and 0 otherwise.

For each x, f(x) is the probability of the sequence of x failures (F ) followed by a success(S). For example, f(5) = P(FFFFFS) = (1 − p)5p.

The probabilities f(0), f(1), . . . form a geometric sequence whose sum is 1. Further,

E(X) =1 − p

pand V ar(X) = E

((X − E(X))2

)=

1 − p

p2.

Example 6.36 (p = 1/4) Let X be a geometric random variable with parameter 1/4.Then

E(X) = 3 and V ar(X) = 12.

The left part of Figure 6.3 is a probability histogram for X, for values of X at most 20.Note that

P (X > 20) =∞∑

x=21

(3

4

)x (1

4

)=

(3

4

)21

≈ 0.00237841.

120

Alternative definition. An alternative definition of the geometric random variable is asfollows: X is the trial number of the first success in a sequence of independent Bernoulliexperiments with success probability p. In this case,

f(x) = (1 − p)x−1p when x = 1, 2, 3, . . ., and 0 otherwise.

In particular, the range is now the positive integers.

6.8.4 Example: Poisson Distribution

Let λ be a positive real number. The random variable X is said to be a Poisson randomvariable, or to have a Poisson distribution, with parameter λ if its PDF is as follows:

f(x) = e−λ λx

x!when x = 0, 1, 2, . . ., and 0 otherwise.

The series expansion for ex centered at 0 (see Table 6.1) can be used to demonstrate thatthe sequence f(0), f(1), . . . has sum 1. Further,

E(X) = λ and V ar(X) = E((X − E(X))2

)= λ.

Example 6.37 (λ = 3) Let X be the Poisson random variable with mean 3 (and withvariance 3). The right part of Figure 6.3 is a probability histogram for X, for values of Xat most 10. Note that

P (X > 10) =∞∑

x=11

e−3 3x

x!≈ 0.000292337.

Rare events. The idea for the Poisson distribution comes from a limit theorem provenby the mathematician S. Poisson in the 1830’s.

Theorem 6.38 (Poisson Limit Theorem) Let λ be a positive real number, n a positiveinteger, and p = λ/n. Then

limn→∞

(n

x

)px(1 − p)n−x = e−λ λx

x!when x = 0, 1, 2, . . ..

This theorem can be used to estimate binomial probabilities when the number of trials islarge and the probability of success is small. That is, when success is a rare event.

For example, if n = 10000, p = 2/5000, and x = 3, then the probability of 3 successes in10000 trials is

(10000

3

)(2

5000

)3 (4998

5000

)9997

≈ 0.195386

121

and Poisson’s approximation is

e−4 43

3!≈ 0.195367,

using λ = np = 4. The values are very close.

Poisson process. Events occurring in time are said to be generated by an (approximate)Poisson process with rate λ when the following conditions are satisfied:

1. The number of events occurring in disjoint subintervals of time are inde-pendent of one another.

2. The probability of one event occurring in a sufficiently small subinterval oftime is proportional to the size of the subinterval. If h is the size, then theprobability is λh.

3. The probability of two or more events occurring in a sufficiently smallsubinterval of time is virtually zero.

In this definition, λ represents the average number of events per unit time. If events followan (approximate) Poisson process and X is the number of events observed in one unit oftime, then X has a Poisson distribution with parameter λ.

Typical applications of Poisson distributions include the numbers of cars passing an inter-section in a fixed period of time during a workday, or the number of phone calls received ina fixed period of time during a workday.

The idea of a Poisson process can be generalized to include events occurring over regionsof space instead of intervals of time. (“Subregions” take the place of “subintervals” in theconditions above. In this case, λ represents the average number of events per unit area orper unit volume.)

Remark. The definition of Poisson process allows you to think of the PDF of X asthe limit of a sequence of binomial PDFs: The observation interval is subdivided into nnonoverlapping subintervals; the ith Bernoulli trial results in success if an event occurs inthe ith subinterval, and failure otherwise. If n is large enough, then the probability that 2or more events occur in one subinterval can be assumed to be zero.

6.9 Continuous Random Variables, revisited

The exponential and gamma families of distributions are related to Poisson processes.

122

6.9.1 Exponential Distribution

Recall that the continuous random variable X has an exponential distribution with param-eter λ when its PDF has the form

f(x) = λe−λx when x ≥ 0 and 0 otherwise,

where λ is a positive constant. (See Section 5.2.2.)

If events occurring over time follow an approximate Poisson process with rate λ, where λis the average number of events per unit time, then the time between successive events hasan exponential distribution with parameter λ. To see this,

1. If you observe the process for t units of time and let Y equal the numberof observed events, then Y has a Poisson distribution with parameter λt.The PDF of Y is as follows:

P (Y = y) = e−λt (λt)y

y!when y = 0, 1, 2, . . . and 0 otherwise.

2. An event occurs, the clock is reset to time 0, and X is the time until thenext event occurs. Then X is a continuous random variable whose rangeis x > 0. Further,

P (X > t) = P (0 events in the interval [0,t]) = P (Y = 0) = e−λt

and P (X ≤ t) = 1 − e−λt.

3. Since f(t) = ddtP (X ≤ t) = d

dt

(1 − e−λt

)= λe−λt when t > 0 (and 0

otherwise) is the same as the PDF of an exponential random variable withparameter λ, X has an exponential distribution with parameter λ.

6.9.2 Gamma Distribution

Recall that the continuous random variable X has a gamma distribution with shape pa-rameter α and scale parameter β when its PDF has the following form

f(x) =1

βαΓ(α)xα−1e−x/β when x > 0 and 0 otherwise,

where α and β are positive constants. (See Section 5.2.4.)

If events occurring over time follow an approximate Poisson process with rate λ, where λis the average number of events per unit time and if r is a positive integer, then the timeuntil the rth event occurs has a gamma distribution with α = r and β = 1/λ. To see this,

1. If you observe the process for t units of time and let Y equal the numberof observed events, then Y has a Poisson distribution with parameter λt.The PDF of Y is as follows:

P (Y = y) = e−λt (λt)y

y!when y = 0, 1, 2, . . . and 0 otherwise.

123

2. Let X be the time you observe the rth event, starting from time 0. Then Xis a continuous random variable whose range is x > 0. Further, P (X > t)is the same as the probability that there are fewer than r events in theinterval [0, t]. Thus,

P (X > t) = P (Y < r) =r−1∑

y=0

e−λt (λt)y

y!

and P (X ≤ t) = 1 − P (X > t).

3. The PDF f(t) = ddtP (X ≤ t) is computed using the product rule:

ddt

[1 − e−λt

(∑r−1y=0

λyty

y!

)]= λe−λt

(∑r−1y=0

λyty

y!

)− e−λt

(∑r−1y=1

λyyty−1

y!

)

= λe−λt[(∑r−1

y=0λyty

y!

)−(∑r−1

y=1λy−1ty−1

(y−1)!

)]

= λe−λt[

λr−1tr−1

(r−1)!

]

= λr

(r−1)! tr−1 e−λt.

Since f(t) is the same as the PDF of a gamma random variable with pa-rameters α = r and β = 1/λ, X has a gamma distribution with parametersα = r and β = 1/λ.

6.10 Summary: Poisson Processes

In summary, there are three distributions related to Poisson processes:

1. If X is the number of events observed in one unit of time, then X is a Poisson randomvariable with parameter λ, where λ is the average number of events per unit time.The probability that exactly x events occur is

P (X = x) = e−λ λx

x!when x = 0, 1, 2, . . ., and 0 otherwise.

2. If X is the time between successive events, then X is an exponential random variablewith parameter λ, where λ is the average number of events per unit time. The CDFof X is

F (x) = 1 − e−λx when x > 0 and 0 otherwise.

3. If X is the time to the rth event, then X is a gamma random variable with parametersα = r and β = 1/λ, where λ is the average number of events per unit time. The CDFof X is

F (x) = 1 −r−1∑

y=0

e−λx (λx)y

y!when x > 0 and 0 otherwise.

Remark. If α = r and r is not too large, then gamma probabilities can be computed byhand using the formula for the CDF above. Otherwise, the computer can be used to findprobabilities for gamma distributions.

124

7 Multivariable Calculus Concepts

Let Rk be the set of ordered k-tuples of real numbers,

Rk = (x1, x2, . . . , xk) : xi ∈ R .

Each element of Rk is called a point or a vector. Initially, we use the notation

x = (x1, x2, . . . , xk)

to denote vectors in Rk. We say that xi is the ith component of x.

This chapter considers real-valued functions whose domain, D, is a subset of Rk,

f : D ⊆ Rk −→ R,

and their applications in statistics.

7.1 Vector And Matrix Operations

This section introduces basic vector and matrix operations. Ideas introduced here will bestudied in more depth in Chapter 9 of these notes.

7.1.1 Representations of Vectors

The origin (or zero vector) O in Rk is the vector all of whose components are zero,

O = (0, 0, . . . , 0).

Let v = (v1, v2, . . . , vk) be a vector in Rk. If v is not the zero vector, then v can berepresented as a directed line segment in two ways:

1. As a position vector drawn from the origin to the point P (v1, v2, . . . , vk): v = OP .

2. As a displacement vector drawn from point P (p1, p2, . . . , pk) to point Q(q1, q2, . . . , qk),where vi = qi − pi for each i: v = P Q.

For example, the left part of Figure 7.1 shows the vector v = (−4, 3) represented as theposition vector OP and the displacement vectors AB and V W . The right part of thefigure shows the position vector v = OP , where P is the point P (−4, 3, 2).

7.1.2 Addition and Scalar Multiplication of Vectors

If v and w are vectors in Rk and c ∈ R is a scalar, then

125

-4 4 x

-3

3

y

A

BP

V

W xy

z P

Figure 7.1: Representations of vectors in 2-space (left) and 3-space (right).

1. the vector sum v + w is the vector obtained by componentwise addition,

v + w = (v1 + w1, v2 + w2, . . . , vk + wk), and

2. the scalar product cv is the vector obtained by multiplying each component of v by c,

cv = (cv1, cv2, . . . , cvk).

Example 7.1 (Vectors in R3) If v = (8,−3, 2) and w = (−9, 2, 3), then

v + w = (−1,−1, 5), 5v = (40,−15, 10) and 5v − w = (49,−17, 7).

Properties of addition and scalar multiplication. The following properties hold forvectors u, v, w in Rk and c, d ∈ R:

1. Commutative: v + w = w + v.

2. Associative: (u + v) + w = u + (v + w), c(dv) = (cd)v.

3. Distributive: (c + d)v = cv + dv, c(v + w) = cv + cw.

4. Identity: 1v = v, 0v = O, v + O = v, and

v + (−1)w = v − w (that is, we have the operation of subtraction).

Parallelogram or triangle rules. Figure 7.2 is graphical way to represent addition (leftpart) and subtraction (right part), in 2-space and 3-space.

In the left plot, the vector sum v + w is obtained by starting at the lower left and goingacross the bottom and then up the right side, or by starting at the lower left and going upthe left side and across the top (since v + w = w + v). Geometrically, the vector sum isthe diagonal of the parallelogram.

In the right plot, observe that w + u = v implies that u = v − w.

126

v

w v+w

v

w u=v-w

Figure 7.2: Graphical representation of vector addition (left) and subtraction(right).

v

2v

-1.5v

Figure 7.3: Graphical representation of scalar multiplication of vectors.

Parallel vectors. Vectors v and w are said to be parallel when w = mv for some constantm. If m > 0, then the vectors are pointing in the same direction; if m < 0, then the vectorsare pointing in the opposite direction.

Figure 7.3 is a graphical way to represent scalar multiplication in 2-space and 3-space.Vector 2v is a vector pointing in the same direction as v but of twice the length; vector−1.5v is a vector of length 1.5 times the length of v, but pointing in the opposite direction.

7.1.3 Length and Distance

If v is a vector in Rk, then the length of v is

‖v‖ =√

v21 + v2

2 + · · · v2k .

The zero vector has length 0; all other vectors have positive length.

A unit vector is a vector of length 1. If v is not the zero vector, then the unit vector in thedirection of v is the vector u = 1

‖v‖ v.

The distance between vectors v and w is the length of the difference vector v − w:

‖v − w‖ =√

(v1 − w1)2 + (v2 − w2)2 + · · · + (vk − wk)2 .

Example 7.2 (Vectors in R3, continued) If v = (8,−3, 2) and w = (−9, 2, 3), then thelength of v is ‖v‖ =

√77 ≈ 8.77, the length of w is ‖w‖ =

√94 ≈ 9.70, and the distance

between v and w is ‖v − w‖ = 3√

35 ≈ 17.75.

127

The unit vector in the direction of v is u =(

8√77

, −3√77

, 2√77

).

Theorem 7.3 (Triangle Inequality) If v,w ∈ Rk, then ‖v + w‖ ≤ ‖v‖ + ‖w‖.

Note that if k = 2 or k = 3, then the triangle with corners O, v and v + w has sides oflengths ‖v‖, ‖w‖, and ‖v + w‖.

7.1.4 Dot Product of Vectors

If v,w ∈ Rk, then the dot product (or inner product or scalar product) is the number

v·w = v1w1 + v2w2 + · · · + vkwk.

Note that v·v = ‖v‖2.

Let k = 2 or 3, v = OP and w = OQ be representations of v and w as position vectors,and θ be the smaller of the two angles defined by P OQ (the connected segments from Pto the origin to Q). Then, analytic geometry can be used to demonstrate that

v·w = ‖v‖ ‖w‖ cos(θ).

If θ is unknown, then it can be found by solving the equation above.

Example 7.4 (Vectors in R3, continued) If v = (8,−3, 2) and w = (−9, 2, 3), then

v·w = −72 and θ = cos−1(

v·w

‖v‖ ‖w‖

)≈ 2.58 radians (≈ 148 degrees).

Angle between vectors. Analytic geometry can be used to prove the following theorem.

Theorem 7.5 (Cauchy-Schwarz Inequality) If v,w ∈ Rk, then |v·w| ≤ ‖v‖ ‖w‖.

The Cauchy-Schwarz inequality allows us to formally define the angle between two nonzerovectors. Specifically, if v,w ∈ Rk are nonzero vectors, then θ is defined to be the anglesatisfying the following equation:

θ = cos−1(

v·w

‖v‖ ‖w‖

).

This definition generalizes ideas about angles from 2-space and 3-space.

Orthogonal vectors. The nonzero vectors v,w ∈ Rk are said to be orthogonal if theirdot product is 0, v·w = 0.

Note that the angle between orthogonal vectors is π2 (or 90 degrees), since cos

(π2

)= 0 and

π2 is in the range of the inverse cosine function (see page 53).

128

7.1.5 Matrix Notation and Matrix Transpose

An m-by-n matrix A is a rectangular array with m rows and n columns:

A =

a11 a12 · · · a1n

a21 a22 · · · a2n...

... · · · ...am1 am2 · · · amn

= [aij ] .

The (i, j)-entry of A (denoted by aij) is the element in the row i and column j position.Entries are shown explicitly in the first term above. The second term is used as shorthandwhen the size of the matrix (the numbers m and n) is understood.

A 1-by-n matrix is known as a row vector and an m-by-1 matrix is known as a columnvector. The m-by-n matrix A can be thought of as a single column of m row vectors or asa single row of n column vectors.

If A is an m-by-n matrix, then the transpose of A, denoted by AT , is the n-by-m matrixwhose (i, j)-entry is aji. For example, if

A =

[1 2 34 5 6

], then AT =

1 42 53 6

.

7.1.6 Addition and Scalar Multiplication of Matrices

Let A = [aij ] and B = [bij ] be m-by-n matrices of numbers and let c ∈ R be a scalar. Then

1. the matrix sum, A + B, is the m-by-n matrix obtained by componentwise addition,

A + B = [aij + bij ] , and

2. the scalar product, cA, is the m-by-n matrix obtained by multiplying each componentof A by c,

cA = [caij ] .

Example 7.6 (2-by-3 Matrices) If A =

[2 3 6

−2 0 1

]and B =

[5 −6 24 5 0

], then

A + B =

[7 −3 82 5 1

], 5A =

[10 15 30

−10 0 5

]and 5A − 2B =

[0 27 26

−18 −10 5

].

Properties of addition and scalar multiplication. Since operations are performedcomponentwise, properties of addition and scalar multiplication of matrices are similar tothose for vectors.

129

7.1.7 Matrix Multiplication

Let A = [aij] be an m-by-n matrix and let B = [bij] be an n-by-p matrix.

The matrix product, C = AB, is the m-by-p matrix whose (i, j)-entry is the dot product ofthe ith row of A with the jth column of B:

cij =n∑

k=1

aikbkj for i = 1, 2, . . . ,m; j = 1, 2, . . . , p.

Example 7.7 (2-by-3 and 3-by-2 Matrices) Let A =

[2 3 6

−1 0 1

]and B =

2 −30 14 −7

.

Then

AB =

[28 −452 −4

]and BA =

7 6 9−1 0 115 12 17

.

For example, the (1, 1)-entry of the product AB is

(2, 3, 6)·(2, 0, 4) = (2)(2) + (3)(0) + (6)(4) = 28.

Similarly, the (1, 1)-entry of the product BA is (2)(2) + (−3)(−1) = 7.

Properties. Matrix products are associative. Thus, the product

ABC = A(BC) = (AB)C

is well-defined. Matrix products are not necessarily commutative, as illustrated in theexample above. In fact, if the number of rows of A is not equal to the number of columnsof B, then the product BA is not defined.

Note also that the transpose of the product of two matrices is equal to the product of thetransposes in the opposite order: (AB)T = BTAT .

7.1.8 Determinants of Square Matrices

A square matrix is a matrix with m = n. If A is a square matrix, then the determinantof A is a number which summarizes certain properties of the matrix. Notations for thedeterminant of A are det(A) or |A|.

The determinant is often defined recursively, as illustrated in the following special cases:

1. If A =

[a1 a2

b1 b2

]is a 2-by-2 matrix, then

det(A) =

∣∣∣∣a1 a2

b1 b2

∣∣∣∣ = a1b2 − b1a2 .

130

2. If A =

a1 a2 a3

b1 b2 b3

c1 c2 c3

is a 3-by-3 matrix, then

det(A) =

∣∣∣∣∣∣

a1 a2 a3

b1 b2 b3

c1 c2 c3

∣∣∣∣∣∣= a1

∣∣∣∣b2 b3

c2 c3

∣∣∣∣− a2

∣∣∣∣b1 b3

c1 c3

∣∣∣∣+ a3

∣∣∣∣b1 b2

c1 c2

∣∣∣∣ .

(The formula uses an alternating sum of 3 determinants of 2-by-2 matrices.)

3. If A =

a1 a2 a3 a4

b1 b2 b3 b4

c1 c2 c3 c4

d1 d2 d3 d4

is a 4-by-4 matrix, then det(A) =

∣∣∣∣∣∣∣

a1 a2 a3 a4

b1 b2 b3 b4

c1 c2 c3 c4

d1 d2 d3 d4

∣∣∣∣∣∣∣= a1

∣∣∣∣∣∣

b2 b3 b4

c2 c3 c4

d2 d3 d4

∣∣∣∣∣∣−a2

∣∣∣∣∣∣

b1 b3 b4

c1 c3 c4

d1 d3 d4

∣∣∣∣∣∣+a3

∣∣∣∣∣∣

b1 b2 b4

c1 c2 c4

d1 d2 d4

∣∣∣∣∣∣−a4

∣∣∣∣∣∣

b1 b2 b3

c1 c2 c3

d1 d2 d3

∣∣∣∣∣∣.

(The formula uses an alternating sum of 4 determinants of 3-by-3 matrices.)

Note that the notation for the entries in the special cases was chosen to emphasize therecursive nature of the definition.

In general, if Aij is the (n − 1)-by-(n − 1) submatrix obtained by eliminating the ith rowand the jth column of A, then

det(A) =n∑

j=1

(−1)1+ja1jdet(A1j).

The formula uses an alternating sum of n determinants of (n − 1)-by-(n − 1) matrices. Wesay that the determinant is obtained by “expanding in the first row.” (Additional methodsfor computing det(A) are discussed in Section 9.3.)

Example 7.8 (3-by-3 Matrix) If A =

4 1 2−1 2 1

1 3 −3

, then det(A) =

∣∣∣∣∣∣∣∣

4 1 2

−1 2 1

1 3 −3

∣∣∣∣∣∣∣∣= 4

∣∣∣∣∣2 1

3 −3

∣∣∣∣∣− 1

∣∣∣∣∣−1 1

1 −3

∣∣∣∣∣+ 2

∣∣∣∣∣−1 2

1 3

∣∣∣∣∣

= 4(−6 − 3) − 1(3 − 1) + 2(−3 − 2)

= 4(−9) − 1(2) + 2(−5) = −48.

7.2 Limits and Continuity

This section generalizes ideas from Section 4.2.

131

7.2.1 Neighborhoods, Deleted Neighborhoods and Limits

A neighborhood of the vector a ∈ Rk is the set of vectors x ∈ Rk satisfying

‖x − a‖ < r for some positive number r.

A deleted neighborhood of a is the neighborhood with a removed; that is, the set of vectorsx satisfying 0 < ‖x − a‖ < r for some positive r.

The vector a ∈ D is said to be a limit point of the domain D if every deleted neighborhoodof a contains points of D.

Consider f : D ⊆ Rk −→ R, and suppose that a is a limit point of D. We say that thelimit as x approaches a of f(x) is L,

limx→a

f(x) = L,

if for every positive number ǫ there exists a positive number δ such that

if x ∈ D satisfies 0 < ‖x − a‖ < δ, then f(x) satisfies |f(x) − L| < ǫ.

(Vectors in the intersection of the domain of the function with the deleted δ-neighborhoodof a are mapped into the ǫ-neighborhood of L.)

Example 7.9 (Limit Exists) Consider the 2-variable function

f(x, y) =x3 − y3

x − ywhen x 6= y.

Let x = (x, y) and a = (0, 0). Then

limx→a

f(x) = lim(x,y)→(0,0)

x3 − y3

x − y= lim

(x,y)→(0,0)x2 + xy + y2 = 0.

(As (x, y) approaches the origin from any direction not crossing the line y = x, the functionvalues f(x, y) approach zero.)

Example 7.10 (Limit Does Not Exist) Consider the 2-variable function

f(x, y) =x2 − y2

x2 + y2when (x, y) 6= (0, 0).

Let x = (x, y) and a = (0, 0), and consider approaching (0, 0) along the line y = mx. Then

f(x, y) = f(x,mx) =x2 − m2x2

x2 + m2x2=

1 − m2

1 + m2on this line (when x 6= 0).

Since

limx→a along y = mx

f(x) = lim(x,y)→(0,0) along y = mx

1 − m2

1 + m2=

1 − m2

1 + m2

and m can be any number, the two-dimensional limit does not exist.

132

Properties of limits. Properties of limits for 1-variable functions (see page 57) can begeneralized to limits for k-variable functions.

Example 7.11 (Squeezing Principle) Consider the 2-variable function

f(x, y) =3x2y

x2 + y2when (x, y) 6= (0, 0).

Let x = (x, y) and a = (0, 0). The squeezing principle can be used to demonstrate that

lim(x,y)→(0,0)

f(x, y) = 0.

To see this, note that

0 ≤ x2

x2 + y2≤ 1 when (x, y) 6= (0, 0)

implies that values of f have the following simple bounds:

−3|y| ≤ f(x, y) ≤ 3|y| when (x, y) 6= (0, 0).

Since ±3|y| → 0 as y → 0, f(x, y) must approach 0 as well.

7.2.2 Continuous Functions

The function f(x) is said to be continuous at a if

limx→a

f(x) = f(a).

The function is said to be discontinuous at a otherwise. Note that f(x) is discontinuous ata when f(a) is undefined, or when the limit does not exist, or when the limit and functionvalues exist but are not equal.

Linear functions. A linear function in k variables is a function of the form

f(x) = c0 + c1x1 + c2x2 + · · · + ckxk

where each coefficient ci ∈ R. Linear functions are continuous for all x ∈ Rk.

Polynomial functions. A polynomial function in k variables is a function of the form

f(x) =d∑

j1,j2,...,jk=0

cj1,j2,...,jkxj1

1 xj22 · · · xjk

k where d is a nonnegative integer,

and the c’s are real numbers. For example, f(x, y) = 3x2 − 10xy8 + y2 + 12 is a polynomialfunction in two variables. Polynomial functions are continuous for all x ∈ Rk.

Rational functions. A rational function in k variables is a ratio of polynomials, f(x) =p(x)/q(x). Rational functions are continuous whenever q(x) 6= 0.

133

More generally. More generally, properties of limits imply that sums, differences, prod-ucts, quotients, and constant multiples of continuous functions are continuous whenever theappropriate operation is defined.

Composition of continuous functions. If the k-variable function g(x) is continuousat a and the 1-variable function f(x) is continuous at g(a), then the composite functionf(g(x)) is continuous at a. In other words,

limx→a

f(g(x)) = f

(lim

x→ag(x)

)= f(g(a)).

For example, let g(x, y) = x2 − 4y, f(x) =√

x and a = (5, 1). Then the composite functionf(g(x, y)) is continuous at (5, 1) since

lim(x,y)→(5,1)

f(g(x, y)) = lim(x,y)→(5,1)

√x2 − 4y =

√25 − 4 =

√21 = f(g(5, 1)).

7.3 Differentiation in Several Variables

This section generalizes differentiability from Section 4.3.

7.3.1 Partial Derivatives and the Gradient Vector

The partial derivative of f(x) with respect to xi is

fxi(x) = lim

h→0

f(x1, . . . , xi−1, xi + h, xi+1, . . . , xk) − f(x)

h.

(The ith partial derivative is the ordinary derivative of the function where all variablesexcept the ith variable are held fixed.) Alternative notations are Dxi

f(x) and ∂f∂xi

(x).

Example 7.12 (k = 3) Let x = (x, y, z) and f(x, y, z) = −2xy2 + z3 + 3x2yz4. Then

fx(x, y, z) = −2 y2 + 6x y z4, fy(x, y, z) = −4x y + 3x2 z4, fz(x, y, z) = 3 z2 + 12x2 y z3.

The gradient vector, ∇f(x) (“grad f of x”), is the vector of partial derivatives:

∇f(x) = (fx1(x), fx2

(x), . . . , fxk(x)) .

For example, for the 3-variable function above,

∇f(x, y, z) = (−2 y2 + 6x y z4, − 4x y + 3x2 z4, 3 z2 + 12x2 y z3).

134

Matrix notation. We sometimes write the list of partial derivatives as a 1-by-k matrix,

Df(x) = [ fx1(x) fx2

(x) . . . fxk(x) ] .

Note that matrix notation is especially useful when considering vector-valued functions. Forexample, if the domain of f is a subset of Rn and the range of f is a subset of Rm, thenDf(x) is an m-by-n matrix of partial derivatives.

7.3.2 Second-Order Partial Derivatives

The second-order partial derivative of f(x) with respect to xi and xj is the partial derivativeof fxi

(x) with respect to xj:

fxi,xj(x) =

∂

∂xj

(∂f

∂xi(x)

)=

∂2f

∂xj ∂xi(x) = Dxi,xj

f(x).

When i = j, the second-order partial derivative can be interpreted as the concavity of apartial function. When i 6= j, the second-order partial derivative is often called a mixedpartial derivative. Additional notations when i = j are:

fxi,xi(x) =

∂2f

∂x2i

(x) = D2xi

f(x).

Theorem 7.13 (Clairaut’s Theorem) Suppose that the k-variable function f(x) hascontinuous first- and second-order partial derivatives in a neighborhood of a. Then for eachi and j, the mixed partial derivatives at a are equal:

fxi,xj(a) = fxj,xi

(a) for all i, j = 1, 2, . . . , k.

Example 7.14 (k = 3, continued) The second-order partial derivatives (with argumentsomitted) of the 3-variable function above are as follows:

fx,x = 6 y z4, fy,y = −4x, fz,z = 6 z + 36x2 y z2,

fx,y = fy,x = −4 y + 6x z4, fx,z = fz,x = 24x y z3, fy,z = fz,y = 12x2 z3.

Remark. Higher-order partial derivatives can also be defined.

7.3.3 Differentiability, Local Linearity and Tangent (Hyper)Planes

Suppose that f(x) is defined for all x in a neighborhood of a and that

fxi(a) exists for i = 1, 2, . . . , k.

135

-2-1

01

2x -2

-1

0

12

y

z

-2-1

01

2x

-10

12x -1

0

1

2

y

z

-10

12x

Figure 7.4: Plots of z = 4 − x2 − y2 (left part) and z = x3 − 3xy + y3 (right part). In eachcase, partial functions are superimposed on the surface. (See Figure 7.5 also.)

Let

h(x) = f(a) + ∇f(a)·(x − a)

= f(a) + fx1(a)(x1 − a1) + · · · + fxk

(a)(xk − ak).

We say that f is differentiable at a if the following limit holds:

limx→a

f(x) − h(x)

‖x − a‖ = 0.

That is, if the error in using h as an approximation to f approaches 0 at a rate faster thanthe rate at which the distance between x and a approaches 0.

Local linearity. The definition of differentiability says that f is locally linear; the errorin using the linear function h as an approximation to f can be made as small as we like bytaking x to be close enough to a.

Tangent (hyper)planes. If f is differentiable at a, then y = h(x) is called the tangentplane at (a, f(a)) when k = 2 and is called the tangent hyperplane at (a, f(a)) when k > 2.

Example 7.15 (Quadratic Function) Consider the 2-variable function

f(x, y) = 4 − x2 − y2 and let a = (0.5,−1).

The left part of Figure 7.4 shows a graph of the surface z = f(x, y), with graphs of thepartial functions z = f(0.5, y) and z = f(x,−1) superimposed.

1. The partial derivatives of f are fx(x, y) = −2x and fy(x, y) = −2y.

2. The slope of z = f(x,−1) when x = 0.5 is fx(0.5,−1) = −1.

3. The slope of z = f(0.5, y) when y = −1 is fy(0.5,−1) = 2.

136

4. Since f(0.5,−1) = 2.75, the linear approximation at (0.5,−1) is

h(x, y) = 2.75 − (x − 0.5) + 2(y + 1).

In addition, f is differentiable at this point (and everywhere).

Example 7.16 (Cubic Function) Consider the 2-variable function

f(x, y) = x3 − 3xy + y3 and let a = (1.5, 1).

The right part of Figure 7.4 shows a graph of the surface z = f(x, y), with graphs of thepartial functions z = f(1.5, y) and z = f(x, 1) superimposed.

1. The partial derivatives of f are fx(x, y) = 3x2 − 3y and fy(x, y) = 3y2 − 3x.

2. The slope of z = f(x, 1) when x = 1.5 is fx(1.5, 1) = 3.75.

3. The slope of z = f(1.5, y) when y = 1 is fy(1.5, 1) = −1.5.

4. Since f(1.5, 1) = −0.125, the linear approximation at (1.5, 1) is

h(x, y) = −0.125 + 3.75(x − 1.5) − 1.5(y − 1).

In addition, f is differentiable at this point (and everywhere).

Example 7.17 (Derivative Does Not Exist) Consider the 2-variable function

f(x, y) =∣∣∣|x| − |y|

∣∣∣− |x| − |y| and let a = (0, 0).

Since f(x, 0) = 0 and f(0, y) = 0, fx(x, 0) = 0 and fy(0, y) = 0 for each x and y, and thelinear approximation h(x, y) = 0.

But f is not differentiable at the origin since the limit given in the definition of differentia-bility does not exist. If you evaluate the limit by approaching along y = 0 and along y = x,for example, you get two different answers:

lim(x,0)→(0,0)

f(x, 0) − 0

‖(x, 0)‖ = 0 and lim(x,x)→(0,0)

f(x, x) − 0

‖(x, x)‖ = −√

2.

Theorem 7.18 (Differentiability and Continuity) Suppose that the k-variable func-tion f(x) is defined for all x in a neighborhood of a. Then

1. If f is differentiable at a, then f is continuous at a.

2. If each partial derivative function, fxi(x) for i = 1, 2, . . . , k, is continuous in a neigh-

borhood of a, then f is differentiable at a.

Note that in the quadratic and cubic function examples, the partial derivative functions (fx

and fy) are defined and continuous for all pairs. In the absolute value function example,partial derivative functions are not defined in a neighborhood of the origin.

137

7.3.4 Total Derivative

If each coordinate xi is a function of t,

xi = xi(t) for i = 1, 2, . . . , k

(or simply x = x(t)), then f is also a function of t. The derivative of f with respect to t(when it exists) is called the total derivative of f .

Theorem 7.19 (Total Derivatives) If f is differentiable in a neighborhood of x0 = x(t0)and each coordinate function is differentiable at t0, then the total derivative exists and itsvalue can be computed as follows:

df

dt(t0) = fx1

(x0)x′1(t0) + fx2

(x0)x′2(t0) + · · · + fxk

(x0)x′k(t0).

Note that if we let x′(t) = (x′1(t), x

′2(t), . . . , x

′k(t)) be the vector of derivatives of the coor-

dinate functions, then the total derivative can be written as the dot product of the gradientvector of f with the derivative vector of x:

df

dt(t0) = ∇f(x0)·x

′(t0).

Thus, the formula for the total derivative is a generalization of the chain rule for composi-tions of 1-variable functions.

Example 7.20 (k = 3, continued) Let f(x, y, z) = −2x y2 + z3 + 3x2 y z4,

x = x(t) =1

t, y = y(t) =

1

t2, z = z(t) = −2t

and t0 = 1. Since

x(t) =

(1

t,

1

t2,−2t

)⇒ x′(t) =

(− 1

t2,− 2

t3,−2

)⇒ x′(1) = (−1,−2,−2),

and ∇f(x(1)) = ∇f(1, 1,−2) = (94, 44,−84) (using the formula for ∇f derived earlier),the total derivative of f when t = 1 is

∇f(x(1))·x′(1) = (94, 44,−84)·(−1,−2,−2) = −14.

7.3.5 Directional Derivatives

If u is a unit vector, and f is a k-variable function defined in a neighorhood of a, then thedirectional derivative of f at a is

Duf(a) = limh→0

f(a + hu) − f(a)

hprovided the limit exists.

138

In general, the function Duf(x) represents the rate of change of the function f with respectto a unit change in the direction given by u.

Theorem 7.21 (Directional Derivatives) If the k-variable function f is differentiablein a neighborhood of a and u is a unit vector, then

Duf(a) = ∇f(a)·u .

That is, the rate of change is the dot product of the gradient vector of f at a with u.

Bounds on directional derivatives. Let θ be the angle between vectors ∇f(a) and u(see page 128). Then, since ‖u‖ = 1,

Duf(a) = ∇f(a)·u = ‖∇f(a)‖ cos(θ) .

Further, since the cosine function takes values in the interval [−1, 1], directional derivativestake values between −‖∇f(a)‖ and ‖∇f(a)‖. If ∇f(a) 6= O, then

1. The minimum value is assumed when u = −1‖∇f(a)‖∇f(a), and

2. The maximum value is assumed when u = 1‖∇f(a)‖∇f(a).

In the first case, the direction of u is called the direction of steepest descent ; in the secondcase, it is called the direction of steepest ascent.

Example 7.22 (Quadratic Function, continued) Consider again the function

f(x, y) = 4 − x2 − y2 and let a = (0.5,−1).

The left part of Figure 7.5 shows contours corresponding to z = 0, 1, 2, 3, 4. The unit vectorin the direction of steepest ascent at (0.5,−1) is highlighted.

Since ∇f(a) = (−1, 2), the direction of steepest ascent is u = 1√5(−1, 2), and the slope of

the partial function in this direction is√

5 ≈ 2.24.

Note that the contour corresponding to z = k is the set of all pairs (x, y) satisfying f(x, y) = k.

Example 7.23 (Cubic Function, continued) Consider again the function

f(x, y) = x3 − 3xy + y3 and let a = (1.5, 1).

The right part of Figure 7.5 shows contours corresponding to z = −1, 0, 1, 2, 3, 4. The unitvector in the direction of steepest ascent at (1.5, 1) is highlighted.

Since ∇f(a) = (3.75,−1.5), the direction of steepest ascent is u = 1√16.3125

(3.75,−1.5), and

the slope of the partial function in this direction is√

16.3125 ≈ 4.04.

139

-2 -1 0 1 2 x-2

-1

0

1

2y

-1 0 1 2 x-1

0

1

2

y

Figure 7.5: Contour plots of z = 4− x2 − y2 (left part) and z = x3 − 3xy + y3 (right part),with directions of steepest ascent highlighted. (See Figure 7.4 also.)

7.4 Optimization

Definitions of local and global extrema for functions of k-variables directly generalize thedefinitions given in Section 4.4 for 1-variable functions. In addition, Fermat’s theorem canbe generalized as follows:

Theorem 7.24 (Fermat’s Theorem) If the k-variable function f has a local extremumat c and if f is differentiable in a neighborhood of c, then ∇f(c) = O.

Example 7.25 (Quadratic Function, continued) Consider again

f(x, y) = 4 − x2 − y2 (left parts of Figures 7.4 and 7.5).

f has a local (and global) maximum at the origin and ∇f(0, 0) = (0, 0).

Example 7.26 (Cubic Function, continued) Consider again

f(x, y) = x3 − 3xy + y3 (right parts of Figures 7.4 and 7.5).

f has a local minimum at (1, 1) and ∇f(1, 1) = (0, 0).

Note that ∇f(0, 0) = (0, 0), but (0, 0) is neither a local max nor a local min.

Critical points. A point c is a critical point (or stationary point) of the k-variable func-tion f if either ∇f(c) = O or if one of the partial derivatives does not exist at c.

For example, the quadratic function f(x, y) = 4 − x2 − y2 has a single critical point at theorigin; the cubic function f(x, y) = x3 − 3xy + y3 has critical points (1, 1) and (0, 0).

7.4.1 Linear and Quadratic Approximations

The definition of Taylor polynomials given in Section 6.6 can be generalized to k-variablefunctions. This section considers first-order and second-order approximations only.

140

Linear approximation. Suppose that the k-variable function f is differentiable in aneighborhood of a. The first-order Taylor polynomial centered at a is defined as follows:

p1(x) = f(a) + ∇f(a)·(x − a).

Note that, since f is differentiable, the error in using p1 as an approximation to f can bemade as small as we like by taking x to be close enough to a.

If we write the list of partial derivatives as the 1-by-k matrix Df(a), the difference vector(x − a) as a k-by-1 matrix, and consider the matrix product

Df(a)(x − a) = [ fx1(a) fx2

(a) · · · fxk(a) ]

x1 − a1

x2 − a2

. . .xk − ak

= fx1(a)(x1 − a1) + fx2

(a)(x2 − a2) + · · · + fxk(a)(xk − ak)

(a 1-by-1 matrix is equivalent to a scalar), then the first-order Taylor polynomial can bewritten as follows:

p1(x) = f(a) + Df(a)(x − a).

Quadratic approximation. The Hessian of a k-variable function f is the k-by-k matrixwhose (i, j)-entry is fxi,xj

:

Hf(x) =

fx1,x1(x) fx1,x2

(x) · · · fx1,xk(x)

fx2,x1(x) fx2,x2

(x) · · · fx2,xk(x)

...... · · · ...

fxk,x1(x) fxk,x2

(x) · · · fxk,xk(x)

Suppose that the k-variable function f has continuous first- and second-order partial deriva-tives in a neighborhood of a. Then the second-order Taylor polynomial centered at a isdefined as follows:

p2(x) = f(a) + Df(a)(x − a) +1

2(x − a)T Hf(a)(x − a)

where Df(a) is the list of partial derivatives written as a 1-by-k matrix, (x − a) is thedifference vector written as a k-by-1 matrix, and Hf(a) is the k-by-k Hessian matrix at a.

Example 7.27 (Ratio Function) Let f(x, y) = yx and a = (1, 1).

1. The first- and second- partial derivatives of f are:

fx(x, y) = − y

x2, fy(x, y) =

1

x,

fxx(x, y) =2y

x3, fxy(x, y) = fyx(x, y) = − 1

x2and fyy(x, y) = 0.

141

2. Since f(a) = 1, fx(a) = −1 and fy(a) = 1, the first-order Taylor polynomial centeredat a is

p1(x, y) = 1 + [−1 1 ]

[x − 1y − 1

]= 1 − (x − 1) + (y − 1).

3. Further, since fxx(a) = 2, fxy(a) = fyx(a) = −1 and fyy(a) = 0, the second-orderTaylor polynomial centered at a is

p2(x, y) = 1 + [−1 1 ]

[x − 1y − 1

]+ 1

2 [ x − 1 y − 1 ]

[2 −1−1 0

] [x − 1y − 1

]

= 1 − (x − 1) + (y − 1) + (x − 1)2 − (x − 1)(y − 1) .

Theorem 7.28 (Second-Order Taylor Polynomials) Suppose that the k-variablefunction f has continuous first- and second-order partial derivatives in a neighborhood ofa, and let p2(x) be the second-order Taylor polynomial centered at a. Then

limx→a

f(x) − p2(x)

‖x − a‖2= 0.

That is, the error function e(x) = f(x) − p2(x) approaches zero at a rate faster than therate at which the square of the distance between x and a approaches zero.

7.4.2 Second Derivative Test

Suppose that the k-variable function f has continuous first- and second-order partial deriva-tives in a neighborhood of the critical point a. Since Df(a) is the zero vector,

p2(x) = f(a) +1

2(x − a)T Hf(a)(x − a).

If the term (x−a)T Hf(a)(x−a) is positive for all x sufficiently close (but not equal) to a,then f has a local minimum at a. Similarly, if the term (x − a)T Hf(a)(x − a) is negativefor all x sufficiently close (but not equal) to a, then f has a local maximum at a.

The second derivative test gives criteria for determining when the quadratic term in thesecond-order approximation is always positive or always negative for x near a. Matrixnotation and operations are useful in developing the test.

Symmetric matrices and quadratic forms. The n-by-n matrix A is symmetric ifaij = aji for all i and j. Symmetric matrices have the property that AT = A.

If A is a symmetric n-by-n matrix and h is an n-by-1 matrix (a column vector), then

Q(h) = hT Ah =n∑

i=1

n∑

j=1

aijhihj

is called a quadratic form in h.

142

Positive and negative definite quadratic forms. The quadratic form Q (respectively,the associated symmetric matrix A) is said to be positive definite if Q(h) > 0 when h 6= Oand is said to be negative definite if Q(h) < 0 when h 6= O.

If the k-variable function f has continuous first- and second-partial derivatives in a neighbor-hood of a, then by Clairaut’s theorem (page 135) the Hessian matrix Hf(a) is a symmetrick-by-k matrix. Our concern is whether Hf(a) is positive definite or negative definite.

Principal minors and determinants. If A is an k-by-k matrix, then the j-by-j principalminor of A is the matrix

Aj =

a11 a12 · · · a1j

a21 a22 · · · a2j...

... · · · ...aj1 aj2 · · · ajj

for j = 1, 2, . . . , k.

Let dj = det(Aj) be the determinant of the j-by-j principal minor (j > 1) and let d1 = a11.

For example, if A =

4 1 2−1 2 1

1 3 −3

, then d1 = 4, d2 = 10 and d3 = −48.

Second derivative test for local extrema. Suppose that the k-variable function fhas continuous first- and second-order partial derivatives in a neighborhood of the criticalpoint a, let Hf(a) be the Hessian matrix at a, and let d1, d2, · · ·, dk be the sequence ofdeterminants of the principal minors of Hf(a).

Assume that dk = det(Hf(a)) 6= 0. Then

1. If di > 0 for i = 1, 2, . . . , k, then f has a local minimum at a.

2. If d1 < 0, d2 > 0, d3 < 0, . . ., then f has a local maximum at a.

3. If the patterns given in the first two steps do not hold, then f has a saddle point atthe point a. That is, there is neither a maximum nor a minimum at a.

If dk = det(Hf(a)) = 0, then anything can happen. (The critical point may correspond toa local maximum, a local minimum, or a saddle point.)

Example 7.29 (Quadratic Function, continued) Consider again

f(x, y) = 4 − x2 − y2 (left parts of Figures 7.4 and 7.5).

Since∇f(x, y) = (−2x,−2y) = (0, 0) ⇒ x = y = 0

the only critical point is the origin. Since

Hf(0, 0) =

[−2 0

0 −2

]⇒ d1 = −2 and d2 = 4,

the second derivative test implies that f has a local maximum at the origin.

143

Example 7.30 (Cubic Function, continued) Consider again

f(x, y) = x3 − 3xy + y3 (right parts of Figures 7.4 and 7.5).

Since∇f(x, y) = (3x2 − 3y,−3x + 3y2) = (0, 0) ⇒ (x, y) = (1, 1) or (0, 0),

there are two critical points to consider.

1. Let (x, y) = (1, 1). Since

Hf(1, 1) =

[6 −3

−3 6

]⇒ d1 = 6 and d2 = 27,

the second derivative test implies that f has a local minimum at (1, 1).

2. Let (x, y) = (0, 0). Since

Hf(0, 0) =

[0 −3

−3 0

]⇒ d1 = 0 and d2 = −9,

the second derivative test implies that f has a saddle point at the origin.

Note that if we let y = −x, then the 1-variable function z = f(x,−x) = 3x2 has a localminimum at the origin; if we let y = x, then the 1-variable function z = f(x, x) =2x3 − 3x2 has a local maximum at the origin.

7.4.3 Constrained Optimization and Lagrange Multipliers

In many applications of calculus the goal is to optimize (either maximize or minimize) areal-valued function of k-variables, f(x), subject to one or more constraints:

g1(x) = c1, g2(x) = c2, . . . , gp(x) = cp, where the ci are constants, and p < k.

Solve and substitute method. It may be possible to use the constraints to reduce thenumber of variables in the optimization problem, as illustrated in the following example.

Example 7.31 (Constrained Minimization) Let x represent length, y represent width,and z represent height of an aquarium. Assume that the cost of the slate needed forthe bottom of the aquarium is 4 times the cost of the glass needed for the sides. Then4xy + 2xz + 2yz is proportional to the total cost for these materials.

The problem of interest is to find positive numbers x, y, z to

Minimize 4xy + 2xz + 2yz when the volume V = xyz = 16 cubic units.

The volume constraint can be solved for z, and the result substituted into the 3-variablefunction we would like to minimize, yielding the following 2-variable problem:

Minimize f(x, y) = 4xy +32

y+

32

xover the domain where x > 0 and y > 0.

144

12

34

5x 1

2

345

y

f

12

34

5x

1 2 3 4 5 x1

2

3

4

5y

Figure 7.6: Surface (left) and contour (right) plots of f(x, y) = 4xy + 32y + 32

x . Contourscorrespond to f(x, y) = 49, 51, 53, 55, 57.

(See Figure 7.6.) Working with this function,

∇f(x, y) =

(4y − 32

x2, 4x − 32

y2

)and Hf(x, y) =

[64/x3 4

4 64/y3

].

Since ∇f(x, y) = (0, 0) when x = y = 2, Hf(2, 2) =

[8 44 8

], d1 = 8 and d2 = 48, we know

that f has a local (and global) minimum at (2, 2). The minimum value is 48.

The dimensions of the aquarium should be x = 2, y = 2, z = 4.

Method of Lagrange Multipliers. A general method, which does not require the “solveand substitute” approach outlined in the example above, depends on the following theorem.

Theorem 7.32 (Lagrange Multipliers) Assume that the k-variable functions f andg1, g2, . . ., gp have continuous first partial derivatives on the domain

D = x : g1(x) = c1, g2(x) = c2, . . . , gp(x) = cp,

f has an extremum on this domain at a, and that

∇g1(a), ∇g2(a), . . . , ∇gp(a) is a linearly independent set (see below).

Then there exist constants λ1, . . ., λp (called Lagrange multipliers) satisfying

∇f(a) = λ1∇g1(a) + λ2∇g2(a) + · · · + λp∇gp(a).

Note that when p = 1, the linear independence requirement in Lagrange’s theorem reducesto ∇g1(a) 6= O. When p > 1, the requirement means that

The linear combination∑p

i=1 ai∇gi(a) = O only when each coefficient ai = 0.

145

The theorem can be used to find critical points. When p = 1, we drop the subscript on thesingle constraint and solve

∇f(x) = λ∇g(x) and g(x) = c for x1, x2, . . . , xk and λ.

Example 7.33 (Constrained Minimization, continued) Consider again the aquariumproblem and let f(x, y, z) = 4xy + 2xz + 2yz and g(x, y, z) = xyz. Then

∇f(x, y, z) = (4y + 2z, 4x + 2z, 2x + 2y) and ∇g(x, y, z) = (yz, xz, xy).

Critical points need to satisfy the following system of 4 equations in 4 unknowns:

4y + 2z = λyz, 4x + 2z = λxz, 2x + 2y = λxy and xyz = 16.

Since x, y and z must be positive, we can solve the first 3 equations for λ:

λ =4

z+

2

y, λ =

4

z+

2

x, λ =

2

y+

2

x.

From equations 1 and 2 we get y = x, and from equations 1 and 3 we get z = 2x. Now,

16 = xyz = x(x)(2x) ⇒ x = 2, y = 2, z = 4.

Finally, λ = 2 as well. (Note that λ can be interpreted as the rate of change of cost withrespect to a unit change in the volume constraint.)

Lagrangian function. The method outlined in Lagrange’s theorem can be coded usingthe following (p + k)-variable Lagrangian function:

L(λ1, . . . , λp, x1, . . . , xk) = f(x1, . . . , xk) −p∑

i=1

λi(g(x1, . . . , xk) − ci).

For simplicity, we use the notation L(λ,x).

Critical points of L satisfy the result of Lagrange’s theorem. A special form of the secondderivative test allows us to determine in some cases whether critical points are local extrema.

Second derivative test for constrained local extrema. Let a be a constrained criticalpoint of f subject to the constraints gi(x) = ci (for all i), and assume that all first- andsecond-partial derivatives are continuous on the constrained domain.

Let d1, d2, . . ., dp+k be the determinants of the principal minors of HL(λ,a) and considerthe subsequence of (k − p) values:

(−1)pd2p+1, (−1)pd2p+2, . . . , (−1)pdp+k.

Assume that dp+k = det(HL(λ,a)) 6= 0. Then

1. If all terms in the subsequence are positive, then f has a constrained localminimum value at a.

146

2. If the first term in the subsequence is negative, and the signs alternate,then f has a constrained local maximum value at a.

3. If the patterns given in the first two steps do not hold, then f has a con-strained saddle point at a.

Note that if dp+k = det(HL(λ,a)) = 0, then anything can happen.

Example 7.34 (Constrained Minimization, continued) Returning to the problemabove,

L(λ, x, y, z) = 4xy + 2xz + 2yz − λ(xyz − 16),

∇L(λ, x, y, z) = (16 − x y z, 4 y + 2 z − y z λ, 4x + 2 z − x z λ, 2x + 2 y − x y λ), and

HL(λ, x, y, z) =

0 − (y z) − (x z) − (x y)− (y z) 0 4 − z λ 2 − y λ− (x z) 4 − z λ 0 2 − xλ− (x y) 2 − y λ 2 − xλ 0

.

If λ = 2, x = 2, y = 2, z = 4, then d1 = 0, d2 = −64, d3 = −512, d4 = −768.

We need to consider the subsequence −d3 = 512 and −d4 = 768. Since both terms arepositive, f has a constrained local minimum when x = 2, y = 2, z = 4.

Example 7.35 (k = 3, p = 2) Consider the following minimization problem:

Minimize x2 + y2 + z2 subject to x + y + z = 12 and x + 2y + 3z = 18

and let

L(λ, µ, x, y, z) = x2 + y2 + z2 − λ(x + y + z − 12) − µ(x + 2y + 3z − 18).

Then ∇L(λ, µ, x, y, z) =

(12 − x − y − z, 18 − x − 2y − 3z, 2x − λ − µ, 2y − λ − 2µ, 2z − λ − 3µ),

HL(λ, µ, x, y, z) =

0 0 −1 −1 −10 0 −1 −2 −3

−1 −1 2 0 0−1 −2 0 2 0−1 −3 0 0 2

,

and d1 = d2 = d3 = 0, d4 = 1 and d5 = 12 (for all x, y, z).

Since∇L(λ, µ, x, y, z) = O ⇒ λ = 20, µ = −6, x = 7, y = 4, z = 1,

and (−1)2d5 = 12 > 0, there is a constrained minimum at (x, y, z) = (7, 4, 1).

Footnote: The minimization problem above corresponds to finding the point on the intersection ofthe planes x + y + z = 12 and x + 2y + 3z = 18 which is closest to the origin.

147

7.4.4 Method of Steepest Descent/Ascent

The first step in solving an optimization problem is to find the critical points of f by solvingthe derivative equation ∇f(x) = O. Since this equation may be difficult to solve, a varietyof approximate methods have been developed. This section focuses on methods based onproperties of directional derivatives and gradients. (See Section 7.3.5.)

Method of steepest descent. Suppose that f has continuous partial derivatives in aneighborhood of a and has a local minimum at a. If x0 (an initial point) is close enoughto the point a and ∇f(x0) 6= O, then

The minimum value of Duf(x0) occurs when u = −∇f(x0)/‖∇f(x0)‖.

That is, the directional derivative is minimized in the direction of the negative of the gradientat the point x0.

The idea of the method of steepest descent is to move in the direction of steepest descentas far as possible with the hope of getting closer to a. Specifically, let g(t) be the following1-variable function:

g(t) = f

(x0 − t

∇f(x0)

‖∇f(x0)‖

).

Note that g(0) = f(x0). In addition, for values of t near zero, g(t) is decreasing. Let t0 > 0be the critical number of g closest to 0 and let x1 = x0 − t0(∇f(x0)/‖∇f(x0)‖).

In general, the process is repeated, defining

xi+1 = xi − ti(∇f(xi)/‖∇f(xi)‖),

until ti is close enough to zero.

Example 7.36 (Constrained Minimization, continued) The method is illustratedusing a familiar situation:

Minimize f(x, y) = 4xy +32

y+

32

xover the domain where x > 0 and y > 0.

(See Figure 7.6.)

The following table shows the results of the first eight steps of the algorithm, using (3.2, 4.1)as the initial point, and using four decimal places of accuracy:

i 0 1 2 3 4 5 6 7

ti 2.0973 0.9084 0.1588 0.0458 0.0038 0.0007 0.0000 0.0000

xi 3.2 1.5789 2.1553 2.0325 2.0034 2.0005 2.0000 2.0000

yi 4.1 2.7694 2.0672 1.9664 2.0019 1.9995 2.0000 2.0000

Both t6 and t7 are virtually zero (and x6 and x7 are virtually equal), indicating that thealgorithm has identified (2, 2) as a local minimum point.

Note that Newton’s method (Section 4.4.5) was used to find each ti.

148

0 2 4 6 8x-10

-5

0

5

10

15

y

Figure 7.7: Weight change in grams (vertical axis) versus dosage level in 100 mg/kg/day(horizontal axis) for ten animals. The gray curve is y = −0.49x2 + 1.98x + 6.70.

Method of steepest ascent. To approximate a local maximum, the function

g(t) = f (xi + t (∇f(xi)/‖∇f(xi)‖))

is used to find ti, and the formula

xi+1 = xi + ti(∇f(xi)/‖∇f(xi)‖)

is used to find the next approximate point.


Optimization is an essential part of many problems in statistics. In this section, optimizationtechniques are used to solve problems in least squares and maximum likelihood estimation.

7.5.1 Least Squares Estimation in Quadratic Models

Given n data pairs (x1, y1), (x2, y2), . . ., (xn, yn), whose values lie close to a quadratic curveof the form

y = ax2 + bx + c ,

the method of least squares, developed by Legendre in the early 1800’s, allows us to estimatethe unknown parameters: Find the values of a, b, and c which minimize the sum of squareddeviations of observed y-coordinates from the values expected if the data points fit the curveexactly. That is, minimize the following 3-variable function:

f(a, b, c) =n∑

i=1

(yi − (ax2i + bxi + c))2.

Example 7.37 (Toxicology Study) (Fine & Bosch, JASA 94:375-382, 2000) As part of astudy to determine the adverse effects of a proposed drug for the treatment of tuberculosis,female rats were given the drug for a period of 14 days at each of five dosage levels (in 100milligrams per kilogram per day).

149

The vertical axis in Figure 7.7 is the weight change in grams (defined as the weight at theend of the period minus the weight at the beginning of the period) for ten animals in thestudy, and the horizontal axis shows the dose in 100 mg/kg/day.

For these data, f(a, b, c) =

562.63 + 703.45a + 7612.13a2 − 15.1b + 2223.5ab + 172.5b2 − 88.6c + 345ac + 62bc + 10c2,

∇f(a, b, c) =

(703.45+15224.3a+2223.5b+345c,−15.1+2223.5a+345b+62c,−88.6+345a+62b+20c)

and ∇f(a, b, c) = O when a = −0.49, b = 1.98, and c = 6.70.

Since

Hf =

15224.3 2223.5 3452223.5 345 62345 62 20

, d1 = 15224.3, d2 ≈ 3.1 × 105, d3 ≈ 1.7 × 106

and f is a quadratic function, f is minimized when a = −0.49, b = 1.98, and c = 6.70. Theleast squares quadratic fit formula is shown in Figure 7.7.

Linear algebra and least squares. A linear algebra approach to least squares analysisis discussed in Sections 9.5.5 and 9.8.1 of these notes.

7.5.2 Maximum Likelihood Estimation in Multinomial Models

This section generalizes ideas from Section 4.5.1.

Consider the problem of estimating the proportions p1, . . . , pk in a multinomial experiment.Let p = (p1, p2, . . . , pk), and let xi be the observed number of occurrences of outcome i inn trials (for i = 1, 2, . . . , k). We will demonstrate that the sample proportions, pi = xi/n,are maximum likelihood estimates of the proportions in the multinomial experiment usingconstrained optimatization techniques.

The likelihood function, Lik(p), is the joint PDF with n and the xi fixed and p varying:

Lik(p) =

(n

x1, x2, . . . , xk

)px1

1 px2

2 · · · pxk

k .

The log-likelihood function, ℓ(p), is the natural logarithm of the likelihood:

ℓ(p) = ln(Lik(p)) = ln

(n

x1, x2, . . . , xk

)+ x1 ln(p1) + x2 ln(p2) + · · · + xk ln(pk).

We wish to maximize ℓ(p) subject to the constraint that the sum of the proportions is 1.The Lagrangian function is

L(λ,p) = ℓ(p) − λ(p1 + p2 + · · · + pk − 1),

150

and the gradient of the Lagrangian function is

∇L(λ,p) = (1 − p1 − p2 − · · · − pk, x1/p1 − λ, x2/p2 − λ, . . . , xk/pk − λ).

If all quantities are positive, then ∇L(λ,p) = O implies

1 =∑

i pi and pi = xi/λ for each i.

These equations imply that (n, x1/n, x2/n, . . . , xk/n) is a critical point.

Example 7.38 (k = 3) To check that the result is a maximum, we consider the specialcase when

k = 3, n = 40, x1 = 18, x2 = 12 and x3 = 10.

Then

HL =

0 −1 −1 −1−1 −800/9 0 0−1 0 −400/3 0−1 0 0 −160

, d1 = 0, d2 = −1, d3 =

2000

9, d4 = −1280000

27.

Since −d3 < 0 and −d4 > 0, the second derivative test for constrained local extrema impliesthat there is a constrained local maximum when p1 = 0.45, p2 = 0.30, and p3 = 0.25 (andλ = 40). There is both a local and global maximum at (p1, p2, p3) = (0.45, 0.35, 0.25).

7.6 Multiple Integration

This section generalizes ideas from Section 4.7.

7.6.1 Riemann Sums and Double Integrals over Rectangles

Suppose that the 2-variable function f(x, y) is continuous on the rectangle

R = [a, b] × [c, d] = (x, y) : a ≤ x ≤ b, c ≤ y ≤ d ⊂ R2

and let m and n be positive integers. Further, let x = (b−a)m , y = (d−c)

n ,

1. xi = a + ix for i = 0, 1, . . . ,m, x∗i = (xi−1 + xi)/2 for i = 1, 2, . . . ,m,

2. yj = c + jy for j = 0, 1, . . . , n, and y∗j = (yj−1 + yj)/2 for j = 1, 2, . . . , n.

Then the (m + 1) numbers

a = x0 < x1 < x2 < · · · < xm = b

partition [a, b] into m subintervals of equal length with midpoints x∗i , the (n + 1) numbers

c = y0 < y1 < y2 < · · · < yn = d

151

1

3x1

3

5

y

z

1

3x

1

3x1

3

5

y

z

1

3x

Figure 7.8: Plot of z = 3x2 + y + 2 on [0, 4] × [0, 6] (left) and approximating boxes (right).

partition [c, d] into n subintervals of equal length with midpoints y∗j and the mn subrectan-gles

Ri,j = [xi−1, xi] × [yj−1, yj ]

form an m-by-n partition of the rectangle R. Each subrectangle has area A = (x)(y).

The double integral of f over R is

∫ ∫

Rf(x, y) dA = lim

m,n→∞

m∑

i=1

n∑

j=1

f(x∗i , y

∗j )A.

In this formula, f(x, y) is the integrand, and the sum on the right is called a Riemann sum.

Continuity and Integrability. f(x, y) is said to be integrable on the rectangle R ifthe limit above exists. Continuous functions are always integrable. Further, it is possi-ble to show that the limit exists when the values of f are bounded on R and the set ofdiscontinuities has area zero.

Volumes. If f is nonnegative on R, then the double integral over R can be interpretedas the volume under the surface z = f(x, y) and above the xy-plane for (x, y) ∈ R.

Example 7.39 (Computing Volume) Let f(x, y) = 3x2 + y + 2 on R = [0, 4] × [0, 6].To estimate the volume under the surface z = f(x, y) and above the xy-plane over R (leftpart of Figure 7.8), we let m = 4, n = 6 and ∆x = ∆y = 1. The 24 midpoints are

(x∗, y∗), where x∗ = 0.5, 1.5, 2.5, 3.5 and y∗ = 0.5, 1.5, . . . , 5.5,

and the value of the Riemann sum is 498.0. This value is the sum of the volumes of the 24rectangular boxes shown in the right part of Figure 7.8.

The following list of approximate values suggests that the limit is 504.

m 4 8 16 32 64 128 256

n 6 12 24 48 96 192 384

Sum 498.0 502.5 503.625 503.906 503.977 503.994 503.999

152

01

23

4x 0

2

4

6

y

z

01

23

4x

01

23

4x 0

2

4

6

y

z

01

23

4x

Figure 7.9: Area slices of z = 3x2 + y + 2 on [0, 4] × [0, 6] for fixed x = 0, 0.5, . . . , 4 (left)and for fixed y = 0, 0.5, . . . , 6 (right).

Average values. The quantity

1

(b − a)(d − c)

∫ ∫

Rf(x, y) dA

can be interpreted as the average value of f(x, y) on R.

To see this, let f = 1mn

∑i,j f(x∗

i , y∗j ) be the sample average, and let A = (b − a)(d − c) be

the area of the rectangle. Then

f =A

A

1

mn

∑

i,j

f(x∗i , y

∗j )

=

1

A

∑

i,j

f(x∗i , y

∗j ) A ⇒ lim

m,n→∞ f =1

A

∫ ∫

Rf(x, y) dA.

For example, the list of approximations in the previous example suggest that the averagevalue of f(x, y) = 3x2 + y + 2 on [0, 4] × [0, 6] is 504/24 = 21.

7.6.2 Iterated Integrals over Rectangles

Suppose that f is a continuous, 2-variable function on the rectangle R = [a, b] × [c, d].

We use the notation ∫ d

cf(x, y) dy

to mean that x is held fixed and f is integrated as y varies from c to d. Similarly, we usethe notation ∫ b

af(x, y) dx

to mean that y is held fixed and f is integrated as x varies from a to b.

If f is nonnegative on R, then the results can be interpreted as areas.

Example 7.40 (Area Slices) Consider again f(x, y) = 3x2 + y + 2 on [0, 4] × [0, 6].

153

The left part of Figure 7.9 shows area slices related to the partial functions of f whenx = 0, 0.5, 1, . . . , 4. The area of each slice as a function of x is

Ax(x) =

∫ 6

0(3x2 + y + 2) dy =

[3x2y +

1

2y2 + 2y

]6

0= 18x2 + 30,

and the areas of the displayed slices are 30, 34.5, 48, . . ., 318.

The right part of Figure 7.9 shows area slices related to the partial functions of f wheny = 0, 0.5, 1, . . . , 6. The area of each slice as a function of y is

Ay(y) =

∫ 4

0(3x2 + y + 2) dx =

[x3 + yx + 2x

]40

= 4y + 72,

and the areas of the displayed slices are 72, 74, 76, . . ., 96.

Cavalieri’s principle. Continuing with the area application, Cavalieri’s principle (alsocalled the method of slicing) says that the volume under the surface can be obtained byintegrating the area function:

Volume =∫ dc Ay(y) dy =

∫ dc

∫ ba f(x, y) dx dy

=∫ ba Ax(x) dx =

∫ ba

∫ dc f(x, y) dy dx

The right most integrals on each line above are examples of iterated integrals. In each case,the inside integral is evaluated first, followed by the outside integral.

Example 7.41 (Cavalieri’s Principle) Continuing with the example above, since

∫ 4

0(18x2 + 30) dx =

[6x3 + 30x

]40

= 504 and

∫ 6

0(4y + 72) dy =

[2y2 + 72y

]60

= 504

we know that the volume of the solid shown in the left part of Figure 7.8 is 504 cubic units.

Double versus iterated integrals. Fubini’s theorem says that we can compute doubleintegrals by computing iterated integrals, where the iteration can be done in either order.

Theorem 7.42 (Fubini’s Theorem) Suppose that f is a continuous, 2-variable functionon the rectangle R = [a, b] × [c, d]. Then

∫ ∫

Rf(x, y) dA =

∫ d

c

∫ b

af(x, y) dx dy =

∫ b

a

∫ d

cf(x, y) dy dx .

Note that Fubini’s theorem remains true if f is a bounded function on R satisfying (1) theset of discontinuities of f has area 0, and (2) each slice (for fixed x or for fixed y) intersectsthe set of discontinuities in at most a finite number of points.

154

0

1

2x

0

1

2

y

z

0

1

2x

1 2 x

1

2

y

Figure 7.10: Plot of z = 10/x2−90/(x + 2y)2 on [1, 2]×[0, 2] (left plot) and function domainwith y = x superimposed (right plot).

Example 7.43 (Difference in Volumes) Let

f(x, y) =10

x2− 90

(x + 2y)2on R = [1, 2] × [0, 2],

as illustrated in Figure 7.10.

By Fubini’s theorem,

∫ ∫R f(x, y) dA =

∫ 21

∫ 20 f(x, y) dx dy

=∫ 21

(20x2 − 45

x + 454+x

)dx

=[−20

x − 45 ln(x) + 45 ln(4 + x)]21≈ −12.9871.

Note that, given (x, y) ∈ R, f(x, y) ≥ 0 when y ≥ x and f(x, y) ≤ 0 when y ≤ x. Thus, thisanswer can be interpreted as the difference between (1) the volume of the solid boundedby z = f(x, y) and the xy-plane for (x, y) ∈ R with y ≥ x and (2) the volume of the solidbounded by z = f(x, y) and the xy-plane for (x, y) ∈ R with y ≤ x.

7.6.3 Properties of Double Integrals

Properties of double integrals generalize those for single integrals from Section 4.7.2.

Suppose that f(x, y) and g(x, y) are integrable functions on R, and let c be a constant.Then properties of limits imply:

1. Constant functions:∫ ∫

R c dA = c × (Area of R).

2. Sums and differences:∫ ∫

R(f(x, y) ± g(x, y)) dA =∫ ∫

R f(x, y) dA ± ∫ ∫R g(x, y) dA.

3. Constant multiples:∫ ∫

R cf(x, y) dA = c∫ ∫

R f(x, y) dA.

4. Null domain: If the area of R is 0, then∫ ∫

R f(x, y) dA = 0.

155

0.5 1 x

1

2

y

0.5 1 x

0.5

1

y

Figure 7.11: Domains of type 1 (left) and of both types (right).

5. Ordering: If f(x, y) ≤ g(x, y) for all (x, y) ∈ R, then∫ ∫

R f(x, y) dA ≤ ∫ ∫R g(x, y) dA.

6. Absolute value: If |f | is integrable, then |∫ ∫R f(x, y) dA| ≤ ∫ ∫R |f(x, y)| dA.

7.6.4 Double and Iterated Integrals over General Domains

Double and iterated integrals can be evaluated over more general domains. Specifically,

1. Type 1: Let D = (x, y) : a ≤ x ≤ b, ℓ(x) ≤ y ≤ u(x), where ℓ(x) and u(x) arecontinuous on [a, b], and let f be a continuous, 2-variable function on D. Then

∫ ∫

Df(x, y) dA =

∫ b

a

∫ u(x)

ℓ(x)f(x, y) dy dx .

2. Type 2: Let D = (x, y) : c ≤ y ≤ d, ℓ(y) ≤ x ≤ u(y), where ℓ(y) and u(y) arecontinuous on [c, d], and let f be a continuous, 2-variable function on D. Then

∫ ∫

Df(x, y) dA =

∫ d

c

∫ u(y)

ℓ(y)f(x, y) dx dy .

The left part of Figure 7.11 illustrates a domain of type 1:

Dleft = (x, y) : 0 ≤ x ≤ 1, 1 − x ≤ y ≤ 1 + x .

The right part of the figure illustrates a domain of both types:

Dright = (x, y) : 0 ≤ x ≤ 1, x2 ≤ y ≤ x = (x, y) : 0 ≤ y ≤ 1, y ≤ x ≤ √y .

Example 7.44 (Figure 7.11, Left Part) To illustrate computations, we compute thedouble integral of f(x, y) = 70xy2 over the domain shown in the left part of Figure 7.11:

∫ ∫

Dleft

70xy2 dA =

∫ 1

0

∫ 1+x

1−x70xy2 dy dx =

∫ 1

0(140x2 + 140x4/3) dx = 56.

156

Note that Dleft can be written as the union of two type 2 domains whose intersection is aset with area 0, and that the double integral can be computed as the sum of the values oftwo iterated integrals. Specifically,

Dleft = (x, y) : 0 ≤ y ≤ 1, 1 − y ≤ x ≤ 1 ∪ (x, y) : 1 ≤ y ≤ 2, y − 1 ≤ x ≤ 1

(union of triangular regions with intersection the segment from (0, 1) to (1, 1)), and

∫ ∫

Dleft

70xy2 dA =

∫ 1

0

∫ 1

1−y70xy2 dx dy +

∫ 2

1

∫ 1

y−170xy2 dx dy =

21

2+

91

2= 56.

Example 7.45 (Difference in Volumes, continued) Consider again

f(x, y) =10

x2− 90

(x + 2y)2on R = [1, 2] × [0, 2] (Figure 7.10).

The rectangle R can be written as the union of two type 1 domains whose intersection isthe segment from (1, 1) to (2, 2):

D1 = R ∩ (x, y) : y ≥ x and D2 = R ∩ (x, y) : y ≤ x.

Since

∫ ∫

D1

f(x, y) dA =

∫ 2

1

∫ 2

xf(x, y) dy dx =

∫ 2

1

(20

x2− 25

2+

45

4 + x

)dx ≈ 0.8758

and

∫ ∫

D2

f(x, y) dA =

∫ 2

1

∫ x

0f(x, y) dy dx =

∫ 2

1

(−20

x

)dx = [−20 ln(x)]21 ≈ −13.8629,

the double integral∫ ∫

R f(x, y) dA ≈ −12.9871, confirming the result obtained earlier.

Further, the total volume enclosed by

z = f(x, y), the xy-plane, x = 1, x = 2, y = 0 and y = 2

is approximately 0.8758 + 13.8629 = 14.7387.

Average values over general domains. If f is integrable over the general domain Dwith positive area A(D), then the average value of f over D is

Average of f over D =1

A(D)

∫ ∫

Df(x, y) dA .

Example 7.46 (Figure 7.11, Right Part) To illustrate computations, we find theaverage value of f(x, y) = 70xy2 over the domain shown in the right part of Figure 7.11,and view the domain as a domain of type 1.

157

The double integral of f over Dright is

∫ ∫

Dright

70xy2 dA =

∫ 1

0

∫ x

x270xy2 dy dx =

∫ 1

0(70x4/3 − 70x7/3) dx = 7/4.

The area of Dright is the double integral of the function 1 (or, equivalently, the single integralover [0, 1] of the difference between the value of y on the upper curve and the value of y onthe lower curve):

∫ ∫

Dright

1 dA =

∫ 1

0

∫ x

x21 dy dx =

∫ 1

0(x − x2) dx = 1/6.

Finally, the average value is (7/4)/(1/6) = 21/2.

7.6.5 Triple and Iterated Integrals over Boxes

Suppose that the 3-variable function f(x, y, z) is continuous on the box

B = [a, b] × [c, d] × [p, q] = (x, y, z) : a ≤ x ≤ b, c ≤ y ≤ d, p ≤ z ≤ q ⊂ R3

and let ℓ, m and n be positive integers. Further, let ∆x = (b−a)ℓ , ∆y = (d−c)

m and ∆z = (q−p)n ,

1. xi = a + i∆x for i = 0, 1, . . . , ℓ, x∗i = (xi−1 + xi)/2 for i = 1, 2, . . . , ℓ,

2. yj = c + j∆y for j = 0, 1, . . . ,m, y∗j = (yj−1 + yj)/2 for j = 1, 2, . . . ,m,

3. zk = p + k∆z for k = 0, 1, . . . , n, and z∗k = (zk−1 + zk)/2 for k = 1, 2, . . . , n.

Then the (ℓ + 1) numbersa = x0 < x1 < · · · < xℓ = b

partition [a, b] into ℓ subintervals of equal length with midpoints x∗i , the (m + 1) numbers

c = y0 < y1 < · · · < ym = d

partition [c, d] into m subintervals of equal length with midpoints y∗j , the (n + 1) numbers

p = z0 < z1 < · · · < zn = q

partition [p, q] into n subintervals of equal length with midpoints z∗k, and the ℓmn subboxes

Bi,j,k = [xi−1, xi] × [yj−1, yj ] × [zk−1, zk] (i = 1, . . . , ℓ; j = 1, . . . ,m; k = 1, . . . , n)

form an ℓ-by-m-by-n partition of B. Each subbox has volume V = (x)(y)(z).

The triple integral of f over B is

∫ ∫ ∫

Bf(x, y, z) dV = lim

ℓ,m,n→∞

ℓ∑

i=1

m∑

j=1

n∑

k=1

f(x∗i , y

∗j , z

∗k)V.

In this formula, f(x, y, z) is the integrand and the sum on the right is a Riemann sum.

158

Continuity and integrability. f(x, y, z) is said to be integrable on the box B if the limitabove exists. Continuous functions are always integrable. Further, it is possible to showthat the limit above exists if the values of f are bounded on B and the set of discontinuitiesof f on B has volume zero.

Iterated integrals and Fubini’s theorem. A generalization of Fubini’s theorem (p. 154)states that if f is continuous on B, then the triple integral can be computed using iteratedintegrals, where the iteration can be done in any order. In particular,

∫ ∫ ∫

Bf(x, y, z) dV =

∫ b

a

∫ d

c

∫ q

pf(x, y, z) dz dy dx .

(There are a total of six orders.)

More generally, if f is a bounded function on B satisfying (1) the set of discontinuities off has volume 0, and (2) each line parallel to one of the coordinate axes intersects the setof discontinuities of f in at most a finite number of points, then the triple integral can becomputed using iterated integrals in any order.

Average values. The quantity

1

(b − a)(d − c)(q − p)

∫ ∫ ∫

Bf(x, y, z) dV

can be interpreted as the average value of f on B.

To see this, let

f =1

ℓmn

∑

i,j,k

f(x∗i , y

∗j , z

∗k)

be the sample average, and let V = (b − a)(d − c)(q − p) be the volume of the box. Then

f =V

V

(1

ℓmn

∑i,j,k f(x∗

i , y∗j , z∗k)

)=

1

V

∑i,j,k f(x∗

i , y∗j , z∗k) V

⇒ limℓ,m,n→∞

f =1

V

∫ ∫ ∫B

f(x, y, z) dV .

Example 7.47 (Average Value) To illustrate computations, we find the average of

f(x, y, z) = 12x2(y − z) on B = [−1, 1] × [−1, 2] × [0, 3].

Using the generalization of Fubini’s theorem,

∫ ∫ ∫B f(x, y, z) dV =

∫ 1−1

∫ 2−1

∫ 30 12x2(y − z) dz dy dx

=∫ 1−1

∫ 2−1

(−54x2 + 36x2y)

dy dx

=∫ 1−1

(−108x2)

dx = −72.

Since the volume of B is 18, the average of f on B is −72/18 = −4.

159

0

1

2x 0

0.5

1

y0246

z

0

1

2x

0

5

10y 0

10

20

x0

25z

0

5

10y

Figure 7.12: Domains of integration for triple integral examples.

7.6.6 Triple Integrals over General Domains

Triple integrals can also be computed over more general domains. There are six basic types,which depend on the order of integration.

One of the six types can be defined as follows:

W = (x, y, z) : a ≤ x ≤ b, ℓ1(x) ≤ y ≤ u1(x), ℓ2(x, y) ≤ z ≤ u2(x, y),

where

1. ℓ1(x) and u1(x) are continuous on [a, b] and

2. ℓ2(x, y) and u2(x, y) are continuous on the 2-dimensional domain

(x, y) : a ≤ x ≤ b, ℓ1(x) ≤ y ≤ u1(x).

If f(x, y, z) is continuous on W , then

∫ ∫ ∫

Wf(x, y, z) dV =

∫ b

a

∫ u1(x)

ℓ1(x)

∫ u2(x,y)

ℓ2(x,y)f(x, y, z) dz dy dx .

Example 7.48 (Figure 7.12, Left Part) Let f(x, y, z) = xy on

W = (x, y, z) : 0 ≤ x ≤ 2, 0 ≤ y ≤ 1, 0 ≤ z ≤ 4 + x + y

(left part of Figure 7.12).

Then, the triple integral of f over W is

∫ ∫ ∫W f dV =

∫ 20

∫ 10

∫ 4+x+y0 xy dz dy dx =

∫ 2

0

∫ 1

0(4xy + x2y + xy2) dy dx =

∫ 2

0(7x/3 + x2/2) dx = 6.

160

Average values over general domains. If f is integrable over the general domain Wwith positive volume V (W ), then the average value of f over W is

Average value of f over W =1

V (W )

∫ ∫ ∫

Wf(x, y, z) dV.

Example 7.49 (Figure 7.12, Right Part) Let f(x, y, z) = 3x on

W = (x, y, z) : 0 ≤ y ≤ 10, y ≤ x ≤ 20 − y, 0 ≤ z ≤ 25 − y

(right part of Figure 7.12).

The triple integral of f over W is

∫ ∫ ∫W f dV =

∫ 100

∫ 20−yy

∫ 25−y0 3x dz dx dy =

∫ 10

0

∫ 20−y

y(75x − 3xy) dx dy =

∫ 10

0(15000 − 2100y + 60y2) dy = 65000.

The volume of W is the triple integral of the function 1 (equivalently, the volume is thedouble integral of 25 − y over (x, y) : 0 ≤ y ≤ 10, y ≤ x ≤ 20 − y):

V (W ) =∫ 100

∫ 20−yy

∫ 25−y0 1 dz dx dy =

∫ 10

0

∫ 20−y

y(25 − y) dx dy =

∫ 10

0(500 − 70y + 2y2) dy =

6500

3.

Finally, the average is (65000)/(6500/3) = 30.

Extensions to k variables. Definitions and techniques for double and triple integralscan be extended to define integration methods for k-variable functions. The extensions arestraightforward, but the computations can be quite tedious.

7.6.7 Partial Anti-Differentiation and Partial Differentiation

Computing iterated integrals involves computing partial anti-differentives. This sectionexplores the processes of partial differentiation and partial anti-differentation further in the2-variable setting.

Theorem 7.50 (Fixed Endpoints) If the 2-variable function f has continuous firstpartial derivatives on the rectangle [a, b] × [c, d], then

d

dy

(∫ b

a

f(x, y) dx

)=

∫ b

a

fy(x, y) dx for y ∈ [c, d]

andd

dx

(∫ d

c

f(x, y) dy

)=

∫ d

c

fx(x, y) dy for x ∈ [a, b].

161

To illustrate the first equality, let f(x, y) = 4x3y2 and [a, b] = [1, 2]. Then

d

dy

(∫ 2

1

4x3y2 dx

)=

d

dy

(15y2

)= 30y and

∫ 2

1

∂

∂y

(4x3y2

)dy =

∫ 2

1

8x3y dx = 30y.

The theorem can be extended to integrals with variable endpoints, using the rule for thetotal derivative from Section 7.3.4.

7.6.8 Change of Variables and the Jacobian

The change of variables technique for multiple integrals generalizes the method of substi-tution for single integrals (see page 83). In this section, we introduce the technique in the2-variable setting.

Consider a vector-valued function T : D∗ ⊆ R2 → R2, with formula

T (u, v) = (x(u, v), y(u, v)) for all (u, v) ∈ D∗,

and let D = T (D∗) be the image of D∗. We say that the transformation T maps thedomain D∗ in the uv-plane to the domain D in the xy-plane.

The functions x(u, v) and y(u, v) are called the coordinate functions of T .

The Jacobian. Assume that each coordinate function has continuous first partial deriva-tives. The Jacobian of the transformation T , denoted by ∂(x,y)

∂(u,v) , is defined as follows:

∂(x, y)

∂(u, v)=

∣∣∣∣xu xv

yu yv

∣∣∣∣ = xuyv − xvyu.

(The Jacobian is the determinant of the matrix of partial derivatives).

Example 7.51 (Linear Transformation) Let

(x, y) = T (u, v) = (au + bv, cu + dv),

where a, b, c and d are constants.

The formula for T can be written using matrix notation. Specifically,

x = T (u) = Au,

where

A =

[a bc d

], x =

[xy

]and u =

[uv

].

A is the matrix of partial derivatives of T ; the Jacobian of T is det(A) = (ad − bc).

162

1 u

1

v

2 4 x

-1

1

y

Figure 7.13: Unit square in uv-plane (left) and parallelogram in xy-plane (right).

Example 7.52 (Nonlinear Transformation) Let

(x, y) = T (u, v) =

(√u

v,√

uv

),

where u and v are positive.

The (simplified) matrix of partial derivatives is

xu xv

yu yv

=

12√

uv−

√u

2v√

v√

v2√

u

√u

2√

v

and the (simplified) Jacobian is 1/(2v).

Stretching/shrinking factor. The absolute value of the Jacobian is the “stretching”(or “shrinking”) factor which tells us how areas in xy-space are related to areas in uv-space.

For linear transformations, the relationship between areas is easy to state.

Theorem 7.53 (Linear Transformations) Let x = T (u) = Au, where

A =

[a bc d

]and det(A) = (ad − bc) 6= 0.

Then

1. T is a one-to-one and onto transformation.

2. T maps triangles to triangles and parallelograms to parallelograms in such a way thatvertices are mapped to vertices.

3. If D∗ is a parallelogram in the uv-plane and D = T (D∗) in the xy-plane, then

(Area of D) =

∣∣∣∣∂(x, y)

∂(u, v)

∣∣∣∣ (Area of D∗) = |det(A)| (Area of D∗).

163

Example 7.54 (Area of Parallelogram) The parallelogram D in the xy-plane whosecorners are (0, 0), (1,−1), (4, 0) and (3, 1) is the image of D∗ = [0, 1]× [0, 1] in the uv-planeunder the linear transformation

(x, y) = T (u, v) = (3u + v, u − v) for (u, v) ∈ D∗,

as illustrated in Figure 7.13.

The Jacobian of T is

∣∣∣∣3 11 −1

∣∣∣∣= −4.

Since the area of D∗ is 1, the area of D is

(Area of D) = | − 4|(Area of D∗) = 4.

Change of variables. In 2-variable substitution, we replace

dA = dx dy with dA =

∣∣∣∣∂(x, y)

∂(u, v)

∣∣∣∣ du dv.

The complete formula is given in the following theorem.

Theorem 7.55 (Change of Variables) Let D and D∗ be basic regions (of types 1 or2) in the xy- and uv-planes, respectively. Suppose that

T : D∗ ⊆ R2 → R2

maps D∗ onto D in a one-to-one fashion, and that each coordinate function of T hascontinuous first partial derivatives. Further, suppose that f is a 2-variable, continuousfunction on D. Then

∫ ∫

Df(x, y) dx dy =

∫ ∫

D∗

f(x(u, v), y(u, v))

∣∣∣∣∂(x, y)

∂(u, v)

∣∣∣∣ du dv .

The change of variables technique is used to simplify the boundaries of integration, tosimplify the integrand, or both.

Example 7.56 (Simplifying Integrand) Consider evaluating the double integral

∫ ∫

Dcos

(x − y

x + y

)dA

where D is the triangular region in the xy-plane bounded by x = 0, y = 0, x + y = 1.

The integrand can be simplified by using

u = x − y, v = x + y ⇐⇒ x =u + v

2, y =

−u + v

2.

Since

(1) x = 0 ⇒ u = −v, (2) y = 0 ⇒ u = v and (3) x + y = 1 ⇒ v = 1,

164

the domain D∗ = (u, v) : 0 ≤ v ≤ 1,−v ≤ u ≤ v. The Jacobian is 1/2.

The integral becomes

∫ ∫

Dcos

(x − y

x + y

)dA =

∫ 1

0

∫ v

−vcos

(u

v

)1

2du dv =

∫ 1

0v sin(1) dv =

1

2sin(1).

(Note that, by substitution,∫

cos(u/v) du = v sin(u/v) + C.)

Example 7.57 (Simplifying Boundary) Consider evaluating the double integral

∫ ∫

Dxy dA

where D is the region bounded by xy = 1, xy = 4, xy2 = 1, xy2 = 4.

The boundary can be simplified by using

u = xy, v = xy2 ⇐⇒ x =u2

v, y =

v

u.

The region D∗ = [1, 4] × [1, 4]. After simplification, the Jacobian is 1/v.

The integral becomes

∫ ∫

Dxy dA =

∫ 4

1

∫ 4

1u

1

vdu dv =

∫ 4

1

15

2vdv =

15 ln(4)

2.

Remarks. As stated above, the absolute value of the Jacobian of a linear transformationis related to a change in area. If T is a nonlinear transformation with continuous firstpartial derivatives, then it is approximately linear in small enough regions. Thus, for smallregions, the absolute value of the Jacobian is an approximate change in area.

Linear transformations will be studied further in Chapter 9 of these notes.

Polar coordinate transformations. Consider (x, y) in polar coordinates

x = r cos(θ), y = r sin(θ)

where r =√

x2 + y2 is the distance between (x, y) and (0, 0), and θ is the angle measuredcounterclockwise from the positive x-axis to the ray from (0, 0) to (x, y).

Transformations from domains in the rθ-plane to domains in the xy-plane have Jacobian

∂(x, y)

∂(r, θ)=

∣∣∣∣xr xθ

yr yθ

∣∣∣∣ =∣∣∣∣cos(θ) −r sin(θ)sin(θ) r cos(θ)

∣∣∣∣ = r cos2(θ) + r sin2(θ) = r ,

since sin2(θ) + cos2(θ) = 1 for all θ.

Polar coordinate transformations are particularly useful when the domain of integration isa disk (the boundary plus the interior of a circle), or a wedge of a disk.

165

Example 7.58 (Quarter Disk) Consider evaluating∫ ∫

D(x2 + y2) dA

where D is the subset of the disk x2+y2 ≤ 1 in the first quadrant. Then D can be describedin polar coordinates as D∗ = (r, θ) : 0 ≤ r ≤ 1, 0 ≤ θ ≤ π/2, and

∫ ∫

D(x2 + y2) dA =

∫ 1

0

∫ π/2

0r2 r dθ dr =

∫ 1

0

π

2r3 dr =

π

8.

Application of polar transformation. An important application of the polar coordi-nate transformation in probability and statistics is the evaluation of the double integral

∫ ∫

De−(x2+y2) dA, where D = R2.

D can be described in polar coordinates as D∗ = (r, θ) : 0 ≤ r < ∞, 0 ≤ θ ≤ 2π and

∫ ∫D e−(x2+y2) dA =

∫∞0

∫ 2π0 e−r2

r dθ dr

= π∫∞0 e−r2

(2r) dr

= π[−e−r2

]r→∞

0(by substitution)

= π(limr→∞(−e−r2

) + 1)

= π(0 + 1) = π.


Multiple integration methods are important in statistics, and, in particular, when work-ing with joint density functions. Joint density functions are used to “smooth over” jointprobability histograms, and as functions supporting the theory of multivariate continuousrandom variables (Chapter 8). This section previews some ideas.

7.7.1 Computing Probabilities for Continuous Random Pairs

If the continuous random pair (X,Y ) has joint density function f(x, y), then the probabilitythat (X,Y ) takes values in the domain D is the double integral of f over D:

P ((X,Y ) ∈ D) =

∫ ∫

Df(x, y) dA.

Example 7.59 (Joint Distribution on Rectangle) Let

f(x, y) =1

8

(x2 + y2

)for (x, y) ∈ [−1, 2] × [−1, 1]

166

-2-1

01

23

x -2

-1

012

y

z

-2-1

01

23

x

-2 -1 1 2 3 x

-2

-1

1

2

y

-2 -1 1 2 3 x

-2

-1

1

2

y

Figure 7.14: Volume (left) over D (right) is P (X ≥ Y ) for a random pair on [−1, 2]× [−1, 1].

(and 0 otherwise) be the joint density function of a continuous random pair (X,Y ).

To find the probability that X is greater than or equal to Y , for example, we need tocompute the double integral of f over the domain D, where D is the intersection of therectangle [−1, 2] × [−1, 1] with the half-plane x ≥ y:

P (X ≥ Y ) =

∫ 1

−1

∫ 2

y

1

8

(x2 + y2

)dx dy =

∫ 1

−1

(1

3+

y2

4− y3

6

)dy =

5

6.

The domain D is pictured in the right part of Figure 7.14; the probability P (X ≥ Y ) is thevolume of the solid shown in the left part of the figure.

7.7.2 Contours for Independent Standard Normal Random Variables

Assume that the continuous random pair (X,Y ) has joint density function

f(x, y) =1

2πexp

(−(x2 + y2)/2

)for (x, y) ∈ R2,

where exp() is the exponential function. The left part of Figure 7.15 is a graph of thesurface z = f(x, y) when (x, y) ∈ [−3, 3] × [−3, 3].

Contours for the joint density are circles in the xy-plane centered at the origin. The rightpart of Figure 7.15 shows contours enclosing 5%, 20%, . . ., 95% of the joint probability.

Given a > 0, letDa = (x, y) : x2 + y2 ≤ a2

be the disk of radius a centered at the origin. The polar transformation can be used tocompute the probability that values of (X,Y ) lie in Da. Specifically,

P (X2 + Y 2 ≤ a2) =

∫ ∫

Da

f(x, y) dA =

∫ 2π

0

∫ a

0

1

2πe−r2/2 r dr dθ = 1 − e−a2/2.

(See the end of the last section for a similar computation.)

167

-3

0

3x -3

0

3

y

z

-3

0

3x

-3 -1.5 1.5 3 x

-3

-1.5

1.5

3

y

Figure 7.15: Joint density (left) and contours enclosing 5%, 20%, . . ., 95% of probabilitywhen X and Y are independent standard normal random variables.

To find the value of a so that Da encloses a given proportion p of probability, we solve

p = P (X2 + Y 2 ≤ a2) = 1 − e−a2/2 ⇒ a =√−2 ln(1 − p).

Further, for (x, y) satisfying x2 + y2 = a2 = −2 ln(1 − p), f(x, y) = (1 − p)/(2π).

Remarks. The joint density given in this section corresponds to the density of indepen-dent standard normal random variables. See Section 8.1.9 for details about bivariate normaldistributions.

Probability contours for bivariate continuous distributions like the bivariate normal distri-bution take the place of quantiles for univariate continuous distributions.

168

8 Joint Continuous Distributions

Recall, from Chapter 3, that a probability distribution describing the joint variability oftwo or more random variables is called a joint distribution, and that a bivariate distributionis the joint distribution of a pair of random variables.

8.1 Bivariate Distributions

This section considers joint distributions of pairs of continuous random variables.


In the bivariate continuous setting, we assume that X and Y are continuous random vari-ables and that there exists a nonnegative function f(x, y) defined over the entire xy-planesuch that

P ((X,Y ) ∈ D) =

∫ ∫

Df dA for every D ⊆ R2.

The function f(x, y) is called the joint probability density function (joint PDF) (or jointdensity function) of the random pair (X,Y ). The notation fXY (x, y) is sometimes used toemphasize the two random variables.

Joint PDFs satisfy the following two properties:

1. f(x, y) ≥ 0 for all pairs (x, y).

2.∫ ∫

R2 f(x, y) dA = 1.

The joint cumulative distribution function (joint CDF) of the random pair (X,Y ), denotedby F (x, y), is the function

F (x, y) = P (X ≤ x, Y ≤ y) =

∫ ∫

Dxy

f dA, for all real pairs (x, y),

where Dxy = (u, v) : u ≤ x, v ≤ y.

Joint CDFs satisfy the following properties:

1. lim(x,y)→(−∞,−∞) F (x, y) = 0 and lim(x,y)→(∞,∞) F (x, y) = 1.

2. If x1 ≤ x2 and y1 ≤ y2, then F (x1, y1) ≤ F (x2, y2).

3. F (x, y) is continuous.

Example 8.1 (Joint Distribution in 1st Quadrant) Assume that (X,Y ) has thefollowing joint density function:

f(x, y) = 2e−x−2y when x, y ≥ 0 and 0 otherwise.

169

Given (x, y) in the first quadrant (both coordinates nonnegative),

F (x, y) =

∫ y

0

∫ x

02e−u−2v du dv =

∫ y

02e−2v(1 − e−x) dv = (1 − e−x)(1 − e−2y).

F (x, y) is equal to 0 otherwise.

Further, to compute the probability that X is greater than or equal to Y , for example, wecompute the double integral of the joint PDF over the intersection of the half-plane x ≥ ywith the first quadrant:

P (X ≥ Y ) =

∫ ∞

0

∫ ∞

y2e−x−2y dx dy =

∫ ∞

02e−3y dy =

2

3.

8.1.2 Marginal Distributions

The marginal probability density function (marginal PDF) (or marginal density function)of X is

fX(x) =∫y f(x, y)dy for x in the range of X

and 0 otherwise, where the integral is taken over all y in the range of Y .

Similarly, the marginal PDF (or marginal density function) of Y is

fY (y) =∫x f(x, y)dx for y in the range of Y

and 0 otherwise, where the integral is taken over all x in the range of X.

Example 8.2 (Joint Distribution in 1st Quadrant, continued) For the joint distri-bution described above:

1. When x ≥ 0,∫∞0 f(x, y) dy = e−x. Thus,

fX(x) = e−x when x ≥ 0 and 0 otherwise.

(X is an exponential random variable with parameter 1.)

2. When y ≥ 0,∫∞0 f(x, y) dx = 2e−2y. Thus,

fY (y) = 2e−2y when y ≥ 0 and 0 otherwise.

(Y is an exponential random variable with parameter 2.)

Example 8.3 (Joint Distribution on Eighth Plane) Assume that (X,Y ) has thefollowing joint density function:

f(x, y) =1

4e−y/2 when 0 ≤ x ≤ y and 0 otherwise.

170

0

5

10x 0

5

10

y

z

0

5

10x

5 10 x

5

10

y

5 10 x

5

10

y

Figure 8.1: Plot of z = (1/4)e−y/2 on 0 ≤ x ≤ y (left) and region of nonzero density (right).

The left plot in Figure 8.1 shows part of the surface z = f(x, y) (with vertical planes added)and the right plot shows part of the region of nonzero density.

The volume of the solid shown in the left part of Figure 8.1 is equal to the probability thatY is at most 11:

P (Y ≤ 11) =

∫ 11

0

∫ y

0

1

4e−y/2 dx dy =

∫ 11

0

1

4y e−y/2 dy = 1 − 13

2e−11/2 ≈ 0.9734.

Further, for this random pair,

1. When x ≥ 0,∫∞x f(x, y) dy = 1

2 e−x/2. Thus,

fX(x) =1

2e−x/2 when x ≥ 0 and 0 otherwise.

(X is an exponential random variable with parameter 1/2.)

2. When y ≥ 0,∫ y0 f(x, y) dx = 1

4 y e−y/2. Thus,

fY (y) =1

4y e−y/2 when y ≥ 0 and 0 otherwise.

(Y is a gamma random variable with shape parameter 2 and scale parameter 2.)

8.1.3 Conditional Distributions

Let X and Y be continuous random variables with joint PDF f(x, y), and marginal PDFsfX(x) and fY (y), respectively.

If fY (y) 6= 0, then the conditional probability density function (conditional PDF) (or con-ditional density function) of X given Y = y is defined as follows:

fX|Y =y(x|y) =f(x, y)

fY (y)for all real numbers x.

171

Similarly, if fX(x) 6= 0, then the conditional PDF (or conditional density function) of Ygiven X = x is

fY |X=x(y|x) =f(x, y)

fX(x)for all real numbers y.

Example 8.4 (Joint Distribution in 1st Quadrant, continued) Consider again therandom pair with joint density function

f(x, y) = 2e−x−2y when x, y ≥ 0 and 0 otherwise.

1. Given x ≥ 0, the conditional distribution of Y given X = x has PDF

fY |X=x(y|x) = 2e−2y when y ≥ 0 and 0 otherwise.

(The conditional distribution is the same as the marginal distribution.)

2. Given y ≥ 0, the conditional distribution of X given Y = y has PDF

fX|Y =y(x|y) = e−x when x ≥ 0 and 0 otherwise.

(The conditional distribution is the same as the marginal distribution.)

Example 8.5 (Joint Distribution on Eighth Plane, continued) Consider again therandom pair with joint density function

f(x, y) =1

4e−y/2 when 0 ≤ x ≤ y and 0 otherwise. (Figure 8.1)

1. Given x ≥ 0, the conditional distribution of Y given X = x has PDF

fY |X=x(y|x) =1

2e−(y−x)/2 when y ≥ x and 0 otherwise.

(The conditional distribution is a shifted exponential distribution.)

2. Given y > 0, the conditional distribution of X given Y = y has PDF

fX|Y =y(x|y) =1

ywhen 0 ≤ x ≤ y and 0 otherwise.

(The conditional distribution is uniform on the interval [0, y].)

Remarks. The function fX|Y =y(x|y) is a valid PDF since fX|Y =y(x|y) ≥ 0 and∫

xfX|Y =y(x|y) dx =

1

fY (y)

∫

xf(x, y) dx =

1

fY (y)fY (y) = 1.

Similarly, the function fY |X=x(y|x) is a valid PDF. Thus, conditional distributions aredistributions in their own right.

Although, for example, fY (y) does not represent probability, we can think of the conditionaldensity function fX|Y =y(x|y) as a limit of probabilities. Specifically,

fX|Y =y(x|y) = limǫ→0

∂

∂xP (X ≤ x|y − ǫ ≤ Y ≤ y + ǫ).

Thus, the conditional PDF of X given Y = y can be thought of as the conditional PDF ofX given that Y is very close to y.

172

8.1.4 Independent Random Variables

The continuous random variables X and Y are said to be independent if their joint densityfunction equals the product of their marginal density functions for all real pairs,

f(x, y) = fX(x)fY (y) for all (x, y).

Equivalently, X and Y are independent if their joint CDF equals the product of theirmarginal CDFs for all real pairs,

F (x, y) = FX(x)FY (y) for all (x, y).

Random variables that are not independent are said to be dependent.

Example 8.6 (Joint Distribution in 1st Quadrant, continued) The random variableswith joint density function

f(x, y) = 2e−x−2y = (e−x)(2e−2y) when x, y ≥ 0 and 0 otherwise

are independent.

Example 8.7 (Joint Distribution on Eighth Plane, continued) The random variableswith joint density function

f(x, y) =1

4e−y/2 when 0 ≤ x ≤ y and 0 otherwise (Figure 8.1)

are dependent. To demonstrate the dependence of X and Y , we just need to show thatf(x, y) 6= fX(x)fY (y) for some pair (x, y). For example, f(2, 1) 6= fX(2)fY (1).

8.1.5 Mathematical Expectation for Bivariate Continuous Distributions

Let X and Y be continuous random variables with joint density function f(x, y) and letg(X,Y ) be a real-valued function.

The mean of g(X,Y ) (or the expected value of g(X,Y ) or the expectation of g(X,Y )) is

E(g(X,Y )) =

∫

x

∫

yg(x, y) f(x, y) dy dx

provided the integral converges absolutely. The double integral is assumed to include allpairs with nonzero joint density.

Example 8.8 (Joint Distribution on Triangle) Assume that (X,Y ) has the followingjoint density function

f(x, y) =2

15when 0 ≤ x ≤ 3, 0 ≤ y ≤ 5x/3 and 0 otherwise.

(The random pair has constant density on the triangle with corners (0, 0), (3, 0), (3, 5).)

Let g(X,Y ) = XY be the product of the variables. Then

E(g(X,Y )) =

∫ 3

0

∫ 5x/3

0xy

2

15dy dx =

∫ 3

0

5

27x3 dx =

15

4.

173

Properties of expectation. The properties of expectation stated in Section 3.1.6 in thebivariate discrete case are also true in the bivariate continuous case.

8.1.6 Covariance and Correlation

Let X and Y be random variables with finite means (µx, µy) and finite standard deviations(σx, σy). The covariance of X and Y , Cov(X,Y ), is defined as follows:

Cov(X,Y ) = E((X − µx)(Y − µy)).

The notation σxy = Cov(X,Y ) is used to denote the covariance. The correlation of X andY , Corr(X,Y ), is defined as follows:

Corr(X,Y ) =Cov(X,Y )

σxσy=

σxy

σxσy.

The notation ρ = Corr(X,Y ) is used to denote the correlation of X and Y ; ρ is called thecorrelation coefficient.

The properties of covariance and correlation stated in Section 3.1.7 in the bivariate discretecase are also true in the bivariate continuous case.

Example 8.9 (Joint Distribution on Triangle, continued) Continuing with the jointdistribution above,

E(X) = 2, E(Y ) =5

3, E(X2) =

9

2, E(Y 2) =

25

6, E(XY ) =

15

4.

Using properties of variance and covariance:

V ar(X) = E(X2) − (E(X))2 =1

2, V ar(Y ) = E(Y 2) − (E(Y ))2 =

25

18,

Cov(X,Y ) = E(XY ) − E(X)E(Y ) = 5/12 and ρ = Corr(X,Y ) = 1/2.

Example 8.10 (Joint Distribution on Diamond) Assume that the continuous pair(X,Y ) has joint density function

f(x, y) =1

2when |x| + |y| ≤ 1 and 0 otherwise.

(Joint density is constant on the “diamond” with corners (−1, 0), (0, 1), (1, 0), (0,−1).)

For this joint distribution,

1. E(X) = E(Y ) = E(XY ) = 0.

2. Cov(X,Y ) = 0 and Corr(X,Y ) = 0.

3. The marginal PDF of X is fX(x) = 1 − |x| when −1 ≤ x ≤ 1 and 0 otherwise.

4. The marginal PDF of Y is fY (y) = 1 − |y| when −1 ≤ y ≤ 1 and 0 otherwise.

Since Corr(X,Y ) = 0, X and Y are uncorrelated. But, X and Y are dependent since, forexample, f(0, 0) 6= fX(0)fY (0).

174

11.5

22.5

3x 0

0.5

1

1.52

y

z

11.5

22.5

3x 1.5 2 2.5 3 x

0.5

1

1.5

2

y

Figure 8.2: Graph of z = f(x, y) for a continuous bivariate distribution (left plot) andcontour plot with z = 1/2, 1/4, 1/6, . . . , 1/20 and conditional expectation of Y given X = xsuperimposed (right plot).

8.1.7 Conditional Expectation and Regression

Let X and Y be continuous random variables with joint density function f(x, y).

If fX(x) 6= 0, then the conditional expectation (or the conditional mean) of Y given X = x,E(Y |X = x), is defined as follows:

E(Y |X = x) =

∫

yy fY |X=x(y|x) dy,

where the integral is over all y with nonzero conditional density (fY |X=x(y|x) 6= 0), providedthe integral converges absolutely.

The conditional expectation of X given Y = y, E(X|Y = y), is defined similarly.

Example 8.11 (Conditional Expectation) Assume that the continuous pair (X,Y )has joint density function

f(x, y) =e−yx

2√

xwhen x ≥ 1, y ≥ 0 and 0 otherwise.

1. The marginal PDF of X is

fX(x) =1

2x3/2when x ≥ 1 and 0 otherwise.

2. The conditional PDF of Y given X = x is

fY |X=x(y|x) = xe−xy when y ≥ 0 and 0 otherwise.

(The conditional distribution is exponential with parameter x.)

3. Since the expected value of an exponential random variable is the reciprocal of itsparameter (see Table 5.2), the conditional mean formula is

E(Y |X = x) =1

xfor x ≥ 1.

175

The left part of Figure 8.2 is a graph of z = f(x, y) (with vertical planes added) and theright part is a contour plot with z = 1/2, 1/4, 1/6, . . . , 1/20 (in gray). The formula for theconditional mean, y = 1/x, is superimposed on the contour plot (in black).

Regression equation. Note that the formula for the conditional expectation

E(Y |X = x) as a function of x

is called the regression equation of Y on X. Similarly, the formula for the conditionalexpectation E(X|Y = y) as a function of y is called the regression equation of X on Y .

Linear conditional means. Let X and Y be random variables with finite means (µx,µy), standard deviations (σx, σy), and correlation (ρ).

If E(Y |X = x) is a linear function of x, then the formula is of the form:


σx(x − µx).

Similarly, if E(X|Y = y) is a linear function of y, then the formula is of the form:


σy(y − µy).

Example 8.12 (Two Step Experiment) Consider the following 2 step experiment:

Step 1: Choose X uniformly from the interval [10, 20].

Step 2: Given that X = x, choose Y uniformly from the interval [0, x].

Then the continuous pair (X,Y ) has joint density function

f(x, y) = fX(x)fY |X=x(y|x) =

(1

10

)(1

x

)=

1

10x

when 10 ≤ x ≤ 20, 0 ≤ y ≤ x and 0 otherwise. (The region of nonzero density is thetrapezoidal region with corners (10, 0), (10, 10), (20, 20), (20, 0).)

Since the conditional distribution of Y given X = x is uniform on the interval [0, x], and theexpected value of a uniform random variable is the midpoint of its interval (see Table 5.2),the conditional mean formula is

E(Y |X = x) =x

2for 10 ≤ x ≤ 20

(a linear function of x). Further, since

µx = 15, µy =15

2, σx =

5√3, σy =

5√

31

6and ρ =

√3

31

for this random pair, µy + ρ(σy/σx)(x − µx) = x/2, verifying the formula given above.

176

8.1.8 Example: Bivariate Uniform Distribution

Let R be a region of the plane with finite positive area. The continuous random pair (X,Y )is said to have a bivariate uniform distribution on the region R if its joint PDF is as follows:

f(x, y) =1

Area of Rwhen (x, y) ∈ R and 0 otherwise.

Bivariate uniform distributions have constant density over the region of nonzero density.That constant density is the reciprocal of the area. If A is a subregion of R (A ⊆ R), then

P ((X,Y ) ∈ A) =Area of A

Area of R.

That is, the probability of the event “the point (x, y) is in A” is the ratio of the area of thesubregion to the area of the full region.

The joint distributions on the triangle and on the diamond considered earlier are examplesof bivariate uniform distributions on regions with areas 15/2 and 2, respectively.

Expectations. If (X,Y ) has a bivariate uniform distribution on R and g(X,Y ) is a real-valued function, then E(g(X,Y )) is the same as the average value of the 2-variable functiong over the region R, as discussed on page 157.

Marginal distributions. If (X,Y ) has a bivariate uniform distribution, then X and Ymay not be uniform random variables.

Consider again the joint distribution on the diamond. X is not a uniform random variablesince its PDF is not constant on the interval [−1, 1]. Similarly, Y is not a uniform randomvariable since its PDF is not constant on [−1, 1].

8.1.9 Example: Bivariate Normal Distribution

Let µx and µy be real numbers, σx and σy be positive real numbers, and ρ be a numberin the interval −1 < ρ < 1. The random pair (X,Y ) is said to have a bivariate normaldistribution with parameters µx, µy, σx, σy and ρ when its joint PDF is as follows

f(x, y) =1

2πσxσy

√1 − ρ2

exp

( − term

2(1 − ρ2)

)for all real pairs (x, y)

where

term =

(x − µx

σy

)2

− 2ρ

(x − µx

σx

)(y − µy

σy

)+

(y − µy

σy

)2

and exp() is the exponential function.

For this random pair,

177

-3

0

3x -3

0

3

y

z

-3

0

3x

-1.5 1.5 x

-1.5

1.5

y

Figure 8.3: Joint density (left) and contours enclosing 5%, 35%, 65% and 95% of theprobability for the standard bivariate normal distribution with correlation ρ = −0.80.

1. µx and σx are the mean and standard deviation of X,

2. µy and σy are the mean and standard deviation of Y and

3. ρ (the correlation coefficient) is the correlation between X and Y .

Further, if ρ = 0, then X and Y are independent.

Standard bivariate normal distribution. The random pair (X,Y ) is said to have astandard bivariate normal distribution with parameter ρ when (X,Y ) has a bivariate normaldistribution with µx = µy = 0 and σx = σy = 1.

The joint PDF of (X,Y ) is as follows:

f(x, y) =1

2π√

1 − ρ2exp

(−(x2 − 2ρxy + y2)

2(1 − ρ2)

)for all real pairs (x, y).

Example 8.13 (ρ = −0.80) The left part of Figure 8.3 shows the joint density functionof the bivariate standard normal distribution with correlation ρ = −0.80.

The right part of the figure shows contours enclosing 5%, 35%, 65% and 95% of the prob-ability of this joint distribution. Each contour is an ellipse whose major axis lies along theline y = −x and whose minor axis lies along the line y = x.

Marginal and conditional distributions. If (X,Y ) has a bivariate normal distribution,then each marginal and conditional distribution is normal. Specifically,

1. X is a normal random variable with mean µx and standard deviation σx,

2. Y is a normal random variable with mean µy and standard deviation σy,

178

2 u

1

v

2 u

1

v

0 1 2 3 x0

1

2

3X Dens.

Figure 8.4: The region of nonzero density for a bivariate uniform distribution on the openrectangle (0, 2)× (0, 1) (left plot) and the probability density function of the product of thecoordinates with values in (0, 2) (right plot). Contours for products equal to 0.2, 0.6, 1.0,and 1.4 are shown in the left plot.

3. the conditional distribution of Y given X = x is normal with mean


σx(x − µx)

and standard deviation σy

√1 − ρ2, and

4. the conditional distribution of X given Y = y is normal with mean


σy(y − µy)

and standard deviation σx

√1 − ρ2.

Remarks. The computations needed to find probability contours for bivariate normaldistributions generalize ideas from Section 7.7.2 and are discussed further in the next section.

8.1.10 Transforming Continuous Random Variables

Let (U, V ) be a continuous random pair with joint density function f(u, v). This sectionconsiders scalar-valued and vector-valued continuous transformations of (U, V ).

Scalar-valued functions. We first consider scalar-valued functions.

Example 8.14 (Product Transformation) Let U be the length, V be the width, andX = UV be the area of a random rectangle. Specifically, assume that U is a uniformrandom variable on the interval (0, 2), V is a uniform random variable on the interval (0, 1),and that U and V are independent.

Since the joint PDF of (U, V ) is

f(u, v) = fU (u)fV (v) =1

2when (u, v) ∈ (0, 2) × (0, 1) and 0 otherwise,

179

(U, V ) has a bivariate uniform distribution on the open rectangle.

For 0 < x < 2,

P (X ≤ x) = P (UV ≤ x) =x

2+

1

2

∫ 2

x

x

udu =

1

2(x + x ln(2) − x ln(x))

and ddxP (X ≤ x) = 1

2 (ln(2) − ln(x)) = 12 ln(2/x). Thus,

fX(x) =1

2ln(2/x) when 0 < x < 2 and 0 otherwise.

The left part of Figure 8.4 shows the region of nonzero joint density, with contours cor-responding to x = 0.2, 0.6, 1.0, 1.4 superimposed. The right part is a plot of the densityfunction of X = UV .

Example 8.15 (Sum Transformation) Let X = U +V , where U and V are independentexponential random variables with parameter 1. The range of X is the nonnegative realnumbers.

Since U and V are independent, the joint density function of (U, V ) is

f(u, v) = fU (u)fV (v) = e−u−v when u, v ≥ 0 and 0 otherwise.

Given x ≥ 0,

P (X ≤ x) =

∫ x

0

∫ x−u

0e−u−v dv du =

∫ x

0

(−e−x + e−u) du = 1 − e−x − xe−x,

and ddxP (X ≤ x) = xe−x (using the product rule for derivatives). Thus,

fX(x) = xe−x when x ≥ 0 and 0 otherwise.

(X is a gamma random variable with shape parameter 2 and scale parameter 1.)

Working with sums. If X = U + V and f(u, v) has continuous partial derivatives, thenthe PDF of X can be computed as follows:

fX(x) =

∫

u∈Rf(u, x − u) du where R is the range of U .

This result can be proven using the relationship between partial differentiation and partialanti-differentation given in Section 7.6.7.

For the example above, since the joint density function is nonzero when both argumentsare nonnegative, the integral reduces to

fX(x) =

∫ x

0e−x du = xe−x when x ≥ 0 and 0 otherwise,

confirming the result obtained using double integration followed by differentiation.

180

Vector-valued functions. We next consider vector-valued functions.

Let (x, y) = T (u, v) be a vector-valued function of two variables. Assume that the coordi-nate functions of the transformation, x = x(u, v) and y = y(u, v), have continuous partialderivatives and that the Jacobian of the transformation

∂(x, y)

∂(u, v)=

∣∣∣∣xu xv

yu yv

∣∣∣∣ = xuyv − xvyu 6= 0 for all (u, v).

(See Section 7.6.8 for a discussion of the Jacobian.)

If (X,Y ) = T (U, V ) and f(u, v) has continuous partial derivatives, then, for all (x, y) withnonzero joint density,

fXY (x, y) = f(u, v)/

∣∣∣∣∂(x, y)

∂(u, v))

∣∣∣∣ , where (x, y) = T (u, v).

(This result generalizes the result for monotone transformations from page 100.)

Example 8.16 (Standard Bivariate Normal Distribution) Let U and V be indepen-dent standard normal random variables and let

(X,Y ) = T (U, V ) =

(U, ρU +

√1 − ρ2V

),

where −1 < ρ < 1. Then (X,Y ) has a standard bivariate normal distribution with correla-tion ρ. To see this,

1. The Jacobian of the transformation is

∂(x, y)

∂(u, v)=

∣∣∣∣xu xv

yu yv

∣∣∣∣ =∣∣∣∣1 0ρ

√1 − ρ2

∣∣∣∣ =√

1 − ρ2.

2. Since x = u, y = ρu +√

1 − ρ2v implies u = x, v = 1√1−ρ2

(−ρx + y),

u2 + v2 = x2 +1

(1 − ρ2)(−ρx + y)2 =

x2 − 2ρxy + y2

(1 − ρ2).

3. Since f(u, v) = exp(−(u2 + v2)/2)/(2π), the formula above implies that

fXY (x, y) =1

2π√

1 − ρ2exp

(−(x2 − 2ρxy + y2)

2(1 − ρ2)

)for all (x, y).

The result is the same as the joint density function of the standard bivariate normaldistribution with correlation ρ given earlier.

Example 8.17 (Bivariate Normal Distribution) More generally, if U and V areindependent standard normal random variables and we let

(X,Y ) = T (U, V ) =

(µx + σxU, µy + σy

(ρU +

√1 − ρ2V

)),

then (X,Y ) has a bivariate normal distribution with parameters µx, µy, σx, σy and ρ.

The steps in the derivation follow those in the previous example.

181

Remarks. The linear transformations given in the last two examples can be used to findprobability contours for bivariate normal distributions. That is, the transformations can beused to find contours enclosing probability p, for 0 < p < 1.

We consider the first transformation.

1. Let U and V be independent standard normal random variables, Ca be the circle ofradius a centered at the origin and

Da = (u, v) : u2 + v2 ≤ a2

be the disk of radius a centered at the origin. If a =√−2 ln(1 − p), then

P ((U, V ) ∈ Da) = p

(see Section 7.7.2) and Ca is the contour enclosing probability p.

Further, if (u, v) ∈ Ca, then f(u, v) = (1 − p)/(2π).

2. Let (X,Y ) = T (U, V ) =(U, ρU +

√1 − ρ2V

), where −1 < ρ < 1.

The image of Da under T ,

T (Da) = (x, y) : x2 − 2ρxy + y2 ≤ a2(1 − ρ2),

is an elliptical region centered at the origin, whose boundary, T (Ca), is the contourenclosing probability p.

Further, if (x, y) ∈ T (Ca), then fXY (x, y) = (1 − p)/(2π√

1 − ρ2).

Note that the right part of Figure 8.3 shows contours corresponding to p = 0.05, 0.35, 0.65and 0.95 when ρ = −0.80.

Bivariate normal distributions (and multivariate normal distributions) will be studied fur-ther in Chapter 9 of these notes.

8.2 Multivariate Distributions

Recall (from Section 3.2) that a multivariate distribution is the joint distribution of krandom variables. Ideas studied in the bivariate case (k = 2) can be generalized to the casewhere k > 2.


Let X1, X2, . . ., Xk be continuous random variables and let

X = (X1,X2, . . . ,Xk).

We assume there exists a nonnegative k-variable function f with domain Rk such that

P (X ∈ W ) =

∫· · ·∫

Wf(x1, x2, . . . , xk) dx1 · · · dxk

182

for every W ⊆ Rk. The function f(x1, x2, . . . , xk) is called the joint probability densityfunction (joint PDF) (or joint density function) of the random k-tuple X.

The joint cumulative distribution function (joint CDF) of X is defined as follows:

F (x) = P (X1 ≤ x1,X2 ≤ x2, . . . ,Xk ≤ xk) for all x = (x1, x2, . . . , xk) ∈ Rk,

where commas are understood to mean the intersection of events.

Example 8.18 (Joint Distribution on 1st Octant) Assume that X = (X,Y,Z) hasjoint density function

f(x, y, z) =6

(1 + x + y + z)4when x, y, z ≥ 0 and 0 otherwise.

Given x, y, z ≥ 0,

F (x, y, z) =∫ x0

∫ y0

∫ z0 f(u, v,w) dw dv du =

yz(2 + y + z)

(1 + y)(1 + z)(1 + y + z)− yz(2 + 2x + y + z)

(1 + x)(1 + x + y)(1 + x + z)(1 + x + y + z).

The joint CDF is equal to 0 otherwise.

Let c be a positive constant. To find the probability that the sum of the three variables isat most c, for example, we compute the triple integral of f over the region

W = (x, y, z) : 0 ≤ x ≤ c, 0 ≤ y ≤ c − x, 0 ≤ z ≤ c − x − y ⊂ R3.

The probability is

P (X + Y + Z ≤ c) =

∫ ∫ ∫

Wf(x, y, z) dV =

(c

c + 1

)3

.

8.2.2 Marginal and Conditional Distributions

The concepts of marginal and conditional distributions can also be generalized.

Example 8.19 (Joint Distribution on 1st Octant, continued) To illustrate compu-tations, we continue with the joint distribution defined above.

1. The marginal distribution of the random pair (X,Y ) is

fXY (x, y) =

∫ ∞

0f(x, y, z) dz =

2

(1 + x + y)3when x, y ≥ 0 and 0 otherwise.

2. The marginal distribution of Z is

fZ(z) =

∫ ∞

0

∫ ∞

0f(x, y, z) dx dy =

1

(1 + z)2when z ≥ 0 and 0 otherwise.

183

3. Let x and y be fixed nonnegative constants. The conditional distribution of Z given(X,Y ) = (x, y) is

fZ|(X,Y )=(x,y)(z|(x, y)) =f(x, y, z)

fXY (x, y)=

3(1 + x + y)3

(1 + x + y + z)4

when z ≥ 0 and 0 otherwise.

8.2.3 Mutually Independent Random Variables and Random Samples

The continuous random variables X1, X2, . . ., Xk are said to be mutually independent (orindependent) if

f(x) = f1(x1)f2(x2) · · · fk(xk) for all x = (x1, x2, . . . , xk) ∈ Rk,

where fi(xi) is the PDF of Xi for each i.

Equivalently, X1, X2, . . ., Xk are said to be mutually independent if

F (x) = F1(x1)F2(x2) · · ·Fk(xk) for all x = (x1, x2, . . . , xk) ∈ Rk,

where Fi(xi) is the CDF of Xi for each i.

X1, X2, . . ., Xk are said to be dependent if they are not independent.

Example 8.20 (Joint Distribution on 1st Octant, continued) The random variablesX, Y and Z in the example above are dependent. To see this, first note that each randomvariable has the same marginal density function:

fX(u) = fY (u) = fZ(u) =1

(1 + u)2when u ≥ 0 and 0 otherwise.

To demonstrate dependence, we just need to show that f(x, y, z) 6= fX(x)fY (y)fZ(z) forsome triple (x, y, z). For example, f(1, 1, 1) 6= fX(1)fY (1)fZ(1).

Random sample. If the continuous random variables X1, X2, . . ., Xk are mutuallyindependent and have a common distribution (each marginal distribution is the same),then X1, X2, . . ., Xk are said to be a random sample from that distribution.

8.2.4 Mathematical Expectation For Multivariate Continuous Distributions

Let X1, X2, . . ., Xk be continuous random variables with joint PDF f(x) and let g(X) bea real-valued function. The mean or expected value or expectation of g(X) is

E(g(X)) =

∫

xk

∫

xk−1

· · ·∫

x1

g(x1, x2, . . . , xk) f(x1, x2, . . . , xk)dx1dx2 · · · dxk

provided that the integral converges absolutely. The multiple integral is assumed to includeall k-tuples with nonzero joint PDF.

184

Example 8.21 (Product Function) You pick three numbers independently and uni-formly from the interval [0, 5]. Let X1, X2, X3 be the three numbers and let g(X1,X2,X3)be the function with rule g(X1,X2,X3) = X1X

22X3.

By independence,

f(x1, x2, x3) =1

125on the [0, 5] × [0, 5] × [0, 5] box and 0 otherwise.

The expected product is E(g(X1,X2,X3)) =

∫ 5

0

∫ 5

0

∫ 5

0

(x1x

22x3

) 1

125dx3 dx2 dx1 =

∫ 5

0

∫ 5

0

1

10x1x

22 dx2 dx1 =

∫ 5

0

25

6x1 dx1 =

625

12≈ 50.08.

Properties of expectation. Properties of expectation in the multivariate case directlygeneralize properties in the univariate and bivariate cases. In particular,

1. If a and b1, b2, . . . , bn are constants and gi(X) are real-valued k-variable functions foreach i = 1, 2, . . . , n, then

E

(a +

n∑

i=1

bigi(X)

)= a +

n∑

i=1

biE(gi(X)).

2. If X1, X2, . . ., Xk are mutually independent and gi(Xi) are real-valued 1-variablefunctions for i = 1, 2, . . . , k, then

E(g1(X1)g2(X2) · · · gk(Xk)) =k∏

i=1

E(gi(Xi)).

Example 8.22 (Product Function, continued) Continuing with the example above,and using properties of expectation, the expected product is

E(X1X22X3) = E(X1)E(X2

2 )E(X3) =

(5

2

)(25

3

)(5

2

)=

625

12,

confirming the result obtained above.

8.2.5 Sample Summaries

If X1, X2, . . ., Xn is a random sample from a distribution with mean µ and standarddeviation σ, then the sample mean, X, is the random variable

X =1

n(X1 + X2 + · · · + Xn) ,

the sample variance, S2, is the random variable

S2 =1

n − 1

n∑

i=1

(Xi − X

)2

185

and the sample standard deviation, S, is the positive square root of the sample variance.

The following theorem can be proven using properties of expectation (see Section 3.2.7):

Theorem 8.23 (Sample Summaries) If X is the sample mean and S2 is the samplevariance of a random sample of size n from a distribution with mean µ and standarddeviation σ, then

1. E(X) = µ and V ar(X) = σ2/n.

2. E(S2) = σ2.

Sample correlation. A random sample of size n from the joint (X,Y ) distribution is alist of n mutually independent random pairs, each with the same distribution as (X,Y ).

If (X1, Y1), (X2, Y2), . . . , (Xn, Yn) is a random sample of size n from a bivariate distributionwith correlation ρ = Corr(X,Y ), then the sample correlation, R, is the random variable

R =

∑ni=1(Xi − X)(Yi − Y )√ ∑n

i=1(Xi − X)2∑n

i=1(Yi − Y )2

where X and Y are the sample means of the X and Y samples, respectively.

Applications. In statistical applications, observed values of the sample mean, samplevariance, and sample correlation are used to estimate unknown values of the parameters µ,σ2 and ρ, respectively.

Example 8.24 (Brain-Body Weights)(Allison & Cicchetti, Science, 194:732-374, 1976;lib.stat.cmu.edu/DASL.) As part of a study on sleep in mammals, researchers collectedinformation on the average brain weight and average body weight for 43 different species.

(Body-weight, brain-weight) combinations ranged from (0.05kg,0.14g) for the short-tailshrew to (2547.0kg,4603.0g) for the Asian elephant. The data for “man” are

(62kg,1320g) = (136.4lb,2.9lb).

Let X be the common logarithm of body weight in kilograms, and Y be the commonlogarithm of brain weight in grams. Sample summaries are as follows:

1. Mean log-body weight is x = 0.311738, with a SD of sx = 1.33113.

2. Mean log-brain weight is y = 1.17421, with a SD of sy = 1.09415.

3. Correlation between log-body weight and log-brain weight is r = 0.951693.

186

-3 -2 -1 0 1 2 3 4x

-2

-1

0

1

2

3

4y

Figure 8.5: Log-brain weight (vertical axis) versus log-body weight (horizontal axis) for thebrain-body study. Probability contours of the estimated bivariate normal model and theestimated conditional expectation are superimposed.

The researchers determined that a bivariate normal model was reasonable for these data.Figure 8.5 compares the common logarithms of average brain weight in grams (verticalaxis) and average body weight in kilograms (horizontal axis) for the 43 species. Probabilitycontours for the model with parameters equal to the sample summaries are shown on theplot; contours enclose 5%, 35%, 65% and 95% of probability. The pair for “man” is theonly point outside the outermost contour.

The estimated conditional expectation line,

y = 1.17421 + 0.782263 (x − 0.311738),

is displayed in the plot. Note, in particular, that the slope of the line is r(sy/sx) = 0.782263.

187

9 Linear Algebra Concepts

The central equations of interest in linear algebra are of the form Ax = b, where A is anm-by-n coefficient matrix, x is an n-by-1 vector of unknowns, and b is an m-by-1 vectorof values. Solutions come at three levels of sophistication: (1) direct solution methods, (2)matrix algebra methods, and (3) vector space methods.

This chapter introduces linear algebra, and applications in statistics.

9.1 Solving Systems of Equations

A linear equation in the variables x1, x2, . . . , xn is an equation of the form

a1x1 + a2x2 + · · · + anxn = b

where a1, a2, . . . , an and b are constants. A system of linear equations in the variablesx1, x2, . . . , xn is a collection of one or more linear equations in these variables. For example,

x1 −2x2 + x3 = 02x2 −8x3 = 8

−4x1 +5x2 +9x3 = −9

is a system of 3 linear equations in the three unknowns x1, x2, x3 (a “3-by-3 system”).

A solution of a linear system is a list (s1, s2, . . . , sn) of numbers that makes each equationin the system true when si is substituted for xi for i = 1, 2, . . . , n. The list (29, 16, 3) is asolution to the system above. To check this, note that

(29) −2(16) + (3) = 02(16) −8(3) = 8

−4(29) +5(16) +9(3) = −9

The solution set of a linear system is the collection of all possible solutions of the system.For the example above, (29, 16, 3) is the unique solution.

Two linear systems are equivalent if each has the same solution set.

9.1.1 Elementary Row Operations

The strategy for solving a linear system is to replace the system with an equivalent one thatis easier to solve. The equivalent system is obtained using elementary row operations:

1. (Replacement) Add a multiple of one row to another row,

2. (Interchange) Interchange two rows,

3. (Scaling) Multiply all entries in a row by a nonzero constant,

where each “row” corresponds to an equation in the system.

189

Example 9.1 (Using Row Operations) We work out the strategy using the 3-by-3system given above, where a 3-by-4 augmented matrix of the coefficients and right handside values follows our progress toward a solution.

(1)R1 :R2 :R3 :

x1 −2x2 + x3 = 02x2 −8x3 = 8

−4x1 +5x2 +9x3 = −9

1 −2 1 00 2 −8 8

−4 5 9 −9

(2)R1 :R2 :

R3 + 4R1 :

x1 −2x2 + x3 = 02x2 −8x3 = 8

−3x2 +13x3 = −9

1 −2 1 00 2 −8 80 −3 13 −9

(3)R1 :

1

2R2 :R3 :

x1 −2x2 + x3 = 0x2 −4x3 = 4

−3x2 +13x3 = −9

1 −2 1 00 1 −4 40 −3 13 −9

(4)R1 :R2 :

R3 + 3R2 :

x1 −2x2 + x3 = 0x2 −4x3 = 4

x3 = 3

1 −2 1 00 1 −4 40 0 1 3

(5)R1 − R3 :

R2 + 4R3 :R3 :

x1 −2x2 = −3x2 = 16

x3 = 3

1 −2 0 −30 1 0 160 0 1 3

(6)R1 + 2R2 :

R2 :R3 :

x1 = 29x2 = 16

x3 = 3

1 0 0 290 1 0 160 0 1 3

Thus, the solution is x1 = 29, x2 = 16, and x3 = 3.

Forward and backward phases. In the forward phase of the row reduction process ofour example (corresponding to systems (1) through (4)), R1 is used to eliminate x1 fromR2 and R3; then R2 is used to eliminate x2 from R3. In the backward phase of the process(corresponding to systems (5) and (6)), R3 is used to eliminate x3 from R1 and R2; thenR2 is used to eliminate x2 from R1.

Existence and uniqueness. Two fundamental questions about linear systems are:

1. Does at least one solution exist?

2. If a solution exists, is the solution unique?

If a linear system has one or more solutions, then it is said to be consistent ; otherwise, it issaid to be inconsistent. A linear system can be inconsistent, or have a unique solution, orhave infinitely many solutions.

Note that in the example above, we knew that the system was consistent once we reachedequivalent system (4); the remaining steps allowed us to find the unique solution.

190

Example 9.2 (Infinitely Many Solutions) Consider the following 3-by-4 system:

x1 + x2 − x3 = 9x1 +2x2 −3x3 −4x4 = 9

−3x1 −2xx +2x3 −3x4 = −23

and the following sequence of equivalent augmented matrices:

1 1 −1 0 91 2 −3 −4 9

−3 −2 2 −3 −23

∼

1 1 −1 0 90 1 −2 −4 00 1 −1 −3 4

∼

1 1 −1 0 90 1 −2 −4 00 0 1 1 4

∼

1 1 0 1 130 1 0 −2 80 0 1 1 4

∼

1 0 0 3 50 1 0 −2 80 0 1 1 4

The last matrix corresponds to x1 + 3x4 = 5, x2 − 2x4 = 8, and x3 + x4 = 4. Thus,

(s1, s2, s3, s4) = (5 − 3x4, 8 + 2x4, 4 − x4, x4)

is a solution for any value of x4.

Example 9.3 (Inconsistent System) Consider the following 3-by-3 system:

3x2 −6x3 = 8x1 −2x2 +3x3 = −1

5x1 −7x2 +9x3 = 0

and the following sequence of equivalent augmented matrices:

0 3 −6 81 −2 3 −15 −7 9 0

∼

1 −2 3 −10 3 −6 85 −7 9 0

∼

1 −2 3 −10 3 −6 80 3 −6 5

∼

1 −2 3 −10 3 −6 80 0 0 −3

The last matrix corresponds to x1 − 2x2 + 3x3 = −1, 3x2 − 6x3 = 8, and 0 = −3.

Since 0 6= −3, the system has no solution.

9.1.2 Row Reduction and Echelon Forms

A matrix is in (row) echelon form when

1. All nonzero rows are above any row of zeros.

2. Each leading entry (that is, leftmost nonzero entry) in a row is in a column to theright of the leading entries of the rows above it.

3. All entries in a column below a leading entry are zero.

191

Table 9.1: 6-by-11 matrices in echelon (left) and reduced echelon (right) forms.

0 • ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗0 0 0 • ∗ ∗ ∗ ∗ ∗ ∗ ∗0 0 0 0 • ∗ ∗ ∗ ∗ ∗ ∗0 0 0 0 0 0 0 • ∗ ∗ ∗0 0 0 0 0 0 0 0 • ∗ ∗0 0 0 0 0 0 0 0 0 0 0

0 1 ∗ 0 0 ∗ ∗ 0 0 ∗ ∗0 0 0 1 0 ∗ ∗ 0 0 ∗ ∗0 0 0 0 1 ∗ ∗ 0 0 ∗ ∗0 0 0 0 0 0 0 1 0 ∗ ∗0 0 0 0 0 0 0 0 1 ∗ ∗0 0 0 0 0 0 0 0 0 0 0

An echelon form matrix is in reduced echelon form when (in addition)

4. The leading entry in each nonzero row is 1.

5. Each leading 1 is the only nonzero entry in its column.

The left part of Table 9.1 represents a 6-by-11 echelon form matrix, where • representsthe leading entry, and ∗ represents any number (either zero or nonzero). The right part ofTable 9.1 represents a 6-by-11 reduced echelon form matrix.

Starting with the augmented matrix of a system of linear equations, the forward phase ofthe row reduction process will produce a matrix in echelon form. Continuing the processthrough the backward phase, we get a reduced echelon form matrix. An important theoremis the following:

Theorem 9.4 (Uniqueness Theorem) Each matrix is row-equivalent to one and onlyone reduced echelon form matrix.

Further, we can “read” the solutions to the original system from the reduced echelon formof the augmented matrix.

Pivot positions. A pivot position is a position of a leading entry in an echelon formmatrix. A column that contains a pivot position is called a pivot column. For example, inthe left part of Table 9.1, each • is located at a pivot position, and columns 2, 4, 5, 8, and9 are pivot columns.

A pivot is a nonzero number that either is used in a pivot position to create zeros or ischanged into a leading 1, which in turn is used to create zeros. In general there is no morethan one pivot in any row and no more than one pivot in any column.

Example 9.5 (Echelon Forms) Consider the following 3-by-6 matrix:

0 3 −6 6 4 −53 −7 8 −5 8 93 −9 12 −9 6 15

We first interchange R1 and R3:

3 −9 12 −9 6 153 −7 8 −5 8 90 3 −6 6 4 −5

192

Column 1 is a pivot column and we use the 3 in the upper left corner to convert the 3 inrow 2 to 0 (R2 − R1 replaces R2). In addition, R1 is replaced by 1

3R1 for simplicity:

1 −3 4 −3 2 50 2 −4 4 2 −60 3 −6 6 4 −5

Column 2 is a pivot column. We could use the current row 2, or interchange the second andthird rows and use the new row two for pivot. For simplicity, we use the 2 in the secondrow to convert the 3 in the third row to a zero (R3 − 3

2R2 replaces R3). In addition, R2 is

replaced by 12R2 for simplicity:

1 −3 4 −3 2 50 1 −2 2 1 −30 0 0 0 1 4

This matrix is in echelon form; columns 1, 2, and 5 are pivot columns.

Working with rows 3 and 2 (in that order) to “clear” columns 5 and 2, we get the followingreduced echelon form matrix:

1 −3 4 −3 2 50 1 −2 2 1 −30 0 0 0 1 4

∼

1 −3 4 −3 0 −30 1 −2 2 0 −70 0 0 0 1 4

∼

1 0 −2 3 0 −240 1 −2 2 0 −70 0 0 0 1 4

Note that if the original matrix was the augmented matrix of a 3-by-5 system of linearequations in x1, x2, x3, x4, x5, then the last matrix corresponds to the equivalent linearsystem x1 − 2x3 + 3x4 = −24, x2 − 2x3 + 2x4 = −7, x5 = 4. Thus,

(s1, s2, s3, s4, s5) = (−24 + 2x3 − 3x4,−7 + 2x3 − 2x4, x3, x4, 4)

is a solution for any values of x3, x4.

Solving linear systems. When solving linear systems,

1. A basic variable is any variable that corresponds to a pivot column in the augmentedmatrix of the system, and a free variable is any nonbasic variable.

2. From the equivalent system produced using the reduced echelon form, we solve eachequation for the basic variable in terms of the free variables (if any) in the equation.

3. If there is at least one free variable, then the original linear system has an infinitenumber of solutions.

4. If an echelon form of the augmented matrix has a row of the form

[ 0 0 · · · 0 b ] where b 6= 0,

then the system is inconsistent. (Once we know the system is inconsistent, there isno need to continue to find the reduced echelon form.)

193

Example 9.6 (Echelon Forms, continued) Continuing with the last example, the basicvariables are x1, x2, and x5; the free variables are x3 and x4. The free variables are used tocharacterize the infinite number of solutions to the 3-by-5 system.

9.1.3 Vector and Matrix Equations

In linear algebra, we let Rm be the set of m-by-1 matrices of real numbers:

Rm =

x1

x2

...xm

: xi ∈ R

.

Each element of Rm is called a point or a vector.

If x =

x1

x2

...xm

is a vector in Rm, we say that xi is the ith component of x.

The zero vector, O, is the vector all of whose components equal zero.

Addition and scalar multiplication. If x,y ∈ Rm and c ∈ R, then

1. the vector sum x + y is the vector obtained by componentwise addition. That is, thevector whose ith component is xi + yi for each i. And,

2. the scalar product cx is the vector obtained by multiplying each component by c.That is, the vector whose ith component is cxi for each i.

Linear combinations and spans. If v1,v2, . . . ,vn ∈ Rm and c1, c2, . . . , cn ∈ R, then

w = c1v1 + c2v2 + · · · + cnvn

is a linear combination of the vectors v1, v2, . . ., vn. The span of the set v1,v2, . . . ,vnis the collection of all linear combinations of the vectors v1, v2, . . ., vn:

Spanv1,v2, . . . ,vn = c1v1 + c2v2 + · · · + cnvn : ci ∈ R ⊆ Rm.

The span of a set of vectors always contains the zero vector O (let each ci = 0).

Vector equations. Consider an m-by-n system of equations in the variables x1, x2, . . .,xn, where each equation is of the form

ai1x1 + ai2x2 + · · · + ainxn = bi for i = 1, 2, . . . ,m.

Let aj be the vector of coefficients of xj , and b be the vector of right hand side values.

194

The m-by-n system can be rewritten as a vector equation:

x1a1 + x2a2 + · · · + xnan = b.

The system is consistent if and only if b ∈ Spana1,a2, . . . ,an.

Matrix equations. The coefficient matrix A of the m-by-n system above is the m-by-nmatrix whose columns are the ai’s: A = [a1 a2 · · · an ].

If x ∈ Rn is the vector of unknowns, then the matrix product of A with x is

Ax = [ a1 a2 · · · an ]

x1

x2

...xn

= x1a1 + x2a2 + · · · + xnan,

and the matrix equation for the system is Ax = b.

Note that the matrix product of a linear combination of vectors is equal to that linearcombination of matrix products:

A (c1v1 + c2v2 + · · · + cnvn) = c1Av1 + c2Av2 + · · · + cnAvn

for all ci ∈ R and vi ∈ Rm. (Multiplication by A distributes over addition, and we canfactor out constants.)

Consistency for all b. The following theorem gives criteria for determining when asystem of equations is consistent for all b ∈ Rm:

Theorem 9.7 (Consistency Theorem) Let A be an m-by-n coefficient matrix, and xbe an n-by-1 vector of unknowns. The following statements are equivalent:

1. The equation Ax = b has a solution for all b ∈ Rm.

2. Each b ∈ Rm is a linear combination of the coefficient vectors aj.

3. Spana1,a2, . . . ,an = Rm.

4. A has a pivot position in each row.

Note that the number of pivot positions is at most the minimum of m and n. Thus, by theconsistency theorem, if m > n (there are more equations than unknowns), a system withcoefficient matrix A cannot be consistent for all b ∈ Rm.

Homogeneous systems and solution sets. A homogeneous system is a linear systemwith matrix equation Ax = O.

Homogeneous systems are consistent since x = O is a solution to the system (called thetrivial solution). In addition, each linear combination of solutions is a solution. Thus, thesolution set is a span of some set of vectors.

195

Example 9.8 (Trivial Solution Only) Consider the matrix equation

2 −21 21 8

[

x1

x2

]=

000

.

Since

2 −2 01 2 01 8 0

∼ · · · ∼

1 0 00 1 00 0 0

⇒ x1 = 0

x2 = 0

the system has the trivial solution only. The solution set is SpanO = O ⊆ R2.

Example 9.9 (One Generating Vector) Consider the matrix equation

[2 4 −64 8 −10

]

x1

x2

x3

=

[00

].

Since [2 4 −6 04 8 −10 0

]∼ · · · ∼

[1 2 0 00 0 1 0

]⇒

x1 = −2x2

x2 is freex3 = 0

solutions can be written parametrically as follows:

s =

−2x2

x2

0

= x2

−210

= x2v

where x2 is free. The solution set is Spanv ⊆ R3.

Example 9.10 (Two Generating Vectors) Consider the matrix equation

2 4 −6 84 8 4 03 6 −1 4

x1

x2

x3

x4

=

000

.

Since

2 4 −6 8 04 8 4 0 03 6 −1 4 0

∼ · · · ∼

1 2 0 1 00 0 1 −1 00 0 0 0 0

⇒

x1 = −2x2 − x4

x2 is freex3 = x4

x4 is free

solutions can be written parametrically as:

s =

−2x2 − x4

x2

x4

x4

= x2

−2100

+ x4

−1011

= x2v1 + x4v2

where x2 and x4 are free. The solution set is Spanv1,v2 ⊆ R4.

196

Nonhomogeneous systems and solution sets. A nonhomogeneous system is a linearsystem with matrix equation Ax = b where b 6= O.

Theorem 9.11 (Solution Sets) Suppose that Ax = b is consistent and let p be asolution. Then the solution set of the system Ax = b is the set of all vectors of the form

s = p + vh

where vh is a solution of the homogeneous system Ax = O.

Example 9.12 (Two Generating Vectors, continued) Consider the matrix equation

2 4 −6 84 8 4 03 6 −1 4

x1

x2

x3

x4

=

184

11

.

Since

2 4 −6 8 184 8 4 0 43 6 −1 4 11

∼ · · · ∼

1 2 0 1 30 0 1 −1 −20 0 0 0 0

⇒

x1 = 3 − 2x2 − x4

x2 is freex3 = −2 + x4

x4 is free

solutions can be written parametrically as:

s =

3 − 2x2 − x4

x2

−2 + x4

x4

=

30

−20

+ x2

−2100

+ x4

−1011

= p + (x2v1 + x4v2) = p + vh

where x2 and x4 are free. (The solution set is a shift of a span.)

9.2 Matrix Operations

An m-by-n matrix A is a rectangular array with m rows and n columns:

A =

a11 a12 · · · a1n

a21 a22 · · · a2n

...... · · ·

...am1 am2 · · · amn

= [ a1 a2 · · · an ].

The (i, j)-entry of A (denoted by aij) is the element in the row i and column j position. Inthe first term above, the entries of A are shown explicitly. In the second term, A is writtenas n columns where each column is a vector in Rm. (See Section 7.1 also.)

Square matrices. A square matrix is a matrix with m = n. An n-by-n square matrix isoften called a square matrix of order n. If A is a square matrix of order n, then the diagonalelements of A are the elements: a11, a22, . . ., ann.

Symmetric matrices. Let A be a square matrix. Then A is said to be a symmetricmatrix if aij = aji for all i and j.

197

Triangular matrices. Let A be a square matrix. Then A is said to be an upper triangularmatrix if aij = 0 when i > j; A is said to a lower triangular matrix if aij = 0 when i < j;and A is said to be a triangular matrix if it is either upper or lower triangular.

Diagonal matrices, identity matrices. The square matrix A is said to be a diagonalmatrix when aij = 0 when i 6= j.

Given d1, d2, . . ., dn, Diag(d1, d2, . . . , dn) is used to denote the diagonal matrix of order nwhose diagonal elements are the di’s:

Diag(d1, d2, . . . , dn) =

d1 0 0 · · · 00 d2 0 · · · 00 0 d3 · · · 0...

...... · · ·

...0 0 0 · · · dn

.

The identity matrix of order n, In, is the diagonal matrix with di = 1 for each i. Wesometimes omit the subscript when the order of the matrix is clear from the problem.

Note that diagonal matrices are symmetric, upper triangular, and lower triangular.

Vectors and matrices with all ones. For convenience, we let 1m be the vector in Rm

all of whose components equal one, and we let Jm×n be the m-by-n matrix all of whoseentries equal one. We sometimes omit the subscripts when the orders of the vectors andmatrices are clear from the problem.

9.2.1 Basic Operations

In this section, we consider the operations of addition and scalar multiplication, linearcombinations of matrices, matrix transpose and matrix product.

Addition and scalar multiplication. Let A and B be m-by-n matrices, and let c ∈ Rbe a scalar. Then matrix sum and scalar product are defined as follows:

1. The matrix sum of A and B is the m-by-n matrix

A + B = [a1 + b1 a2 + b2 · · · an + bn ] .

That is, A + B is the matrix with (i, j)-entry (A + B)ij = aij + bij .

2. The scalar product of A by c is the m-by-n matrix

cA = [ ca1 ca2 · · · can ] .

That is, cA is the matrix with (i, j)-entry (cA)ij = caij .

198

Linear combinations of matrices. Let A and B be m-by-n matrices, and c and d bescalars. Then the m-by-n matrix

cA + dB = [ ca1 + db1 ca2 + db2 · · · can + dbn ]

is said to be a linear combination of A and B.

The linear combination of A and B has (i, j)-entry as follows:

(cA + dB)ij = caij + dbij .

For example, if A =

1 −23 −45 −6

and B =

0 74 2

−2 −6

, then 2A − B =

2 −112 −10

12 −6

.

Matrix transpose. The transpose of the m-by-n matrix A, denoted by AT , is the n-by-m

matrix whose (i, j)-entry is aji:(AT)

ij= aji.

For convenience, we let αi ∈ Rn (for i = 1, 2, . . . ,m) be the columns of AT and write

A =

α1T

...αm

T

when we want to view A as a “stack” of m row vectors.

Matrix product. Let A be an m-by-n matrix and let B = [ b1 b2 · · · bp ] be ann-by-p matrix. Then the matrix product, AB, is the m-by-p matrix whose jth column is theproduct of A with the jth column of B:

AB = [Ab1 Ab2 · · · Abp ] .

Equivalently, AB is the matrix whose (i, j)-entry is

(AB)ij = ai1b1j + ai2b2j + · · · + ainbnj.

(See Section 7.1.7 also.)

Dot and matrix products. Another convenient way to write the matrix product is interms of dot products. Recall that if v and w are vectors in Rn, then their dot product isthe scalar

v·w = vT w = v1w1 + v2w2 + · · · + vnwn.

Then

AB =

α1T

...αm

T

[ b1 b2 · · · bp ] =

α1T b1 α1

T b2 · · · α1T bp

α2T b1 α2

T b2 · · · α2T bp

...... · · ·

...αm

T b1 αmT b2 · · · αm

T bp

.

199

Properties of the basic operations. Some properties of the basic operations are givenbelow. In each case, the sizes of the matrices are assumed to be compatible.

1. Commutative: A + B = B + A.

2. Associative: A + (B + C) = (A + B) + C and A(BC) = (AB)C.

3. Distributive: A(B + C) = (AB) + (AC), (B + C)A = (BA) + (CA).

4. Scalar Multiples: Given c ∈ R:

c(AB) = (cA)B = A(cB), c(A + B) = (cA) + (cB) and (cA)T = cAT .

5. Identity: ImA = A = AIn when A is an m-by-n matrix.

6. Transpose of Transpose:(AT)T

= A.

7. Transpose of Sum: (A + B)T = AT + BT .

8. Transpose of Product: (AB)T = BT AT .

Note, in particular, that the matrix sum is commutative but the matrix product is not. Infact, it is possible that AB is defined but that BA is undefined.

9.2.2 Inverses of Square Matrices

The n-by-n matrix A is said to be invertible if there exists a square matrix C satisfying

AC = CA = In where In is the n-by-n identity matrix.

Otherwise, A is said to be not invertible. Square matrices that are not invertible are alsocalled singular matrices; invertible square matrices are also called nonsingular matrices.

If A is invertible, then the matrix C is unique. To see this, suppose that the matrix D alsosatisfies the equality AD = DA = In. Then, using properties of matrix products,

D = DIn = D(AC) = (DA)C = InC = C.

If A is invertible, then A−1 is used to denote the matrix inverse of A.

Special case: 2-by-2 matrices. Let A =

[a bc d

]be a 2-by-2 matrix. Then

A−1 = 1ad−bc

[d −b−c a

]when ad − bc 6= 0.

If ad − bc = 0, then A does not have an inverse. (Note: ad − bc is the determinant of A.)

200

Special case: diagonal matrices. Let A = Diag(a1, a2, . . . , an) be a diagonal matrix.If ai 6= 0 for each i, then A is invertible with inverse

A−1 = Diag(1/a1, 1/a2, . . . , 1/an).

If ai = 0 for some i, then A is not invertible.

Demonstration: Note first that if C = Diag(c1, c2, . . . , cn), then

AC = CA = Diag(a1c1, a2c2, . . . , ancn).

To obtain AC = CA = In, we need aici = 1 for each i. If ai = 0 for some i, then we cannotfind solutions; otherwise, we let ci = 1/ai for each i.

Properties of inverses. Let A and B be invertible square matrices of order n and let cbe a nonzero constant. Then

1. Inverse of Identity: I−1n = In for each n.

2. Inverse of Inverse:(A−1

)−1= A.

3. Inverse of Product: (AB)−1 = B−1A−1.

4. Inverse of Scalar Multiple: (cA)−1 = (1/c)A−1.

5. Inverse of Transpose:(AT)−1

=(A−1

)T.

Algorithm for finding an inverse. Let A be an n-by-n matrix. Form the n-by-2naugmented matrix whose last n columns are the columns of the identity matrix: [A In ].

If A is row equivalent to the identity matrix, then the n-by-2n augmented matrix will betransformed as follows:

[A In ] ∼ · · · ∼ [ In A−1 ]

That is, the last n columns will be the inverse of A. Otherwise, A does not have an inverse.

Partial justification: For convenience, let C = A−1 and In = [e1 e2 · · · en ]. ThenAC = In can be written as follows:

AC = [ Ac1 Ac2 · · · Acn ] = [e1 e2 · · · en ] .

Equivalently,Ac1 = e1, Ac2 = e2, . . . , Acn = en.

By using the n-by-2n augmented matrix, we are solving all n systems at once. The solutionsto the n systems form the n columns of the inverse matrix.

Example 9.13 (2-by-2 Invertible Matrix) Let A =

[−7 8

2 −2

]. Using the special

formula for 2-by-2 matrices,

A−1 =1

−2

[−2 −8−2 −7

]=

[1 41 7/2

].

201

Using the algorithm for finding inverses,

[ A I2 ] =

[−7 8 1 0

2 −2 0 1

]∼[

−7 8 1 00 2/7 2/7 1

]∼[

−7 8 1 00 1 1 7/2

]

∼[

−7 0 −7 −280 1 1 7/2

]∼[

1 0 1 40 1 1 7/2

]= [ I2 A−1 ] .

The last two columns of the augmented matrix match the inverse computed above.

Example 9.14 (3-by-3 Invertible Matrix) Let A =

0 1 21 0 34 −3 8

. Since

[ A I3 ] =

0 1 2 1 0 01 0 3 0 1 04 −3 8 0 0 1

∼ · · · ∼

1 0 0 −4.5 7.0 −1.50 1 0 −2.0 4.0 −1.00 0 1 1.5 −2.0 −0.5

,

A is invertible with inverse A−1 =

4.5 7.0 −1.5−2.0 4.0 −1.0

1.5 −2.0 −0.5

.

Example 9.15 (3-by-3 Singular Matrix) Let A =

0 1 21 0 34 −3 6

. Since

[ A I3 ] =

0 1 2 1 0 01 0 3 0 1 04 −3 6 0 0 1

∼ · · · ∼

1 0 3 0 1 00 1 2 1 0 00 0 0 3 −4 1

,

A is not invertible. (Note that the third row has 3 leading zeros.)

Elementary matrices and matrix inverses. An elementary matrix, E, is one obtainedby performing a single elementary row operation on an identity matrix. Since each elemen-tary row operation is reversible, each elementary matrix is invertible.

For example, let n = 4.

1. (Replacement) Starting with I4, we replace R3 with R3 +aR1. The elementary matrixis E shown on the left below. The operation is reversed by replacing R3 with R3−aR1.The inverse matrix is shown on the right below.

E =

1 0 0 00 1 0 0a 0 1 00 0 0 1

and E−1 =

1 0 0 00 1 0 0

−a 0 1 00 0 0 1

.

2. (Scaling) Starting with I4, we scale R2 using the nonzero constant c. The elementarymatrix is E shown on the left below. The operation is reversed by scaling R2 by theconstant 1/c. The inverse matrix is shown on the right below.

E =

1 0 0 00 c 0 00 0 1 00 0 0 1

and E−1 =

1 0 0 00 1/c 0 00 0 1 00 0 0 1

.

202

3. (Interchange) Starting with I4, we interchange R2 and R4. The elementary matrixis E shown below. The operation is reversed by interchanging R2 and R4. Thus,E−1 = E.

E =

1 0 0 00 0 0 10 0 1 00 1 0 0

= E−1.

Note also that E2 = I4.

Let A be an n-by-n matrix. Elementary row operations on A correspond to left multiplica-tion by the corresponding elementary matrix. If a sequence of p elementary row operations(with elementary matrices E1, E2, . . ., Ep) transform A to the identity matrix, then

(Ep(Ep−1 · · · (E2(E1A)))) = (EpEp−1 · · ·E2E1)A = In,

the inverse of A is the following product of elementary matrices:

A−1 = EpEp−1 · · ·E2E1

and A is the product of the inverses in the opposite order:

A = E−11 E−1

2 · · ·E−1p−1E

−1p .

These computations can be used to formally justify the algorithm for finding inverses.

Example 9.16 (2 × 2 Matrix, continued) Continuing with the 2-by-2 example, fourelementary row operations were used. Written as a matrix product:

E4E3E2E1A =

[−1/7 0

0 1

] [1 −10 1

] [1 00 7/2

] [1 0

2/7 1

] [−7 8

2 −2

]=

[1 00 1

]= I2.

Thus, A−1 = E4E3E2E1 =

[1 4

−1 7/2

].

Relationship to solving systems of equations. The following theorem provides animportant relationship between matrix inverses and systems of linear equations.

Theorem 9.17 (Uniqueness Theorem) Let A be an invertible n-by-n matrix. Thenthe matrix equation Ax = b has the unique solution x = A−1b for any b ∈ Rn.

9.2.3 Matrix Factorizations

A factorization of a matrix A is an equation expressing A as the product of two or morematrices. This section introduces the LU -factorization, which supports an important prac-tical method for solving large systems of linear equations. Additional factorizations will beconsidered later in these notes.

203

LU-Factorization. Let A be an m-by-n matrix. An LU -factorization of A is a matrixequation A = LU , where

1. L is an m-by-m unit lower triangular matrix, and

2. U is an m-by-n (upper) echelon form matrix.

Note that L is said to be a unit lower triangular matrix if L is a lower triangular matrixwhose diagonal elements ℓii = 1 for all i.

The m-by-n matrix A has an LU -factorization when (and only when) A is row equivalentto an echelon form matrix using a sequence of row replacements only:

EpEp−1 · · ·E2E1A = U =⇒ A = LU where L = (EpEp−1 · · ·E2E1)−1.

In this equation, the Ei’s are the elementary matrices representing the row replacements.

Example 9.18 (3-by-3 Matrix) Consider the following matrix and equivalent forms:

A =

3 1 0−6 0 −4

0 6 −10

∼

3 1 00 2 −40 6 −10

∼

3 1 00 2 −40 0 2

.

The last matrix is the echelon matrix U . Since two row replacements were used,

L = E−11 E−1

2 =

1 0 0−2 1 0

0 0 1

1 0 00 1 00 3 1

=

1 0 0−2 1 0

0 3 1

.

Example 9.19 (3-by-4 Matrix) Consider the following matrix and equivalent forms:

A =

2 4 −1 5−4 −5 3 −8

2 −5 −4 1

∼

2 4 −1 50 3 1 20 −9 −3 −4

∼

2 4 −1 50 3 1 20 0 0 2

.

The last matrix is the echelon matrix U . Since three row replacements were used,

L = E−11 E−1

2 E−13 =

1 0 0−2 1 0

0 0 1

1 0 00 1 01 0 1

1 0 00 1 00 −3 1

=

1 0 0−2 1 0

1 −3 1

.

Application. If A = LU and Ax = b is consistent, then the system can be solvedefficiently. Consider the following setup:

b = Ax = (LU)x = L(Ux) = Ly where y = Ux.

The efficient solution is done in two steps:

1. Solve Ly = b for y.

2. Solve Ux = y for x.

The first step is essentially a forward substitution method, and the second step is essentiallya backward substitution method. The process is useful when the same coefficient matrix isused with several different right hand sides.

204

Permutation matrices and factorizations. A permutation matrix P is a product ofelementary matrices corresponding to row interchanges.

In the LU -factorization, L encodes information about row replacements, and the pivotpositions of U encode information about scalings. If row interchanges are needed to solve asystem, then a permutation matrix can be used to encode these operations. Specifically, letP be the product of interchange matrices needed to bring the rows of A into proper orderfor an LU -factorization. Then write PA = LU . (If row interchanges are needed, then Adoes not have an LU -factorization. The matrix PA does have an LU -factorization.)

9.2.4 Partitioned Matrices

A partitioned matrix (or blocked matrix ) is one whose entries are themselves matrices. Anm-by-n matrix can be partitioned in many ways, including the following 2-by-2 form:

A =

[A11 A12

A21 A22

]where Aij is a pi-by-qj matrix,

p1 + p2 = m and q1 + q2 = n. (This can be generalized to 3-by-3 forms, etc.)

Each submatrix in a partitioned matrix is called a block. Partitioning is especially usefulwhen you can take advantage of block patterns. For example, if certain Aij ’s are zeromatrices, identity matrices, diagonal matrices, or matrices all of whose entries are equal.

If all submatrices are conformable, then partitioned matrices can be added or multiplied byadding or multiplying the appropriate blocks. In particular,

AB =

[A11 A12

A21 A22

] [B11 B12

B21 B22

]=

[C11 C12

C21 C22

]

where Cij = Ai1B1j + Ai2B2j for each i and j.

This section considers inverses of partitioned square matrices. Additional methods forpartitioned matrices will be considered in later sections of these notes.

Block diagonal matrices. A 2-by-2 block diagonal matrix is a square matrix A of ordern written as follows

A =

[A11 OO A22

]where A11 is p-by-p, A22 is q-by-q,

each O is a zero matrix, and p + q = n. (This can be generalized to 3-by-3, etc.)

A is invertible if and only if both A11 and A22 are invertible. Then, A−1 =

[A−1

11 OO A−1

22

].

Example 9.20 (4-by-4 Matrix) Let A be the 4-by-4 matrix shown on the left below.The inverse of A can be computed using the formula for 2-by-2 block diagonal matrices:

A =

8 −5−3 2

0 00 0

0 00 0

9 44 2

=⇒ A−1 =

2 5 0 03 8 0 00 0 1 −20 0 −2 9/2

205

Block upper triangular matrices. A 2-by-2 block upper triangular matrix is a squarematrix A of order n written as follows

A =

[A11 A12

O A22

]where A11 is p-by-p, A12 is p-by-q, A22 is q-by-q,

O is a zero matrix, and p + q = n. (This can be generalized to 3-by-3, etc.)

A is invertible if and only if both A11 and A22 are invertible. Then, A−1 =

[A−1

11 CO A−1

22

],

where C = −A−111 A12A

−122 .

In particular, if A11 and A22 are identity matrices, then A−1 =

[Ip −A12

O Iq

].

Block lower triangular matrices. A 2-by-2 block lower triangular matrix is a squarematrix A of order n written as follows

A =

[A11 OA21 A22

]where A11 is p-by-p, A21 is q-by-p, A22 is q-by-q,

O is a zero matrix, and p + q = n. (This can be generalized to 3-by-3, etc.)

Since AT =

[AT

11 AT21

O AT22

]is block upper triangular, the method above easily generalizes to

handle this case.

Block elimination and Schur complements. Let M be a square matrix of order nwritten as follows:

M =

[A BC D

]where the diagonal submatrices are square,

and assume that A is invertible. Then block elimination subtracts CA−1 times the first rowfrom the second row. This leaves the Schur complement of A in the lower right corner ofthe resulting block upper triangular matrix, as illustrated below:

[I O

−CA−1 I

] [A BC D

]=

[A BO D − CA−1B

].

If the Schur complement (D−CA−1B) is invertible, then the inverse of M can be found inblock form using methods for block triangular matrices.

Example 9.21 (5-by-5 Matrix) Consider the 5-by-5 matrix M (written as a 2-by-2 blockmatrix) and its Schur complement:

M =

[A BC D

]=

3 1 0 −42 04 1 −2 0 −12 0 −5 1 −7

1 0 1 2 11 1 1 3 4

and (D − CA−1B) =

[−289 −13

378 17

].

206

Since A, D and (D − CA−1B) are invertible, we can compute M−1 using block methods.

Using properties of inverses,

M−1 =

[A BO D − CA−1B

]−1 [I O

−CA−1 I

]

=

−5 5 −2 −134 −10316 −15 6 1116 855−2 2 −1 479 366

0 0 0 17 130 0 0 −378 −289

1 0 0 0 00 1 0 0 00 0 1 0 07 −7 3 1 0

−9 8 −3 0 1

=

−16 119 −95 −134 −103133 −987 789 1116 85557 −423 338 479 3662 −15 12 17 13

−45 334 −267 −378 −289

.

9.3 Determinants of Square Matrices

Determinants of square matrices were introduced in Section 7.1.8. This section gives addi-tional information about determinants.

Recursive or cofactor definition. Let A be a square matrix of order n, and let Aij

be the submatrix obtained by eliminating the ith row and the jth column of A. Then thedeterminant of A can be defined recursively as follows:

det(A) =n∑

j=1

(−1)1+ja1jdet(A1j).

(The determinant has been computed by expanding across the first row.)

The value of the determinant remains the same if we expand across the ith row:

det(A) =n∑

j=1

(−1)i+jaijdet(Aij)

or if we expand down the jth column:

det(A) =n∑

i=1

(−1)i+jaijdet(Aij)

for any i or j. The numbercij = (−1)i+jdet(Aij)

is often called the ij-cofactor of A.

207

Properties of determinants. Let A and B be square matrices of order n. Then

1. Transpose: det(AT)

= det(A).

2. Triangular : If A is a triangular matrix, then det(A) = a11 · · · ann.

3. Product : det(AB) = det(A) det(B).

4. Idempotent : If A2 = A (A is an idempotent matrix), then det(A) = 0 or 1.

5. Inverse: A is an invertible matrix if and only if det(A) 6= 0. Further, if the matrix Ais invertible, then det

(A−1

)= 1/det(A).

6. Elementary Matrices: Let E be an elementary matrix. Then

det(E) =

1 when E corresponds to row replacement

−1 when E corresponds to row interchange

c when E corresponds to scaling a row by c

7. Zeroes: If A has a row or column of zeroes, then det(A) = 0.

8. Identical : If two rows or two columns of A are identical, then det(A) = 0.

Example 9.22 (3-by-3 Invertible Matrix, continued) If A =

0 1 21 0 34 −3 8

, then

det(A) =

∣∣∣∣∣∣

0 1 21 0 34 −3 8

∣∣∣∣∣∣= 0 −

∣∣∣∣1 34 8

∣∣∣∣+ 2

∣∣∣∣1 04 −3

∣∣∣∣ = 0 − (−4) + 2(−3) = −2 6= 0.

Example 9.23 (3-by-3 Singular Matrix, continued) If A =

0 1 21 0 34 −3 6

,then

det(A) =

∣∣∣∣∣∣

0 1 21 0 34 −3 6

∣∣∣∣∣∣= 0 −

∣∣∣∣1 34 6

∣∣∣∣+ 2

∣∣∣∣1 04 −3

∣∣∣∣ = 0 − (−6) + 2(−3) = 0.

Relationship to pivots. Consider the square matrix A of order n. Let K be a productof elementary matrices needed to bring A into echelon form, and U be the resulting echelonform matrix: KA = U .

Since det(K) 6= 0 by Properties 3 and 5, and det(U) =∏

i uii be Property 2,

det(A) =1

det(K)

∏

i

uii (using Property 3).

Thus, det(A) = 0 when there are fewer than n pivot positions.

208

Reducing to echelon form. The determinant of A can be derived by reducing A to anechelon form matrix and observing the following:

1. If a multiple of one row is added to another, then the determinant is unchanged.

2. If two rows are interchanged, then the determinant changes sign.

3. If a row is multiplied by c, then the determinant is multiplied by c.

For example,

∣∣∣∣∣∣

3 6 61 2 11 4 4

∣∣∣∣∣∣= 3

∣∣∣∣∣∣

1 2 21 2 11 4 4

∣∣∣∣∣∣= 3

∣∣∣∣∣∣

1 2 20 0 −10 2 2

∣∣∣∣∣∣= −3

∣∣∣∣∣∣

1 2 20 2 20 0 −1

∣∣∣∣∣∣= −3(−2) = 6.

Explicit formula for the inverse using cofactors. If A is invertible, then the inverseof A can be written explicitly in terms of deteminants as follows:

A−1 =1

det(A)

c11 c12 · · · c1n

c21 c22 · · · c2n

...... · · ·

...cn1 cn2 · · · cnn

T

where cij = (−1)i+jdet(Aij).

(A−1 equals the reciprocal of det(A) times the transpose of the cofactor matrix.)

This explicit formula is often useful when proving theorems about inverses, but it is not auseful computational formula.

9.4 Vector Spaces and Subspaces

This section introduces vector space theory, emphasizing topics of interest in statistics.

9.4.1 Definitions

A vector space V over the reals is a set of objects (called vectors), and operations of additionand scalar multiplication satisfying the following axioms:

1. Vector sum: Vector sum is commutative and associative.

2. Zero vector: There is a zero vector O satisfying

O + x = x = x + O for each x ∈ V .

3. Additive inverse: For each x ∈ V ,

y + x = O = x + y for some y ∈ V .

209

4. Scalar multiples: Given x,y ∈ V and c, d ∈ R,

c(x + y) = (cx) + (cy), (c + d)x = (cx) + (dx), (cd)x = c(dx), 1x = x.

Examples of vector spaces include

1. Rm with the usual vector addition and scalar multiplication.

2. The set of r-by-c matrices, Mr×c, with the usual operations of matrix addition andscalar multiplication.

3. The set of polynomial functions of degree at most k, Pk,

Pk = p(t) = a0 + a1t + a2t2 + · · · + akt

k : ai ∈ R

where polynomials are added by adding their values for each t, and the scalar multipleby c is obtained by multiplying each value of the polynomial by the scalar c.

4. The set of all polynomial functions, P∞,

P∞ = P1 ∪ P2 ∪ P3 ∪ · · ·

with addition and scalar multiplication as defined above.

5. The set of continuous real-valued functions with domain [0, 1], C[0, 1],

C[0, 1] = f : [0, 1] → R : f is continuous

where functions are added by adding their values for each x ∈ [0, 1], and the scalarmultiple by c is obtained by multiplying each value of the function by the scalar c.

Subspaces. A subset H ⊆ V is a subspace of the vector space V if

1. Contains zero vector : O ∈ H.

2. Closed under addition: If x,y ∈ H, then x + y ∈ H.

3. Closed under scalar multiplication: If x ∈ H, then cx ∈ H for each c ∈ R.

Note that the subset O is a subspace, called the trivial subspace.

Example 9.24 (Lines in R2) The set of points in the plane satisfying y = mx + b forsome m and b:

Lm,b =

[xy

]: y = mx + b

⊆ R2

is a subspace of R2 when b = 0 (the line goes through the origin) and is not a subspacewhen b 6= 0 (the line does not go through the origin).

210

Example 9.25 (Planes in R3) The set of points in three-space satisfying z = ax+ by + cfor some a, b, c:

Pa,b,c =

xyz

: z = ax + by + c

⊆ R3

is a subspace of R3 when c = 0 (the plane goes through the origin) and is not a subspacewhen c 6= 0 (the plane does not go through the origin).

Of interest. Subspaces are vector spaces in their own right. Of interest in statistics arethe vector spaces Rm, and the vector subspaces H ⊆ Rm.

9.4.2 Subspaces of Rm

Recall (from Section 9.1.3) that the span of the set v1,v2, . . . ,vn ⊆ Rm, is the collectionof all linear combinations of the vectors v1, v2, . . ., vn:

Spanv1,v2, . . . ,vn = c1v1 + c2v2 + · · · + cnvn : ci ∈ R ⊆ Rm.

Spans are subspaces. Of particular interest are spans related to coefficient matrices.

Column space of A. Let A = [a1 a2 · · · an ] be an m-by-n coefficient matrix. Thecolumn space of A is the span of the columns of A:

Col(A) = Spana1,a2, . . . ,an ⊆ Rm.

The column space of A is the collection of all b so that Ax = b is consistent.

Null space of A. Let A be an m-by-n coefficient matrix. The null space of A is the setof solutions of the homogeneous system Ax = O:

Null(A) = x : Ax = O ⊆ Rn.

Since Null(A) can be written as a span of a set of vectors, it is a subspace of Rn.

9.4.3 Linear Independence, Basis and Dimension

The set v1,v2, . . . ,vn ⊆ Rm is said to be linearly independent when

c1v1 + c2v2 + · · · + cnvn = O

only when each ci = 0. Otherwise, the set is said to be linearly dependent.

The following uniqueness theorem gives an important property of linearly independent sets.

211

Theorem 9.26 (Uniqueness Theorem) If v1,v2, . . . ,vn ⊆ Rm is a linearly inde-pendent set and w ∈ Spanv1,v2, . . . ,vn, then w has a unique representation as a linearcombination of the vi’s. That is, we can write

w = c1v1 + c2v2 + · · · + cnvn

for a unique ordered list of scalars c1, c2, . . ., cn (called the coordinates of w).

Note that the coordinates can be found by solving the system

[ v1 v2 · · · vn ]

c1

...cn

=

w1

...wm

.

That is, by solving Ac = w, where A is the m-by-n matrix whose columns are the vi’s.

Additional facts. Some additional facts about linear independence/dependence are:

1. If vi = O for some i, then the set v1,v2, . . . ,vn is linearly dependent.

2. If n > m, then the set v1,v2, . . . ,vn is linearly dependent.

3. Assume vi 6= O for each i. Then, the set is linearly dependent if and only if one vectorin the set can be written as a linear combination of the others.

4. Let be the m-by-n matrix whose columns are the vi’s: A = [ v1 v2 · · · vn ]. Then,the set is linearly independent if and only if A has a pivot in every column.

5. Let be the m-by-n matrix whose columns are the vi’s: A = [ v1 v2 · · · vn ]. Then,the set is linearly independent if and only if the homogeneous system Ax = O hasthe trivial solution only.

Bases. A basis for the vector space V ⊆ Rm is a linearly independent set which spans V .That is, a basis is a set of linearly independent vectors v1,v2, . . . ,vn satisfying

V = Spanv1,v2, . . . ,vn.

Example 9.27 (Bases for R3) Bases are not unique. For example, the sets

100

,

010

,

001

and

−300

,

210

,

6−2

1

are each bases of V = R3.

Further, if the components of x ∈ R3 are x1, x2, x3, then the coordinates of x

1. With respect to the first basis are x1, x2, x3.

2. With respect to the second basis are 13(−x1 + 2x2 + 10x3), x2 + 2x3, x3.

212

Standard basis for Rm. The standard basis for Rm is the set e1,e2, . . . ,em, wherethe ei’s are the columns of the identity matrix Im. That is, where ei is the vector whosejth component equals 0 when j 6= i and equals 1 when j = i.

Note that the first basis given in the previous example is the standard basis for R3.

Basis for the span of a set of vectors. Suppose that

V = Spanv1,v2, . . . ,vn ⊆ Rm,

A basis for V can be found using row reductions. Specifically,

Theorem 9.28 (Spanning Set Theorem) If V = Spanv1,v2, . . . ,vn ⊆ Rm is notthe trivial subspace, and A is the m-by-n matrix whose columns are the vi’s:

A = [v1 v2 · · · vn ] ,

then the pivot columns of A form a basis for the vector space V .

Basis for the column space of A. In particular, the pivot columns of the coefficientmatrix A form a basis for the column space of A, Col(A).

Example 9.29 (3-by-5 Matrix) Consider the following 3-by-5 coefficient matrix A andits equivalent reduced echelon form matrix:

A = [a1 a2 a3 a4 a5 ] =

0 3 −6 6 43 −7 8 −5 83 −9 12 −9 6

∼

1 0 −2 3 00 1 −2 2 00 0 0 0 1

Since the pivots are in columns 1, 2, 5, a basis for Col(A) is the set a1,a2,a5.

Basis for the null space of A. The method for writing the solution set of a homogeneoussystem Ax = O given in Section 9.1.3 automatically produces a set of linearly independentvectors which span Null(A). Thus, no further work is needed to find a basis for Null(A).

Example 9.30 (3-by-5 Matrix, continued) Continuing with the example above, since

Ax = O ⇒

x1 = 2x3 − 3x4

x2 = 2x3 − 2x4

x3 is freex4 is freex5 = 0

⇒ s = x3

22100

+ x4

−3−2

010

= x3v1 + x4v2,

where x3 and x4 are free, a basis for Null(A) is the set v1,v2.

213

Dimension of a vector space. Although there are many choices for the basis vectors ofa vector space V , the number of basis vectors does not change.

Theorem 9.31 (Basis Theorem) If V ⊆ Rm has a basis with p vectors, then

1. Every basis of V must contain exactly p vectors.

2. Any subset of V with p linearly independent vectors is a basis of V .

3. Any subset of V with p vectors which span V is a basis of V .

If V is a nontrivial subspace of Rm, then the dimension of V , denoted by dim(V ), is thenumber of vectors in any basis of V . By the basis theorem, this number is well-defined. Forconvenience, we say that the dimension of the trivial subspace, O, is zero.

Note that for every subspace V ⊆ Rm, 0 ≤ dim(V ) ≤ m.

9.4.4 Rank and Nullity

Let A be an m-by-n matrix. The rank of A, denoted by rank(A), is the number of pivotcolumns of A. Equivalently, the rank of A is the dimension of the column space of A,

rank(A) = dim(Col(A)).

The nullity of A, denoted by nullity(A), is the dimension of its null space,

nullity(A) = dim(Null(A)).

If A is the coefficient matrix of an m-by-n system of linear equations and r = rank(A),then the system has r basic variables and n − r free variables. Since the number of freevariables corresponds to the dimension of the null space of A, we have

rank(A) + nullity(A) = n.

Example 9.32 (3-by-5 Matrix, continued) For example, the 3-by-5 coefficient matrixA considered earlier has rank 3 and nullity 2. The sum of these dimensions equals thenumber of columns of the matrix.

Subspaces related to A and AT . Let A be an m-by-n coefficient matrix. There arefour important subspaces related A:

1. Column space of A: The column space of A, Col(A) ⊆ Rm.

2. Null space of A: The null space of A, Null(A) ⊆ Rn.

3. Row space of A: The column space of AT , Col(AT ) ⊆ Rn.

4. Left null space of A: The null space of AT , Null(AT ) ⊆ Rm.

214

The dimensions of these spaces are related as follows:

Theorem 9.33 (Fundamental Theorem of Linear Algebra, Part I) If A is anm-by-n coefficient matrix of rank r, then

1. rank(A) = rank(AT ) = r,

2. nullity(A) = n − r and

3. nullity(AT ) = m − r.

Matrices of rank one. If the m-by-n matrix A has rank 1, then the columns of A areall multiples of a single vector, say ai = civ for some constants ci and vector v ∈ Rm. Letc ∈ Rn be the vector whose components are the ci’s. Then

A = [ c1v c2v · · · cnv ] = vcT .

9.5 Orthogonality

Let v,w ∈ Rm. Recall (from Section 7.1.4) that

1. The dot product of v with w is the number

v·w = vT w = v1w1 + v2w2 + · · · + vnwn.

2. If θ is the angle between v and w, then

v·w = ‖v‖ ‖w‖ cos(θ).

3. The vectors v and w are said to be orthogonal if v·w = 0.

When m = 2 or m = 3, orthogonality corresponds to being at right angles in the usualgeometric sense. When m > 3, the definition of orthogonality in terms of the dot productis a useful generalization.

9.5.1 Orthogonal Complements

If V ⊆ Rm is a subspace, then the orthogonal complement of V is the set of all vectorsx ∈ Rm which are orthogonal to every vector in V :

V ⊥ = x : v·x = 0 for every v ∈ V ⊆ Rm.

(V ⊥ is read “V -perp.”)

215

Properties. Orthogonal complements satisfy the following properties:

1. Subspace: V ⊥ is a subspace of Rm.

2. Trivial intersection: Since x·x = 0 only when x = O, V ∩ V ⊥ = O.

3. Complement of complement :(V ⊥

)⊥= V .

4. Bases: If you pool bases for V and V ⊥, you get a basis for Rm.

5. Dimensions: dim(V ) + dim(V ⊥) = m.

Example 9.34 (Complements in R3) Let V be the following subspace:

V = Spanv1,v2 = Span

1−1

0

,

20

−2

⊆ R3.

V is a two-dimensional subspace of R3 since the set v1,v2 is linearly independent. Tofind V ⊥ it is sufficient to find those vectors x satisfying v1·x = 0 and v2·x = 0.

Equivalently, we need to solve the following matrix equation[v1

T

v2T

]x = O.

Since [1 −1 0 02 0 −2 0

]∼[

1 0 −1 00 1 −1 0

]=⇒

x1 = x3

x2 = x3

x3 is free

V ⊥ is the span of the single vector v3 =

111

. Further, v1,v2,v3 is a basis for R3.

The example above suggests the second part of the Fundamental Theorem of Linear Algebra:

Theorem 9.35 (Fundamental Theorem of Linear Algebra, Part II) Let A be anm-by-n coefficient matrix. Then

1. Null(AT ) and Col(A) are orthogonal complements in Rm.

2. Null(A) and Col(AT ) are orthogonal complements in Rn.

9.5.2 Orthogonal Sets, Bases and Projections

The nonzero vectors v1, v2, . . ., vn are said to be mutually orthogonal if vi·vj = 0 wheni 6= j. A set of mutually orthogonal vectors is known as an orthogonal set.

The following theorem summarizes several important properties of orthogonal sets.

Theorem 9.36 (Orthogonal Sets) Let v1,v2, . . . ,vn be an orthogonal set of vectorsin Rm, and let V be the subspace spanned by these vectors. Then

216

1. v1,v2, . . . ,vn is a basis for V .

2. If y ∈ V , then

y = c1v1 + c2v2 + · · · + cnvn where ci =y·vi

vi·vi

for each i.

3. If y 6∈ V , then

y = c1v1 + c2v2 + · · · + cnvn where ci =y·vi

vi·vi

for each i

is the point closest to y in V . Further, the difference between the vector y and thevector y is a vector in the orthogonal complement of V : (y − y) ∈ V ⊥.

The first part of the theorem tells us that orthogonal sets are linearly independent. Thesecond part gives a convenient way of finding the coordinates of a vector with respect to abasis of mutually orthogonal vectors (called an orthogonal basis). The third part generalizesthe idea from analytic geometry of “dropping a perpendicular.”

Projections and orthogonal decompositions. If y 6∈ V , then y in the theorem aboveis called the projection of y on V , and is denoted by

y = projV (y).

Writing y as the sum of a vector in V and a vector in V ⊥:

y = y + (y − y)

is called the orthogonal decomposition of y with respect to V . For each y ∈ Rm, theorthogonal decomposition of y with respect to V is unique.

9.5.3 Gram-Schmidt Orthogonalization

The Gram-Schmidt orthogonalization process allows us to convert any basis of a subspaceof Rm to an orthogonal basis for the space.

Theorem 9.37 (Gram-Schmidt Process) Let v1,v2, . . . ,vn be a basis for the vectorspace V ⊆ Rm, and let u1,u2, . . . ,un be defined as follows:

u1 = v1

u2 = v2 − projV1(v2) where V1 = Spanu1 = Spanv1.

u3 = v3 − projV2(v3) where V2 = Spanu1,u2 = Spanv1,v2.

...

un = vn − projVn−1(vn) where Vn−1 = Spanu1, . . . ,un−1 = Spanv1, . . . ,vn−1.

Then u1,u2, . . . ,un is an orthogonal basis of V .

217

Example 9.38 (Subspace of R4) If V = Span

1−1

01

,

20

−21

,

4052

, then

u1 = v1 =

1−1

01

, u2 = v2 −

(v2·u1

u1·u1

)u1 =

20

−21

−

(3

3

)

1−1

01

=

11

−20

,

and

u3 = v3 −(

v3·u1

u1·u1

)u1 −

(v3·u2

u2·u2

)u2 =

4052

−

(6

3

)

1−1

01

−

(−6

6

)

11

−20

=

3330

.

u1,u2,u3 is an orthogonal basis of V ⊆ R4.

9.5.4 QR-Factorization

Orthogonal vectors make many calculations easier. For example, if u1,u2, . . . ,un is anorthogonal set in Rm, and A is the matrix whose columns are the ui’s, then AT A is adiagonal matrix. This idea motivates a technique known as QR-factorization.

Orthonormal sets. The orthogonal set q1,q2, . . . ,qn is said to be an orthonormal setof vectors if each qi is a unit vector. (Each vector has been normalized to have length 1.)

Theorem 9.39 (Orthonormal Sets) If q1,q2, . . . ,qn is an orthonormal set of vectorsin Rm, Q is the matrix whose columns are the qi’s, and x,y ∈ Rm, then

1. QT Q = In (and QQT = Im).

2. ‖Qx‖ = ‖x‖.

3. (Qx)·(Qy) = x·y.

4. (Qx)·(Qy) = 0 if and only if x·y = 0.

Orthogonal matrices. The square matrix Q is said to be an orthogonal matrix if itscolumns form an orthonormal set. If Q is an orthogonal matrix of order n, then the firstpart of the theorem above implies

QT Q = QQT = In =⇒ Q−1 = QT .

That is, the inverse of Q is equal to its transpose.

218

QR-factorization. Let A = [v1 v2 · · · vn ] be an m-by-n matrix whose columns arelinearly independent. Use the Gram-Schmidt process to convert the vi’s to ui’s, let

qi =1

‖ui‖ui for i = 1, 2, . . . , n,

and let Q be the m-by-n matrix whose columns are the qi’s.

In the Gram-Schmidt process,

vi ∈ Spanv1, . . . ,vi = Spanu1, . . . ,ui for i = 1, 2, . . . , n.

By the uniqueness theorem for coordinates, QT A = R must be an upper triangular matrixof order n. Further, it can be shown that the diagonal elements of R are all positive. Sincethe matrix product QQT is the identity matrix of order m,

QT A = R =⇒ Q(QT A) = QR =⇒ A = QR.

Example 9.40 (Subspace of R4, continued) For example, the QR-factorization of the4-by-3 matrix whose columns are the vi’s from the Gram-Schmidt example above is

1 2 4−1 0 0

0 −2 51 1 2

=

1√3

1√6

1√3

−1√3

1√6

1√3

0 −√

2

3

1√3

1√3

0 0

√

3√

3 2√

30

√6 −

√6

0 0 3√

3

.

9.5.5 Linear Least Squares

One of the most useful applications of the methods from this section is to the linear leastsquares problem. The introduction here focuses on the general problem of finding approxi-mate solutions to inconsistent systems.

Let A be an m-by-n coefficient matrix, and assume that Ax = b is inconsistent (b is not inthe column space of A). To find approximate solutions to the original system, we proposeto do the following:

1. Find the projection of b on Col(A), b, and

2. Report solutions to the consistent system Ax = b.

There are two key observations:

Observation 1 : Since b is as close to b as possible, each approximate solution x satisfies

‖b − b‖ = ‖b−Ax‖ is as small as possible.

219

The difference vector is

b−Ax =

b1 − (a1,1x1 + a1,2x2 + · · · + a1,nxn)b2 − (a2,1x1 + a2,2x2 + · · · + a2,nxn)

...bm − (am,1x1 + am,2x2 + · · · + am,nxn)

and the square of the length of the difference vector is

m∑

i=1

(bi − (ai,1x1 + ai,2x2 + · · · + ai,nxn))2 .

Each approximate solution x will minimize the above sum of squared differences. For thisreason the approximate solutions are called least squares solutions.

Observation 2 : Since the difference vector

(b − b) = (b−Ax) ∈ Col(A)⊥ = Null(AT )

(by the Fundamental Theorem of Linear Algebra), we know that

O = AT (b−Ax) = AT b − AT Ax =⇒ AT Ax = AT b.

Thus, least squares solutions can be found by solving the consistent system on the right;there is no need to explicitly compute the projection of b on the column space of A.

The matrix equation AT Ax = AT b is often called the normal equation of the system.

Example 9.41 (3-by-2 Inconsistent System) Consider the inconsistent system

1 01 11 2

[

xy

]=

600

.

Then

AT Ax = AT b ⇒[

3 33 5

] [xy

]=

[60

]⇒ x = 5

y = −3

Thus, x = 5 and y = −3 is the unique least squares solution to the system.

Example 9.42 (4-by-3 Inconsistent System) Consider the inconsistent system

1 3 51 1 01 1 21 3 3

xyz

=

357

−3

.

Then

AT Ax = AT b ⇒

4 8 108 20 2610 26 38

xyz

=

121220

⇒

x = 10y = −6z = 2

Thus, x = 10, y = −6 and z = 2 is the unique least squares solution to the system.

220

Remark. If the columns of the coefficient matrix A are linearly independent, then thesymmetric matrix AT A is invertible and

x =(AT A

)−1AT b

is the unique least squares solution. (The proof uses the QR-factorization of A.)

9.6 Eigenvalues and Eigenvectors

Let A be a square matrix of order n. The constant λ is said to be an eigenvalue of A if

Ax = λx for some nonzero vector x ∈ Rn.

Each nonzero x satisfying this equation is said to be an eigenvector with eigenvalue λ.

If x is an eigenvector of A with eigenvalue λ, then so is each scalar multiple of x, since

A(cx) = c(Ax) = c(λx) = λ(cx) for each nonzero constant c.

If we think of the collection of scalar multiples of x as defining a line in n-space, then Atakes each point on the line to another point on the line.

9.6.1 Characteristic Equation

Since x = Inx, where In is the identity matrix,

Ax = λx ⇐⇒ Ax − λx = O ⇐⇒ (A − λIn)x = O.

That is, λ is an eigenvalue of A if and only if the homogeneous system of equations on theright has nontrivial solutions. The system has nontrivial solutions if and only if (A − λIn)is a singular matrix. The matrix is singular if and only if its determinant is zero.

Thus, the eigenvalues of A can be found by solving

det(A − λIn) = 0 for λ.

This equation is known as the characteristic equation for the matrix A. Further, if λ is aneigenvalue of A, then its eigenvectors are the nonzero vectors in the null space of (A−λIn).

Example 9.43 (2-by-2 Matrix) Let A =

[−8 −10

5 7

]. Then

det(A − λI2) =

∣∣∣∣−8 − λ −10

5 7 − λ

∣∣∣∣ = (−8 − λ)(7 − λ) + 50 = λ2 + λ − 6 = (λ + 3)(λ − 2),

and the determinant equals 0 when λ = 2 or λ = −3.

Using the techniques of Section 9.1.3,

Null(A − 2I2) = Span

[1

−1

]and Null(A + 3I2) = Span

[−2

1

].

Finally, note that the set v1,v2 =

[1

−1

],

[−2

1

]is a basis for R2.

221

Properties. The following are important properties of eigenvalues and eigenvectors:

1. 0 is an eigenvalue of A if and only if A is singular (does not have an inverse).

2. If A is a triangular matrix, then the eigenvalues of A are its diagonal elements.

3. If λ is an eigenvalue of A with eigenvector x and Ak = AA · · ·A is the kth power ofthe matrix A, then λk is an eigenvalue of Ak with eigenvector x.

4. Suppose that A has k distinct eigenvalues λ1, λ2, . . ., λk. If you pool the bases ofthe null spaces Null(A − λiIn), for i = 1, 2, . . . , k, then the resulting set is linearlyindependent. If the resulting set has n elements, the set is a basis for Rn.

Remarks. If A is a square matrix of order n, then the characteristic equation is a polyno-mial equation of degree n. Polynomials of degree n cannot always be factored into n termsof the form “(λ−c)” where the c’s are real numbers, but can be factored into n linear termsif we allow some c’s to be complex numbers.

9.6.2 Diagonalization

Suppose that A has n linearly independent eigenvectors,

vi for i = 1, 2, . . . , n,

with corresponding eigenvalues λi for i = 1, 2, . . . , n.

Let P be the matrix whose columns are the vi’s, and let Λ be the diagonal matrix whosediagonal elements are the λi’s,

P = [v1 v2 · · · vn ] , Λ = Diag(λ1, λ2, . . . , λn).

Then, since

AP = [Av1 Av2 · · · Avn ] = [λ1v1 λ2v2 · · · λ2vn ] = PΛ

and P is invertible, P−1AP = Λ and A = PΛP−1.

The matrix equation A = PΛP−1, where Λ is a diagonal matrix, is called a diagonalizationof the matrix A. Diagonalizations do not always exist.

Theorem 9.44 (Diagonalization Theorem) The square matrix A of order n is diago-nalizable if and only if A has n linearly independent eigenvectors.

Example 9.45 (Diagonalizable 3-by-3 Matrix) Let A =

3 0 00 3 0

−1 1 2

.

Since A is a triangular matrix, its eigenvalues are 3, 3, 2. (Since 3 corresponds to two rootsof the characteristic equation, it appears twice in this list.)

222



110

,

−101

and Null(A − 2I3) = Span

001

.

Since there are three linearly independent eigenvectors, A is diagonalizable. Further, wecan factor A as follows:

A = PΛP−1 =

1 −1 01 0 00 1 1

3 0 00 3 00 0 2

0 1 0−1 1 0

1 −1 1

.

Example 9.46 (Nondiagonalizable 3-by-3 Matrix) Let A =

3 0 02 3 02 1 2

.

As above, since A is a triangular matrix, its eigenvalues are 3, 3, 2.



011

and Null(A − 2I3) = Span

001

.

Since there are only two linearly independent eigenvectors, A is not diagonalizable.

9.6.3 Symmetric Matrices

Recall that the square matrix A is symmetric if aij = aji for all i and j. Written succinctly,the square matrix A is symmetric if A = AT .

Symmetric matrices are always diagonalizable. Further, the eigenvectors can be chosen tobe orthonormal. These facts are summarized in the following theorem.

Theorem 9.47 (Spectral Theorem) Let A be a symmetric matrix of order n. Thenthere exists a diagonal matrix Λ and an orthogonal matrix Q satisfying

A = QΛQ−1 = QΛQT

(since the inverse of Q is its transpose).

The spectral theorem says that if A is symmetric, then (1) the diagonalization processoutlined above always results in finding n linearly independent eigenvectors, and (2) eigen-vectors corresponding to different eigenvalues are always orthogonal.

To construct an orthonormal eigenvector basis, you just need to convert the basis of eachnull space, Null(A− λiIn) for each i, to an orthonormal basis using the normalized Gram-Schmidt process.

223

Spectral decomposition. Let Q = [q1 q2 · · · qn ] and let λi be the eigenvalueassociated with qi for each i. Then

A = QΛQT = λ1q1q1T + λ2q2q2

T + · · · + λnqnqnT

is called the spectral decomposition of A. The spectral decomposition breaks up A into piecesdetermined by its “spectrum” of eigenvalues. Each term in the spectral decomposition ofA is a rank 1 matrix.

9.6.4 Positive Definite Matrices

Let A be a symmetric matrix of order n and let Quad(x) be the following real-valuedn-variable function:

Quad(x) = xT Ax =n∑

i=1

n∑

j=1

aijxixj .

(Quad(x) is a quadratic form in x.) Then, A is said to be positive definite if

Quad(x) > 0 when x 6= O.

The following theorem relates the method for determining when a matrix is positive definitegiven in Section 7.4.2 (on the second derivative test) to the eigenvalues of A.

Theorem 9.48 (Positive Definite Matrices) Let A be a symmetric matrix. If A hasone of the following three properties, it has them all:

1. All the eigenvalues of A are positive.

2. The determinants of all the principal minors are positive.

3. Quad(x) = xT Ax > 0 except when x = O.

To demonstrate that the first statement implies the third, for example, write A = QΛQT ,as given in the spectral theorem above. Then

xT Ax = xT (QΛQT )x = yT Λy =n∑

i=1

λiy2i where y = QTx.

Since Q is an invertible matrix (with inverse QT ), y = QT x 6= O when x 6= O. Further,since each λi is positive, the sum on the right is positive when y 6= O.

9.6.5 Singular Value Decomposition

The singular value decomposition is a useful factorization method for rectangular matrices,generalizing the idea of diagonalization.

224

Singular values. Let A be an m-by-n matrix. By the spectral theorem, the symmetricmatrix AT A can be orthogonally diagonalized. Let

AT A = V ΛV T where V = [v1 v2 · · · vn ] , Λ = Diag(λ1, λ2, . . . , λn),

the columns of V form an orthogonal set, and the λi’s (and vi’s) have been ordered so that

λ1 ≥ λ2 ≥ · · · ≥ λn.

The λi’s are nonnegative since they correspond to squared lengths:

‖Avi‖2 = (Avi)T (Avi) = vi

T AT Avi = viT λivi = λi ≥ 0 for each i.

The singular values of A are the square roots of the eigenvalues of AT A,

σi =√

λi for i = 1, 2, . . . , n

written in descending order. (The singular values are the lengths of the vectors Avi.)

Theorem 9.49 (Singular Values) Given the setup above, if the rank of A is r, then

1. σi > 0 when i ≤ r and σi = 0 when i > r.

2. The set Av1, Av2, . . . , Avr is an orthogonal basis of Col(A).

Decomposition. To complete the singular value decomposition, let

ui =1

‖Avi‖Avi =

1

σiAvi for i = 1, 2, . . . , r.

Then u1,u2, . . . ,ur is an orthonormal basis of Col(A). This basis can be completed toan orthonormal basis of Rm, say u1,u2, . . . ,um.

Let U be the orthogonal matrix of order m whose columns are the ui’s,

U = [u1 u2 · · · um ] ,

let D be the diagonal matrix of order r whose diagonal elements are the nonzero σi’s,

D = Diag(σ1, σ2, . . . , σr),

and let Σ be the m-by-n partitioned matrix

Σ =

[D OO O

]where each O is a zero matrix.

Then, since

UΣ = [σ1u1 · · · σrur 0 · · · 0 ] = [Av1 · · · Avr 0 · · · 0 ] = AV

and V is invertible (with inverse V T ), UΣV T = AV V T = A.

225

The matrix equationA = UΣV T ,

where U and V are orthogonal matrices and Σ has the form shown above, is called a singularvalue decomposition of A. Singular value decompositions always exist.

Example 9.50 (2-by-2 Matrix of Rank 1) Let A =

[2 21 1

].

Then AT A =

[5 55 5

]= V ΛV T where

V = [ v1 v2 ] =

[1/

√2 −1/

√2

1/√

2 1/√

2

]and Λ =

[λ1 00 λ2

]=

[10 00 0

].

The column space of A is spanned by v1, the first singular value is σ1 =√

10, and

u1 =1

σ1Av1 =

[2/

√5

1/√

5

].

A unit vector orthogonal to u1 is u2 =

[−1/

√5

2/√

5

].

Thus, a singular value decomposition of A is

A = UΣV T =

[2/

√5 −1/

√5

1/√

5 2/√

5

] [√10 00 0

] [1/

√2 1/

√2

−1/√

2 1/√

2

].

Example 9.51 (3-by-2 Matrix of Rank 2) Let A =

−1 134 8

10 −10

.

Then AT A =

[117 −81−81 333

]= V ΛV T where

V = [v1 v2 ] =

[−1/

√10 3/

√10

3/√

10 1/√

10

]and Λ =

[λ1 00 λ2

]=

[360 0

0 90

].

The column space of A is spanned by v1,v2, σ1 = 6√

10, σ2 = 3√

10,

u1 =1

σ1Av1 =

2/31/3

−2/3

and u2 =

1

σ2Av2 =

1/32/32/3

.

A unit vector orthogonal to these vectors is u3 =

2/3−2/3

1/3

.

Thus, a singular value decomposition of A is

A = UΣV T =

2/3 1/3 2/31/3 2/3 −2/3

−2/3 2/3 1/3

6√

10 00 3

√10

0 0

[

−1/√

10 3/√

10

3/√

10 1/√

10

].

226

Bases for the four important subspaces. Let A be an m-by-n matrix of rank r, andlet A = UΣV T be a singular value decomposition. Let U = [Ur Um−r ] be a partition ofU into matrices with r and (m − r) columns. Similarly, let V = [Vr Vn−r ] be a partitionof V into matrices with r and (n − r) columns. Then

1. The columns of Ur form an orthonormal basis for Col(A) ⊆ Rm.

2. The columns of Um−r form an orthonormal basis for Null(AT ) ⊆ Rm.

3. The columns of Vr form an orthonormal basis for Col(AT ) ⊆ Rn.

4. The columns of Vn−r form an orthonormal basis for Null(A) ⊆ Rn.

For this reason, the singular value decomposition is often thought of as the third part ofthe Fundamental Theorem of Linear Algebra.

Note that we can now write:

A = UΣV T =

[Ur Um−r

]

D O

O O

V Tr

V Tn−r

= UrDV T

r ,

where D = Diag(σ1, σ2, . . . , σr) is the diagonal matrix of order r whose diagonal elementsare the nonzero singular values, written in descending order.

Moore-Penrose pseudoinverse. Working from the decomposition above, the matrix

A+ = VrD−1UT

r

is called the Moore-Penrose pseudoinverse of the matrix A. A+ is an n-by-m matrix.

Using properties of orthogonal matrices: A+AA+ = A and AA+A = A+.

The Moore-Penrose pseudoinverse can be applied to least squares problems (Section 9.5.5).If Ax = b is inconsistent and AT A does not have an inverse, then there are infinitely manyleast squares solutions. By using A+ as if it were a true inverse:

Ax = b =⇒ x+ = A+b

we obtain the least squares solution of minimum length.

Example 9.52 (Solution of Minimum Length) Consider the inconsistent system

1 −22 −4

−1 2

[

xy

]=

3−415

.

Then

AT Ax = AT b ⇒[

6 −12−12 24

] [xy

]=

[−20

40

]⇒ x = 2y − 10/3

y is free.

227

Thus, there are an infinite number of least squares solutions of the form

(2y − 10/3, y) where y is free.

Since A =

−1/√

6

−2/√

6

1/√

6

[√30

][−1/

√5 2/

√5 ], the pseudoinverse is

A+ =

[−1/

√5

2/√

5

] [1/

√30][−1/

√6 −2/

√6 1/

√6 ]

and x+ = A+b =

[−2/3

4/3

].

Note that the square of the length of each solution is f(y) = (2y − 10/3)2 + y2, and thatthe function f(y) is minimized when y = 4/3.

Numerical linear algebra. In most practical applications of linear algebra, solutionsare subject to roundoff errors and care must be taken to minimize these errors. Often, asingular value decomposition of the coefficient matrix is used in the computations, and therange of nonzero singular values is used to analyze errors.

If the coefficient matrix is invertible, then the ratio of the largest to the smallest singularvalue c = σ1/σn is called the condition number of the matrix. The larger the conditionnumber, the more error prone the numerical algorithm. A rule of thumb, which has beenexperimentally verified in a large number of trial cases, is that the computer can lose aboutlog10(c) decimal places of accuracy while solving linear equations numerically.

In linear regression analysis, for example, if two columns of the design matrix X are nearlylinearly related (nearly colinear), then the condition number of XT X will be quite high,affecting estimates of coefficients and variances.

9.7 Linear Transformations

Let V and W be vector spaces and let T : V → W be a function assigning to each v ∈ Va vector T (v) ∈ W . Then T is said to be a linear transformation if

1. T (v1 + v2) = T (v1) + T (v2) for all v1,v2 ∈ V and

2. T (cv) = cT (v) for all v ∈ V and c ∈ R.

(That is, if T “respects” vector addition and scalar multiplication.)

Example 9.53 (Transformations on R2) For example, let V = W = R2,

T1

([xy

])=

[2x + 3y−y

]and T2

([xy

])=

[x − yxy

].

Then T1 satisfies the criteria for linear transformations, but T2 does not. In particular,

T2

([33

])=

[09

]6=[

03

]= 3T2

([11

]).

228

-2 2

-2

2

Hx1,y1L

-2 2

-2

2

Hx2,y2L

-2 2

-2

2

Hx2,y2L

-2 2

-2

2

Hx2,y2L

Figure 9.1: The effects of three different linear transformations on the unit square (left).

Linear combinations. If T is a linear transformation, then T (O) = O and

T (c1v1 + c2v2 + · · · + cnvn) = c1T (v1) + c2T (v2) + · · · + cnT (vn)

for each linear combination of vectors in V .

9.7.1 Range and Kernel

If T is a linear transformation, then the range of T ,

Range(T ) = w : w = T (v) for some v ∈ V ⊆ W,

is a subspace of W . Similarly, the kernel of T ,

Kernel(T ) = v : T (v) = O ⊆ V,

is a subspace of V .

Isomorphism. T is said to be one-to-one if its kernel is trivial, Kernel(T ) = O. T issaid to be onto if its range is the entire space, Range(T ) = W .

If T is both one-to-one and onto, then T is an isomorphism between the vector spaces.

9.7.2 Linear Transformations and Matrices

If V ⊆ Rn and W ⊆ Rm, then the linear transformation T corresponds to matrix multipli-cation. That is,

T (v) = Av for some m-by-n matrix A.

Further, Range(T ) = Col(A) and Kernel(T ) = Null(A).

Example 9.54 (Planar Transformations) When V = W = R2, linear transformationsare often viewed geometrically. For example, Figure 9.1 shows the image of the unit square(the set [0, 1] × [0, 1]) under linear transformations whose matrices are

[1 1.50 1

],

[−1.5 0

0 2

], and

[−2 1−1 2.5

], respectively.

229

x1

y1

x2y2

z2

x2

Figure 9.2: A transformation of R2 (left) whose range (right) is isomorphic to R2.

A linear transformation whose matrix is invertible will map parallelograms to parallelo-grams. Since the second matrix is a diagonal matrix, the image of the unit square is arectangle whose sides have lengths equal to the absolute values of the diagonal elements.

In each case, the area of the image of the unit square is the absolute value of the determinantof the matrix. The resulting areas in this case are 1, 3, and 4, respectively.

9.7.3 Linearly Independent Columns

Let T : Rn → Rm be a linear transformation with matrix A. If the columns of A arelinearly independent, then

Range(T ) = Col(A) ⊆ Rm is isomorphic to Rn.

Example 9.55 (Dimension 2) Let T : R2 → R3 be defined as follows:

T (x) = A

[x1

y1

]=

1 00 −11 1

[

x1

y1

]=

x1

−y1

x1 + y1

.

Since the columns of A are linearly independent, the range of T is the 2-dimensional subspaceof R3 shown in the right part of Figure 9.2. (Plane with equation z2 = x2 − y2.)

The figure also shows how a square grid on the x1y1-plane is transformed by T to a paral-lelogram grid on the surface in x2y2z2-space with equation z2 = x2 − y2.

9.7.4 Composition of Functions and Matrix Multiplication

Let U ⊆ Rp, V ⊆ Rn, and W ⊆ Rm be subspaces, let

TB : U → V and TA : V → W

be linear transformations with matrices B and A, respectively, and let

T : U → W be the composition T (u) = TA(TB(u)).

230

-8 -4 4 8

-8

-4

4

8Hx1,y1L

-8 -4 4 8

-8

-4

4

8Hx2,y2L

-8 -4 4 8

-8

-4

4

8Hx3,y3L

-8 -4 4 8

-8

-4

4

8Hx4,y4L

Figure 9.3: A linear transformation of the [−2, 2]×[−2, 2] square viewed through the singularvalue decomposition of the matrix of the transformation.

Then the matrix of the linear transformation T is the matrix product AB.

Example 9.56 (SVD) The planar transformation whose matrix is A =

[2 −12 2

]

maps the x1y1-plane to the x4y4-plane, shown in the first and last parts of Figure 9.3.

Consider the following singular value decomposition of A

A = UΣV T =

[1/

√5 −2/

√5

2/√

5 1/√

5

] [3 00 2

] [2/

√5 1/

√5

−1/√

5 2/√

5

],

and its effect on the [−2, 2] × [−2, 2] square.

1. The x1y1-plane is mapped to the x2y2-plane by the transformation whose matrix isV T . Since V is orthogonal, the image of the square is a tilted square of the same size.

2. The x2y2-plane is mapped to the x3y3-plane by the transformation whose matrix isΣ. Since Σ is diagonal, the tilted square is stretched in the horizontal and verticaldirections by 3 and 2, respectively, yielding a parallelogram.

3. The x3y3-plane is mapped to the x4y4-plane by the transformation whose matrix is U .Since U is orthogonal, the image of the parallelogram is a congruent parallelogram.

The figure also shows that the unit circle is transformed to an ellipse by the action of thediagonal matrix in the singular value decomposition.

9.7.5 Diagonalization and Change of Bases

Suppose that A is a diagonalizable matrix of order n (Section 9.6.2), with diagonalizationA = PΛP−1, where P = [v1 v2 · · · vn ] and Λ = Diag(λ1, λ2, . . . , λn).

The columns of P form an eigenvector basis of Rn. Given x ∈ Rn, the coordinates of xwith respect to this basis, say c, are found by solving the matrix equation:

Pc = x =⇒ c = P−1x.

231

Thus, the linear transformation whose matrix is P−1 can be thought of as a transformationfrom the standard basis of Rn to the eigenvector basis v1,v2, . . . ,vn. Similarly, thetransformation whose matrix is P can be thought of as a change of basis from the eigenvectorbasis back to the standard basis.

Now, consider the transformation T whose matrix is A. Since

T (x) = Ax =(PΛP−1

)x = P

(Λ(P−1x

))= P (Λc) ,

the action of T is (1) to change from the standard basis to the eigenvector basis, (2) tooperate in the eigenvector-coordinate system as a diagonal matrix, and (3) to interpret theresults back in standard-basis form.

Powers of A. The kth power of A can be written as follows:

Ak =(PΛP−1

)k=(PΛP−1

) (PΛP−1

)· · ·(PΛP−1

)= PΛkP−1.

Thus, the action of the linear transformation whose matrix is Ak is (1) to change to theeigenvector basis, (2) to operate in the eigenvector-coordinate system by applying Λ k times,and (3) to interpret the results back in standard-basis form.

This geometric interpretation is especially useful in practice when considering, for example,transition matrices (from probability theory) or Leslie matrices (from population dynamics).

9.7.6 Random Vectors and Linear Transformations

Linear transformations of vectors of random variables are commonly used. Let

X =

X1

X2

...Xn

and µ = E(X) =

E(X1)E(X2)

...E(Xn)

be n-by-1 vectors of random variables and their means, respectively, and let

V ar(X) = E((X − µ)(X − µ)T )

be the n-by-n matrix of covariances of the random variables. That is, V ar(X) is the matrixwhose ij-entry is equal to the covariance between Xi and Xj when i 6= j, and is equal tothe variance of Xi when i = j.

Linear transformations. If A is an m-by-n matrix, then the linearly transformed ran-dom vector Y = AX has an m-variate distribution with mean vector

E(Y ) = E(AX) = AE(X) = Aµ

and with covariance matrix

V ar(Y ) = E((AX − Aµ)(AX − Aµ)T ) = E(A(X − µ)(X − µ)T AT ) = AV ar(X)AT .

232

-3

0

3x -3

0

3

y

z

-3

0

3x

-1.5 1.5 x

-1.5

1.5

y

Figure 9.4: Joint density (left) and contours enclosing 20%, 40% and 60% of the probabilityfor the standard bivariate normal distribution with correlation ρ = 0.50.

Example 9.57 (2-by-2 Matrix) Let

X =

[X1

X2

], A =

[1 12 −4

]and Y = AX =

[X1 + X2

2X1 − 4X2

].

If

µ = E(X) =

[21

]and V ar(X) =

[1 1/2

1/2 1

],

then

E(Y ) = Aµ =

[30

]and V ar(Y ) = AV ar(X)AT =

[3 −3

−3 12

].

Note that Corr(X1,X2) = 12 and Corr(Y1, Y2) =

Cov(Y1,Y2)√V ar(Y1)V ar(Y2)

= −12 .

9.7.7 Example: Multivariate Normal Distribution

Let X =

X1

...Xk

be a random k-tuple.

X is said to have a multivariate normal distribution if its joint PDF has the following form

f(x) =1

(2π)k/2√|V | exp

(−1

2(x − µ)T V −1(x − µ)

)for all x ∈ Rk.

In this formula, µ = E(X) is the k-by-1 vector of means and V = V ar(X) is the k-by-kcovariance matrix. (See Section 8.1.9 also.)

Note that to find the probability that a normal random k-tuple takes values in a domainD ⊆ Rk, you need to compute a k-variate integral.

233

-1.5 1.5

-1.5

1.5

Hx1,y1L

-1.5 1.5

-1.5

1.5

Hx2,y2L

-1.5 1.5

-1.5

1.5

Hx3,y3L

Figure 9.5: Transformations of probability contours.

Example 9.58 (k = 2) Let X =

[XY

], µ = O and V =

[1 0.5

0.5 1

].

Then the random pair has a standard bivariate normal distribution with parameter ρ = 0.50.The left part of Figure 9.4 is a graph of the surface z = f(x, y). The right part of the figureshows contours enclosing 20%, 40% and 60% of the probability.

Each contour in the right plot is an ellipse whose major and minor axes are along the linesy = x and y = −x (gray lines), respectively. Let D1, D2, and D3 be the regions boundedby the three ellipses in increasing size.

Orthogonal diagonalization. Since V is symmetric, the spectral theorem (Section 9.6.3)implies that it can be diagonalized as follows:

V = QΛQT ,

where Λ is a diagonal matrix of eigenvalues and Q is orthogonal.

As discussed earlier, the linear transformation whose matrix is QT can be thought of as atransformation from the standard basis of Rk to the basis whose elements are the columnsof Q. Further, if Y = QT X , then

E(Y ) = QT µ and V ar(Y ) = QT V ar(X)Q = QT(QΛQT

)Q = Λ.

Thus, the Yi’s are uncorrelated, with variances equal to the eigenvalues of V . (The Yi’s are,in fact, independent normal random variables.)

Example 9.59 (k = 2, continued) Continuing with the example above,

V = QΛQT =

[1/

√2 −1/

√2

1/√

2 1/√

2

] [1.5 00 0.5

] [1/

√2 1/

√2

−1/√

2 1/√

2

].

The action of the matrix product Λ−1QT is illustrated in Figure 9.5.

Specifically, if we let x2 = QT x1, then points on the ellipses shown in the x1y1-plane (leftplot) are rotated by 45 degrees in the clockwise direction to points on the ellipses shown inthe x2y2-plane (center plot).

234

Further, if we let x3 = Λ−1x2, then points on the rotated ellipses in the x2y2-plane (centerplot) are mapped to circles in the x3y3-plane (right plot).

These transformations (plus a polar transformation) can be used to demonstrate that

P ((X,Y ) ∈ D1) = 0.20, P ((X,Y ) ∈ D2) = 0.40, P ((X,Y ) ∈ D3) = 0.60.


Linear algebra is one of the most useful branches of mathematics. In statistics, linear algebramethods are used to simplify computations related to multivariate probability distributionsand to simplify computations related to summarizing random samples. This section focuseson two applications.

9.8.1 Least Squares Estimation

Given n cases, (x1,i, x2,i, . . . , xp−1,i, yi) for i = 1, 2, . . . , n, whose values lie close to a linearequation of the form

y = β0 + β1x1 + β2x2 + · · · + βp−1xp−1,

the method of least squares can be used to estimate the unknown parameters.

The n-by-p design matrix, X, the p-by-1 vector of unknowns, β, and the n-by-1 vector ofright hand side values, y, are defined as follows:

X =

1 x1,1 x2,1 · · · xp−1,1

1 x1,2 x2,2 · · · xp−1,2

......

... · · ·...

1 x1,n x2,n · · · xp−1,n

, β =

β0

β1

...βp−1

and y =

y1

y2

...yn

.

We wish to find least squares solutions to the inconsistent system Xβ = y.

Example 9.60 (Sleep Study, Part 1)(Allison & Cicchetti, Science, 194:732-374, 1976;lib.stat.cmu.edu/DASL.) As part of a study on sleep in mammals, researchers collectedinformation on the average hours of nondreaming and dreaming sleep for 43 different species.

The left part of Figure 9.6 compares the common logarithm of nondreaming sleep (verticalaxis) to the common logarithm of dreaming sleep (horizontal axis). A simple linear predic-tion formula will be used to write the common logarithm of hours of nondreaming sleep asa function of the common logarithm of hours of dreaming sleep.

For these data, Xβ = y ⇒ XT Xβ = XT y ⇒[

43 8.759158.75915 5.33664

] [β0

β1

]=

[38.36399.33209

]⇒ β0 = 0.805178

β1 = 0.427124.

Thus, the simple linear prediction equation is y = 0.427124x + 0.805178.

235

0 0.4 0.8 DS0.4

0.8

1.2

NDS

-2 0 2 BW0.4

0.8

1.2

NDS

Figure 9.6: Comparison of nondreaming sleep in log-hours with dreaming sleep in log-hours(left plot) and with body weight in log-kilograms (right plot) for data from the sleep study.Least squares linear fits are superimposed.

The simple linear prediction equation is shown in the left part of Figure 9.6.

Example 9.61 (Sleep Study, Part 2) The researchers also collected information on theaverage body weight in kilograms for the 43 species, and on other variables. They developeda danger index using a five-point scale, where a higher value corresponds to a higher riskof exposure to the elements while sleeping and/or a higher risk of predation while sleeping.Table 9.2 gives the values of the danger index for each species in the study.

The right part of Figure 9.6 compares the common logarithm of nondreaming sleep (verticalaxis) to the common logarithm of body weight (horizontal axis). A linear prediction formulawill be used to write the common logarithm of hours of nondreaming sleep as a functionof the common logarithm of kilograms of body weight (variable 1) and the danger index(variable 2).

For these data, Xβ = y ⇒ XT Xβ = XT y ⇒

43 13.4047 11513.4047 78.5992 63.7371

115 63.7371 389

β0

β1

β2

=

38.36393.2244695.4993

⇒

β0 = 1.06671β1 = −0.097165β2 = −0.0539304

.

Thus, the linear prediction equation is y = −0.097165x − 0.0539304d + 1.06671.

Nondreaming sleep is negatively associated with both body weight and danger index. Theright part of Figure 9.6 shows the simple linear prediction formulas when d = 1, 2, 3, 4, 5.

9.8.2 Principal Components Analysis

Let X = [ x1 x2 · · · xn ] be a p-by-n data matrix, whose columns are a sample of sizen from a p-variate distribution, let x be the p-by-1 sample mean,

x =1

n

n∑

i=1

xi,

236

Table 9.2: Danger indices for 43 species in the sleep study.

Danger = 1 Danger = 2 Danger = 3 Danger = 4 Danger = 5

Big brown bat European hedgehog African giant rat Asian elephant CowCat Galago ground squirrel Baboon Goat

Chimpanzee Golden hamster Mountain beaver Brazilian tapir HorseE.Amer.mole Owl monkey Mouse Chincilla Rabbit

Gray seal Phanlanger Musk shrew guinea pig SheepLittle brown bat Rhesus monkey Rat Short tail shrew

Man Rock hyrax (h.b.) Rockhyrax (p.h.) Patas monkeyMole rat Tenrec Tree hyrax Pig

N.Amer.opossum Tree shrew Vervet9 banded armadillo

Red foxWater opossum

and let X be the p-by-n matrix whose jth column is the deviation of the jth observationfrom the sample mean: xj = xj − x.

Let S be the sample correlation matrix. S can be computed using the matrix equation

S =1

(n − 1)XX

T,

where X is the centered data matrix above.

Since S is symmetric, we can use the spectral theorem (Section 9.6.3) to write

S = QΛQT

where Λ is a diagonal matrix of eigenvalues ordered so that λ1 ≥ λ2 ≥ · · · ≥ λp, and whereQ is an orthogonal matrix.

Consider the linear transformation

Y = QT X =

q1·x1 q1·x2 · · · q1·xn

q2·x1 q2·x2 · · · q2·xn...

... · · · ...qp·x1 qp·x2 · · · qp·xn

,

and let SY be the sample correlation matrix of Y . Then

SY = QT SQ = QT(QΛQT

)Q = Λ,

indicating that the p transformed centered samples are uncorrelated. Further, the firsttransformed centered sample (first row of Y ) has the largest sample variance, the second(second row of Y ) has the next largest sample variance, and so forth.

Since the ith column of Q was used to transform the ith centered sample, the vector qi iscalled the ith principal component of the centered data matrix X.

237

75 95 115 c75

95

115

a

75

95

115

c75

95115a

75

95

115

h

75

95

115

75

95

115

Figure 9.7: Abdomen and chest measurements (left), and abdomen, chest and hip measure-ments (right) for 100 men. Principal component axes are highlighted in each plot.

Example 9.62 (Body Fat Study, Part 1) (Johnson, JSE:4(1), 1996) As part of a studyto determine if simple body measurements could be used to predict body fat, abdomenand chest measurements in centimeters were made on 100 men. The left part of Figure 9.7compares chest (variable 1) and abdomen (variable 2) measurements for these men.

For these data,

x =

[101.06892.183

], S =

[68.3283 76.346676.3466 102.599

],

and S = QΛQT where

Λ =

[λ1 00 λ2

]≈[163.709 0

0 7.218

]and Q = [q1 q2 ] ≈

[0.625 −0.7810.781 0.625

].

The principal component axes shown in the left part of Figure 9.7 are lines describedparametrically as ℓi(t) = x + tqi, for i = 1, 2. The lines are centered at x.

Since λ1 is much greater than λ2, most of the variation in the data is along the first principalcomponent axis. In fact, since

λ1

λ1 + λ2≈ 0.958 and

λ2

λ1 + λ2≈ 0.042

about 95.8% of the total variation in the data is due to the first component.

Further, if c is the chest measurement and a is the abdomen measurement of a man in thestudy, then the formula

y = 0.625(c − 101.068) + 0.781(a − 92.183)

can be used as a body “size index” in later analyses.

238

Example 9.63 (Body Fat Study, Part 2) The researchers also took hip measurementsin centimeters. The right part of Figure 9.7 compares chest (variable 1), abdomen (variable2), and hip (variable 3) measurements for the 100 men.

For these data,

x =

101.06892.18399.594

, S =

68.3283 76.3466 43.839776.3466 102.599 56.7943.8397 56.79 42.1284

,

and S = QΛQT where

Λ =

λ1 0 00 λ2 00 0 λ3

≈

196.946 0 00 9.474 00 0 6.635

and

Q = [q1 q2 q3 ] ≈

0.565 0.588 0.5790.710 0.011 −0.7040.420 −0.809 0.412

.

The principal component axes shown in the right part of Figure 9.7 are lines describedparametrically as ℓi(t) = x + tqi, for i = 1, 2, 3. The lines are centered at x.

Since λ1 is much greater than λ2 and λ3, most of the variation in the data is along the firstprincipal component axis. In fact, since

λ1

λ1 + λ2 + λ3≈ 0.924,

λ2

λ1 + λ2 + λ3≈ 0.044,

λ3

λ1 + λ2 + λ3≈ 0.031,

about 92.4% of the total variation in the data is due to the first component.

Further, if c is the chest measurement, a is the abdomen measurement, and h is the hipmeasurement of a man in the study, then the formula

y = 0.565(c − 101.068) + 0.710(a − 92.183) + 0.420(h − 99.594)

can be used as a body “size index” in later analyses.

239

10 Additional Reading

(1-2) An Introduction to Mathematical Statistics, by Larsen and Marx (Prentice Hall) andMathematical Statistics and Data Analysis, by Rice (Duxbury Press), are introductions toprobability theory and mathematical statistics with emphasis on applications. The bookby Larsen and Marx is currently used as the text book for the MT426-427 mathematicalstatistics sequence.

(3-4) Calculus by Hughes-Hallett, Gleason, et al (Wiley), and Multivariable Calculus byMcCallum, Hughes-Hallett, Gleason, et al (Wiley), are gentle introductions to single variableand multivariable calculus methods. The first book is currently used as the text book forthe MT100-101 calculus sequence.

(5) Vector Calculus, by Colley (Prentice Hall), is a more sophisticated introduction tomultivariable calculus. Colley emphasizes the use of linear algebra methods. This book iscurrently used as the text book for the MT202 multivariable calculus course.

(6-7) Linear Algebra and its Applications, by Lay (Addison-Wesley), and Introduction toLinear Algebra, by Strang (Wellesley-Cambridge) are good introductions to linear algebramethods. Each book emphasizes both theory and practice. The book by Lay is currentlyused as the text book for the MT210 linear algebra course.

(8) Advanced Calculus with Applications in Statistics, by Khuri (Wiley), is a more advancedtext designed specifically to tie topics in advanced calculus to applications in statistics.

(9-10) Introduction to Matrices with Applications in Statistics by Graybill (Wadsworth),and Matrix Algebra useful for Statistics by Searle (Wiley), are well-respected texts designedspecifically to tie linear algebra methods to applications in statistics. Note, however, thateach book is quite old, and each uses out-of-date notation.

(11) Statistical Models, by Freedman (Cambridge University Press), is a recent text for asecond course in statistics. From the Preface: “The contents of the book can be fairlydescribed as what you have to know in order to start reading empirical papers that usestatistical models.”

241

mt580 mathematics for statistics - boston college · mt580 mathematics for statistics: “all the...

Documents