cover - download.e-bookshelf.de · for more information about wiley products, visit our web site at...
TRANSCRIPT
COMPUTATIONALSTATISTICS
Wiley Series in Computational Statistics
Consulting Editors
Paolo GiudiciUniversity of Pavia Italy
Geof H GivensColorado State University USA
Bani K MallickTexas AampM University USA
Wiley Series in Computational Statistics is comprised of practical guides and cuttingedge research books on new developments in computational statistics It featuresquality authors with a strong applications focus The texts in the series providedetailed coverage of statistical concepts methods and case studies in areas at theinterface of statistics computing and numerics
With sound motivation and a wealth of practical examples the books show inconcrete terms how to select and to use appropriate ranges of statistical computingtechniques in particular fields of study Readers are assumed to have a basicunderstanding of introductory terminology
The series concentrates on applications of computational methods in statistics to fieldsof bioformatics genomics epidemiology business engineering finance and appliedstatistics
A complete list of titles in this series appears at the end of the volume
SECOND EDITION
COMPUTATIONALSTATISTICS
GEOF H GIVENS AND JENNIFER A HOETINGDepartment of Statistics Colorado State University Fort Collins CO
Copyright copy 2013 by John Wiley amp Sons Inc All rights reserved
Published by John Wiley amp Sons Inc Hoboken New JerseyPublished simultaneously in Canada
No part of this publication may be reproduced stored in a retrieval system or transmitted in any formor by any means electronic mechanical photocopying recording scanning or otherwise except aspermitted under Section 107 or 108 of the 1976 United States Copyright Act without either the priorwritten permission of the Publisher or authorization through payment of the appropriate per-copy fee tothe Copyright Clearance Center Inc 222 Rosewood Drive Danvers MA 01923 (978) 750-8400 fax(978) 750-4470 or on the web at wwwcopyrightcom Requests to the Publisher for permission shouldbe addressed to the Permissions Department John Wiley amp Sons Inc 111 River Street Hoboken NJ07030 (201) 748-6011 fax (201) 748-6008 or online at httpwwwwileycomgopermission
Limit of LiabilityDisclaimer of Warranty While the publisher and author have used their best efforts inpreparing this book they make no representations or warranties with respect to the accuracy orcompleteness of the contents of this book and specifically disclaim any implied warranties ofmerchantability or fitness for a particular purpose No warranty may be created or extended by salesrepresentatives or written sales materials The advice and strategies contained herein may not be suitablefor your situation You should consult with a professional where appropriate Neither the publisher norauthor shall be liable for any loss of profit or any other commercial damages including but not limited tospecial incidental consequential or other damages
For general information on our other products and services or for technical support please contact ourCustomer Care Department within the United States at (800) 762-2974 outside the United Statesat (317) 572-3993 or fax (317) 572-4002
Wiley also publishes its books in a variety of electronic formats Some content that appears in print maynot be available in electronic formats For more information about Wiley products visit our web site atwwwwileycom
Library of Congress Cataloging-in-Publication Data
Givens Geof HComputational statistics Geof H Givens Jennifer A Hoeting ndash 2nd ed
p cmIncludes indexISBN 978-0-470-53331-4 (cloth)
1 Mathematical statisticsndashData processing I Hoeting Jennifer A(Jennifer Ann) 1966ndash II Title
QA2764G58 20135195ndashdc23
2012017381
Printed in the United States of America
10 9 8 7 6 5 4 3 2 1
To Natalie and Neil
CONTENTS
PREFACE xv
ACKNOWLEDGMENTS xvii
1 REVIEW 1
11 Mathematical Notation 1
12 Taylorrsquos Theorem and Mathematical Limit Theory 2
13 Statistical Notation and Probability Distributions 4
14 Likelihood Inference 9
15 Bayesian Inference 11
16 Statistical Limit Theory 13
17 Markov Chains 14
18 Computing 17
PART I
OPTIMIZATION
2 OPTIMIZATION AND SOLVING NONLINEAR EQUATIONS 21
21 Univariate Problems 22
211 Newtonrsquos Method 26
2111 Convergence Order 29
212 Fisher Scoring 30213 Secant Method 30214 Fixed-Point Iteration 32
2141 Scaling 33
22 Multivariate Problems 34
221 Newtonrsquos Method and Fisher Scoring 34
2211 Iteratively Reweighted Least Squares 36
222 Newton-Like Methods 39
2221 Ascent Algorithms 39
2222 Discrete Newton and Fixed-Point Methods 41
2223 Quasi-Newton Methods 41
vii
viii CONTENTS
223 GaussndashNewton Method 44224 NelderndashMead Algorithm 45225 Nonlinear GaussndashSeidel Iteration 52
Problems 54
3 COMBINATORIAL OPTIMIZATION 59
31 Hard Problems and NP-Completeness 59311 Examples 61312 Need for Heuristics 64
32 Local Search 65
33 Simulated Annealing 68331 Practical Issues 70
3311 Neighborhoods and Proposals 70
3312 Cooling Schedule and Convergence 71332 Enhancements 74
34 Genetic Algorithms 75341 Definitions and the Canonical Algorithm 75
3411 Basic Definitions 75
3412 Selection Mechanisms and Genetic Operators 76
3413 Allele Alphabets and Genotypic Representation 78
3414 Initialization Termination and Parameter Values 79342 Variations 80
3421 Fitness 80
3422 Selection Mechanisms and Updating Generations 81
3423 Genetic Operators and PermutationChromosomes 82
343 Initialization and Parameter Values 84344 Convergence 84
35 Tabu Algorithms 85351 Basic Definitions 86352 The Tabu List 87353 Aspiration Criteria 88354 Diversification 89355 Intensification 90356 Comprehensive Tabu Algorithm 91
Problems 92
4 EM OPTIMIZATION METHODS 97
41 Missing Data Marginalization and Notation 97
42 The EM Algorithm 98421 Convergence 102422 Usage in Exponential Families 105
CONTENTS ix
423 Variance Estimation 106
4231 Louisrsquos Method 106
4232 SEM Algorithm 108
4233 Bootstrapping 110
4234 Empirical Information 110
4235 Numerical Differentiation 111
43 EM Variants 111431 Improving the E Step 111
4311 Monte Carlo EM 111432 Improving the M Step 112
4321 ECM Algorithm 113
4322 EM Gradient Algorithm 116433 Acceleration Methods 118
4331 Aitken Acceleration 118
4332 Quasi-Newton Acceleration 119
Problems 121
PART II
INTEGRATION AND SIMULATION
5 NUMERICAL INTEGRATION 129
51 NewtonndashCotes Quadrature 129511 Riemann Rule 130512 Trapezoidal Rule 134513 Simpsonrsquos Rule 136514 General kth-Degree Rule 138
52 Romberg Integration 139
53 Gaussian Quadrature 142531 Orthogonal Polynomials 143532 The Gaussian Quadrature Rule 143
54 Frequently Encountered Problems 146541 Range of Integration 146542 Integrands with Singularities or Other Extreme Behavior 146543 Multiple Integrals 147544 Adaptive Quadrature 147545 Software for Exact Integration 148
Problems 148
6 SIMULATION AND MONTE CARLO INTEGRATION 151
61 Introduction to the Monte Carlo Method 151
62 Exact Simulation 152
x CONTENTS
621 Generating from Standard Parametric Families 153622 Inverse Cumulative Distribution Function 153623 Rejection Sampling 155
6231 Squeezed Rejection Sampling 158
6232 Adaptive Rejection Sampling 159
63 Approximate Simulation 163
631 Sampling Importance Resampling Algorithm 163
6311 Adaptive Importance Bridge and Path Sampling 167
632 Sequential Monte Carlo 168
6321 Sequential Importance Sampling for Markov Processes 169
6322 General Sequential Importance Sampling 170
6323 Weight Degeneracy Rejuvenation and EffectiveSample Size 171
6324 Sequential Importance Sampling for HiddenMarkov Models 175
6325 Particle Filters 179
64 Variance Reduction Techniques 180
641 Importance Sampling 180642 Antithetic Sampling 186643 Control Variates 189644 RaondashBlackwellization 193
Problems 195
7 MARKOV CHAIN MONTE CARLO 201
71 MetropolisndashHastings Algorithm 202
711 Independence Chains 204712 Random Walk Chains 206
72 Gibbs Sampling 209
721 Basic Gibbs Sampler 209722 Properties of the Gibbs Sampler 214723 Update Ordering 216724 Blocking 216725 Hybrid Gibbs Sampling 216726 GriddyndashGibbs Sampler 218
73 Implementation 218
731 Ensuring Good Mixing and Convergence 219
7311 Simple Graphical Diagnostics 219
7312 Burn-in and Run Length 220
7313 Choice of Proposal 222
7314 Reparameterization 223
7315 Comparing Chains Effective Sample Size 224
7316 Number of Chains 225
CONTENTS xi
732 Practical Implementation Advice 226733 Using the Results 226
Problems 230
8 ADVANCED TOPICS IN MCMC 237
81 Adaptive MCMC 237811 Adaptive Random Walk Metropolis-within-Gibbs Algorithm 238812 General Adaptive Metropolis-within-Gibbs Algorithm 240813 Adaptive Metropolis Algorithm 247
82 Reversible Jump MCMC 250821 RJMCMC for Variable Selection in Regression 253
83 Auxiliary Variable Methods 256831 Simulated Tempering 257832 Slice Sampler 258
84 Other MetropolisndashHastings Algorithms 260841 Hit-and-Run Algorithm 260842 Multiple-Try MetropolisndashHastings Algorithm 261843 Langevin MetropolisndashHastings Algorithm 262
85 Perfect Sampling 264851 Coupling from the Past 264
8511 Stochastic Monotonicity and Sandwiching 267
86 Markov Chain Maximum Likelihood 268
87 Example MCMC for Markov Random Fields 269871 Gibbs Sampling for Markov Random Fields 270872 Auxiliary Variable Methods for Markov Random Fields 274873 Perfect Sampling for Markov Random Fields 277
Problems 279
PART III
BOOTSTRAPPING
9 BOOTSTRAPPING 287
91 The Bootstrap Principle 287
92 Basic Methods 288921 Nonparametric Bootstrap 288922 Parametric Bootstrap 289923 Bootstrapping Regression 290924 Bootstrap Bias Correction 291
93 Bootstrap Inference 292931 Percentile Method 292
9311 Justification for the Percentile Method 293932 Pivoting 294
xii CONTENTS
9321 Accelerated Bias-Corrected Percentile Method BCa 294
9322 The Bootstrap t 296
9323 Empirical Variance Stabilization 298
9324 Nested Bootstrap and Prepivoting 299
933 Hypothesis Testing 301
94 Reducing Monte Carlo Error 302
941 Balanced Bootstrap 302942 Antithetic Bootstrap 302
95 Bootstrapping Dependent Data 303
951 Model-Based Approach 304952 Block Bootstrap 304
9521 Nonmoving Block Bootstrap 304
9522 Moving Block Bootstrap 306
9523 Blocks-of-Blocks Bootstrapping 307
9524 Centering and Studentizing 309
9525 Block Size 311
96 Bootstrap Performance 315
961 Independent Data Case 315962 Dependent Data Case 316
97 Other Uses of the Bootstrap 316
98 Permutation Tests 317
Problems 319
PART IV
DENSITY ESTIMATION AND SMOOTHING
10 NONPARAMETRIC DENSITY ESTIMATION 325
101 Measures of Performance 326
102 Kernel Density Estimation 327
1021 Choice of Bandwidth 329
10211 Cross-Validation 332
10212 Plug-in Methods 335
10213 Maximal Smoothing Principle 338
1022 Choice of Kernel 339
10221 Epanechnikov Kernel 339
10222 Canonical Kernels and Rescalings 340
103 Nonkernel Methods 341
1031 Logspline 341
104 Multivariate Methods 345
1041 The Nature of the Problem 345
CONTENTS xiii
1042 Multivariate Kernel Estimators 3461043 Adaptive Kernels and Nearest Neighbors 348
10431 Nearest Neighbor Approaches 349
10432 Variable-Kernel Approaches and Transformations 350
1044 Exploratory Projection Pursuit 353
Problems 359
11 BIVARIATE SMOOTHING 363
111 PredictorndashResponse Data 363
112 Linear Smoothers 365
1121 Constant-Span Running Mean 366
11211 Effect of Span 368
11212 Span Selection for Linear Smoothers 369
1122 Running Lines and Running Polynomials 3721123 Kernel Smoothers 3741124 Local Regression Smoothing 3741125 Spline Smoothing 376
11251 Choice of Penalty 377
113 Comparison of Linear Smoothers 377
114 Nonlinear Smoothers 379
1141 Loess 3791142 Supersmoother 381
115 Confidence Bands 384
116 General Bivariate Data 388
Problems 389
12 MULTIVARIATE SMOOTHING 393
121 PredictorndashResponse Data 393
1211 Additive Models 3941212 Generalized Additive Models 3971213 Other Methods Related to Additive Models 399
12131 Projection Pursuit Regression 399
12132 Neural Networks 402
12133 Alternating Conditional Expectations 403
12134 Additivity and Variance Stabilization 404
1214 Tree-Based Methods 405
12141 Recursive Partitioning Regression Trees 406
12142 Tree Pruning 409
12143 Classification Trees 411
12144 Other Issues for Tree-Based Methods 412
xiv CONTENTS
122 General Multivariate Data 4131221 Principal Curves 413
12211 Definition and Motivation 413
12212 Estimation 415
12213 Span Selection 416
Problems 416
DATA ACKNOWLEDGMENTS 421
REFERENCES 423
INDEX 457
PREFACE
This book covers most topics needed to develop a broad and thorough working knowl-edge of modern computational statistics We seek to develop a practical understandingof how and why existing methods work enabling readers to use modern statisticalmethods effectively Since many new methods are built from components of exist-ing techniques our ultimate goal is to provide scientists with the tools they need tocontribute new ideas to the field
A growing challenge in science is that there is so much of it While the pursuitof important new methods and the honing of existing approaches is a worthy goalthere is also a need to organize and distill the teeming jungle of ideas We attempt todo that here Our choice of topics reflects our view of what constitutes the core of theevolving field of computational statistics and what will be interesting and useful forour readers
Our use of the adjective modern in the first sentence of this preface is potentiallytroublesome There is no way that this book can cover all the latest greatest techniquesWe have not even tried We have instead aimed to provide a reasonably up-to-datesurvey of a broad portion of the field while leaving room for diversions and esoterica
The foundations of optimization and numerical integration are covered in thisbook We include these venerable topics because (i) they are cornerstones of frequen-tist and Bayesian inference (ii) routine application of available software often failsfor hard problems and (iii) the methods themselves are often secondary componentsof other statistical computing algorithms Some topics we have omitted representimportant areas of past and present research in the field but their priority here islowered by the availability of high-quality software For example the generation ofpseudo-random numbers is a classic topic but one that we prefer to address by giv-ing students reliable software Finally some topics (eg principal curves and tabusearch) are included simply because they are interesting and provide very differentperspectives on familiar problems Perhaps a future researcher may draw ideas fromsuch topics to design a creative and effective new algorithm
In this second edition we have both updated and broadened our coverage andwe now provide computer code For example we have added new MCMC topics to re-flect continued activity in that popular area A notable increase in breadth is our inclu-sion of more methods relevant for problems where statistical dependency is importantsuch as block bootstrapping and sequential importance sampling This second editionprovides extensive new support in R Specifically code for the examples in this bookis available from the book website wwwstatcolostateeducomputationalstatistics
Our target audience includes graduate students in statistics and related fieldsstatisticians and quantitative empirical scientists in other fields We hope such readersmay use the book when applying standard methods and developing new methods
xv
xvi PREFACE
The level of mathematics expected of the reader does not extend much beyondTaylor series and linear algebra Breadth of mathematical training is more helpfulthan depth Essential review is provided in Chapter 1 More advanced readers will findgreater mathematical detail in the wide variety of high-quality books available on spe-cific topics many of which are referenced in the text Other readers caring less aboutanalytical details may prefer to focus on our descriptions of algorithms and examples
The expected level of statistics is equivalent to that obtained by a graduatestudent in his or her first year of study of the theory of statistics and probabilityAn understanding of maximum likelihood methods Bayesian methods elementaryasymptotic theory Markov chains and linear models is most important Many ofthese topics are reviewed in Chapter 1
With respect to computer programming we find that good students can learn asthey go However a working knowledge of a suitable language allows implementationof the ideas covered in this book to progress much more quickly We have chosento forgo any language-specific examples algorithms or coding in the text For thosewishing to learn a language while they study this book we recommend that you choosea high-level interactive package that permits the flexible design of graphical displaysand includes supporting statistics and probability functions such as R and MATLAB1
These are the sort of languages often used by researchers during the development ofnew statistical computing techniques and they are suitable for implementing all themethods we describe except in some cases for problems of vast scope or complexityWe use R and recommend it Although lower-level languages such as C++ couldalso be used they are more appropriate for professional-grade implementation ofalgorithms after researchers have refined the methodology
The book is organized into four major parts optimization (Chapters 2 3 and4) integration and simulation (Chapters 5 6 7 and 8) bootstrapping (Chapter 9)and density estimation and smoothing (Chapters 10 11 and 12) The chapters arewritten to stand independently so a course can be built by selecting the topics onewishes to teach For a one-semester course our selection typically weights mostheavily topics from Chapters 2 3 6 7 9 10 and 11 With a leisurely pace or morethorough coverage a shorter list of topics could still easily fill a semester courseThere is sufficient material here to provide a thorough one-year course of studynotwithstanding any supplemental topics one might wish to teach
A variety of homework problems are included at the end of each chapter Someare straightforward while others require the student to develop a thorough under-standing of the modelmethod being used to carefully (and perhaps cleverly) code asuitable technique and to devote considerable attention to the interpretation of resultsA few exercises invite open-ended exploration of methods and ideas We are some-times asked for solutions to the exercises but we prefer to sequester them to preservethe challenge for future students and readers
The datasets discussed in the examples and exercises are available from the bookwebsite wwwstatcolostateeducomputationalstatistics The R code is also providedthere Finally the website includes an errata Responsibility for all errors lies with us
1R is available for free from wwwr-projectorg Information about MATLAB can be found atwwwmathworkscom
ACKNOWLEDGMENTS
The course upon which this book is based was developed and taught by us at ColoradoState University from 1994 onwards Thanks are due to our many students who havebeen semiwilling guinea pigs over the years We also thank our colleagues in theStatistics Department for their continued support The late Richard Tweedie meritsparticular acknowledgment for his mentoring during the early years of our careers
We owe a great deal of intellectual debt to Adrian Raftery who deserves specialthanks not only for his teaching and advising but also for his unwavering support andhis seemingly inexhaustible supply of good ideas In addition we thank our influentialadvisors and teachers at the University of Washington Statistics Department includingDavid Madigan Werner Stuetzle and Judy Zeh Of course each of our chapters couldbe expanded into a full-length book and great scholars have already done so We owemuch to their efforts upon which we relied when developing our course and ourmanuscript
Portions of the first edition were written at the Department of Mathematicsand Statistics University of Otago in Dunedin New Zealand whose faculty wethank for graciously hosting us during our sabbatical in 2003 Much of our workon the second edition was undertaken during our sabbatical visit to the AustraliaCommonwealth Scientific and Research Organization in 2009ndash2010 sponsored byCSIRO Mathematics Informatics and Statistics and hosted at the Longpocket Lab-oratory in Indooroopilly Australia We thank our hosts and colleagues there for theirsupport
Our manuscript has been greatly improved through the constructive reviewsof John Bickham Ben Bird Kate Cowles Jan Hannig Alan Herlihy David HunterDevin Johnson Michael Newton Doug Nychka Steve Sain David W Scott N ScottUrquhart Haonan Wang Darrell Whitley and eight anonymous referees We alsothank the sharp-eyed readers listed in the errata for their suggestions and correctionsOur editor Steve Quigley and the folks at Wiley were supportive and helpful duringthe publication process We thank Nelida Pohl for permission to adapt her photographin the cover design of the first edition We thank Melinda Stelzer for permission to useher painting ldquoChampagne Circuitrdquo 2001 for the cover of the second edition Moreabout her art can be found at wwwfacebookcomgeekchicart We also owe specialnote of thanks to Zube (aka John Dzubera) who kept our own computers runningdespite our best efforts to the contrary
Funding from National Science Foundation (NSF) CAREER grant SBR-9875508 was a significant source of support for the first author during the preparationof the first edition He also thanks his colleagues and friends in the North SlopeBorough Alaska Department of Wildlife Management for their longtime researchsupport The second author gratefully acknowledges the support of STAR Research
xvii
xviii ACKNOWLEDGMENTS
Assistance Agreement CR-829095 awarded to Colorado State University by the USEnvironmental Protection Agency (EPA) The views expressed here are solely thoseof the authors NSF and EPA do not endorse any products or commercial servicesmentioned herein
Finally we thank our parents for enabling and supporting our educations and forproviding us with the ldquostubbornness generdquo necessary for graduate school the tenuretrack or book publicationmdashtake your pick The second edition is dedicated to ourkids Natalie and Neil for continuing to show us what is important and what is not
Geof H GivensJennifer A Hoeting
CHAPTER 1REVIEW
This chapter reviews notation and background material in mathematics probabilityand statistics Readers may wish to skip this chapter and turn directly to Chapter 2returning here only as needed
11 MATHEMATICAL NOTATION
We use boldface to distinguish a vector x = (x1 xp) or a matrix M from a scalarvariable x or a constant M A vector-valued function f evaluated at x is also boldfacedas in f(x) = (f1(x) fp(x)) The transpose of M is denoted MT
Unless otherwise specified all vectors are considered to be column vectors sofor example an n times p matrix can be written as M = (x1 xn)T Let I denote anidentity matrix and 1 and 0 denote vectors of ones and zeros respectively
A symmetric square matrix M is positive definite if xTMx gt 0 for all nonzerovectors x Positive definiteness is equivalent to the condition that all eigenvalues ofM are positive M is nonnegative definite or positive semidefinite if xTMx ge 0 for allnonzero vectors x
The derivative of a function f evaluated at x is denoted f prime(x) When x =(x1 xp) the gradient of f at x is
f prime(x) =(
df (x)
dx1
df (x)
dxp
)
The Hessian matrix for f at x is f primeprime(x) having (i j)th element equal to d2f (x)(dxi dxj) The negative Hessian has important uses in statistical inference
Let J(x) denote the Jacobian matrix evaluated at x for the one-to-one mappingy = f(x) The (i j)th element of J(x) is equal to dfi(x)dxj
A functional is a real-valued function on a space of functions For example ifT (f ) = int 1
0 f (x) dx then the functional T maps suitably integrable functions onto thereal line
The indicator function 1A equals 1 if A is true and 0 otherwise The real lineis denoted and p-dimensional real space is p
Computational Statistics Second Edition Geof H Givens and Jennifer A Hoetingcopy 2013 John Wiley amp Sons Inc Published 2013 by John Wiley amp Sons Inc
1
2 CHAPTER 1 REVIEW
12 TAYLORrsquoS THEOREM AND MATHEMATICALLIMIT THEORY
First we define standard ldquobig ohrdquo and ldquolittle ohrdquo notation for describing the relativeorders of convergence of functions Let the functions f and g be defined on a commonpossibly infinite interval Let z0 be a point in this interval or a boundary point of it(ie minusinfin or infin) We require g(z) = 0 for all z = z0 in a neighborhood of z0 Thenwe say
f (z) = O(g(z)) (11)
if there exists a constant M such that |f (z)| le M|g(z)| as z rarr z0 For example(n + 1)(3n2) = O(nminus1) and it is understood that we are considering n rarr infin Iflimzrarrz0 f (z)g(z) = 0 then we say
f (z) = O(g(z)) (12)
For example f (x0 + h) minus f (x0) = hf prime(x0) + O(h) as h rarr 0 if f is differentiable atx0 The same notation can be used for describing the convergence of a sequence xnas n rarr infin by letting f (n) = xn
Taylorrsquos theorem provides a polynomial approximation to a function f Supposef has finite (n + 1)th derivative on (a b) and continuous nth derivative on [a b] Thenfor any x0 isin [a b] distinct from x the Taylor series expansion of f about x0 is
f (x) =nsum
i=0
1
if (i)(x0)(x minus x0)i + Rn (13)
where f (i)(x0) is the ith derivative of f evaluated at x0 and
Rn = 1
(n + 1)f (n+1)(ξ)(x minus x0)n+1 (14)
for some point ξ in the interval between x and x0 As |x minus x0| rarr 0 note that Rn =O(|x minus x0|n+1)
The multivariate version of Taylorrsquos theorem is analogous Suppose f is areal-valued function of a p-dimensional variable x possessing continuous partialderivatives of all orders up to and including n + 1 with respect to all coordinates inan open convex set containing x and x0 = x Then
f (x) = f (x0) +nsum
i=1
1
iD(i)(f x0 x minus x0) + Rn (15)
where
D(i)(f x y) =psum
j1=1
middot middot middotpsum
ji=1
(di
dtj1 middot middot middot dtji
f (t)
∣∣∣∣t=x
) iprodk=1
yjk
(16)
12 TAYLORrsquoS THEOREM AND MATHEMATICAL LIMIT THEORY 3
and
Rn = 1
(n + 1)D(n+1)(f ξ x minus x0) (17)
for some ξ on the line segment joining x and x0 As |x minus x0| rarr 0 note that Rn =O(|x minus x0|n+1)
The EulerndashMaclaurin formula is useful in many asymptotic analyses If f has2n continuous derivatives in [0 1] thenint 1
0f (x) dx =f (0) + f (1)
2
minusnminus1sumi=0
b2i(f (2iminus1)(1) minus f (2iminus1)(0))
(2i)minus b2nf
(2n)(ξ)
(2n) (18)
where 0 le ξ le 1 f (j) is the jth derivative of f and bj = Bj(0) can be determinedusing the recursion relation
msumj=0
(m + 1
j
)Bj(z) = (m + 1)zm (19)
initialized with B0(z) = 1 The proof of this result is based on repeated integrationsby parts [376]
Finally we note that it is sometimes desirable to approximate the derivative ofa function numerically using finite differences For example the ith component ofthe gradient of f at x can be approximated by
df (x)
dxi
asymp f (x + εiei) minus f (x minus εiei)
2εi
(110)
where εi is a small number and ei is the unit vector in the ith coordinate directionTypically one might start with say εi = 001 or 0001 and approximate the desiredderivative for a sequence of progressively smaller εi The approximation willgenerally improve until εi becomes small enough that the calculation is degraded andeventually dominated by computer roundoff error introduced by subtractive cancel-lation Introductory discussion of this approach and a more sophisticated Richardsonextrapolation strategy for obtaining greater precision are provided in [376] Finitedifferences can also be used to approximate the second derivative of f at x via
df (x)
dxi dxj
asymp 1
4εiεj
(f (x + εiei + εjej) minus f (x + εiei minus εjej)
minus f (x minus εiei + εjej) + f (x minus εiei minus εjej)
)(111)
with similar sequential precision improvements
4 CHAPTER 1 REVIEW
13 STATISTICAL NOTATION AND PROBABILITYDISTRIBUTIONS
We use capital letters to denote random variables such as Y or X and lowercaseletters to represent specific realized values of random variables such as y or x Theprobability density function of X is denoted f the cumulative distribution functionis F We use the notation X sim f (x) to mean that X is distributed with density f (x)Frequently the dependence of f (x) on one or more parameters also will be denotedwith a conditioning bar as in f (x|α β) Because of the diversity of topics covered inthis book we want to be careful to distinguish when f (x|α) refers to a density functionas opposed to the evaluation of that density at a point x When the meaning is unclearfrom the context we will be explicit for example by using f (middot|α) to denote thefunction When it is important to distinguish among several densities we may adoptsubscripts referring to specific random variables so that the density functions forX and Y are fX and fY respectively We use the same notation for distributions ofdiscrete random variables and in the Bayesian context
The conditional distribution of X given that Y equals y (ie X|Y = y) is de-scribed by the density denoted f (x|y) or fX|Y (x|y) In this case we write that X|Y hasdensity f (x|Y ) For notational simplicity we allow density functions to be implicitlyspecified by their arguments so we may use the same symbol say f to refer to manydistinct functions as in the equation f (x y|μ) = f (x|y μ)f (y|μ) Finally f (X) andF (X) are random variables the evaluations of the density and cumulative distributionfunctions respectively at the random argument X
The expectation of a random variable is denoted EX Unless specificallymentioned the distribution with respect to which an expectation is taken is the dis-tribution of X or should be implicit from the context To denote the probability ofan event A we use P[A] = E1A The conditional expectation of X|Y = y isEX|y When Y is unknown EX|Y is a random variable that depends on Y Otherattributes of the distribution of X and Y include varX covX Y corX Y andcvX = varX12EX These quantities are the variance of X the covariance andcorrelation of X and Y and the coefficient of variation of X respectively
A useful result regarding expectations is Jensenrsquos inequality Let g be a convexfunction on a possibly infinite open interval I so
g(λx + (1 minus λ)y) le λg(x) + (1 minus λ)g(y) (112)
for all x y isin I and all 0 lt λ lt 1 Then Jensenrsquos inequality states that Eg(X) geg(EX) for any random variable X having P[X isin I] = 1
Tables 11 12 and 13 provide information about many discrete and contin-uous distributions used throughout this book We refer to the following well-knowncombinatorial constants
n = n(n minus 1)(n minus 2) middot middot middot (3)(2)(1) with 0 = 1 (113)(n
k
)= n
k(n minus k) (114)
TABLE 11 Notation and description for common probability distributions of discrete random variables
Notation and Density and Mean and
Name Parameter Space Sample Space Variance
Bernoulli X simBernoulli(p)0 le p le 1
f (x) = px(1 minus p)1minusx
x = 0 or 1EX = p
varX = p(1 minus p)
Binomial X simBin(n p)0 le p le 1n isin 1 2
f (x) =(
n
x
)px(1 minus p)nminusx
x = 0 1 n
EX = np
varX = np(1 minus p)
Multinomial X simMultinomial(n p)p = (p1 pk)0 le pi le 1 and n isin 1 2 sumk
i=1 pi = 1
f (x) =(
n
x1 xk
)prodk
i=1 pxi
i
x = (x1 xk) and xi isin 0 1 nsumk
i=1 xi = n
EX = npvarXi = npi(1 minus pi)covXi Xj = minusnpipj
NegativeBinomial
X simNegBin(r p)0 le p le 1r isin 1 2
f (x) =(
x + r minus 1
r minus 1
)pr(1 minus p)x
x isin 0 1 EX = r(1 minus p)pvarX = r(1 minus p)p2
Poisson X simPoisson(λ) f (x) = λx
xexpminusλ EX = λ
λ gt 0 x isin 0 1 varX = λ
5
TABLE 12 Notation and description for some common probability distributions of continuous random variables
Notation and Density and Mean and
Name Parameter Space Sample Space Variance
Beta X simBeta(α β)α gt 0 and β gt 0
f (x) = (α + β)
(α)(β)xαminus1(1 minus x)βminus1
0 le x le 1
EX = α
α + β
varX = αβ
(α + β)2(α + β + 1)
Cauchy X simCauchy(α β)α isin and β gt 0
f (x) = 1
πβ
[1 +
(x minus α
β
)2]
x isin
EX is nonexistentvarX is nonexistent
Chi-square X sim χ2ν
ν gt 0f (x) = Gamma(ν2 12)x gt 0
EX = ν
varX = 2ν
Dirichlet X simDirichlet(α)α = (α1 αk)αi gt 0α0 = sumk
i=1 αi
f (x) = (α0)prodk
i=1 xαiminus1iprodk
i=1 (αi)x = (x1 xk) and 0 le xi le 1sumk
i=1 xi = 1
EX = αα0
varXi = αi(α0 minus αi)
α20(α0 + 1)
covXi Xj = minusαiαj
α20(α0 + 1)
Exponential X simExp(λ)λ gt 0
f (x) = λ expminusλxx gt 0
EX = 1λ
varX = 1λ2
Gamma X simGamma(r λ)λ gt 0 and r gt 0
f (x) = λrxrminus1
(r)expminusλx
x gt 0
EX = rλ
varX = rλ2
6
TABLE 13 Notation and description for more common probability distributions of continuous random variables
Notation and Density and Mean and
Name Parameter Space Sample Space Variance
Lognormal X simLognormal(μ σ2)μ isin and σ gt 0
f (x) = 1
xradic
2πσ2exp
minus1
2
(logx minus μ
σ
)2
x isin EX = expμ + σ22varX = exp2μ + 2σ2 minus exp2μ + σ2
MultivariateNormal
X sim Nk(μ )μ = (μ1 μk) isin k
positive definite
f (x) = expminus(x minus μ)T minus1(x minus μ)2(2π)k2||12
x = (x1 xk) isin k
EX = μ
varX =
Normal X simN(μ σ2)μ isin and σ gt 0
f (x) = 1radic2πσ2
exp
minus1
2
(x minus μ
σ
)2
x isin EX = μ
varX = σ2
Studentrsquos t X sim tν
ν gt 0f (x) = ((ν + 1)2)
(ν2)radic
πν
(1 + x2
ν
)minus(ν+1)2
x isin EX = 0 if ν gt 1
varX = ν
ν + 2if ν gt 2
Uniform X simUnif(a b)a b isin and a lt b
f (x) = 1
b minus ax isin [a b]
EX = (a + b)2varX = (b minus a)212
Weibull X simWeibull(a b)a gt 0 and b gt 0
f (x) = abxbminus1 expminusaxbx gt 0
EX = (1 + 1b)
a1b
varX = (1 + 2b) minus (1 + 1b)2
a2b
7
8 CHAPTER 1 REVIEW(n
k1 km
)= nprodm
i=1 kiwhere n =
msumi=1
ki (115)
(r) =
(r minus 1) for r = 1 2 int infin0 trminus1 expminust dt for general r gt 0
(116)
It is worth knowing that
(
12
)= radic
π and (n + 1
2
)= 1 times 3 times 5 times middot middot middot times (2n minus 1)
radicπ2n
for positive integer nMany of the distributions commonly used in statistics are members of an
exponential family A k-parameter exponential family density can be expressed as
f (x|γ) = c1(x)c2(γ) exp
ksum
i=1
yi(x)θi(γ)
(117)
for nonnegative functions c1 and c2 The vectorγ denotes the familiar parameters suchas λ for the Poisson density and p for the binomial density The real-valued θi(γ) arethe natural or canonical parameters which are usually transformations of γ Theyi(x) are the sufficient statistics for the canonical parameters It is straightforwardto show
Ey(X) =κprime(θ) (118)
and
vary(X) =κprimeprime(θ) (119)
where κ(θ) = minus log c3(θ) letting c3(θ) denote the reexpression of c2(γ) in terms ofthe canonical parameters θ = (θ1 θk) and y(X) = (y1(X) yk(X)) Theseresults can be rewritten in terms of the original parameters γ as
E
ksum
i=1
dθi(γ)
dγj
yi(X)
= minus d
dγj
log c2(γ) (120)
and
var
ksum
i=1
dθi(γ)
dγj
yi(X)
= minus d2
dγ2j
log c2(γ) minus E
ksum
i=1
d2θi(γ)
dγ2j
yi(X)
(121)
Example 11 (Poisson) The Poisson distribution belongs to the exponential familywith c1(x) = 1x c2(λ) = expminusλ y(x) = x and θ(λ) = log λ Deriving momentsin terms of θ we have κ(θ) = expθ so EX = κprime(θ) = expθ = λ and varX =κprimeprime(θ) = expθ = λ The same results may be obtained with (120) and (121) notingthat dθdλ = 1λ For example (120) gives E Xλ = 1
It is also important to know how the distribution of a random variable changeswhen it is transformed Let X = (X1 Xp) denote a p-dimensional random
14 LIKELIHOOD INFERENCE 9
variable with continuous density function f Suppose that
U = g(X) = (g1(X) gp(X)) = (U1 Up) (122)
where g is a one-to-one function mapping the support region of f onto the space ofall u = g(x) for which x satisfies f (x) gt 0 To derive the probability distribution ofU from that of X we need to use the Jacobian matrix The density of the transformedvariables is
f (u) = f (gminus1(u)) |J(u)| (123)
where |J(u)| is the absolute value of the determinant of the Jacobian matrix of gminus1
evaluated at u having (i j)th element dxiduj where these derivatives are assumedto be continuous over the support region of U
14 LIKELIHOOD INFERENCE
If X1 Xn are independent and identically distributed (iid) each having densityf (x | θ) that depends on a vector of p unknown parameters θ = (θ1 θp) then thejoint likelihood function is
L(θ) =nprod
i=1
f (xi|θ) (124)
When the data are not iid the joint likelihood is still expressed as the joint densityf (x1 xn|θ) viewed as a function of θ
The observed data x1 xn might have been realized under many differentvalues for θ The parameters for which observing x1 xn would be most likelyconstitute the maximum likelihood estimate of θ In other words if ϑ is the functionof x1 xn that maximizes L(θ) then θ = ϑ(X1 Xn) is the maximum likeli-hood estimator (MLE) for θ MLEs are invariant to transformation so the MLE of atransformation of θ equals the transformation of θ
It is typically easier to work with the log likelihood function
l (θ) = log L(θ) (125)
which has the same maximum as the original likelihood since log is a strictly mono-tonic function Furthermore any additive constants (involving possibly x1 xn
but not θ) may be omitted from the log likelihood without changing the location of itsmaximum or differences between log likelihoods at different θ Note that maximizingL(θ) with respect to θ is equivalent to solving the system of equations
lprime(θ) = 0 (126)
where
lprime(θ) =(
dl(θ)
dθ1
dl(θ)
dθn
)
10 CHAPTER 1 REVIEW
is called the score function The score function satisfies
Elprime(θ) = 0 (127)
where the expectation is taken with respect to the distribution of X1 Xn Some-times an analytical solution to (126) provides the MLE this book describes a varietyof methods that can be used when the MLE cannot be solved for in closed form Itis worth noting that there are pathological circumstances where the MLE is not asolution of the score equation or the MLE is not unique see [127] for examples
The MLE has a sampling distribution because it depends on the realizationof the random variables X1 Xn The MLE may be biased or unbiased for θ yetunder quite general conditions it is asymptotically unbiased as n rarr infin The samplingvariance of the MLE depends on the average curvature of the log likelihood When thelog likelihood is very pointy the location of the maximum is more precisely known
To make this precise let lprimeprime(θ) denote the p times p matrix having (i j)th elementgiven by d2l(θ)(dθidθj) The Fisher information matrix is defined as
I(θ) = Elprime(θ)lprime(θ)T = minusElprimeprime(θ) (128)
where the expectations are taken with respect to the distribution of X1 Xn Thefinal equality in (128) requires mild assumptions which are satisfied for example inexponential families I(θ) may sometimes be called the expected Fisher informationto distinguish it from minuslprimeprime(θ) which is the observed Fisher information There aretwo reasons why the observed Fisher information is quite useful First it can becalculated even if the expectations in (128) cannot easily be computed Second it isa good approximation to I(θ) that improves as n increases
Under regularity conditions the asymptotic variancendashcovariance matrix ofthe MLE θ is I(θlowast)minus1 where θlowast denotes the true value of θ Indeed as n rarr infin thelimiting distribution of θ is Np(θlowast I(θlowast)minus1) Since the true parameter values areunknown I(θlowast)minus1 must be estimated in order to estimate the variancendashcovariancematrix of the MLE An obvious approach is to use I(θ)minus1 Alternatively it is alsoreasonable to use minuslprimeprime(θ)minus1 Standard errors for individual parameter MLEs canbe estimated by taking the square root of the diagonal elements of the chosenestimate of I(θlowast)minus1 A thorough introduction to maximum likelihood theory andthe relative merits of these estimates of I(θlowast)minus1 can be found in [127 182 371 470]
Profile likelihoods provide an informative way to graph a higher-dimensionallikelihood surface to make inference about some parameters while treating othersas nuisance parameters and to facilitate various optimization strategies The profilelikelihood is obtained by constrained maximization of the full likelihood with respectto parameters to be ignored If θ = (μ φ) then the profile likelihood for φ is
L(φ|μ(φ)) = maxμ
L(μ φ) (129)
Thus for each possible φ a value of μ is chosen to maximize L(μ φ) This optimalμ is a function of φ The profile likelihood is the function that maps φ to the valueof the full likelihood evaluated at φ and its corresponding optimal μ Note that the φ
COMPUTATIONALSTATISTICS
Wiley Series in Computational Statistics
Consulting Editors
Paolo GiudiciUniversity of Pavia Italy
Geof H GivensColorado State University USA
Bani K MallickTexas AampM University USA
Wiley Series in Computational Statistics is comprised of practical guides and cuttingedge research books on new developments in computational statistics It featuresquality authors with a strong applications focus The texts in the series providedetailed coverage of statistical concepts methods and case studies in areas at theinterface of statistics computing and numerics
With sound motivation and a wealth of practical examples the books show inconcrete terms how to select and to use appropriate ranges of statistical computingtechniques in particular fields of study Readers are assumed to have a basicunderstanding of introductory terminology
The series concentrates on applications of computational methods in statistics to fieldsof bioformatics genomics epidemiology business engineering finance and appliedstatistics
A complete list of titles in this series appears at the end of the volume
SECOND EDITION
COMPUTATIONALSTATISTICS
GEOF H GIVENS AND JENNIFER A HOETINGDepartment of Statistics Colorado State University Fort Collins CO
Copyright copy 2013 by John Wiley amp Sons Inc All rights reserved
Published by John Wiley amp Sons Inc Hoboken New JerseyPublished simultaneously in Canada
No part of this publication may be reproduced stored in a retrieval system or transmitted in any formor by any means electronic mechanical photocopying recording scanning or otherwise except aspermitted under Section 107 or 108 of the 1976 United States Copyright Act without either the priorwritten permission of the Publisher or authorization through payment of the appropriate per-copy fee tothe Copyright Clearance Center Inc 222 Rosewood Drive Danvers MA 01923 (978) 750-8400 fax(978) 750-4470 or on the web at wwwcopyrightcom Requests to the Publisher for permission shouldbe addressed to the Permissions Department John Wiley amp Sons Inc 111 River Street Hoboken NJ07030 (201) 748-6011 fax (201) 748-6008 or online at httpwwwwileycomgopermission
Limit of LiabilityDisclaimer of Warranty While the publisher and author have used their best efforts inpreparing this book they make no representations or warranties with respect to the accuracy orcompleteness of the contents of this book and specifically disclaim any implied warranties ofmerchantability or fitness for a particular purpose No warranty may be created or extended by salesrepresentatives or written sales materials The advice and strategies contained herein may not be suitablefor your situation You should consult with a professional where appropriate Neither the publisher norauthor shall be liable for any loss of profit or any other commercial damages including but not limited tospecial incidental consequential or other damages
For general information on our other products and services or for technical support please contact ourCustomer Care Department within the United States at (800) 762-2974 outside the United Statesat (317) 572-3993 or fax (317) 572-4002
Wiley also publishes its books in a variety of electronic formats Some content that appears in print maynot be available in electronic formats For more information about Wiley products visit our web site atwwwwileycom
Library of Congress Cataloging-in-Publication Data
Givens Geof HComputational statistics Geof H Givens Jennifer A Hoeting ndash 2nd ed
p cmIncludes indexISBN 978-0-470-53331-4 (cloth)
1 Mathematical statisticsndashData processing I Hoeting Jennifer A(Jennifer Ann) 1966ndash II Title
QA2764G58 20135195ndashdc23
2012017381
Printed in the United States of America
10 9 8 7 6 5 4 3 2 1
To Natalie and Neil
CONTENTS
PREFACE xv
ACKNOWLEDGMENTS xvii
1 REVIEW 1
11 Mathematical Notation 1
12 Taylorrsquos Theorem and Mathematical Limit Theory 2
13 Statistical Notation and Probability Distributions 4
14 Likelihood Inference 9
15 Bayesian Inference 11
16 Statistical Limit Theory 13
17 Markov Chains 14
18 Computing 17
PART I
OPTIMIZATION
2 OPTIMIZATION AND SOLVING NONLINEAR EQUATIONS 21
21 Univariate Problems 22
211 Newtonrsquos Method 26
2111 Convergence Order 29
212 Fisher Scoring 30213 Secant Method 30214 Fixed-Point Iteration 32
2141 Scaling 33
22 Multivariate Problems 34
221 Newtonrsquos Method and Fisher Scoring 34
2211 Iteratively Reweighted Least Squares 36
222 Newton-Like Methods 39
2221 Ascent Algorithms 39
2222 Discrete Newton and Fixed-Point Methods 41
2223 Quasi-Newton Methods 41
vii
viii CONTENTS
223 GaussndashNewton Method 44224 NelderndashMead Algorithm 45225 Nonlinear GaussndashSeidel Iteration 52
Problems 54
3 COMBINATORIAL OPTIMIZATION 59
31 Hard Problems and NP-Completeness 59311 Examples 61312 Need for Heuristics 64
32 Local Search 65
33 Simulated Annealing 68331 Practical Issues 70
3311 Neighborhoods and Proposals 70
3312 Cooling Schedule and Convergence 71332 Enhancements 74
34 Genetic Algorithms 75341 Definitions and the Canonical Algorithm 75
3411 Basic Definitions 75
3412 Selection Mechanisms and Genetic Operators 76
3413 Allele Alphabets and Genotypic Representation 78
3414 Initialization Termination and Parameter Values 79342 Variations 80
3421 Fitness 80
3422 Selection Mechanisms and Updating Generations 81
3423 Genetic Operators and PermutationChromosomes 82
343 Initialization and Parameter Values 84344 Convergence 84
35 Tabu Algorithms 85351 Basic Definitions 86352 The Tabu List 87353 Aspiration Criteria 88354 Diversification 89355 Intensification 90356 Comprehensive Tabu Algorithm 91
Problems 92
4 EM OPTIMIZATION METHODS 97
41 Missing Data Marginalization and Notation 97
42 The EM Algorithm 98421 Convergence 102422 Usage in Exponential Families 105
CONTENTS ix
423 Variance Estimation 106
4231 Louisrsquos Method 106
4232 SEM Algorithm 108
4233 Bootstrapping 110
4234 Empirical Information 110
4235 Numerical Differentiation 111
43 EM Variants 111431 Improving the E Step 111
4311 Monte Carlo EM 111432 Improving the M Step 112
4321 ECM Algorithm 113
4322 EM Gradient Algorithm 116433 Acceleration Methods 118
4331 Aitken Acceleration 118
4332 Quasi-Newton Acceleration 119
Problems 121
PART II
INTEGRATION AND SIMULATION
5 NUMERICAL INTEGRATION 129
51 NewtonndashCotes Quadrature 129511 Riemann Rule 130512 Trapezoidal Rule 134513 Simpsonrsquos Rule 136514 General kth-Degree Rule 138
52 Romberg Integration 139
53 Gaussian Quadrature 142531 Orthogonal Polynomials 143532 The Gaussian Quadrature Rule 143
54 Frequently Encountered Problems 146541 Range of Integration 146542 Integrands with Singularities or Other Extreme Behavior 146543 Multiple Integrals 147544 Adaptive Quadrature 147545 Software for Exact Integration 148
Problems 148
6 SIMULATION AND MONTE CARLO INTEGRATION 151
61 Introduction to the Monte Carlo Method 151
62 Exact Simulation 152
x CONTENTS
621 Generating from Standard Parametric Families 153622 Inverse Cumulative Distribution Function 153623 Rejection Sampling 155
6231 Squeezed Rejection Sampling 158
6232 Adaptive Rejection Sampling 159
63 Approximate Simulation 163
631 Sampling Importance Resampling Algorithm 163
6311 Adaptive Importance Bridge and Path Sampling 167
632 Sequential Monte Carlo 168
6321 Sequential Importance Sampling for Markov Processes 169
6322 General Sequential Importance Sampling 170
6323 Weight Degeneracy Rejuvenation and EffectiveSample Size 171
6324 Sequential Importance Sampling for HiddenMarkov Models 175
6325 Particle Filters 179
64 Variance Reduction Techniques 180
641 Importance Sampling 180642 Antithetic Sampling 186643 Control Variates 189644 RaondashBlackwellization 193
Problems 195
7 MARKOV CHAIN MONTE CARLO 201
71 MetropolisndashHastings Algorithm 202
711 Independence Chains 204712 Random Walk Chains 206
72 Gibbs Sampling 209
721 Basic Gibbs Sampler 209722 Properties of the Gibbs Sampler 214723 Update Ordering 216724 Blocking 216725 Hybrid Gibbs Sampling 216726 GriddyndashGibbs Sampler 218
73 Implementation 218
731 Ensuring Good Mixing and Convergence 219
7311 Simple Graphical Diagnostics 219
7312 Burn-in and Run Length 220
7313 Choice of Proposal 222
7314 Reparameterization 223
7315 Comparing Chains Effective Sample Size 224
7316 Number of Chains 225
CONTENTS xi
732 Practical Implementation Advice 226733 Using the Results 226
Problems 230
8 ADVANCED TOPICS IN MCMC 237
81 Adaptive MCMC 237811 Adaptive Random Walk Metropolis-within-Gibbs Algorithm 238812 General Adaptive Metropolis-within-Gibbs Algorithm 240813 Adaptive Metropolis Algorithm 247
82 Reversible Jump MCMC 250821 RJMCMC for Variable Selection in Regression 253
83 Auxiliary Variable Methods 256831 Simulated Tempering 257832 Slice Sampler 258
84 Other MetropolisndashHastings Algorithms 260841 Hit-and-Run Algorithm 260842 Multiple-Try MetropolisndashHastings Algorithm 261843 Langevin MetropolisndashHastings Algorithm 262
85 Perfect Sampling 264851 Coupling from the Past 264
8511 Stochastic Monotonicity and Sandwiching 267
86 Markov Chain Maximum Likelihood 268
87 Example MCMC for Markov Random Fields 269871 Gibbs Sampling for Markov Random Fields 270872 Auxiliary Variable Methods for Markov Random Fields 274873 Perfect Sampling for Markov Random Fields 277
Problems 279
PART III
BOOTSTRAPPING
9 BOOTSTRAPPING 287
91 The Bootstrap Principle 287
92 Basic Methods 288921 Nonparametric Bootstrap 288922 Parametric Bootstrap 289923 Bootstrapping Regression 290924 Bootstrap Bias Correction 291
93 Bootstrap Inference 292931 Percentile Method 292
9311 Justification for the Percentile Method 293932 Pivoting 294
xii CONTENTS
9321 Accelerated Bias-Corrected Percentile Method BCa 294
9322 The Bootstrap t 296
9323 Empirical Variance Stabilization 298
9324 Nested Bootstrap and Prepivoting 299
933 Hypothesis Testing 301
94 Reducing Monte Carlo Error 302
941 Balanced Bootstrap 302942 Antithetic Bootstrap 302
95 Bootstrapping Dependent Data 303
951 Model-Based Approach 304952 Block Bootstrap 304
9521 Nonmoving Block Bootstrap 304
9522 Moving Block Bootstrap 306
9523 Blocks-of-Blocks Bootstrapping 307
9524 Centering and Studentizing 309
9525 Block Size 311
96 Bootstrap Performance 315
961 Independent Data Case 315962 Dependent Data Case 316
97 Other Uses of the Bootstrap 316
98 Permutation Tests 317
Problems 319
PART IV
DENSITY ESTIMATION AND SMOOTHING
10 NONPARAMETRIC DENSITY ESTIMATION 325
101 Measures of Performance 326
102 Kernel Density Estimation 327
1021 Choice of Bandwidth 329
10211 Cross-Validation 332
10212 Plug-in Methods 335
10213 Maximal Smoothing Principle 338
1022 Choice of Kernel 339
10221 Epanechnikov Kernel 339
10222 Canonical Kernels and Rescalings 340
103 Nonkernel Methods 341
1031 Logspline 341
104 Multivariate Methods 345
1041 The Nature of the Problem 345
CONTENTS xiii
1042 Multivariate Kernel Estimators 3461043 Adaptive Kernels and Nearest Neighbors 348
10431 Nearest Neighbor Approaches 349
10432 Variable-Kernel Approaches and Transformations 350
1044 Exploratory Projection Pursuit 353
Problems 359
11 BIVARIATE SMOOTHING 363
111 PredictorndashResponse Data 363
112 Linear Smoothers 365
1121 Constant-Span Running Mean 366
11211 Effect of Span 368
11212 Span Selection for Linear Smoothers 369
1122 Running Lines and Running Polynomials 3721123 Kernel Smoothers 3741124 Local Regression Smoothing 3741125 Spline Smoothing 376
11251 Choice of Penalty 377
113 Comparison of Linear Smoothers 377
114 Nonlinear Smoothers 379
1141 Loess 3791142 Supersmoother 381
115 Confidence Bands 384
116 General Bivariate Data 388
Problems 389
12 MULTIVARIATE SMOOTHING 393
121 PredictorndashResponse Data 393
1211 Additive Models 3941212 Generalized Additive Models 3971213 Other Methods Related to Additive Models 399
12131 Projection Pursuit Regression 399
12132 Neural Networks 402
12133 Alternating Conditional Expectations 403
12134 Additivity and Variance Stabilization 404
1214 Tree-Based Methods 405
12141 Recursive Partitioning Regression Trees 406
12142 Tree Pruning 409
12143 Classification Trees 411
12144 Other Issues for Tree-Based Methods 412
xiv CONTENTS
122 General Multivariate Data 4131221 Principal Curves 413
12211 Definition and Motivation 413
12212 Estimation 415
12213 Span Selection 416
Problems 416
DATA ACKNOWLEDGMENTS 421
REFERENCES 423
INDEX 457
PREFACE
This book covers most topics needed to develop a broad and thorough working knowl-edge of modern computational statistics We seek to develop a practical understandingof how and why existing methods work enabling readers to use modern statisticalmethods effectively Since many new methods are built from components of exist-ing techniques our ultimate goal is to provide scientists with the tools they need tocontribute new ideas to the field
A growing challenge in science is that there is so much of it While the pursuitof important new methods and the honing of existing approaches is a worthy goalthere is also a need to organize and distill the teeming jungle of ideas We attempt todo that here Our choice of topics reflects our view of what constitutes the core of theevolving field of computational statistics and what will be interesting and useful forour readers
Our use of the adjective modern in the first sentence of this preface is potentiallytroublesome There is no way that this book can cover all the latest greatest techniquesWe have not even tried We have instead aimed to provide a reasonably up-to-datesurvey of a broad portion of the field while leaving room for diversions and esoterica
The foundations of optimization and numerical integration are covered in thisbook We include these venerable topics because (i) they are cornerstones of frequen-tist and Bayesian inference (ii) routine application of available software often failsfor hard problems and (iii) the methods themselves are often secondary componentsof other statistical computing algorithms Some topics we have omitted representimportant areas of past and present research in the field but their priority here islowered by the availability of high-quality software For example the generation ofpseudo-random numbers is a classic topic but one that we prefer to address by giv-ing students reliable software Finally some topics (eg principal curves and tabusearch) are included simply because they are interesting and provide very differentperspectives on familiar problems Perhaps a future researcher may draw ideas fromsuch topics to design a creative and effective new algorithm
In this second edition we have both updated and broadened our coverage andwe now provide computer code For example we have added new MCMC topics to re-flect continued activity in that popular area A notable increase in breadth is our inclu-sion of more methods relevant for problems where statistical dependency is importantsuch as block bootstrapping and sequential importance sampling This second editionprovides extensive new support in R Specifically code for the examples in this bookis available from the book website wwwstatcolostateeducomputationalstatistics
Our target audience includes graduate students in statistics and related fieldsstatisticians and quantitative empirical scientists in other fields We hope such readersmay use the book when applying standard methods and developing new methods
xv
xvi PREFACE
The level of mathematics expected of the reader does not extend much beyondTaylor series and linear algebra Breadth of mathematical training is more helpfulthan depth Essential review is provided in Chapter 1 More advanced readers will findgreater mathematical detail in the wide variety of high-quality books available on spe-cific topics many of which are referenced in the text Other readers caring less aboutanalytical details may prefer to focus on our descriptions of algorithms and examples
The expected level of statistics is equivalent to that obtained by a graduatestudent in his or her first year of study of the theory of statistics and probabilityAn understanding of maximum likelihood methods Bayesian methods elementaryasymptotic theory Markov chains and linear models is most important Many ofthese topics are reviewed in Chapter 1
With respect to computer programming we find that good students can learn asthey go However a working knowledge of a suitable language allows implementationof the ideas covered in this book to progress much more quickly We have chosento forgo any language-specific examples algorithms or coding in the text For thosewishing to learn a language while they study this book we recommend that you choosea high-level interactive package that permits the flexible design of graphical displaysand includes supporting statistics and probability functions such as R and MATLAB1
These are the sort of languages often used by researchers during the development ofnew statistical computing techniques and they are suitable for implementing all themethods we describe except in some cases for problems of vast scope or complexityWe use R and recommend it Although lower-level languages such as C++ couldalso be used they are more appropriate for professional-grade implementation ofalgorithms after researchers have refined the methodology
The book is organized into four major parts optimization (Chapters 2 3 and4) integration and simulation (Chapters 5 6 7 and 8) bootstrapping (Chapter 9)and density estimation and smoothing (Chapters 10 11 and 12) The chapters arewritten to stand independently so a course can be built by selecting the topics onewishes to teach For a one-semester course our selection typically weights mostheavily topics from Chapters 2 3 6 7 9 10 and 11 With a leisurely pace or morethorough coverage a shorter list of topics could still easily fill a semester courseThere is sufficient material here to provide a thorough one-year course of studynotwithstanding any supplemental topics one might wish to teach
A variety of homework problems are included at the end of each chapter Someare straightforward while others require the student to develop a thorough under-standing of the modelmethod being used to carefully (and perhaps cleverly) code asuitable technique and to devote considerable attention to the interpretation of resultsA few exercises invite open-ended exploration of methods and ideas We are some-times asked for solutions to the exercises but we prefer to sequester them to preservethe challenge for future students and readers
The datasets discussed in the examples and exercises are available from the bookwebsite wwwstatcolostateeducomputationalstatistics The R code is also providedthere Finally the website includes an errata Responsibility for all errors lies with us
1R is available for free from wwwr-projectorg Information about MATLAB can be found atwwwmathworkscom
ACKNOWLEDGMENTS
The course upon which this book is based was developed and taught by us at ColoradoState University from 1994 onwards Thanks are due to our many students who havebeen semiwilling guinea pigs over the years We also thank our colleagues in theStatistics Department for their continued support The late Richard Tweedie meritsparticular acknowledgment for his mentoring during the early years of our careers
We owe a great deal of intellectual debt to Adrian Raftery who deserves specialthanks not only for his teaching and advising but also for his unwavering support andhis seemingly inexhaustible supply of good ideas In addition we thank our influentialadvisors and teachers at the University of Washington Statistics Department includingDavid Madigan Werner Stuetzle and Judy Zeh Of course each of our chapters couldbe expanded into a full-length book and great scholars have already done so We owemuch to their efforts upon which we relied when developing our course and ourmanuscript
Portions of the first edition were written at the Department of Mathematicsand Statistics University of Otago in Dunedin New Zealand whose faculty wethank for graciously hosting us during our sabbatical in 2003 Much of our workon the second edition was undertaken during our sabbatical visit to the AustraliaCommonwealth Scientific and Research Organization in 2009ndash2010 sponsored byCSIRO Mathematics Informatics and Statistics and hosted at the Longpocket Lab-oratory in Indooroopilly Australia We thank our hosts and colleagues there for theirsupport
Our manuscript has been greatly improved through the constructive reviewsof John Bickham Ben Bird Kate Cowles Jan Hannig Alan Herlihy David HunterDevin Johnson Michael Newton Doug Nychka Steve Sain David W Scott N ScottUrquhart Haonan Wang Darrell Whitley and eight anonymous referees We alsothank the sharp-eyed readers listed in the errata for their suggestions and correctionsOur editor Steve Quigley and the folks at Wiley were supportive and helpful duringthe publication process We thank Nelida Pohl for permission to adapt her photographin the cover design of the first edition We thank Melinda Stelzer for permission to useher painting ldquoChampagne Circuitrdquo 2001 for the cover of the second edition Moreabout her art can be found at wwwfacebookcomgeekchicart We also owe specialnote of thanks to Zube (aka John Dzubera) who kept our own computers runningdespite our best efforts to the contrary
Funding from National Science Foundation (NSF) CAREER grant SBR-9875508 was a significant source of support for the first author during the preparationof the first edition He also thanks his colleagues and friends in the North SlopeBorough Alaska Department of Wildlife Management for their longtime researchsupport The second author gratefully acknowledges the support of STAR Research
xvii
xviii ACKNOWLEDGMENTS
Assistance Agreement CR-829095 awarded to Colorado State University by the USEnvironmental Protection Agency (EPA) The views expressed here are solely thoseof the authors NSF and EPA do not endorse any products or commercial servicesmentioned herein
Finally we thank our parents for enabling and supporting our educations and forproviding us with the ldquostubbornness generdquo necessary for graduate school the tenuretrack or book publicationmdashtake your pick The second edition is dedicated to ourkids Natalie and Neil for continuing to show us what is important and what is not
Geof H GivensJennifer A Hoeting
CHAPTER 1REVIEW
This chapter reviews notation and background material in mathematics probabilityand statistics Readers may wish to skip this chapter and turn directly to Chapter 2returning here only as needed
11 MATHEMATICAL NOTATION
We use boldface to distinguish a vector x = (x1 xp) or a matrix M from a scalarvariable x or a constant M A vector-valued function f evaluated at x is also boldfacedas in f(x) = (f1(x) fp(x)) The transpose of M is denoted MT
Unless otherwise specified all vectors are considered to be column vectors sofor example an n times p matrix can be written as M = (x1 xn)T Let I denote anidentity matrix and 1 and 0 denote vectors of ones and zeros respectively
A symmetric square matrix M is positive definite if xTMx gt 0 for all nonzerovectors x Positive definiteness is equivalent to the condition that all eigenvalues ofM are positive M is nonnegative definite or positive semidefinite if xTMx ge 0 for allnonzero vectors x
The derivative of a function f evaluated at x is denoted f prime(x) When x =(x1 xp) the gradient of f at x is
f prime(x) =(
df (x)
dx1
df (x)
dxp
)
The Hessian matrix for f at x is f primeprime(x) having (i j)th element equal to d2f (x)(dxi dxj) The negative Hessian has important uses in statistical inference
Let J(x) denote the Jacobian matrix evaluated at x for the one-to-one mappingy = f(x) The (i j)th element of J(x) is equal to dfi(x)dxj
A functional is a real-valued function on a space of functions For example ifT (f ) = int 1
0 f (x) dx then the functional T maps suitably integrable functions onto thereal line
The indicator function 1A equals 1 if A is true and 0 otherwise The real lineis denoted and p-dimensional real space is p
Computational Statistics Second Edition Geof H Givens and Jennifer A Hoetingcopy 2013 John Wiley amp Sons Inc Published 2013 by John Wiley amp Sons Inc
1
2 CHAPTER 1 REVIEW
12 TAYLORrsquoS THEOREM AND MATHEMATICALLIMIT THEORY
First we define standard ldquobig ohrdquo and ldquolittle ohrdquo notation for describing the relativeorders of convergence of functions Let the functions f and g be defined on a commonpossibly infinite interval Let z0 be a point in this interval or a boundary point of it(ie minusinfin or infin) We require g(z) = 0 for all z = z0 in a neighborhood of z0 Thenwe say
f (z) = O(g(z)) (11)
if there exists a constant M such that |f (z)| le M|g(z)| as z rarr z0 For example(n + 1)(3n2) = O(nminus1) and it is understood that we are considering n rarr infin Iflimzrarrz0 f (z)g(z) = 0 then we say
f (z) = O(g(z)) (12)
For example f (x0 + h) minus f (x0) = hf prime(x0) + O(h) as h rarr 0 if f is differentiable atx0 The same notation can be used for describing the convergence of a sequence xnas n rarr infin by letting f (n) = xn
Taylorrsquos theorem provides a polynomial approximation to a function f Supposef has finite (n + 1)th derivative on (a b) and continuous nth derivative on [a b] Thenfor any x0 isin [a b] distinct from x the Taylor series expansion of f about x0 is
f (x) =nsum
i=0
1
if (i)(x0)(x minus x0)i + Rn (13)
where f (i)(x0) is the ith derivative of f evaluated at x0 and
Rn = 1
(n + 1)f (n+1)(ξ)(x minus x0)n+1 (14)
for some point ξ in the interval between x and x0 As |x minus x0| rarr 0 note that Rn =O(|x minus x0|n+1)
The multivariate version of Taylorrsquos theorem is analogous Suppose f is areal-valued function of a p-dimensional variable x possessing continuous partialderivatives of all orders up to and including n + 1 with respect to all coordinates inan open convex set containing x and x0 = x Then
f (x) = f (x0) +nsum
i=1
1
iD(i)(f x0 x minus x0) + Rn (15)
where
D(i)(f x y) =psum
j1=1
middot middot middotpsum
ji=1
(di
dtj1 middot middot middot dtji
f (t)
∣∣∣∣t=x
) iprodk=1
yjk
(16)
12 TAYLORrsquoS THEOREM AND MATHEMATICAL LIMIT THEORY 3
and
Rn = 1
(n + 1)D(n+1)(f ξ x minus x0) (17)
for some ξ on the line segment joining x and x0 As |x minus x0| rarr 0 note that Rn =O(|x minus x0|n+1)
The EulerndashMaclaurin formula is useful in many asymptotic analyses If f has2n continuous derivatives in [0 1] thenint 1
0f (x) dx =f (0) + f (1)
2
minusnminus1sumi=0
b2i(f (2iminus1)(1) minus f (2iminus1)(0))
(2i)minus b2nf
(2n)(ξ)
(2n) (18)
where 0 le ξ le 1 f (j) is the jth derivative of f and bj = Bj(0) can be determinedusing the recursion relation
msumj=0
(m + 1
j
)Bj(z) = (m + 1)zm (19)
initialized with B0(z) = 1 The proof of this result is based on repeated integrationsby parts [376]
Finally we note that it is sometimes desirable to approximate the derivative ofa function numerically using finite differences For example the ith component ofthe gradient of f at x can be approximated by
df (x)
dxi
asymp f (x + εiei) minus f (x minus εiei)
2εi
(110)
where εi is a small number and ei is the unit vector in the ith coordinate directionTypically one might start with say εi = 001 or 0001 and approximate the desiredderivative for a sequence of progressively smaller εi The approximation willgenerally improve until εi becomes small enough that the calculation is degraded andeventually dominated by computer roundoff error introduced by subtractive cancel-lation Introductory discussion of this approach and a more sophisticated Richardsonextrapolation strategy for obtaining greater precision are provided in [376] Finitedifferences can also be used to approximate the second derivative of f at x via
df (x)
dxi dxj
asymp 1
4εiεj
(f (x + εiei + εjej) minus f (x + εiei minus εjej)
minus f (x minus εiei + εjej) + f (x minus εiei minus εjej)
)(111)
with similar sequential precision improvements
4 CHAPTER 1 REVIEW
13 STATISTICAL NOTATION AND PROBABILITYDISTRIBUTIONS
We use capital letters to denote random variables such as Y or X and lowercaseletters to represent specific realized values of random variables such as y or x Theprobability density function of X is denoted f the cumulative distribution functionis F We use the notation X sim f (x) to mean that X is distributed with density f (x)Frequently the dependence of f (x) on one or more parameters also will be denotedwith a conditioning bar as in f (x|α β) Because of the diversity of topics covered inthis book we want to be careful to distinguish when f (x|α) refers to a density functionas opposed to the evaluation of that density at a point x When the meaning is unclearfrom the context we will be explicit for example by using f (middot|α) to denote thefunction When it is important to distinguish among several densities we may adoptsubscripts referring to specific random variables so that the density functions forX and Y are fX and fY respectively We use the same notation for distributions ofdiscrete random variables and in the Bayesian context
The conditional distribution of X given that Y equals y (ie X|Y = y) is de-scribed by the density denoted f (x|y) or fX|Y (x|y) In this case we write that X|Y hasdensity f (x|Y ) For notational simplicity we allow density functions to be implicitlyspecified by their arguments so we may use the same symbol say f to refer to manydistinct functions as in the equation f (x y|μ) = f (x|y μ)f (y|μ) Finally f (X) andF (X) are random variables the evaluations of the density and cumulative distributionfunctions respectively at the random argument X
The expectation of a random variable is denoted EX Unless specificallymentioned the distribution with respect to which an expectation is taken is the dis-tribution of X or should be implicit from the context To denote the probability ofan event A we use P[A] = E1A The conditional expectation of X|Y = y isEX|y When Y is unknown EX|Y is a random variable that depends on Y Otherattributes of the distribution of X and Y include varX covX Y corX Y andcvX = varX12EX These quantities are the variance of X the covariance andcorrelation of X and Y and the coefficient of variation of X respectively
A useful result regarding expectations is Jensenrsquos inequality Let g be a convexfunction on a possibly infinite open interval I so
g(λx + (1 minus λ)y) le λg(x) + (1 minus λ)g(y) (112)
for all x y isin I and all 0 lt λ lt 1 Then Jensenrsquos inequality states that Eg(X) geg(EX) for any random variable X having P[X isin I] = 1
Tables 11 12 and 13 provide information about many discrete and contin-uous distributions used throughout this book We refer to the following well-knowncombinatorial constants
n = n(n minus 1)(n minus 2) middot middot middot (3)(2)(1) with 0 = 1 (113)(n
k
)= n
k(n minus k) (114)
TABLE 11 Notation and description for common probability distributions of discrete random variables
Notation and Density and Mean and
Name Parameter Space Sample Space Variance
Bernoulli X simBernoulli(p)0 le p le 1
f (x) = px(1 minus p)1minusx
x = 0 or 1EX = p
varX = p(1 minus p)
Binomial X simBin(n p)0 le p le 1n isin 1 2
f (x) =(
n
x
)px(1 minus p)nminusx
x = 0 1 n
EX = np
varX = np(1 minus p)
Multinomial X simMultinomial(n p)p = (p1 pk)0 le pi le 1 and n isin 1 2 sumk
i=1 pi = 1
f (x) =(
n
x1 xk
)prodk
i=1 pxi
i
x = (x1 xk) and xi isin 0 1 nsumk
i=1 xi = n
EX = npvarXi = npi(1 minus pi)covXi Xj = minusnpipj
NegativeBinomial
X simNegBin(r p)0 le p le 1r isin 1 2
f (x) =(
x + r minus 1
r minus 1
)pr(1 minus p)x
x isin 0 1 EX = r(1 minus p)pvarX = r(1 minus p)p2
Poisson X simPoisson(λ) f (x) = λx
xexpminusλ EX = λ
λ gt 0 x isin 0 1 varX = λ
5
TABLE 12 Notation and description for some common probability distributions of continuous random variables
Notation and Density and Mean and
Name Parameter Space Sample Space Variance
Beta X simBeta(α β)α gt 0 and β gt 0
f (x) = (α + β)
(α)(β)xαminus1(1 minus x)βminus1
0 le x le 1
EX = α
α + β
varX = αβ
(α + β)2(α + β + 1)
Cauchy X simCauchy(α β)α isin and β gt 0
f (x) = 1
πβ
[1 +
(x minus α
β
)2]
x isin
EX is nonexistentvarX is nonexistent
Chi-square X sim χ2ν
ν gt 0f (x) = Gamma(ν2 12)x gt 0
EX = ν
varX = 2ν
Dirichlet X simDirichlet(α)α = (α1 αk)αi gt 0α0 = sumk
i=1 αi
f (x) = (α0)prodk
i=1 xαiminus1iprodk
i=1 (αi)x = (x1 xk) and 0 le xi le 1sumk
i=1 xi = 1
EX = αα0
varXi = αi(α0 minus αi)
α20(α0 + 1)
covXi Xj = minusαiαj
α20(α0 + 1)
Exponential X simExp(λ)λ gt 0
f (x) = λ expminusλxx gt 0
EX = 1λ
varX = 1λ2
Gamma X simGamma(r λ)λ gt 0 and r gt 0
f (x) = λrxrminus1
(r)expminusλx
x gt 0
EX = rλ
varX = rλ2
6
TABLE 13 Notation and description for more common probability distributions of continuous random variables
Notation and Density and Mean and
Name Parameter Space Sample Space Variance
Lognormal X simLognormal(μ σ2)μ isin and σ gt 0
f (x) = 1
xradic
2πσ2exp
minus1
2
(logx minus μ
σ
)2
x isin EX = expμ + σ22varX = exp2μ + 2σ2 minus exp2μ + σ2
MultivariateNormal
X sim Nk(μ )μ = (μ1 μk) isin k
positive definite
f (x) = expminus(x minus μ)T minus1(x minus μ)2(2π)k2||12
x = (x1 xk) isin k
EX = μ
varX =
Normal X simN(μ σ2)μ isin and σ gt 0
f (x) = 1radic2πσ2
exp
minus1
2
(x minus μ
σ
)2
x isin EX = μ
varX = σ2
Studentrsquos t X sim tν
ν gt 0f (x) = ((ν + 1)2)
(ν2)radic
πν
(1 + x2
ν
)minus(ν+1)2
x isin EX = 0 if ν gt 1
varX = ν
ν + 2if ν gt 2
Uniform X simUnif(a b)a b isin and a lt b
f (x) = 1
b minus ax isin [a b]
EX = (a + b)2varX = (b minus a)212
Weibull X simWeibull(a b)a gt 0 and b gt 0
f (x) = abxbminus1 expminusaxbx gt 0
EX = (1 + 1b)
a1b
varX = (1 + 2b) minus (1 + 1b)2
a2b
7
8 CHAPTER 1 REVIEW(n
k1 km
)= nprodm
i=1 kiwhere n =
msumi=1
ki (115)
(r) =
(r minus 1) for r = 1 2 int infin0 trminus1 expminust dt for general r gt 0
(116)
It is worth knowing that
(
12
)= radic
π and (n + 1
2
)= 1 times 3 times 5 times middot middot middot times (2n minus 1)
radicπ2n
for positive integer nMany of the distributions commonly used in statistics are members of an
exponential family A k-parameter exponential family density can be expressed as
f (x|γ) = c1(x)c2(γ) exp
ksum
i=1
yi(x)θi(γ)
(117)
for nonnegative functions c1 and c2 The vectorγ denotes the familiar parameters suchas λ for the Poisson density and p for the binomial density The real-valued θi(γ) arethe natural or canonical parameters which are usually transformations of γ Theyi(x) are the sufficient statistics for the canonical parameters It is straightforwardto show
Ey(X) =κprime(θ) (118)
and
vary(X) =κprimeprime(θ) (119)
where κ(θ) = minus log c3(θ) letting c3(θ) denote the reexpression of c2(γ) in terms ofthe canonical parameters θ = (θ1 θk) and y(X) = (y1(X) yk(X)) Theseresults can be rewritten in terms of the original parameters γ as
E
ksum
i=1
dθi(γ)
dγj
yi(X)
= minus d
dγj
log c2(γ) (120)
and
var
ksum
i=1
dθi(γ)
dγj
yi(X)
= minus d2
dγ2j
log c2(γ) minus E
ksum
i=1
d2θi(γ)
dγ2j
yi(X)
(121)
Example 11 (Poisson) The Poisson distribution belongs to the exponential familywith c1(x) = 1x c2(λ) = expminusλ y(x) = x and θ(λ) = log λ Deriving momentsin terms of θ we have κ(θ) = expθ so EX = κprime(θ) = expθ = λ and varX =κprimeprime(θ) = expθ = λ The same results may be obtained with (120) and (121) notingthat dθdλ = 1λ For example (120) gives E Xλ = 1
It is also important to know how the distribution of a random variable changeswhen it is transformed Let X = (X1 Xp) denote a p-dimensional random
14 LIKELIHOOD INFERENCE 9
variable with continuous density function f Suppose that
U = g(X) = (g1(X) gp(X)) = (U1 Up) (122)
where g is a one-to-one function mapping the support region of f onto the space ofall u = g(x) for which x satisfies f (x) gt 0 To derive the probability distribution ofU from that of X we need to use the Jacobian matrix The density of the transformedvariables is
f (u) = f (gminus1(u)) |J(u)| (123)
where |J(u)| is the absolute value of the determinant of the Jacobian matrix of gminus1
evaluated at u having (i j)th element dxiduj where these derivatives are assumedto be continuous over the support region of U
14 LIKELIHOOD INFERENCE
If X1 Xn are independent and identically distributed (iid) each having densityf (x | θ) that depends on a vector of p unknown parameters θ = (θ1 θp) then thejoint likelihood function is
L(θ) =nprod
i=1
f (xi|θ) (124)
When the data are not iid the joint likelihood is still expressed as the joint densityf (x1 xn|θ) viewed as a function of θ
The observed data x1 xn might have been realized under many differentvalues for θ The parameters for which observing x1 xn would be most likelyconstitute the maximum likelihood estimate of θ In other words if ϑ is the functionof x1 xn that maximizes L(θ) then θ = ϑ(X1 Xn) is the maximum likeli-hood estimator (MLE) for θ MLEs are invariant to transformation so the MLE of atransformation of θ equals the transformation of θ
It is typically easier to work with the log likelihood function
l (θ) = log L(θ) (125)
which has the same maximum as the original likelihood since log is a strictly mono-tonic function Furthermore any additive constants (involving possibly x1 xn
but not θ) may be omitted from the log likelihood without changing the location of itsmaximum or differences between log likelihoods at different θ Note that maximizingL(θ) with respect to θ is equivalent to solving the system of equations
lprime(θ) = 0 (126)
where
lprime(θ) =(
dl(θ)
dθ1
dl(θ)
dθn
)
10 CHAPTER 1 REVIEW
is called the score function The score function satisfies
Elprime(θ) = 0 (127)
where the expectation is taken with respect to the distribution of X1 Xn Some-times an analytical solution to (126) provides the MLE this book describes a varietyof methods that can be used when the MLE cannot be solved for in closed form Itis worth noting that there are pathological circumstances where the MLE is not asolution of the score equation or the MLE is not unique see [127] for examples
The MLE has a sampling distribution because it depends on the realizationof the random variables X1 Xn The MLE may be biased or unbiased for θ yetunder quite general conditions it is asymptotically unbiased as n rarr infin The samplingvariance of the MLE depends on the average curvature of the log likelihood When thelog likelihood is very pointy the location of the maximum is more precisely known
To make this precise let lprimeprime(θ) denote the p times p matrix having (i j)th elementgiven by d2l(θ)(dθidθj) The Fisher information matrix is defined as
I(θ) = Elprime(θ)lprime(θ)T = minusElprimeprime(θ) (128)
where the expectations are taken with respect to the distribution of X1 Xn Thefinal equality in (128) requires mild assumptions which are satisfied for example inexponential families I(θ) may sometimes be called the expected Fisher informationto distinguish it from minuslprimeprime(θ) which is the observed Fisher information There aretwo reasons why the observed Fisher information is quite useful First it can becalculated even if the expectations in (128) cannot easily be computed Second it isa good approximation to I(θ) that improves as n increases
Under regularity conditions the asymptotic variancendashcovariance matrix ofthe MLE θ is I(θlowast)minus1 where θlowast denotes the true value of θ Indeed as n rarr infin thelimiting distribution of θ is Np(θlowast I(θlowast)minus1) Since the true parameter values areunknown I(θlowast)minus1 must be estimated in order to estimate the variancendashcovariancematrix of the MLE An obvious approach is to use I(θ)minus1 Alternatively it is alsoreasonable to use minuslprimeprime(θ)minus1 Standard errors for individual parameter MLEs canbe estimated by taking the square root of the diagonal elements of the chosenestimate of I(θlowast)minus1 A thorough introduction to maximum likelihood theory andthe relative merits of these estimates of I(θlowast)minus1 can be found in [127 182 371 470]
Profile likelihoods provide an informative way to graph a higher-dimensionallikelihood surface to make inference about some parameters while treating othersas nuisance parameters and to facilitate various optimization strategies The profilelikelihood is obtained by constrained maximization of the full likelihood with respectto parameters to be ignored If θ = (μ φ) then the profile likelihood for φ is
L(φ|μ(φ)) = maxμ
L(μ φ) (129)
Thus for each possible φ a value of μ is chosen to maximize L(μ φ) This optimalμ is a function of φ The profile likelihood is the function that maps φ to the valueof the full likelihood evaluated at φ and its corresponding optimal μ Note that the φ
Wiley Series in Computational Statistics
Consulting Editors
Paolo GiudiciUniversity of Pavia Italy
Geof H GivensColorado State University USA
Bani K MallickTexas AampM University USA
Wiley Series in Computational Statistics is comprised of practical guides and cuttingedge research books on new developments in computational statistics It featuresquality authors with a strong applications focus The texts in the series providedetailed coverage of statistical concepts methods and case studies in areas at theinterface of statistics computing and numerics
With sound motivation and a wealth of practical examples the books show inconcrete terms how to select and to use appropriate ranges of statistical computingtechniques in particular fields of study Readers are assumed to have a basicunderstanding of introductory terminology
The series concentrates on applications of computational methods in statistics to fieldsof bioformatics genomics epidemiology business engineering finance and appliedstatistics
A complete list of titles in this series appears at the end of the volume
SECOND EDITION
COMPUTATIONALSTATISTICS
GEOF H GIVENS AND JENNIFER A HOETINGDepartment of Statistics Colorado State University Fort Collins CO
Copyright copy 2013 by John Wiley amp Sons Inc All rights reserved
Published by John Wiley amp Sons Inc Hoboken New JerseyPublished simultaneously in Canada
No part of this publication may be reproduced stored in a retrieval system or transmitted in any formor by any means electronic mechanical photocopying recording scanning or otherwise except aspermitted under Section 107 or 108 of the 1976 United States Copyright Act without either the priorwritten permission of the Publisher or authorization through payment of the appropriate per-copy fee tothe Copyright Clearance Center Inc 222 Rosewood Drive Danvers MA 01923 (978) 750-8400 fax(978) 750-4470 or on the web at wwwcopyrightcom Requests to the Publisher for permission shouldbe addressed to the Permissions Department John Wiley amp Sons Inc 111 River Street Hoboken NJ07030 (201) 748-6011 fax (201) 748-6008 or online at httpwwwwileycomgopermission
Limit of LiabilityDisclaimer of Warranty While the publisher and author have used their best efforts inpreparing this book they make no representations or warranties with respect to the accuracy orcompleteness of the contents of this book and specifically disclaim any implied warranties ofmerchantability or fitness for a particular purpose No warranty may be created or extended by salesrepresentatives or written sales materials The advice and strategies contained herein may not be suitablefor your situation You should consult with a professional where appropriate Neither the publisher norauthor shall be liable for any loss of profit or any other commercial damages including but not limited tospecial incidental consequential or other damages
For general information on our other products and services or for technical support please contact ourCustomer Care Department within the United States at (800) 762-2974 outside the United Statesat (317) 572-3993 or fax (317) 572-4002
Wiley also publishes its books in a variety of electronic formats Some content that appears in print maynot be available in electronic formats For more information about Wiley products visit our web site atwwwwileycom
Library of Congress Cataloging-in-Publication Data
Givens Geof HComputational statistics Geof H Givens Jennifer A Hoeting ndash 2nd ed
p cmIncludes indexISBN 978-0-470-53331-4 (cloth)
1 Mathematical statisticsndashData processing I Hoeting Jennifer A(Jennifer Ann) 1966ndash II Title
QA2764G58 20135195ndashdc23
2012017381
Printed in the United States of America
10 9 8 7 6 5 4 3 2 1
To Natalie and Neil
CONTENTS
PREFACE xv
ACKNOWLEDGMENTS xvii
1 REVIEW 1
11 Mathematical Notation 1
12 Taylorrsquos Theorem and Mathematical Limit Theory 2
13 Statistical Notation and Probability Distributions 4
14 Likelihood Inference 9
15 Bayesian Inference 11
16 Statistical Limit Theory 13
17 Markov Chains 14
18 Computing 17
PART I
OPTIMIZATION
2 OPTIMIZATION AND SOLVING NONLINEAR EQUATIONS 21
21 Univariate Problems 22
211 Newtonrsquos Method 26
2111 Convergence Order 29
212 Fisher Scoring 30213 Secant Method 30214 Fixed-Point Iteration 32
2141 Scaling 33
22 Multivariate Problems 34
221 Newtonrsquos Method and Fisher Scoring 34
2211 Iteratively Reweighted Least Squares 36
222 Newton-Like Methods 39
2221 Ascent Algorithms 39
2222 Discrete Newton and Fixed-Point Methods 41
2223 Quasi-Newton Methods 41
vii
viii CONTENTS
223 GaussndashNewton Method 44224 NelderndashMead Algorithm 45225 Nonlinear GaussndashSeidel Iteration 52
Problems 54
3 COMBINATORIAL OPTIMIZATION 59
31 Hard Problems and NP-Completeness 59311 Examples 61312 Need for Heuristics 64
32 Local Search 65
33 Simulated Annealing 68331 Practical Issues 70
3311 Neighborhoods and Proposals 70
3312 Cooling Schedule and Convergence 71332 Enhancements 74
34 Genetic Algorithms 75341 Definitions and the Canonical Algorithm 75
3411 Basic Definitions 75
3412 Selection Mechanisms and Genetic Operators 76
3413 Allele Alphabets and Genotypic Representation 78
3414 Initialization Termination and Parameter Values 79342 Variations 80
3421 Fitness 80
3422 Selection Mechanisms and Updating Generations 81
3423 Genetic Operators and PermutationChromosomes 82
343 Initialization and Parameter Values 84344 Convergence 84
35 Tabu Algorithms 85351 Basic Definitions 86352 The Tabu List 87353 Aspiration Criteria 88354 Diversification 89355 Intensification 90356 Comprehensive Tabu Algorithm 91
Problems 92
4 EM OPTIMIZATION METHODS 97
41 Missing Data Marginalization and Notation 97
42 The EM Algorithm 98421 Convergence 102422 Usage in Exponential Families 105
CONTENTS ix
423 Variance Estimation 106
4231 Louisrsquos Method 106
4232 SEM Algorithm 108
4233 Bootstrapping 110
4234 Empirical Information 110
4235 Numerical Differentiation 111
43 EM Variants 111431 Improving the E Step 111
4311 Monte Carlo EM 111432 Improving the M Step 112
4321 ECM Algorithm 113
4322 EM Gradient Algorithm 116433 Acceleration Methods 118
4331 Aitken Acceleration 118
4332 Quasi-Newton Acceleration 119
Problems 121
PART II
INTEGRATION AND SIMULATION
5 NUMERICAL INTEGRATION 129
51 NewtonndashCotes Quadrature 129511 Riemann Rule 130512 Trapezoidal Rule 134513 Simpsonrsquos Rule 136514 General kth-Degree Rule 138
52 Romberg Integration 139
53 Gaussian Quadrature 142531 Orthogonal Polynomials 143532 The Gaussian Quadrature Rule 143
54 Frequently Encountered Problems 146541 Range of Integration 146542 Integrands with Singularities or Other Extreme Behavior 146543 Multiple Integrals 147544 Adaptive Quadrature 147545 Software for Exact Integration 148
Problems 148
6 SIMULATION AND MONTE CARLO INTEGRATION 151
61 Introduction to the Monte Carlo Method 151
62 Exact Simulation 152
x CONTENTS
621 Generating from Standard Parametric Families 153622 Inverse Cumulative Distribution Function 153623 Rejection Sampling 155
6231 Squeezed Rejection Sampling 158
6232 Adaptive Rejection Sampling 159
63 Approximate Simulation 163
631 Sampling Importance Resampling Algorithm 163
6311 Adaptive Importance Bridge and Path Sampling 167
632 Sequential Monte Carlo 168
6321 Sequential Importance Sampling for Markov Processes 169
6322 General Sequential Importance Sampling 170
6323 Weight Degeneracy Rejuvenation and EffectiveSample Size 171
6324 Sequential Importance Sampling for HiddenMarkov Models 175
6325 Particle Filters 179
64 Variance Reduction Techniques 180
641 Importance Sampling 180642 Antithetic Sampling 186643 Control Variates 189644 RaondashBlackwellization 193
Problems 195
7 MARKOV CHAIN MONTE CARLO 201
71 MetropolisndashHastings Algorithm 202
711 Independence Chains 204712 Random Walk Chains 206
72 Gibbs Sampling 209
721 Basic Gibbs Sampler 209722 Properties of the Gibbs Sampler 214723 Update Ordering 216724 Blocking 216725 Hybrid Gibbs Sampling 216726 GriddyndashGibbs Sampler 218
73 Implementation 218
731 Ensuring Good Mixing and Convergence 219
7311 Simple Graphical Diagnostics 219
7312 Burn-in and Run Length 220
7313 Choice of Proposal 222
7314 Reparameterization 223
7315 Comparing Chains Effective Sample Size 224
7316 Number of Chains 225
CONTENTS xi
732 Practical Implementation Advice 226733 Using the Results 226
Problems 230
8 ADVANCED TOPICS IN MCMC 237
81 Adaptive MCMC 237811 Adaptive Random Walk Metropolis-within-Gibbs Algorithm 238812 General Adaptive Metropolis-within-Gibbs Algorithm 240813 Adaptive Metropolis Algorithm 247
82 Reversible Jump MCMC 250821 RJMCMC for Variable Selection in Regression 253
83 Auxiliary Variable Methods 256831 Simulated Tempering 257832 Slice Sampler 258
84 Other MetropolisndashHastings Algorithms 260841 Hit-and-Run Algorithm 260842 Multiple-Try MetropolisndashHastings Algorithm 261843 Langevin MetropolisndashHastings Algorithm 262
85 Perfect Sampling 264851 Coupling from the Past 264
8511 Stochastic Monotonicity and Sandwiching 267
86 Markov Chain Maximum Likelihood 268
87 Example MCMC for Markov Random Fields 269871 Gibbs Sampling for Markov Random Fields 270872 Auxiliary Variable Methods for Markov Random Fields 274873 Perfect Sampling for Markov Random Fields 277
Problems 279
PART III
BOOTSTRAPPING
9 BOOTSTRAPPING 287
91 The Bootstrap Principle 287
92 Basic Methods 288921 Nonparametric Bootstrap 288922 Parametric Bootstrap 289923 Bootstrapping Regression 290924 Bootstrap Bias Correction 291
93 Bootstrap Inference 292931 Percentile Method 292
9311 Justification for the Percentile Method 293932 Pivoting 294
xii CONTENTS
9321 Accelerated Bias-Corrected Percentile Method BCa 294
9322 The Bootstrap t 296
9323 Empirical Variance Stabilization 298
9324 Nested Bootstrap and Prepivoting 299
933 Hypothesis Testing 301
94 Reducing Monte Carlo Error 302
941 Balanced Bootstrap 302942 Antithetic Bootstrap 302
95 Bootstrapping Dependent Data 303
951 Model-Based Approach 304952 Block Bootstrap 304
9521 Nonmoving Block Bootstrap 304
9522 Moving Block Bootstrap 306
9523 Blocks-of-Blocks Bootstrapping 307
9524 Centering and Studentizing 309
9525 Block Size 311
96 Bootstrap Performance 315
961 Independent Data Case 315962 Dependent Data Case 316
97 Other Uses of the Bootstrap 316
98 Permutation Tests 317
Problems 319
PART IV
DENSITY ESTIMATION AND SMOOTHING
10 NONPARAMETRIC DENSITY ESTIMATION 325
101 Measures of Performance 326
102 Kernel Density Estimation 327
1021 Choice of Bandwidth 329
10211 Cross-Validation 332
10212 Plug-in Methods 335
10213 Maximal Smoothing Principle 338
1022 Choice of Kernel 339
10221 Epanechnikov Kernel 339
10222 Canonical Kernels and Rescalings 340
103 Nonkernel Methods 341
1031 Logspline 341
104 Multivariate Methods 345
1041 The Nature of the Problem 345
CONTENTS xiii
1042 Multivariate Kernel Estimators 3461043 Adaptive Kernels and Nearest Neighbors 348
10431 Nearest Neighbor Approaches 349
10432 Variable-Kernel Approaches and Transformations 350
1044 Exploratory Projection Pursuit 353
Problems 359
11 BIVARIATE SMOOTHING 363
111 PredictorndashResponse Data 363
112 Linear Smoothers 365
1121 Constant-Span Running Mean 366
11211 Effect of Span 368
11212 Span Selection for Linear Smoothers 369
1122 Running Lines and Running Polynomials 3721123 Kernel Smoothers 3741124 Local Regression Smoothing 3741125 Spline Smoothing 376
11251 Choice of Penalty 377
113 Comparison of Linear Smoothers 377
114 Nonlinear Smoothers 379
1141 Loess 3791142 Supersmoother 381
115 Confidence Bands 384
116 General Bivariate Data 388
Problems 389
12 MULTIVARIATE SMOOTHING 393
121 PredictorndashResponse Data 393
1211 Additive Models 3941212 Generalized Additive Models 3971213 Other Methods Related to Additive Models 399
12131 Projection Pursuit Regression 399
12132 Neural Networks 402
12133 Alternating Conditional Expectations 403
12134 Additivity and Variance Stabilization 404
1214 Tree-Based Methods 405
12141 Recursive Partitioning Regression Trees 406
12142 Tree Pruning 409
12143 Classification Trees 411
12144 Other Issues for Tree-Based Methods 412
xiv CONTENTS
122 General Multivariate Data 4131221 Principal Curves 413
12211 Definition and Motivation 413
12212 Estimation 415
12213 Span Selection 416
Problems 416
DATA ACKNOWLEDGMENTS 421
REFERENCES 423
INDEX 457
PREFACE
This book covers most topics needed to develop a broad and thorough working knowl-edge of modern computational statistics We seek to develop a practical understandingof how and why existing methods work enabling readers to use modern statisticalmethods effectively Since many new methods are built from components of exist-ing techniques our ultimate goal is to provide scientists with the tools they need tocontribute new ideas to the field
A growing challenge in science is that there is so much of it While the pursuitof important new methods and the honing of existing approaches is a worthy goalthere is also a need to organize and distill the teeming jungle of ideas We attempt todo that here Our choice of topics reflects our view of what constitutes the core of theevolving field of computational statistics and what will be interesting and useful forour readers
Our use of the adjective modern in the first sentence of this preface is potentiallytroublesome There is no way that this book can cover all the latest greatest techniquesWe have not even tried We have instead aimed to provide a reasonably up-to-datesurvey of a broad portion of the field while leaving room for diversions and esoterica
The foundations of optimization and numerical integration are covered in thisbook We include these venerable topics because (i) they are cornerstones of frequen-tist and Bayesian inference (ii) routine application of available software often failsfor hard problems and (iii) the methods themselves are often secondary componentsof other statistical computing algorithms Some topics we have omitted representimportant areas of past and present research in the field but their priority here islowered by the availability of high-quality software For example the generation ofpseudo-random numbers is a classic topic but one that we prefer to address by giv-ing students reliable software Finally some topics (eg principal curves and tabusearch) are included simply because they are interesting and provide very differentperspectives on familiar problems Perhaps a future researcher may draw ideas fromsuch topics to design a creative and effective new algorithm
In this second edition we have both updated and broadened our coverage andwe now provide computer code For example we have added new MCMC topics to re-flect continued activity in that popular area A notable increase in breadth is our inclu-sion of more methods relevant for problems where statistical dependency is importantsuch as block bootstrapping and sequential importance sampling This second editionprovides extensive new support in R Specifically code for the examples in this bookis available from the book website wwwstatcolostateeducomputationalstatistics
Our target audience includes graduate students in statistics and related fieldsstatisticians and quantitative empirical scientists in other fields We hope such readersmay use the book when applying standard methods and developing new methods
xv
xvi PREFACE
The level of mathematics expected of the reader does not extend much beyondTaylor series and linear algebra Breadth of mathematical training is more helpfulthan depth Essential review is provided in Chapter 1 More advanced readers will findgreater mathematical detail in the wide variety of high-quality books available on spe-cific topics many of which are referenced in the text Other readers caring less aboutanalytical details may prefer to focus on our descriptions of algorithms and examples
The expected level of statistics is equivalent to that obtained by a graduatestudent in his or her first year of study of the theory of statistics and probabilityAn understanding of maximum likelihood methods Bayesian methods elementaryasymptotic theory Markov chains and linear models is most important Many ofthese topics are reviewed in Chapter 1
With respect to computer programming we find that good students can learn asthey go However a working knowledge of a suitable language allows implementationof the ideas covered in this book to progress much more quickly We have chosento forgo any language-specific examples algorithms or coding in the text For thosewishing to learn a language while they study this book we recommend that you choosea high-level interactive package that permits the flexible design of graphical displaysand includes supporting statistics and probability functions such as R and MATLAB1
These are the sort of languages often used by researchers during the development ofnew statistical computing techniques and they are suitable for implementing all themethods we describe except in some cases for problems of vast scope or complexityWe use R and recommend it Although lower-level languages such as C++ couldalso be used they are more appropriate for professional-grade implementation ofalgorithms after researchers have refined the methodology
The book is organized into four major parts optimization (Chapters 2 3 and4) integration and simulation (Chapters 5 6 7 and 8) bootstrapping (Chapter 9)and density estimation and smoothing (Chapters 10 11 and 12) The chapters arewritten to stand independently so a course can be built by selecting the topics onewishes to teach For a one-semester course our selection typically weights mostheavily topics from Chapters 2 3 6 7 9 10 and 11 With a leisurely pace or morethorough coverage a shorter list of topics could still easily fill a semester courseThere is sufficient material here to provide a thorough one-year course of studynotwithstanding any supplemental topics one might wish to teach
A variety of homework problems are included at the end of each chapter Someare straightforward while others require the student to develop a thorough under-standing of the modelmethod being used to carefully (and perhaps cleverly) code asuitable technique and to devote considerable attention to the interpretation of resultsA few exercises invite open-ended exploration of methods and ideas We are some-times asked for solutions to the exercises but we prefer to sequester them to preservethe challenge for future students and readers
The datasets discussed in the examples and exercises are available from the bookwebsite wwwstatcolostateeducomputationalstatistics The R code is also providedthere Finally the website includes an errata Responsibility for all errors lies with us
1R is available for free from wwwr-projectorg Information about MATLAB can be found atwwwmathworkscom
ACKNOWLEDGMENTS
The course upon which this book is based was developed and taught by us at ColoradoState University from 1994 onwards Thanks are due to our many students who havebeen semiwilling guinea pigs over the years We also thank our colleagues in theStatistics Department for their continued support The late Richard Tweedie meritsparticular acknowledgment for his mentoring during the early years of our careers
We owe a great deal of intellectual debt to Adrian Raftery who deserves specialthanks not only for his teaching and advising but also for his unwavering support andhis seemingly inexhaustible supply of good ideas In addition we thank our influentialadvisors and teachers at the University of Washington Statistics Department includingDavid Madigan Werner Stuetzle and Judy Zeh Of course each of our chapters couldbe expanded into a full-length book and great scholars have already done so We owemuch to their efforts upon which we relied when developing our course and ourmanuscript
Portions of the first edition were written at the Department of Mathematicsand Statistics University of Otago in Dunedin New Zealand whose faculty wethank for graciously hosting us during our sabbatical in 2003 Much of our workon the second edition was undertaken during our sabbatical visit to the AustraliaCommonwealth Scientific and Research Organization in 2009ndash2010 sponsored byCSIRO Mathematics Informatics and Statistics and hosted at the Longpocket Lab-oratory in Indooroopilly Australia We thank our hosts and colleagues there for theirsupport
Our manuscript has been greatly improved through the constructive reviewsof John Bickham Ben Bird Kate Cowles Jan Hannig Alan Herlihy David HunterDevin Johnson Michael Newton Doug Nychka Steve Sain David W Scott N ScottUrquhart Haonan Wang Darrell Whitley and eight anonymous referees We alsothank the sharp-eyed readers listed in the errata for their suggestions and correctionsOur editor Steve Quigley and the folks at Wiley were supportive and helpful duringthe publication process We thank Nelida Pohl for permission to adapt her photographin the cover design of the first edition We thank Melinda Stelzer for permission to useher painting ldquoChampagne Circuitrdquo 2001 for the cover of the second edition Moreabout her art can be found at wwwfacebookcomgeekchicart We also owe specialnote of thanks to Zube (aka John Dzubera) who kept our own computers runningdespite our best efforts to the contrary
Funding from National Science Foundation (NSF) CAREER grant SBR-9875508 was a significant source of support for the first author during the preparationof the first edition He also thanks his colleagues and friends in the North SlopeBorough Alaska Department of Wildlife Management for their longtime researchsupport The second author gratefully acknowledges the support of STAR Research
xvii
xviii ACKNOWLEDGMENTS
Assistance Agreement CR-829095 awarded to Colorado State University by the USEnvironmental Protection Agency (EPA) The views expressed here are solely thoseof the authors NSF and EPA do not endorse any products or commercial servicesmentioned herein
Finally we thank our parents for enabling and supporting our educations and forproviding us with the ldquostubbornness generdquo necessary for graduate school the tenuretrack or book publicationmdashtake your pick The second edition is dedicated to ourkids Natalie and Neil for continuing to show us what is important and what is not
Geof H GivensJennifer A Hoeting
CHAPTER 1REVIEW
This chapter reviews notation and background material in mathematics probabilityand statistics Readers may wish to skip this chapter and turn directly to Chapter 2returning here only as needed
11 MATHEMATICAL NOTATION
We use boldface to distinguish a vector x = (x1 xp) or a matrix M from a scalarvariable x or a constant M A vector-valued function f evaluated at x is also boldfacedas in f(x) = (f1(x) fp(x)) The transpose of M is denoted MT
Unless otherwise specified all vectors are considered to be column vectors sofor example an n times p matrix can be written as M = (x1 xn)T Let I denote anidentity matrix and 1 and 0 denote vectors of ones and zeros respectively
A symmetric square matrix M is positive definite if xTMx gt 0 for all nonzerovectors x Positive definiteness is equivalent to the condition that all eigenvalues ofM are positive M is nonnegative definite or positive semidefinite if xTMx ge 0 for allnonzero vectors x
The derivative of a function f evaluated at x is denoted f prime(x) When x =(x1 xp) the gradient of f at x is
f prime(x) =(
df (x)
dx1
df (x)
dxp
)
The Hessian matrix for f at x is f primeprime(x) having (i j)th element equal to d2f (x)(dxi dxj) The negative Hessian has important uses in statistical inference
Let J(x) denote the Jacobian matrix evaluated at x for the one-to-one mappingy = f(x) The (i j)th element of J(x) is equal to dfi(x)dxj
A functional is a real-valued function on a space of functions For example ifT (f ) = int 1
0 f (x) dx then the functional T maps suitably integrable functions onto thereal line
The indicator function 1A equals 1 if A is true and 0 otherwise The real lineis denoted and p-dimensional real space is p
Computational Statistics Second Edition Geof H Givens and Jennifer A Hoetingcopy 2013 John Wiley amp Sons Inc Published 2013 by John Wiley amp Sons Inc
1
2 CHAPTER 1 REVIEW
12 TAYLORrsquoS THEOREM AND MATHEMATICALLIMIT THEORY
First we define standard ldquobig ohrdquo and ldquolittle ohrdquo notation for describing the relativeorders of convergence of functions Let the functions f and g be defined on a commonpossibly infinite interval Let z0 be a point in this interval or a boundary point of it(ie minusinfin or infin) We require g(z) = 0 for all z = z0 in a neighborhood of z0 Thenwe say
f (z) = O(g(z)) (11)
if there exists a constant M such that |f (z)| le M|g(z)| as z rarr z0 For example(n + 1)(3n2) = O(nminus1) and it is understood that we are considering n rarr infin Iflimzrarrz0 f (z)g(z) = 0 then we say
f (z) = O(g(z)) (12)
For example f (x0 + h) minus f (x0) = hf prime(x0) + O(h) as h rarr 0 if f is differentiable atx0 The same notation can be used for describing the convergence of a sequence xnas n rarr infin by letting f (n) = xn
Taylorrsquos theorem provides a polynomial approximation to a function f Supposef has finite (n + 1)th derivative on (a b) and continuous nth derivative on [a b] Thenfor any x0 isin [a b] distinct from x the Taylor series expansion of f about x0 is
f (x) =nsum
i=0
1
if (i)(x0)(x minus x0)i + Rn (13)
where f (i)(x0) is the ith derivative of f evaluated at x0 and
Rn = 1
(n + 1)f (n+1)(ξ)(x minus x0)n+1 (14)
for some point ξ in the interval between x and x0 As |x minus x0| rarr 0 note that Rn =O(|x minus x0|n+1)
The multivariate version of Taylorrsquos theorem is analogous Suppose f is areal-valued function of a p-dimensional variable x possessing continuous partialderivatives of all orders up to and including n + 1 with respect to all coordinates inan open convex set containing x and x0 = x Then
f (x) = f (x0) +nsum
i=1
1
iD(i)(f x0 x minus x0) + Rn (15)
where
D(i)(f x y) =psum
j1=1
middot middot middotpsum
ji=1
(di
dtj1 middot middot middot dtji
f (t)
∣∣∣∣t=x
) iprodk=1
yjk
(16)
12 TAYLORrsquoS THEOREM AND MATHEMATICAL LIMIT THEORY 3
and
Rn = 1
(n + 1)D(n+1)(f ξ x minus x0) (17)
for some ξ on the line segment joining x and x0 As |x minus x0| rarr 0 note that Rn =O(|x minus x0|n+1)
The EulerndashMaclaurin formula is useful in many asymptotic analyses If f has2n continuous derivatives in [0 1] thenint 1
0f (x) dx =f (0) + f (1)
2
minusnminus1sumi=0
b2i(f (2iminus1)(1) minus f (2iminus1)(0))
(2i)minus b2nf
(2n)(ξ)
(2n) (18)
where 0 le ξ le 1 f (j) is the jth derivative of f and bj = Bj(0) can be determinedusing the recursion relation
msumj=0
(m + 1
j
)Bj(z) = (m + 1)zm (19)
initialized with B0(z) = 1 The proof of this result is based on repeated integrationsby parts [376]
Finally we note that it is sometimes desirable to approximate the derivative ofa function numerically using finite differences For example the ith component ofthe gradient of f at x can be approximated by
df (x)
dxi
asymp f (x + εiei) minus f (x minus εiei)
2εi
(110)
where εi is a small number and ei is the unit vector in the ith coordinate directionTypically one might start with say εi = 001 or 0001 and approximate the desiredderivative for a sequence of progressively smaller εi The approximation willgenerally improve until εi becomes small enough that the calculation is degraded andeventually dominated by computer roundoff error introduced by subtractive cancel-lation Introductory discussion of this approach and a more sophisticated Richardsonextrapolation strategy for obtaining greater precision are provided in [376] Finitedifferences can also be used to approximate the second derivative of f at x via
df (x)
dxi dxj
asymp 1
4εiεj
(f (x + εiei + εjej) minus f (x + εiei minus εjej)
minus f (x minus εiei + εjej) + f (x minus εiei minus εjej)
)(111)
with similar sequential precision improvements
4 CHAPTER 1 REVIEW
13 STATISTICAL NOTATION AND PROBABILITYDISTRIBUTIONS
We use capital letters to denote random variables such as Y or X and lowercaseletters to represent specific realized values of random variables such as y or x Theprobability density function of X is denoted f the cumulative distribution functionis F We use the notation X sim f (x) to mean that X is distributed with density f (x)Frequently the dependence of f (x) on one or more parameters also will be denotedwith a conditioning bar as in f (x|α β) Because of the diversity of topics covered inthis book we want to be careful to distinguish when f (x|α) refers to a density functionas opposed to the evaluation of that density at a point x When the meaning is unclearfrom the context we will be explicit for example by using f (middot|α) to denote thefunction When it is important to distinguish among several densities we may adoptsubscripts referring to specific random variables so that the density functions forX and Y are fX and fY respectively We use the same notation for distributions ofdiscrete random variables and in the Bayesian context
The conditional distribution of X given that Y equals y (ie X|Y = y) is de-scribed by the density denoted f (x|y) or fX|Y (x|y) In this case we write that X|Y hasdensity f (x|Y ) For notational simplicity we allow density functions to be implicitlyspecified by their arguments so we may use the same symbol say f to refer to manydistinct functions as in the equation f (x y|μ) = f (x|y μ)f (y|μ) Finally f (X) andF (X) are random variables the evaluations of the density and cumulative distributionfunctions respectively at the random argument X
The expectation of a random variable is denoted EX Unless specificallymentioned the distribution with respect to which an expectation is taken is the dis-tribution of X or should be implicit from the context To denote the probability ofan event A we use P[A] = E1A The conditional expectation of X|Y = y isEX|y When Y is unknown EX|Y is a random variable that depends on Y Otherattributes of the distribution of X and Y include varX covX Y corX Y andcvX = varX12EX These quantities are the variance of X the covariance andcorrelation of X and Y and the coefficient of variation of X respectively
A useful result regarding expectations is Jensenrsquos inequality Let g be a convexfunction on a possibly infinite open interval I so
g(λx + (1 minus λ)y) le λg(x) + (1 minus λ)g(y) (112)
for all x y isin I and all 0 lt λ lt 1 Then Jensenrsquos inequality states that Eg(X) geg(EX) for any random variable X having P[X isin I] = 1
Tables 11 12 and 13 provide information about many discrete and contin-uous distributions used throughout this book We refer to the following well-knowncombinatorial constants
n = n(n minus 1)(n minus 2) middot middot middot (3)(2)(1) with 0 = 1 (113)(n
k
)= n
k(n minus k) (114)
TABLE 11 Notation and description for common probability distributions of discrete random variables
Notation and Density and Mean and
Name Parameter Space Sample Space Variance
Bernoulli X simBernoulli(p)0 le p le 1
f (x) = px(1 minus p)1minusx
x = 0 or 1EX = p
varX = p(1 minus p)
Binomial X simBin(n p)0 le p le 1n isin 1 2
f (x) =(
n
x
)px(1 minus p)nminusx
x = 0 1 n
EX = np
varX = np(1 minus p)
Multinomial X simMultinomial(n p)p = (p1 pk)0 le pi le 1 and n isin 1 2 sumk
i=1 pi = 1
f (x) =(
n
x1 xk
)prodk
i=1 pxi
i
x = (x1 xk) and xi isin 0 1 nsumk
i=1 xi = n
EX = npvarXi = npi(1 minus pi)covXi Xj = minusnpipj
NegativeBinomial
X simNegBin(r p)0 le p le 1r isin 1 2
f (x) =(
x + r minus 1
r minus 1
)pr(1 minus p)x
x isin 0 1 EX = r(1 minus p)pvarX = r(1 minus p)p2
Poisson X simPoisson(λ) f (x) = λx
xexpminusλ EX = λ
λ gt 0 x isin 0 1 varX = λ
5
TABLE 12 Notation and description for some common probability distributions of continuous random variables
Notation and Density and Mean and
Name Parameter Space Sample Space Variance
Beta X simBeta(α β)α gt 0 and β gt 0
f (x) = (α + β)
(α)(β)xαminus1(1 minus x)βminus1
0 le x le 1
EX = α
α + β
varX = αβ
(α + β)2(α + β + 1)
Cauchy X simCauchy(α β)α isin and β gt 0
f (x) = 1
πβ
[1 +
(x minus α
β
)2]
x isin
EX is nonexistentvarX is nonexistent
Chi-square X sim χ2ν
ν gt 0f (x) = Gamma(ν2 12)x gt 0
EX = ν
varX = 2ν
Dirichlet X simDirichlet(α)α = (α1 αk)αi gt 0α0 = sumk
i=1 αi
f (x) = (α0)prodk
i=1 xαiminus1iprodk
i=1 (αi)x = (x1 xk) and 0 le xi le 1sumk
i=1 xi = 1
EX = αα0
varXi = αi(α0 minus αi)
α20(α0 + 1)
covXi Xj = minusαiαj
α20(α0 + 1)
Exponential X simExp(λ)λ gt 0
f (x) = λ expminusλxx gt 0
EX = 1λ
varX = 1λ2
Gamma X simGamma(r λ)λ gt 0 and r gt 0
f (x) = λrxrminus1
(r)expminusλx
x gt 0
EX = rλ
varX = rλ2
6
TABLE 13 Notation and description for more common probability distributions of continuous random variables
Notation and Density and Mean and
Name Parameter Space Sample Space Variance
Lognormal X simLognormal(μ σ2)μ isin and σ gt 0
f (x) = 1
xradic
2πσ2exp
minus1
2
(logx minus μ
σ
)2
x isin EX = expμ + σ22varX = exp2μ + 2σ2 minus exp2μ + σ2
MultivariateNormal
X sim Nk(μ )μ = (μ1 μk) isin k
positive definite
f (x) = expminus(x minus μ)T minus1(x minus μ)2(2π)k2||12
x = (x1 xk) isin k
EX = μ
varX =
Normal X simN(μ σ2)μ isin and σ gt 0
f (x) = 1radic2πσ2
exp
minus1
2
(x minus μ
σ
)2
x isin EX = μ
varX = σ2
Studentrsquos t X sim tν
ν gt 0f (x) = ((ν + 1)2)
(ν2)radic
πν
(1 + x2
ν
)minus(ν+1)2
x isin EX = 0 if ν gt 1
varX = ν
ν + 2if ν gt 2
Uniform X simUnif(a b)a b isin and a lt b
f (x) = 1
b minus ax isin [a b]
EX = (a + b)2varX = (b minus a)212
Weibull X simWeibull(a b)a gt 0 and b gt 0
f (x) = abxbminus1 expminusaxbx gt 0
EX = (1 + 1b)
a1b
varX = (1 + 2b) minus (1 + 1b)2
a2b
7
8 CHAPTER 1 REVIEW(n
k1 km
)= nprodm
i=1 kiwhere n =
msumi=1
ki (115)
(r) =
(r minus 1) for r = 1 2 int infin0 trminus1 expminust dt for general r gt 0
(116)
It is worth knowing that
(
12
)= radic
π and (n + 1
2
)= 1 times 3 times 5 times middot middot middot times (2n minus 1)
radicπ2n
for positive integer nMany of the distributions commonly used in statistics are members of an
exponential family A k-parameter exponential family density can be expressed as
f (x|γ) = c1(x)c2(γ) exp
ksum
i=1
yi(x)θi(γ)
(117)
for nonnegative functions c1 and c2 The vectorγ denotes the familiar parameters suchas λ for the Poisson density and p for the binomial density The real-valued θi(γ) arethe natural or canonical parameters which are usually transformations of γ Theyi(x) are the sufficient statistics for the canonical parameters It is straightforwardto show
Ey(X) =κprime(θ) (118)
and
vary(X) =κprimeprime(θ) (119)
where κ(θ) = minus log c3(θ) letting c3(θ) denote the reexpression of c2(γ) in terms ofthe canonical parameters θ = (θ1 θk) and y(X) = (y1(X) yk(X)) Theseresults can be rewritten in terms of the original parameters γ as
E
ksum
i=1
dθi(γ)
dγj
yi(X)
= minus d
dγj
log c2(γ) (120)
and
var
ksum
i=1
dθi(γ)
dγj
yi(X)
= minus d2
dγ2j
log c2(γ) minus E
ksum
i=1
d2θi(γ)
dγ2j
yi(X)
(121)
Example 11 (Poisson) The Poisson distribution belongs to the exponential familywith c1(x) = 1x c2(λ) = expminusλ y(x) = x and θ(λ) = log λ Deriving momentsin terms of θ we have κ(θ) = expθ so EX = κprime(θ) = expθ = λ and varX =κprimeprime(θ) = expθ = λ The same results may be obtained with (120) and (121) notingthat dθdλ = 1λ For example (120) gives E Xλ = 1
It is also important to know how the distribution of a random variable changeswhen it is transformed Let X = (X1 Xp) denote a p-dimensional random
14 LIKELIHOOD INFERENCE 9
variable with continuous density function f Suppose that
U = g(X) = (g1(X) gp(X)) = (U1 Up) (122)
where g is a one-to-one function mapping the support region of f onto the space ofall u = g(x) for which x satisfies f (x) gt 0 To derive the probability distribution ofU from that of X we need to use the Jacobian matrix The density of the transformedvariables is
f (u) = f (gminus1(u)) |J(u)| (123)
where |J(u)| is the absolute value of the determinant of the Jacobian matrix of gminus1
evaluated at u having (i j)th element dxiduj where these derivatives are assumedto be continuous over the support region of U
14 LIKELIHOOD INFERENCE
If X1 Xn are independent and identically distributed (iid) each having densityf (x | θ) that depends on a vector of p unknown parameters θ = (θ1 θp) then thejoint likelihood function is
L(θ) =nprod
i=1
f (xi|θ) (124)
When the data are not iid the joint likelihood is still expressed as the joint densityf (x1 xn|θ) viewed as a function of θ
The observed data x1 xn might have been realized under many differentvalues for θ The parameters for which observing x1 xn would be most likelyconstitute the maximum likelihood estimate of θ In other words if ϑ is the functionof x1 xn that maximizes L(θ) then θ = ϑ(X1 Xn) is the maximum likeli-hood estimator (MLE) for θ MLEs are invariant to transformation so the MLE of atransformation of θ equals the transformation of θ
It is typically easier to work with the log likelihood function
l (θ) = log L(θ) (125)
which has the same maximum as the original likelihood since log is a strictly mono-tonic function Furthermore any additive constants (involving possibly x1 xn
but not θ) may be omitted from the log likelihood without changing the location of itsmaximum or differences between log likelihoods at different θ Note that maximizingL(θ) with respect to θ is equivalent to solving the system of equations
lprime(θ) = 0 (126)
where
lprime(θ) =(
dl(θ)
dθ1
dl(θ)
dθn
)
10 CHAPTER 1 REVIEW
is called the score function The score function satisfies
Elprime(θ) = 0 (127)
where the expectation is taken with respect to the distribution of X1 Xn Some-times an analytical solution to (126) provides the MLE this book describes a varietyof methods that can be used when the MLE cannot be solved for in closed form Itis worth noting that there are pathological circumstances where the MLE is not asolution of the score equation or the MLE is not unique see [127] for examples
The MLE has a sampling distribution because it depends on the realizationof the random variables X1 Xn The MLE may be biased or unbiased for θ yetunder quite general conditions it is asymptotically unbiased as n rarr infin The samplingvariance of the MLE depends on the average curvature of the log likelihood When thelog likelihood is very pointy the location of the maximum is more precisely known
To make this precise let lprimeprime(θ) denote the p times p matrix having (i j)th elementgiven by d2l(θ)(dθidθj) The Fisher information matrix is defined as
I(θ) = Elprime(θ)lprime(θ)T = minusElprimeprime(θ) (128)
where the expectations are taken with respect to the distribution of X1 Xn Thefinal equality in (128) requires mild assumptions which are satisfied for example inexponential families I(θ) may sometimes be called the expected Fisher informationto distinguish it from minuslprimeprime(θ) which is the observed Fisher information There aretwo reasons why the observed Fisher information is quite useful First it can becalculated even if the expectations in (128) cannot easily be computed Second it isa good approximation to I(θ) that improves as n increases
Under regularity conditions the asymptotic variancendashcovariance matrix ofthe MLE θ is I(θlowast)minus1 where θlowast denotes the true value of θ Indeed as n rarr infin thelimiting distribution of θ is Np(θlowast I(θlowast)minus1) Since the true parameter values areunknown I(θlowast)minus1 must be estimated in order to estimate the variancendashcovariancematrix of the MLE An obvious approach is to use I(θ)minus1 Alternatively it is alsoreasonable to use minuslprimeprime(θ)minus1 Standard errors for individual parameter MLEs canbe estimated by taking the square root of the diagonal elements of the chosenestimate of I(θlowast)minus1 A thorough introduction to maximum likelihood theory andthe relative merits of these estimates of I(θlowast)minus1 can be found in [127 182 371 470]
Profile likelihoods provide an informative way to graph a higher-dimensionallikelihood surface to make inference about some parameters while treating othersas nuisance parameters and to facilitate various optimization strategies The profilelikelihood is obtained by constrained maximization of the full likelihood with respectto parameters to be ignored If θ = (μ φ) then the profile likelihood for φ is
L(φ|μ(φ)) = maxμ
L(μ φ) (129)
Thus for each possible φ a value of μ is chosen to maximize L(μ φ) This optimalμ is a function of φ The profile likelihood is the function that maps φ to the valueof the full likelihood evaluated at φ and its corresponding optimal μ Note that the φ
SECOND EDITION
COMPUTATIONALSTATISTICS
GEOF H GIVENS AND JENNIFER A HOETINGDepartment of Statistics Colorado State University Fort Collins CO
Copyright copy 2013 by John Wiley amp Sons Inc All rights reserved
Published by John Wiley amp Sons Inc Hoboken New JerseyPublished simultaneously in Canada
No part of this publication may be reproduced stored in a retrieval system or transmitted in any formor by any means electronic mechanical photocopying recording scanning or otherwise except aspermitted under Section 107 or 108 of the 1976 United States Copyright Act without either the priorwritten permission of the Publisher or authorization through payment of the appropriate per-copy fee tothe Copyright Clearance Center Inc 222 Rosewood Drive Danvers MA 01923 (978) 750-8400 fax(978) 750-4470 or on the web at wwwcopyrightcom Requests to the Publisher for permission shouldbe addressed to the Permissions Department John Wiley amp Sons Inc 111 River Street Hoboken NJ07030 (201) 748-6011 fax (201) 748-6008 or online at httpwwwwileycomgopermission
Limit of LiabilityDisclaimer of Warranty While the publisher and author have used their best efforts inpreparing this book they make no representations or warranties with respect to the accuracy orcompleteness of the contents of this book and specifically disclaim any implied warranties ofmerchantability or fitness for a particular purpose No warranty may be created or extended by salesrepresentatives or written sales materials The advice and strategies contained herein may not be suitablefor your situation You should consult with a professional where appropriate Neither the publisher norauthor shall be liable for any loss of profit or any other commercial damages including but not limited tospecial incidental consequential or other damages
For general information on our other products and services or for technical support please contact ourCustomer Care Department within the United States at (800) 762-2974 outside the United Statesat (317) 572-3993 or fax (317) 572-4002
Wiley also publishes its books in a variety of electronic formats Some content that appears in print maynot be available in electronic formats For more information about Wiley products visit our web site atwwwwileycom
Library of Congress Cataloging-in-Publication Data
Givens Geof HComputational statistics Geof H Givens Jennifer A Hoeting ndash 2nd ed
p cmIncludes indexISBN 978-0-470-53331-4 (cloth)
1 Mathematical statisticsndashData processing I Hoeting Jennifer A(Jennifer Ann) 1966ndash II Title
QA2764G58 20135195ndashdc23
2012017381
Printed in the United States of America
10 9 8 7 6 5 4 3 2 1
To Natalie and Neil
CONTENTS
PREFACE xv
ACKNOWLEDGMENTS xvii
1 REVIEW 1
11 Mathematical Notation 1
12 Taylorrsquos Theorem and Mathematical Limit Theory 2
13 Statistical Notation and Probability Distributions 4
14 Likelihood Inference 9
15 Bayesian Inference 11
16 Statistical Limit Theory 13
17 Markov Chains 14
18 Computing 17
PART I
OPTIMIZATION
2 OPTIMIZATION AND SOLVING NONLINEAR EQUATIONS 21
21 Univariate Problems 22
211 Newtonrsquos Method 26
2111 Convergence Order 29
212 Fisher Scoring 30213 Secant Method 30214 Fixed-Point Iteration 32
2141 Scaling 33
22 Multivariate Problems 34
221 Newtonrsquos Method and Fisher Scoring 34
2211 Iteratively Reweighted Least Squares 36
222 Newton-Like Methods 39
2221 Ascent Algorithms 39
2222 Discrete Newton and Fixed-Point Methods 41
2223 Quasi-Newton Methods 41
vii
viii CONTENTS
223 GaussndashNewton Method 44224 NelderndashMead Algorithm 45225 Nonlinear GaussndashSeidel Iteration 52
Problems 54
3 COMBINATORIAL OPTIMIZATION 59
31 Hard Problems and NP-Completeness 59311 Examples 61312 Need for Heuristics 64
32 Local Search 65
33 Simulated Annealing 68331 Practical Issues 70
3311 Neighborhoods and Proposals 70
3312 Cooling Schedule and Convergence 71332 Enhancements 74
34 Genetic Algorithms 75341 Definitions and the Canonical Algorithm 75
3411 Basic Definitions 75
3412 Selection Mechanisms and Genetic Operators 76
3413 Allele Alphabets and Genotypic Representation 78
3414 Initialization Termination and Parameter Values 79342 Variations 80
3421 Fitness 80
3422 Selection Mechanisms and Updating Generations 81
3423 Genetic Operators and PermutationChromosomes 82
343 Initialization and Parameter Values 84344 Convergence 84
35 Tabu Algorithms 85351 Basic Definitions 86352 The Tabu List 87353 Aspiration Criteria 88354 Diversification 89355 Intensification 90356 Comprehensive Tabu Algorithm 91
Problems 92
4 EM OPTIMIZATION METHODS 97
41 Missing Data Marginalization and Notation 97
42 The EM Algorithm 98421 Convergence 102422 Usage in Exponential Families 105
CONTENTS ix
423 Variance Estimation 106
4231 Louisrsquos Method 106
4232 SEM Algorithm 108
4233 Bootstrapping 110
4234 Empirical Information 110
4235 Numerical Differentiation 111
43 EM Variants 111431 Improving the E Step 111
4311 Monte Carlo EM 111432 Improving the M Step 112
4321 ECM Algorithm 113
4322 EM Gradient Algorithm 116433 Acceleration Methods 118
4331 Aitken Acceleration 118
4332 Quasi-Newton Acceleration 119
Problems 121
PART II
INTEGRATION AND SIMULATION
5 NUMERICAL INTEGRATION 129
51 NewtonndashCotes Quadrature 129511 Riemann Rule 130512 Trapezoidal Rule 134513 Simpsonrsquos Rule 136514 General kth-Degree Rule 138
52 Romberg Integration 139
53 Gaussian Quadrature 142531 Orthogonal Polynomials 143532 The Gaussian Quadrature Rule 143
54 Frequently Encountered Problems 146541 Range of Integration 146542 Integrands with Singularities or Other Extreme Behavior 146543 Multiple Integrals 147544 Adaptive Quadrature 147545 Software for Exact Integration 148
Problems 148
6 SIMULATION AND MONTE CARLO INTEGRATION 151
61 Introduction to the Monte Carlo Method 151
62 Exact Simulation 152
x CONTENTS
621 Generating from Standard Parametric Families 153622 Inverse Cumulative Distribution Function 153623 Rejection Sampling 155
6231 Squeezed Rejection Sampling 158
6232 Adaptive Rejection Sampling 159
63 Approximate Simulation 163
631 Sampling Importance Resampling Algorithm 163
6311 Adaptive Importance Bridge and Path Sampling 167
632 Sequential Monte Carlo 168
6321 Sequential Importance Sampling for Markov Processes 169
6322 General Sequential Importance Sampling 170
6323 Weight Degeneracy Rejuvenation and EffectiveSample Size 171
6324 Sequential Importance Sampling for HiddenMarkov Models 175
6325 Particle Filters 179
64 Variance Reduction Techniques 180
641 Importance Sampling 180642 Antithetic Sampling 186643 Control Variates 189644 RaondashBlackwellization 193
Problems 195
7 MARKOV CHAIN MONTE CARLO 201
71 MetropolisndashHastings Algorithm 202
711 Independence Chains 204712 Random Walk Chains 206
72 Gibbs Sampling 209
721 Basic Gibbs Sampler 209722 Properties of the Gibbs Sampler 214723 Update Ordering 216724 Blocking 216725 Hybrid Gibbs Sampling 216726 GriddyndashGibbs Sampler 218
73 Implementation 218
731 Ensuring Good Mixing and Convergence 219
7311 Simple Graphical Diagnostics 219
7312 Burn-in and Run Length 220
7313 Choice of Proposal 222
7314 Reparameterization 223
7315 Comparing Chains Effective Sample Size 224
7316 Number of Chains 225
CONTENTS xi
732 Practical Implementation Advice 226733 Using the Results 226
Problems 230
8 ADVANCED TOPICS IN MCMC 237
81 Adaptive MCMC 237811 Adaptive Random Walk Metropolis-within-Gibbs Algorithm 238812 General Adaptive Metropolis-within-Gibbs Algorithm 240813 Adaptive Metropolis Algorithm 247
82 Reversible Jump MCMC 250821 RJMCMC for Variable Selection in Regression 253
83 Auxiliary Variable Methods 256831 Simulated Tempering 257832 Slice Sampler 258
84 Other MetropolisndashHastings Algorithms 260841 Hit-and-Run Algorithm 260842 Multiple-Try MetropolisndashHastings Algorithm 261843 Langevin MetropolisndashHastings Algorithm 262
85 Perfect Sampling 264851 Coupling from the Past 264
8511 Stochastic Monotonicity and Sandwiching 267
86 Markov Chain Maximum Likelihood 268
87 Example MCMC for Markov Random Fields 269871 Gibbs Sampling for Markov Random Fields 270872 Auxiliary Variable Methods for Markov Random Fields 274873 Perfect Sampling for Markov Random Fields 277
Problems 279
PART III
BOOTSTRAPPING
9 BOOTSTRAPPING 287
91 The Bootstrap Principle 287
92 Basic Methods 288921 Nonparametric Bootstrap 288922 Parametric Bootstrap 289923 Bootstrapping Regression 290924 Bootstrap Bias Correction 291
93 Bootstrap Inference 292931 Percentile Method 292
9311 Justification for the Percentile Method 293932 Pivoting 294
xii CONTENTS
9321 Accelerated Bias-Corrected Percentile Method BCa 294
9322 The Bootstrap t 296
9323 Empirical Variance Stabilization 298
9324 Nested Bootstrap and Prepivoting 299
933 Hypothesis Testing 301
94 Reducing Monte Carlo Error 302
941 Balanced Bootstrap 302942 Antithetic Bootstrap 302
95 Bootstrapping Dependent Data 303
951 Model-Based Approach 304952 Block Bootstrap 304
9521 Nonmoving Block Bootstrap 304
9522 Moving Block Bootstrap 306
9523 Blocks-of-Blocks Bootstrapping 307
9524 Centering and Studentizing 309
9525 Block Size 311
96 Bootstrap Performance 315
961 Independent Data Case 315962 Dependent Data Case 316
97 Other Uses of the Bootstrap 316
98 Permutation Tests 317
Problems 319
PART IV
DENSITY ESTIMATION AND SMOOTHING
10 NONPARAMETRIC DENSITY ESTIMATION 325
101 Measures of Performance 326
102 Kernel Density Estimation 327
1021 Choice of Bandwidth 329
10211 Cross-Validation 332
10212 Plug-in Methods 335
10213 Maximal Smoothing Principle 338
1022 Choice of Kernel 339
10221 Epanechnikov Kernel 339
10222 Canonical Kernels and Rescalings 340
103 Nonkernel Methods 341
1031 Logspline 341
104 Multivariate Methods 345
1041 The Nature of the Problem 345
CONTENTS xiii
1042 Multivariate Kernel Estimators 3461043 Adaptive Kernels and Nearest Neighbors 348
10431 Nearest Neighbor Approaches 349
10432 Variable-Kernel Approaches and Transformations 350
1044 Exploratory Projection Pursuit 353
Problems 359
11 BIVARIATE SMOOTHING 363
111 PredictorndashResponse Data 363
112 Linear Smoothers 365
1121 Constant-Span Running Mean 366
11211 Effect of Span 368
11212 Span Selection for Linear Smoothers 369
1122 Running Lines and Running Polynomials 3721123 Kernel Smoothers 3741124 Local Regression Smoothing 3741125 Spline Smoothing 376
11251 Choice of Penalty 377
113 Comparison of Linear Smoothers 377
114 Nonlinear Smoothers 379
1141 Loess 3791142 Supersmoother 381
115 Confidence Bands 384
116 General Bivariate Data 388
Problems 389
12 MULTIVARIATE SMOOTHING 393
121 PredictorndashResponse Data 393
1211 Additive Models 3941212 Generalized Additive Models 3971213 Other Methods Related to Additive Models 399
12131 Projection Pursuit Regression 399
12132 Neural Networks 402
12133 Alternating Conditional Expectations 403
12134 Additivity and Variance Stabilization 404
1214 Tree-Based Methods 405
12141 Recursive Partitioning Regression Trees 406
12142 Tree Pruning 409
12143 Classification Trees 411
12144 Other Issues for Tree-Based Methods 412
xiv CONTENTS
122 General Multivariate Data 4131221 Principal Curves 413
12211 Definition and Motivation 413
12212 Estimation 415
12213 Span Selection 416
Problems 416
DATA ACKNOWLEDGMENTS 421
REFERENCES 423
INDEX 457
PREFACE
This book covers most topics needed to develop a broad and thorough working knowl-edge of modern computational statistics We seek to develop a practical understandingof how and why existing methods work enabling readers to use modern statisticalmethods effectively Since many new methods are built from components of exist-ing techniques our ultimate goal is to provide scientists with the tools they need tocontribute new ideas to the field
A growing challenge in science is that there is so much of it While the pursuitof important new methods and the honing of existing approaches is a worthy goalthere is also a need to organize and distill the teeming jungle of ideas We attempt todo that here Our choice of topics reflects our view of what constitutes the core of theevolving field of computational statistics and what will be interesting and useful forour readers
Our use of the adjective modern in the first sentence of this preface is potentiallytroublesome There is no way that this book can cover all the latest greatest techniquesWe have not even tried We have instead aimed to provide a reasonably up-to-datesurvey of a broad portion of the field while leaving room for diversions and esoterica
The foundations of optimization and numerical integration are covered in thisbook We include these venerable topics because (i) they are cornerstones of frequen-tist and Bayesian inference (ii) routine application of available software often failsfor hard problems and (iii) the methods themselves are often secondary componentsof other statistical computing algorithms Some topics we have omitted representimportant areas of past and present research in the field but their priority here islowered by the availability of high-quality software For example the generation ofpseudo-random numbers is a classic topic but one that we prefer to address by giv-ing students reliable software Finally some topics (eg principal curves and tabusearch) are included simply because they are interesting and provide very differentperspectives on familiar problems Perhaps a future researcher may draw ideas fromsuch topics to design a creative and effective new algorithm
In this second edition we have both updated and broadened our coverage andwe now provide computer code For example we have added new MCMC topics to re-flect continued activity in that popular area A notable increase in breadth is our inclu-sion of more methods relevant for problems where statistical dependency is importantsuch as block bootstrapping and sequential importance sampling This second editionprovides extensive new support in R Specifically code for the examples in this bookis available from the book website wwwstatcolostateeducomputationalstatistics
Our target audience includes graduate students in statistics and related fieldsstatisticians and quantitative empirical scientists in other fields We hope such readersmay use the book when applying standard methods and developing new methods
xv
xvi PREFACE
The level of mathematics expected of the reader does not extend much beyondTaylor series and linear algebra Breadth of mathematical training is more helpfulthan depth Essential review is provided in Chapter 1 More advanced readers will findgreater mathematical detail in the wide variety of high-quality books available on spe-cific topics many of which are referenced in the text Other readers caring less aboutanalytical details may prefer to focus on our descriptions of algorithms and examples
The expected level of statistics is equivalent to that obtained by a graduatestudent in his or her first year of study of the theory of statistics and probabilityAn understanding of maximum likelihood methods Bayesian methods elementaryasymptotic theory Markov chains and linear models is most important Many ofthese topics are reviewed in Chapter 1
With respect to computer programming we find that good students can learn asthey go However a working knowledge of a suitable language allows implementationof the ideas covered in this book to progress much more quickly We have chosento forgo any language-specific examples algorithms or coding in the text For thosewishing to learn a language while they study this book we recommend that you choosea high-level interactive package that permits the flexible design of graphical displaysand includes supporting statistics and probability functions such as R and MATLAB1
These are the sort of languages often used by researchers during the development ofnew statistical computing techniques and they are suitable for implementing all themethods we describe except in some cases for problems of vast scope or complexityWe use R and recommend it Although lower-level languages such as C++ couldalso be used they are more appropriate for professional-grade implementation ofalgorithms after researchers have refined the methodology
The book is organized into four major parts optimization (Chapters 2 3 and4) integration and simulation (Chapters 5 6 7 and 8) bootstrapping (Chapter 9)and density estimation and smoothing (Chapters 10 11 and 12) The chapters arewritten to stand independently so a course can be built by selecting the topics onewishes to teach For a one-semester course our selection typically weights mostheavily topics from Chapters 2 3 6 7 9 10 and 11 With a leisurely pace or morethorough coverage a shorter list of topics could still easily fill a semester courseThere is sufficient material here to provide a thorough one-year course of studynotwithstanding any supplemental topics one might wish to teach
A variety of homework problems are included at the end of each chapter Someare straightforward while others require the student to develop a thorough under-standing of the modelmethod being used to carefully (and perhaps cleverly) code asuitable technique and to devote considerable attention to the interpretation of resultsA few exercises invite open-ended exploration of methods and ideas We are some-times asked for solutions to the exercises but we prefer to sequester them to preservethe challenge for future students and readers
The datasets discussed in the examples and exercises are available from the bookwebsite wwwstatcolostateeducomputationalstatistics The R code is also providedthere Finally the website includes an errata Responsibility for all errors lies with us
1R is available for free from wwwr-projectorg Information about MATLAB can be found atwwwmathworkscom
ACKNOWLEDGMENTS
The course upon which this book is based was developed and taught by us at ColoradoState University from 1994 onwards Thanks are due to our many students who havebeen semiwilling guinea pigs over the years We also thank our colleagues in theStatistics Department for their continued support The late Richard Tweedie meritsparticular acknowledgment for his mentoring during the early years of our careers
We owe a great deal of intellectual debt to Adrian Raftery who deserves specialthanks not only for his teaching and advising but also for his unwavering support andhis seemingly inexhaustible supply of good ideas In addition we thank our influentialadvisors and teachers at the University of Washington Statistics Department includingDavid Madigan Werner Stuetzle and Judy Zeh Of course each of our chapters couldbe expanded into a full-length book and great scholars have already done so We owemuch to their efforts upon which we relied when developing our course and ourmanuscript
Portions of the first edition were written at the Department of Mathematicsand Statistics University of Otago in Dunedin New Zealand whose faculty wethank for graciously hosting us during our sabbatical in 2003 Much of our workon the second edition was undertaken during our sabbatical visit to the AustraliaCommonwealth Scientific and Research Organization in 2009ndash2010 sponsored byCSIRO Mathematics Informatics and Statistics and hosted at the Longpocket Lab-oratory in Indooroopilly Australia We thank our hosts and colleagues there for theirsupport
Our manuscript has been greatly improved through the constructive reviewsof John Bickham Ben Bird Kate Cowles Jan Hannig Alan Herlihy David HunterDevin Johnson Michael Newton Doug Nychka Steve Sain David W Scott N ScottUrquhart Haonan Wang Darrell Whitley and eight anonymous referees We alsothank the sharp-eyed readers listed in the errata for their suggestions and correctionsOur editor Steve Quigley and the folks at Wiley were supportive and helpful duringthe publication process We thank Nelida Pohl for permission to adapt her photographin the cover design of the first edition We thank Melinda Stelzer for permission to useher painting ldquoChampagne Circuitrdquo 2001 for the cover of the second edition Moreabout her art can be found at wwwfacebookcomgeekchicart We also owe specialnote of thanks to Zube (aka John Dzubera) who kept our own computers runningdespite our best efforts to the contrary
Funding from National Science Foundation (NSF) CAREER grant SBR-9875508 was a significant source of support for the first author during the preparationof the first edition He also thanks his colleagues and friends in the North SlopeBorough Alaska Department of Wildlife Management for their longtime researchsupport The second author gratefully acknowledges the support of STAR Research
xvii
xviii ACKNOWLEDGMENTS
Assistance Agreement CR-829095 awarded to Colorado State University by the USEnvironmental Protection Agency (EPA) The views expressed here are solely thoseof the authors NSF and EPA do not endorse any products or commercial servicesmentioned herein
Finally we thank our parents for enabling and supporting our educations and forproviding us with the ldquostubbornness generdquo necessary for graduate school the tenuretrack or book publicationmdashtake your pick The second edition is dedicated to ourkids Natalie and Neil for continuing to show us what is important and what is not
Geof H GivensJennifer A Hoeting
CHAPTER 1REVIEW
This chapter reviews notation and background material in mathematics probabilityand statistics Readers may wish to skip this chapter and turn directly to Chapter 2returning here only as needed
11 MATHEMATICAL NOTATION
We use boldface to distinguish a vector x = (x1 xp) or a matrix M from a scalarvariable x or a constant M A vector-valued function f evaluated at x is also boldfacedas in f(x) = (f1(x) fp(x)) The transpose of M is denoted MT
Unless otherwise specified all vectors are considered to be column vectors sofor example an n times p matrix can be written as M = (x1 xn)T Let I denote anidentity matrix and 1 and 0 denote vectors of ones and zeros respectively
A symmetric square matrix M is positive definite if xTMx gt 0 for all nonzerovectors x Positive definiteness is equivalent to the condition that all eigenvalues ofM are positive M is nonnegative definite or positive semidefinite if xTMx ge 0 for allnonzero vectors x
The derivative of a function f evaluated at x is denoted f prime(x) When x =(x1 xp) the gradient of f at x is
f prime(x) =(
df (x)
dx1
df (x)
dxp
)
The Hessian matrix for f at x is f primeprime(x) having (i j)th element equal to d2f (x)(dxi dxj) The negative Hessian has important uses in statistical inference
Let J(x) denote the Jacobian matrix evaluated at x for the one-to-one mappingy = f(x) The (i j)th element of J(x) is equal to dfi(x)dxj
A functional is a real-valued function on a space of functions For example ifT (f ) = int 1
0 f (x) dx then the functional T maps suitably integrable functions onto thereal line
The indicator function 1A equals 1 if A is true and 0 otherwise The real lineis denoted and p-dimensional real space is p
Computational Statistics Second Edition Geof H Givens and Jennifer A Hoetingcopy 2013 John Wiley amp Sons Inc Published 2013 by John Wiley amp Sons Inc
1
2 CHAPTER 1 REVIEW
12 TAYLORrsquoS THEOREM AND MATHEMATICALLIMIT THEORY
First we define standard ldquobig ohrdquo and ldquolittle ohrdquo notation for describing the relativeorders of convergence of functions Let the functions f and g be defined on a commonpossibly infinite interval Let z0 be a point in this interval or a boundary point of it(ie minusinfin or infin) We require g(z) = 0 for all z = z0 in a neighborhood of z0 Thenwe say
f (z) = O(g(z)) (11)
if there exists a constant M such that |f (z)| le M|g(z)| as z rarr z0 For example(n + 1)(3n2) = O(nminus1) and it is understood that we are considering n rarr infin Iflimzrarrz0 f (z)g(z) = 0 then we say
f (z) = O(g(z)) (12)
For example f (x0 + h) minus f (x0) = hf prime(x0) + O(h) as h rarr 0 if f is differentiable atx0 The same notation can be used for describing the convergence of a sequence xnas n rarr infin by letting f (n) = xn
Taylorrsquos theorem provides a polynomial approximation to a function f Supposef has finite (n + 1)th derivative on (a b) and continuous nth derivative on [a b] Thenfor any x0 isin [a b] distinct from x the Taylor series expansion of f about x0 is
f (x) =nsum
i=0
1
if (i)(x0)(x minus x0)i + Rn (13)
where f (i)(x0) is the ith derivative of f evaluated at x0 and
Rn = 1
(n + 1)f (n+1)(ξ)(x minus x0)n+1 (14)
for some point ξ in the interval between x and x0 As |x minus x0| rarr 0 note that Rn =O(|x minus x0|n+1)
The multivariate version of Taylorrsquos theorem is analogous Suppose f is areal-valued function of a p-dimensional variable x possessing continuous partialderivatives of all orders up to and including n + 1 with respect to all coordinates inan open convex set containing x and x0 = x Then
f (x) = f (x0) +nsum
i=1
1
iD(i)(f x0 x minus x0) + Rn (15)
where
D(i)(f x y) =psum
j1=1
middot middot middotpsum
ji=1
(di
dtj1 middot middot middot dtji
f (t)
∣∣∣∣t=x
) iprodk=1
yjk
(16)
12 TAYLORrsquoS THEOREM AND MATHEMATICAL LIMIT THEORY 3
and
Rn = 1
(n + 1)D(n+1)(f ξ x minus x0) (17)
for some ξ on the line segment joining x and x0 As |x minus x0| rarr 0 note that Rn =O(|x minus x0|n+1)
The EulerndashMaclaurin formula is useful in many asymptotic analyses If f has2n continuous derivatives in [0 1] thenint 1
0f (x) dx =f (0) + f (1)
2
minusnminus1sumi=0
b2i(f (2iminus1)(1) minus f (2iminus1)(0))
(2i)minus b2nf
(2n)(ξ)
(2n) (18)
where 0 le ξ le 1 f (j) is the jth derivative of f and bj = Bj(0) can be determinedusing the recursion relation
msumj=0
(m + 1
j
)Bj(z) = (m + 1)zm (19)
initialized with B0(z) = 1 The proof of this result is based on repeated integrationsby parts [376]
Finally we note that it is sometimes desirable to approximate the derivative ofa function numerically using finite differences For example the ith component ofthe gradient of f at x can be approximated by
df (x)
dxi
asymp f (x + εiei) minus f (x minus εiei)
2εi
(110)
where εi is a small number and ei is the unit vector in the ith coordinate directionTypically one might start with say εi = 001 or 0001 and approximate the desiredderivative for a sequence of progressively smaller εi The approximation willgenerally improve until εi becomes small enough that the calculation is degraded andeventually dominated by computer roundoff error introduced by subtractive cancel-lation Introductory discussion of this approach and a more sophisticated Richardsonextrapolation strategy for obtaining greater precision are provided in [376] Finitedifferences can also be used to approximate the second derivative of f at x via
df (x)
dxi dxj
asymp 1
4εiεj
(f (x + εiei + εjej) minus f (x + εiei minus εjej)
minus f (x minus εiei + εjej) + f (x minus εiei minus εjej)
)(111)
with similar sequential precision improvements
4 CHAPTER 1 REVIEW
13 STATISTICAL NOTATION AND PROBABILITYDISTRIBUTIONS
We use capital letters to denote random variables such as Y or X and lowercaseletters to represent specific realized values of random variables such as y or x Theprobability density function of X is denoted f the cumulative distribution functionis F We use the notation X sim f (x) to mean that X is distributed with density f (x)Frequently the dependence of f (x) on one or more parameters also will be denotedwith a conditioning bar as in f (x|α β) Because of the diversity of topics covered inthis book we want to be careful to distinguish when f (x|α) refers to a density functionas opposed to the evaluation of that density at a point x When the meaning is unclearfrom the context we will be explicit for example by using f (middot|α) to denote thefunction When it is important to distinguish among several densities we may adoptsubscripts referring to specific random variables so that the density functions forX and Y are fX and fY respectively We use the same notation for distributions ofdiscrete random variables and in the Bayesian context
The conditional distribution of X given that Y equals y (ie X|Y = y) is de-scribed by the density denoted f (x|y) or fX|Y (x|y) In this case we write that X|Y hasdensity f (x|Y ) For notational simplicity we allow density functions to be implicitlyspecified by their arguments so we may use the same symbol say f to refer to manydistinct functions as in the equation f (x y|μ) = f (x|y μ)f (y|μ) Finally f (X) andF (X) are random variables the evaluations of the density and cumulative distributionfunctions respectively at the random argument X
The expectation of a random variable is denoted EX Unless specificallymentioned the distribution with respect to which an expectation is taken is the dis-tribution of X or should be implicit from the context To denote the probability ofan event A we use P[A] = E1A The conditional expectation of X|Y = y isEX|y When Y is unknown EX|Y is a random variable that depends on Y Otherattributes of the distribution of X and Y include varX covX Y corX Y andcvX = varX12EX These quantities are the variance of X the covariance andcorrelation of X and Y and the coefficient of variation of X respectively
A useful result regarding expectations is Jensenrsquos inequality Let g be a convexfunction on a possibly infinite open interval I so
g(λx + (1 minus λ)y) le λg(x) + (1 minus λ)g(y) (112)
for all x y isin I and all 0 lt λ lt 1 Then Jensenrsquos inequality states that Eg(X) geg(EX) for any random variable X having P[X isin I] = 1
Tables 11 12 and 13 provide information about many discrete and contin-uous distributions used throughout this book We refer to the following well-knowncombinatorial constants
n = n(n minus 1)(n minus 2) middot middot middot (3)(2)(1) with 0 = 1 (113)(n
k
)= n
k(n minus k) (114)
TABLE 11 Notation and description for common probability distributions of discrete random variables
Notation and Density and Mean and
Name Parameter Space Sample Space Variance
Bernoulli X simBernoulli(p)0 le p le 1
f (x) = px(1 minus p)1minusx
x = 0 or 1EX = p
varX = p(1 minus p)
Binomial X simBin(n p)0 le p le 1n isin 1 2
f (x) =(
n
x
)px(1 minus p)nminusx
x = 0 1 n
EX = np
varX = np(1 minus p)
Multinomial X simMultinomial(n p)p = (p1 pk)0 le pi le 1 and n isin 1 2 sumk
i=1 pi = 1
f (x) =(
n
x1 xk
)prodk
i=1 pxi
i
x = (x1 xk) and xi isin 0 1 nsumk
i=1 xi = n
EX = npvarXi = npi(1 minus pi)covXi Xj = minusnpipj
NegativeBinomial
X simNegBin(r p)0 le p le 1r isin 1 2
f (x) =(
x + r minus 1
r minus 1
)pr(1 minus p)x
x isin 0 1 EX = r(1 minus p)pvarX = r(1 minus p)p2
Poisson X simPoisson(λ) f (x) = λx
xexpminusλ EX = λ
λ gt 0 x isin 0 1 varX = λ
5
TABLE 12 Notation and description for some common probability distributions of continuous random variables
Notation and Density and Mean and
Name Parameter Space Sample Space Variance
Beta X simBeta(α β)α gt 0 and β gt 0
f (x) = (α + β)
(α)(β)xαminus1(1 minus x)βminus1
0 le x le 1
EX = α
α + β
varX = αβ
(α + β)2(α + β + 1)
Cauchy X simCauchy(α β)α isin and β gt 0
f (x) = 1
πβ
[1 +
(x minus α
β
)2]
x isin
EX is nonexistentvarX is nonexistent
Chi-square X sim χ2ν
ν gt 0f (x) = Gamma(ν2 12)x gt 0
EX = ν
varX = 2ν
Dirichlet X simDirichlet(α)α = (α1 αk)αi gt 0α0 = sumk
i=1 αi
f (x) = (α0)prodk
i=1 xαiminus1iprodk
i=1 (αi)x = (x1 xk) and 0 le xi le 1sumk
i=1 xi = 1
EX = αα0
varXi = αi(α0 minus αi)
α20(α0 + 1)
covXi Xj = minusαiαj
α20(α0 + 1)
Exponential X simExp(λ)λ gt 0
f (x) = λ expminusλxx gt 0
EX = 1λ
varX = 1λ2
Gamma X simGamma(r λ)λ gt 0 and r gt 0
f (x) = λrxrminus1
(r)expminusλx
x gt 0
EX = rλ
varX = rλ2
6
TABLE 13 Notation and description for more common probability distributions of continuous random variables
Notation and Density and Mean and
Name Parameter Space Sample Space Variance
Lognormal X simLognormal(μ σ2)μ isin and σ gt 0
f (x) = 1
xradic
2πσ2exp
minus1
2
(logx minus μ
σ
)2
x isin EX = expμ + σ22varX = exp2μ + 2σ2 minus exp2μ + σ2
MultivariateNormal
X sim Nk(μ )μ = (μ1 μk) isin k
positive definite
f (x) = expminus(x minus μ)T minus1(x minus μ)2(2π)k2||12
x = (x1 xk) isin k
EX = μ
varX =
Normal X simN(μ σ2)μ isin and σ gt 0
f (x) = 1radic2πσ2
exp
minus1
2
(x minus μ
σ
)2
x isin EX = μ
varX = σ2
Studentrsquos t X sim tν
ν gt 0f (x) = ((ν + 1)2)
(ν2)radic
πν
(1 + x2
ν
)minus(ν+1)2
x isin EX = 0 if ν gt 1
varX = ν
ν + 2if ν gt 2
Uniform X simUnif(a b)a b isin and a lt b
f (x) = 1
b minus ax isin [a b]
EX = (a + b)2varX = (b minus a)212
Weibull X simWeibull(a b)a gt 0 and b gt 0
f (x) = abxbminus1 expminusaxbx gt 0
EX = (1 + 1b)
a1b
varX = (1 + 2b) minus (1 + 1b)2
a2b
7
8 CHAPTER 1 REVIEW(n
k1 km
)= nprodm
i=1 kiwhere n =
msumi=1
ki (115)
(r) =
(r minus 1) for r = 1 2 int infin0 trminus1 expminust dt for general r gt 0
(116)
It is worth knowing that
(
12
)= radic
π and (n + 1
2
)= 1 times 3 times 5 times middot middot middot times (2n minus 1)
radicπ2n
for positive integer nMany of the distributions commonly used in statistics are members of an
exponential family A k-parameter exponential family density can be expressed as
f (x|γ) = c1(x)c2(γ) exp
ksum
i=1
yi(x)θi(γ)
(117)
for nonnegative functions c1 and c2 The vectorγ denotes the familiar parameters suchas λ for the Poisson density and p for the binomial density The real-valued θi(γ) arethe natural or canonical parameters which are usually transformations of γ Theyi(x) are the sufficient statistics for the canonical parameters It is straightforwardto show
Ey(X) =κprime(θ) (118)
and
vary(X) =κprimeprime(θ) (119)
where κ(θ) = minus log c3(θ) letting c3(θ) denote the reexpression of c2(γ) in terms ofthe canonical parameters θ = (θ1 θk) and y(X) = (y1(X) yk(X)) Theseresults can be rewritten in terms of the original parameters γ as
E
ksum
i=1
dθi(γ)
dγj
yi(X)
= minus d
dγj
log c2(γ) (120)
and
var
ksum
i=1
dθi(γ)
dγj
yi(X)
= minus d2
dγ2j
log c2(γ) minus E
ksum
i=1
d2θi(γ)
dγ2j
yi(X)
(121)
Example 11 (Poisson) The Poisson distribution belongs to the exponential familywith c1(x) = 1x c2(λ) = expminusλ y(x) = x and θ(λ) = log λ Deriving momentsin terms of θ we have κ(θ) = expθ so EX = κprime(θ) = expθ = λ and varX =κprimeprime(θ) = expθ = λ The same results may be obtained with (120) and (121) notingthat dθdλ = 1λ For example (120) gives E Xλ = 1
It is also important to know how the distribution of a random variable changeswhen it is transformed Let X = (X1 Xp) denote a p-dimensional random
14 LIKELIHOOD INFERENCE 9
variable with continuous density function f Suppose that
U = g(X) = (g1(X) gp(X)) = (U1 Up) (122)
where g is a one-to-one function mapping the support region of f onto the space ofall u = g(x) for which x satisfies f (x) gt 0 To derive the probability distribution ofU from that of X we need to use the Jacobian matrix The density of the transformedvariables is
f (u) = f (gminus1(u)) |J(u)| (123)
where |J(u)| is the absolute value of the determinant of the Jacobian matrix of gminus1
evaluated at u having (i j)th element dxiduj where these derivatives are assumedto be continuous over the support region of U
14 LIKELIHOOD INFERENCE
If X1 Xn are independent and identically distributed (iid) each having densityf (x | θ) that depends on a vector of p unknown parameters θ = (θ1 θp) then thejoint likelihood function is
L(θ) =nprod
i=1
f (xi|θ) (124)
When the data are not iid the joint likelihood is still expressed as the joint densityf (x1 xn|θ) viewed as a function of θ
The observed data x1 xn might have been realized under many differentvalues for θ The parameters for which observing x1 xn would be most likelyconstitute the maximum likelihood estimate of θ In other words if ϑ is the functionof x1 xn that maximizes L(θ) then θ = ϑ(X1 Xn) is the maximum likeli-hood estimator (MLE) for θ MLEs are invariant to transformation so the MLE of atransformation of θ equals the transformation of θ
It is typically easier to work with the log likelihood function
l (θ) = log L(θ) (125)
which has the same maximum as the original likelihood since log is a strictly mono-tonic function Furthermore any additive constants (involving possibly x1 xn
but not θ) may be omitted from the log likelihood without changing the location of itsmaximum or differences between log likelihoods at different θ Note that maximizingL(θ) with respect to θ is equivalent to solving the system of equations
lprime(θ) = 0 (126)
where
lprime(θ) =(
dl(θ)
dθ1
dl(θ)
dθn
)
10 CHAPTER 1 REVIEW
is called the score function The score function satisfies
Elprime(θ) = 0 (127)
where the expectation is taken with respect to the distribution of X1 Xn Some-times an analytical solution to (126) provides the MLE this book describes a varietyof methods that can be used when the MLE cannot be solved for in closed form Itis worth noting that there are pathological circumstances where the MLE is not asolution of the score equation or the MLE is not unique see [127] for examples
The MLE has a sampling distribution because it depends on the realizationof the random variables X1 Xn The MLE may be biased or unbiased for θ yetunder quite general conditions it is asymptotically unbiased as n rarr infin The samplingvariance of the MLE depends on the average curvature of the log likelihood When thelog likelihood is very pointy the location of the maximum is more precisely known
To make this precise let lprimeprime(θ) denote the p times p matrix having (i j)th elementgiven by d2l(θ)(dθidθj) The Fisher information matrix is defined as
I(θ) = Elprime(θ)lprime(θ)T = minusElprimeprime(θ) (128)
where the expectations are taken with respect to the distribution of X1 Xn Thefinal equality in (128) requires mild assumptions which are satisfied for example inexponential families I(θ) may sometimes be called the expected Fisher informationto distinguish it from minuslprimeprime(θ) which is the observed Fisher information There aretwo reasons why the observed Fisher information is quite useful First it can becalculated even if the expectations in (128) cannot easily be computed Second it isa good approximation to I(θ) that improves as n increases
Under regularity conditions the asymptotic variancendashcovariance matrix ofthe MLE θ is I(θlowast)minus1 where θlowast denotes the true value of θ Indeed as n rarr infin thelimiting distribution of θ is Np(θlowast I(θlowast)minus1) Since the true parameter values areunknown I(θlowast)minus1 must be estimated in order to estimate the variancendashcovariancematrix of the MLE An obvious approach is to use I(θ)minus1 Alternatively it is alsoreasonable to use minuslprimeprime(θ)minus1 Standard errors for individual parameter MLEs canbe estimated by taking the square root of the diagonal elements of the chosenestimate of I(θlowast)minus1 A thorough introduction to maximum likelihood theory andthe relative merits of these estimates of I(θlowast)minus1 can be found in [127 182 371 470]
Profile likelihoods provide an informative way to graph a higher-dimensionallikelihood surface to make inference about some parameters while treating othersas nuisance parameters and to facilitate various optimization strategies The profilelikelihood is obtained by constrained maximization of the full likelihood with respectto parameters to be ignored If θ = (μ φ) then the profile likelihood for φ is
L(φ|μ(φ)) = maxμ
L(μ φ) (129)
Thus for each possible φ a value of μ is chosen to maximize L(μ φ) This optimalμ is a function of φ The profile likelihood is the function that maps φ to the valueof the full likelihood evaluated at φ and its corresponding optimal μ Note that the φ
Copyright copy 2013 by John Wiley amp Sons Inc All rights reserved
Published by John Wiley amp Sons Inc Hoboken New JerseyPublished simultaneously in Canada
No part of this publication may be reproduced stored in a retrieval system or transmitted in any formor by any means electronic mechanical photocopying recording scanning or otherwise except aspermitted under Section 107 or 108 of the 1976 United States Copyright Act without either the priorwritten permission of the Publisher or authorization through payment of the appropriate per-copy fee tothe Copyright Clearance Center Inc 222 Rosewood Drive Danvers MA 01923 (978) 750-8400 fax(978) 750-4470 or on the web at wwwcopyrightcom Requests to the Publisher for permission shouldbe addressed to the Permissions Department John Wiley amp Sons Inc 111 River Street Hoboken NJ07030 (201) 748-6011 fax (201) 748-6008 or online at httpwwwwileycomgopermission
Limit of LiabilityDisclaimer of Warranty While the publisher and author have used their best efforts inpreparing this book they make no representations or warranties with respect to the accuracy orcompleteness of the contents of this book and specifically disclaim any implied warranties ofmerchantability or fitness for a particular purpose No warranty may be created or extended by salesrepresentatives or written sales materials The advice and strategies contained herein may not be suitablefor your situation You should consult with a professional where appropriate Neither the publisher norauthor shall be liable for any loss of profit or any other commercial damages including but not limited tospecial incidental consequential or other damages
For general information on our other products and services or for technical support please contact ourCustomer Care Department within the United States at (800) 762-2974 outside the United Statesat (317) 572-3993 or fax (317) 572-4002
Wiley also publishes its books in a variety of electronic formats Some content that appears in print maynot be available in electronic formats For more information about Wiley products visit our web site atwwwwileycom
Library of Congress Cataloging-in-Publication Data
Givens Geof HComputational statistics Geof H Givens Jennifer A Hoeting ndash 2nd ed
p cmIncludes indexISBN 978-0-470-53331-4 (cloth)
1 Mathematical statisticsndashData processing I Hoeting Jennifer A(Jennifer Ann) 1966ndash II Title
QA2764G58 20135195ndashdc23
2012017381
Printed in the United States of America
10 9 8 7 6 5 4 3 2 1
To Natalie and Neil
CONTENTS
PREFACE xv
ACKNOWLEDGMENTS xvii
1 REVIEW 1
11 Mathematical Notation 1
12 Taylorrsquos Theorem and Mathematical Limit Theory 2
13 Statistical Notation and Probability Distributions 4
14 Likelihood Inference 9
15 Bayesian Inference 11
16 Statistical Limit Theory 13
17 Markov Chains 14
18 Computing 17
PART I
OPTIMIZATION
2 OPTIMIZATION AND SOLVING NONLINEAR EQUATIONS 21
21 Univariate Problems 22
211 Newtonrsquos Method 26
2111 Convergence Order 29
212 Fisher Scoring 30213 Secant Method 30214 Fixed-Point Iteration 32
2141 Scaling 33
22 Multivariate Problems 34
221 Newtonrsquos Method and Fisher Scoring 34
2211 Iteratively Reweighted Least Squares 36
222 Newton-Like Methods 39
2221 Ascent Algorithms 39
2222 Discrete Newton and Fixed-Point Methods 41
2223 Quasi-Newton Methods 41
vii
viii CONTENTS
223 GaussndashNewton Method 44224 NelderndashMead Algorithm 45225 Nonlinear GaussndashSeidel Iteration 52
Problems 54
3 COMBINATORIAL OPTIMIZATION 59
31 Hard Problems and NP-Completeness 59311 Examples 61312 Need for Heuristics 64
32 Local Search 65
33 Simulated Annealing 68331 Practical Issues 70
3311 Neighborhoods and Proposals 70
3312 Cooling Schedule and Convergence 71332 Enhancements 74
34 Genetic Algorithms 75341 Definitions and the Canonical Algorithm 75
3411 Basic Definitions 75
3412 Selection Mechanisms and Genetic Operators 76
3413 Allele Alphabets and Genotypic Representation 78
3414 Initialization Termination and Parameter Values 79342 Variations 80
3421 Fitness 80
3422 Selection Mechanisms and Updating Generations 81
3423 Genetic Operators and PermutationChromosomes 82
343 Initialization and Parameter Values 84344 Convergence 84
35 Tabu Algorithms 85351 Basic Definitions 86352 The Tabu List 87353 Aspiration Criteria 88354 Diversification 89355 Intensification 90356 Comprehensive Tabu Algorithm 91
Problems 92
4 EM OPTIMIZATION METHODS 97
41 Missing Data Marginalization and Notation 97
42 The EM Algorithm 98421 Convergence 102422 Usage in Exponential Families 105
CONTENTS ix
423 Variance Estimation 106
4231 Louisrsquos Method 106
4232 SEM Algorithm 108
4233 Bootstrapping 110
4234 Empirical Information 110
4235 Numerical Differentiation 111
43 EM Variants 111431 Improving the E Step 111
4311 Monte Carlo EM 111432 Improving the M Step 112
4321 ECM Algorithm 113
4322 EM Gradient Algorithm 116433 Acceleration Methods 118
4331 Aitken Acceleration 118
4332 Quasi-Newton Acceleration 119
Problems 121
PART II
INTEGRATION AND SIMULATION
5 NUMERICAL INTEGRATION 129
51 NewtonndashCotes Quadrature 129511 Riemann Rule 130512 Trapezoidal Rule 134513 Simpsonrsquos Rule 136514 General kth-Degree Rule 138
52 Romberg Integration 139
53 Gaussian Quadrature 142531 Orthogonal Polynomials 143532 The Gaussian Quadrature Rule 143
54 Frequently Encountered Problems 146541 Range of Integration 146542 Integrands with Singularities or Other Extreme Behavior 146543 Multiple Integrals 147544 Adaptive Quadrature 147545 Software for Exact Integration 148
Problems 148
6 SIMULATION AND MONTE CARLO INTEGRATION 151
61 Introduction to the Monte Carlo Method 151
62 Exact Simulation 152
x CONTENTS
621 Generating from Standard Parametric Families 153622 Inverse Cumulative Distribution Function 153623 Rejection Sampling 155
6231 Squeezed Rejection Sampling 158
6232 Adaptive Rejection Sampling 159
63 Approximate Simulation 163
631 Sampling Importance Resampling Algorithm 163
6311 Adaptive Importance Bridge and Path Sampling 167
632 Sequential Monte Carlo 168
6321 Sequential Importance Sampling for Markov Processes 169
6322 General Sequential Importance Sampling 170
6323 Weight Degeneracy Rejuvenation and EffectiveSample Size 171
6324 Sequential Importance Sampling for HiddenMarkov Models 175
6325 Particle Filters 179
64 Variance Reduction Techniques 180
641 Importance Sampling 180642 Antithetic Sampling 186643 Control Variates 189644 RaondashBlackwellization 193
Problems 195
7 MARKOV CHAIN MONTE CARLO 201
71 MetropolisndashHastings Algorithm 202
711 Independence Chains 204712 Random Walk Chains 206
72 Gibbs Sampling 209
721 Basic Gibbs Sampler 209722 Properties of the Gibbs Sampler 214723 Update Ordering 216724 Blocking 216725 Hybrid Gibbs Sampling 216726 GriddyndashGibbs Sampler 218
73 Implementation 218
731 Ensuring Good Mixing and Convergence 219
7311 Simple Graphical Diagnostics 219
7312 Burn-in and Run Length 220
7313 Choice of Proposal 222
7314 Reparameterization 223
7315 Comparing Chains Effective Sample Size 224
7316 Number of Chains 225
CONTENTS xi
732 Practical Implementation Advice 226733 Using the Results 226
Problems 230
8 ADVANCED TOPICS IN MCMC 237
81 Adaptive MCMC 237811 Adaptive Random Walk Metropolis-within-Gibbs Algorithm 238812 General Adaptive Metropolis-within-Gibbs Algorithm 240813 Adaptive Metropolis Algorithm 247
82 Reversible Jump MCMC 250821 RJMCMC for Variable Selection in Regression 253
83 Auxiliary Variable Methods 256831 Simulated Tempering 257832 Slice Sampler 258
84 Other MetropolisndashHastings Algorithms 260841 Hit-and-Run Algorithm 260842 Multiple-Try MetropolisndashHastings Algorithm 261843 Langevin MetropolisndashHastings Algorithm 262
85 Perfect Sampling 264851 Coupling from the Past 264
8511 Stochastic Monotonicity and Sandwiching 267
86 Markov Chain Maximum Likelihood 268
87 Example MCMC for Markov Random Fields 269871 Gibbs Sampling for Markov Random Fields 270872 Auxiliary Variable Methods for Markov Random Fields 274873 Perfect Sampling for Markov Random Fields 277
Problems 279
PART III
BOOTSTRAPPING
9 BOOTSTRAPPING 287
91 The Bootstrap Principle 287
92 Basic Methods 288921 Nonparametric Bootstrap 288922 Parametric Bootstrap 289923 Bootstrapping Regression 290924 Bootstrap Bias Correction 291
93 Bootstrap Inference 292931 Percentile Method 292
9311 Justification for the Percentile Method 293932 Pivoting 294
xii CONTENTS
9321 Accelerated Bias-Corrected Percentile Method BCa 294
9322 The Bootstrap t 296
9323 Empirical Variance Stabilization 298
9324 Nested Bootstrap and Prepivoting 299
933 Hypothesis Testing 301
94 Reducing Monte Carlo Error 302
941 Balanced Bootstrap 302942 Antithetic Bootstrap 302
95 Bootstrapping Dependent Data 303
951 Model-Based Approach 304952 Block Bootstrap 304
9521 Nonmoving Block Bootstrap 304
9522 Moving Block Bootstrap 306
9523 Blocks-of-Blocks Bootstrapping 307
9524 Centering and Studentizing 309
9525 Block Size 311
96 Bootstrap Performance 315
961 Independent Data Case 315962 Dependent Data Case 316
97 Other Uses of the Bootstrap 316
98 Permutation Tests 317
Problems 319
PART IV
DENSITY ESTIMATION AND SMOOTHING
10 NONPARAMETRIC DENSITY ESTIMATION 325
101 Measures of Performance 326
102 Kernel Density Estimation 327
1021 Choice of Bandwidth 329
10211 Cross-Validation 332
10212 Plug-in Methods 335
10213 Maximal Smoothing Principle 338
1022 Choice of Kernel 339
10221 Epanechnikov Kernel 339
10222 Canonical Kernels and Rescalings 340
103 Nonkernel Methods 341
1031 Logspline 341
104 Multivariate Methods 345
1041 The Nature of the Problem 345
CONTENTS xiii
1042 Multivariate Kernel Estimators 3461043 Adaptive Kernels and Nearest Neighbors 348
10431 Nearest Neighbor Approaches 349
10432 Variable-Kernel Approaches and Transformations 350
1044 Exploratory Projection Pursuit 353
Problems 359
11 BIVARIATE SMOOTHING 363
111 PredictorndashResponse Data 363
112 Linear Smoothers 365
1121 Constant-Span Running Mean 366
11211 Effect of Span 368
11212 Span Selection for Linear Smoothers 369
1122 Running Lines and Running Polynomials 3721123 Kernel Smoothers 3741124 Local Regression Smoothing 3741125 Spline Smoothing 376
11251 Choice of Penalty 377
113 Comparison of Linear Smoothers 377
114 Nonlinear Smoothers 379
1141 Loess 3791142 Supersmoother 381
115 Confidence Bands 384
116 General Bivariate Data 388
Problems 389
12 MULTIVARIATE SMOOTHING 393
121 PredictorndashResponse Data 393
1211 Additive Models 3941212 Generalized Additive Models 3971213 Other Methods Related to Additive Models 399
12131 Projection Pursuit Regression 399
12132 Neural Networks 402
12133 Alternating Conditional Expectations 403
12134 Additivity and Variance Stabilization 404
1214 Tree-Based Methods 405
12141 Recursive Partitioning Regression Trees 406
12142 Tree Pruning 409
12143 Classification Trees 411
12144 Other Issues for Tree-Based Methods 412
xiv CONTENTS
122 General Multivariate Data 4131221 Principal Curves 413
12211 Definition and Motivation 413
12212 Estimation 415
12213 Span Selection 416
Problems 416
DATA ACKNOWLEDGMENTS 421
REFERENCES 423
INDEX 457
PREFACE
This book covers most topics needed to develop a broad and thorough working knowl-edge of modern computational statistics We seek to develop a practical understandingof how and why existing methods work enabling readers to use modern statisticalmethods effectively Since many new methods are built from components of exist-ing techniques our ultimate goal is to provide scientists with the tools they need tocontribute new ideas to the field
A growing challenge in science is that there is so much of it While the pursuitof important new methods and the honing of existing approaches is a worthy goalthere is also a need to organize and distill the teeming jungle of ideas We attempt todo that here Our choice of topics reflects our view of what constitutes the core of theevolving field of computational statistics and what will be interesting and useful forour readers
Our use of the adjective modern in the first sentence of this preface is potentiallytroublesome There is no way that this book can cover all the latest greatest techniquesWe have not even tried We have instead aimed to provide a reasonably up-to-datesurvey of a broad portion of the field while leaving room for diversions and esoterica
The foundations of optimization and numerical integration are covered in thisbook We include these venerable topics because (i) they are cornerstones of frequen-tist and Bayesian inference (ii) routine application of available software often failsfor hard problems and (iii) the methods themselves are often secondary componentsof other statistical computing algorithms Some topics we have omitted representimportant areas of past and present research in the field but their priority here islowered by the availability of high-quality software For example the generation ofpseudo-random numbers is a classic topic but one that we prefer to address by giv-ing students reliable software Finally some topics (eg principal curves and tabusearch) are included simply because they are interesting and provide very differentperspectives on familiar problems Perhaps a future researcher may draw ideas fromsuch topics to design a creative and effective new algorithm
In this second edition we have both updated and broadened our coverage andwe now provide computer code For example we have added new MCMC topics to re-flect continued activity in that popular area A notable increase in breadth is our inclu-sion of more methods relevant for problems where statistical dependency is importantsuch as block bootstrapping and sequential importance sampling This second editionprovides extensive new support in R Specifically code for the examples in this bookis available from the book website wwwstatcolostateeducomputationalstatistics
Our target audience includes graduate students in statistics and related fieldsstatisticians and quantitative empirical scientists in other fields We hope such readersmay use the book when applying standard methods and developing new methods
xv
xvi PREFACE
The level of mathematics expected of the reader does not extend much beyondTaylor series and linear algebra Breadth of mathematical training is more helpfulthan depth Essential review is provided in Chapter 1 More advanced readers will findgreater mathematical detail in the wide variety of high-quality books available on spe-cific topics many of which are referenced in the text Other readers caring less aboutanalytical details may prefer to focus on our descriptions of algorithms and examples
The expected level of statistics is equivalent to that obtained by a graduatestudent in his or her first year of study of the theory of statistics and probabilityAn understanding of maximum likelihood methods Bayesian methods elementaryasymptotic theory Markov chains and linear models is most important Many ofthese topics are reviewed in Chapter 1
With respect to computer programming we find that good students can learn asthey go However a working knowledge of a suitable language allows implementationof the ideas covered in this book to progress much more quickly We have chosento forgo any language-specific examples algorithms or coding in the text For thosewishing to learn a language while they study this book we recommend that you choosea high-level interactive package that permits the flexible design of graphical displaysand includes supporting statistics and probability functions such as R and MATLAB1
These are the sort of languages often used by researchers during the development ofnew statistical computing techniques and they are suitable for implementing all themethods we describe except in some cases for problems of vast scope or complexityWe use R and recommend it Although lower-level languages such as C++ couldalso be used they are more appropriate for professional-grade implementation ofalgorithms after researchers have refined the methodology
The book is organized into four major parts optimization (Chapters 2 3 and4) integration and simulation (Chapters 5 6 7 and 8) bootstrapping (Chapter 9)and density estimation and smoothing (Chapters 10 11 and 12) The chapters arewritten to stand independently so a course can be built by selecting the topics onewishes to teach For a one-semester course our selection typically weights mostheavily topics from Chapters 2 3 6 7 9 10 and 11 With a leisurely pace or morethorough coverage a shorter list of topics could still easily fill a semester courseThere is sufficient material here to provide a thorough one-year course of studynotwithstanding any supplemental topics one might wish to teach
A variety of homework problems are included at the end of each chapter Someare straightforward while others require the student to develop a thorough under-standing of the modelmethod being used to carefully (and perhaps cleverly) code asuitable technique and to devote considerable attention to the interpretation of resultsA few exercises invite open-ended exploration of methods and ideas We are some-times asked for solutions to the exercises but we prefer to sequester them to preservethe challenge for future students and readers
The datasets discussed in the examples and exercises are available from the bookwebsite wwwstatcolostateeducomputationalstatistics The R code is also providedthere Finally the website includes an errata Responsibility for all errors lies with us
1R is available for free from wwwr-projectorg Information about MATLAB can be found atwwwmathworkscom
ACKNOWLEDGMENTS
The course upon which this book is based was developed and taught by us at ColoradoState University from 1994 onwards Thanks are due to our many students who havebeen semiwilling guinea pigs over the years We also thank our colleagues in theStatistics Department for their continued support The late Richard Tweedie meritsparticular acknowledgment for his mentoring during the early years of our careers
We owe a great deal of intellectual debt to Adrian Raftery who deserves specialthanks not only for his teaching and advising but also for his unwavering support andhis seemingly inexhaustible supply of good ideas In addition we thank our influentialadvisors and teachers at the University of Washington Statistics Department includingDavid Madigan Werner Stuetzle and Judy Zeh Of course each of our chapters couldbe expanded into a full-length book and great scholars have already done so We owemuch to their efforts upon which we relied when developing our course and ourmanuscript
Portions of the first edition were written at the Department of Mathematicsand Statistics University of Otago in Dunedin New Zealand whose faculty wethank for graciously hosting us during our sabbatical in 2003 Much of our workon the second edition was undertaken during our sabbatical visit to the AustraliaCommonwealth Scientific and Research Organization in 2009ndash2010 sponsored byCSIRO Mathematics Informatics and Statistics and hosted at the Longpocket Lab-oratory in Indooroopilly Australia We thank our hosts and colleagues there for theirsupport
Our manuscript has been greatly improved through the constructive reviewsof John Bickham Ben Bird Kate Cowles Jan Hannig Alan Herlihy David HunterDevin Johnson Michael Newton Doug Nychka Steve Sain David W Scott N ScottUrquhart Haonan Wang Darrell Whitley and eight anonymous referees We alsothank the sharp-eyed readers listed in the errata for their suggestions and correctionsOur editor Steve Quigley and the folks at Wiley were supportive and helpful duringthe publication process We thank Nelida Pohl for permission to adapt her photographin the cover design of the first edition We thank Melinda Stelzer for permission to useher painting ldquoChampagne Circuitrdquo 2001 for the cover of the second edition Moreabout her art can be found at wwwfacebookcomgeekchicart We also owe specialnote of thanks to Zube (aka John Dzubera) who kept our own computers runningdespite our best efforts to the contrary
Funding from National Science Foundation (NSF) CAREER grant SBR-9875508 was a significant source of support for the first author during the preparationof the first edition He also thanks his colleagues and friends in the North SlopeBorough Alaska Department of Wildlife Management for their longtime researchsupport The second author gratefully acknowledges the support of STAR Research
xvii
xviii ACKNOWLEDGMENTS
Assistance Agreement CR-829095 awarded to Colorado State University by the USEnvironmental Protection Agency (EPA) The views expressed here are solely thoseof the authors NSF and EPA do not endorse any products or commercial servicesmentioned herein
Finally we thank our parents for enabling and supporting our educations and forproviding us with the ldquostubbornness generdquo necessary for graduate school the tenuretrack or book publicationmdashtake your pick The second edition is dedicated to ourkids Natalie and Neil for continuing to show us what is important and what is not
Geof H GivensJennifer A Hoeting
CHAPTER 1REVIEW
This chapter reviews notation and background material in mathematics probabilityand statistics Readers may wish to skip this chapter and turn directly to Chapter 2returning here only as needed
11 MATHEMATICAL NOTATION
We use boldface to distinguish a vector x = (x1 xp) or a matrix M from a scalarvariable x or a constant M A vector-valued function f evaluated at x is also boldfacedas in f(x) = (f1(x) fp(x)) The transpose of M is denoted MT
Unless otherwise specified all vectors are considered to be column vectors sofor example an n times p matrix can be written as M = (x1 xn)T Let I denote anidentity matrix and 1 and 0 denote vectors of ones and zeros respectively
A symmetric square matrix M is positive definite if xTMx gt 0 for all nonzerovectors x Positive definiteness is equivalent to the condition that all eigenvalues ofM are positive M is nonnegative definite or positive semidefinite if xTMx ge 0 for allnonzero vectors x
The derivative of a function f evaluated at x is denoted f prime(x) When x =(x1 xp) the gradient of f at x is
f prime(x) =(
df (x)
dx1
df (x)
dxp
)
The Hessian matrix for f at x is f primeprime(x) having (i j)th element equal to d2f (x)(dxi dxj) The negative Hessian has important uses in statistical inference
Let J(x) denote the Jacobian matrix evaluated at x for the one-to-one mappingy = f(x) The (i j)th element of J(x) is equal to dfi(x)dxj
A functional is a real-valued function on a space of functions For example ifT (f ) = int 1
0 f (x) dx then the functional T maps suitably integrable functions onto thereal line
The indicator function 1A equals 1 if A is true and 0 otherwise The real lineis denoted and p-dimensional real space is p
Computational Statistics Second Edition Geof H Givens and Jennifer A Hoetingcopy 2013 John Wiley amp Sons Inc Published 2013 by John Wiley amp Sons Inc
1
2 CHAPTER 1 REVIEW
12 TAYLORrsquoS THEOREM AND MATHEMATICALLIMIT THEORY
First we define standard ldquobig ohrdquo and ldquolittle ohrdquo notation for describing the relativeorders of convergence of functions Let the functions f and g be defined on a commonpossibly infinite interval Let z0 be a point in this interval or a boundary point of it(ie minusinfin or infin) We require g(z) = 0 for all z = z0 in a neighborhood of z0 Thenwe say
f (z) = O(g(z)) (11)
if there exists a constant M such that |f (z)| le M|g(z)| as z rarr z0 For example(n + 1)(3n2) = O(nminus1) and it is understood that we are considering n rarr infin Iflimzrarrz0 f (z)g(z) = 0 then we say
f (z) = O(g(z)) (12)
For example f (x0 + h) minus f (x0) = hf prime(x0) + O(h) as h rarr 0 if f is differentiable atx0 The same notation can be used for describing the convergence of a sequence xnas n rarr infin by letting f (n) = xn
Taylorrsquos theorem provides a polynomial approximation to a function f Supposef has finite (n + 1)th derivative on (a b) and continuous nth derivative on [a b] Thenfor any x0 isin [a b] distinct from x the Taylor series expansion of f about x0 is
f (x) =nsum
i=0
1
if (i)(x0)(x minus x0)i + Rn (13)
where f (i)(x0) is the ith derivative of f evaluated at x0 and
Rn = 1
(n + 1)f (n+1)(ξ)(x minus x0)n+1 (14)
for some point ξ in the interval between x and x0 As |x minus x0| rarr 0 note that Rn =O(|x minus x0|n+1)
The multivariate version of Taylorrsquos theorem is analogous Suppose f is areal-valued function of a p-dimensional variable x possessing continuous partialderivatives of all orders up to and including n + 1 with respect to all coordinates inan open convex set containing x and x0 = x Then
f (x) = f (x0) +nsum
i=1
1
iD(i)(f x0 x minus x0) + Rn (15)
where
D(i)(f x y) =psum
j1=1
middot middot middotpsum
ji=1
(di
dtj1 middot middot middot dtji
f (t)
∣∣∣∣t=x
) iprodk=1
yjk
(16)
12 TAYLORrsquoS THEOREM AND MATHEMATICAL LIMIT THEORY 3
and
Rn = 1
(n + 1)D(n+1)(f ξ x minus x0) (17)
for some ξ on the line segment joining x and x0 As |x minus x0| rarr 0 note that Rn =O(|x minus x0|n+1)
The EulerndashMaclaurin formula is useful in many asymptotic analyses If f has2n continuous derivatives in [0 1] thenint 1
0f (x) dx =f (0) + f (1)
2
minusnminus1sumi=0
b2i(f (2iminus1)(1) minus f (2iminus1)(0))
(2i)minus b2nf
(2n)(ξ)
(2n) (18)
where 0 le ξ le 1 f (j) is the jth derivative of f and bj = Bj(0) can be determinedusing the recursion relation
msumj=0
(m + 1
j
)Bj(z) = (m + 1)zm (19)
initialized with B0(z) = 1 The proof of this result is based on repeated integrationsby parts [376]
Finally we note that it is sometimes desirable to approximate the derivative ofa function numerically using finite differences For example the ith component ofthe gradient of f at x can be approximated by
df (x)
dxi
asymp f (x + εiei) minus f (x minus εiei)
2εi
(110)
where εi is a small number and ei is the unit vector in the ith coordinate directionTypically one might start with say εi = 001 or 0001 and approximate the desiredderivative for a sequence of progressively smaller εi The approximation willgenerally improve until εi becomes small enough that the calculation is degraded andeventually dominated by computer roundoff error introduced by subtractive cancel-lation Introductory discussion of this approach and a more sophisticated Richardsonextrapolation strategy for obtaining greater precision are provided in [376] Finitedifferences can also be used to approximate the second derivative of f at x via
df (x)
dxi dxj
asymp 1
4εiεj
(f (x + εiei + εjej) minus f (x + εiei minus εjej)
minus f (x minus εiei + εjej) + f (x minus εiei minus εjej)
)(111)
with similar sequential precision improvements
4 CHAPTER 1 REVIEW
13 STATISTICAL NOTATION AND PROBABILITYDISTRIBUTIONS
We use capital letters to denote random variables such as Y or X and lowercaseletters to represent specific realized values of random variables such as y or x Theprobability density function of X is denoted f the cumulative distribution functionis F We use the notation X sim f (x) to mean that X is distributed with density f (x)Frequently the dependence of f (x) on one or more parameters also will be denotedwith a conditioning bar as in f (x|α β) Because of the diversity of topics covered inthis book we want to be careful to distinguish when f (x|α) refers to a density functionas opposed to the evaluation of that density at a point x When the meaning is unclearfrom the context we will be explicit for example by using f (middot|α) to denote thefunction When it is important to distinguish among several densities we may adoptsubscripts referring to specific random variables so that the density functions forX and Y are fX and fY respectively We use the same notation for distributions ofdiscrete random variables and in the Bayesian context
The conditional distribution of X given that Y equals y (ie X|Y = y) is de-scribed by the density denoted f (x|y) or fX|Y (x|y) In this case we write that X|Y hasdensity f (x|Y ) For notational simplicity we allow density functions to be implicitlyspecified by their arguments so we may use the same symbol say f to refer to manydistinct functions as in the equation f (x y|μ) = f (x|y μ)f (y|μ) Finally f (X) andF (X) are random variables the evaluations of the density and cumulative distributionfunctions respectively at the random argument X
The expectation of a random variable is denoted EX Unless specificallymentioned the distribution with respect to which an expectation is taken is the dis-tribution of X or should be implicit from the context To denote the probability ofan event A we use P[A] = E1A The conditional expectation of X|Y = y isEX|y When Y is unknown EX|Y is a random variable that depends on Y Otherattributes of the distribution of X and Y include varX covX Y corX Y andcvX = varX12EX These quantities are the variance of X the covariance andcorrelation of X and Y and the coefficient of variation of X respectively
A useful result regarding expectations is Jensenrsquos inequality Let g be a convexfunction on a possibly infinite open interval I so
g(λx + (1 minus λ)y) le λg(x) + (1 minus λ)g(y) (112)
for all x y isin I and all 0 lt λ lt 1 Then Jensenrsquos inequality states that Eg(X) geg(EX) for any random variable X having P[X isin I] = 1
Tables 11 12 and 13 provide information about many discrete and contin-uous distributions used throughout this book We refer to the following well-knowncombinatorial constants
n = n(n minus 1)(n minus 2) middot middot middot (3)(2)(1) with 0 = 1 (113)(n
k
)= n
k(n minus k) (114)
TABLE 11 Notation and description for common probability distributions of discrete random variables
Notation and Density and Mean and
Name Parameter Space Sample Space Variance
Bernoulli X simBernoulli(p)0 le p le 1
f (x) = px(1 minus p)1minusx
x = 0 or 1EX = p
varX = p(1 minus p)
Binomial X simBin(n p)0 le p le 1n isin 1 2
f (x) =(
n
x
)px(1 minus p)nminusx
x = 0 1 n
EX = np
varX = np(1 minus p)
Multinomial X simMultinomial(n p)p = (p1 pk)0 le pi le 1 and n isin 1 2 sumk
i=1 pi = 1
f (x) =(
n
x1 xk
)prodk
i=1 pxi
i
x = (x1 xk) and xi isin 0 1 nsumk
i=1 xi = n
EX = npvarXi = npi(1 minus pi)covXi Xj = minusnpipj
NegativeBinomial
X simNegBin(r p)0 le p le 1r isin 1 2
f (x) =(
x + r minus 1
r minus 1
)pr(1 minus p)x
x isin 0 1 EX = r(1 minus p)pvarX = r(1 minus p)p2
Poisson X simPoisson(λ) f (x) = λx
xexpminusλ EX = λ
λ gt 0 x isin 0 1 varX = λ
5
TABLE 12 Notation and description for some common probability distributions of continuous random variables
Notation and Density and Mean and
Name Parameter Space Sample Space Variance
Beta X simBeta(α β)α gt 0 and β gt 0
f (x) = (α + β)
(α)(β)xαminus1(1 minus x)βminus1
0 le x le 1
EX = α
α + β
varX = αβ
(α + β)2(α + β + 1)
Cauchy X simCauchy(α β)α isin and β gt 0
f (x) = 1
πβ
[1 +
(x minus α
β
)2]
x isin
EX is nonexistentvarX is nonexistent
Chi-square X sim χ2ν
ν gt 0f (x) = Gamma(ν2 12)x gt 0
EX = ν
varX = 2ν
Dirichlet X simDirichlet(α)α = (α1 αk)αi gt 0α0 = sumk
i=1 αi
f (x) = (α0)prodk
i=1 xαiminus1iprodk
i=1 (αi)x = (x1 xk) and 0 le xi le 1sumk
i=1 xi = 1
EX = αα0
varXi = αi(α0 minus αi)
α20(α0 + 1)
covXi Xj = minusαiαj
α20(α0 + 1)
Exponential X simExp(λ)λ gt 0
f (x) = λ expminusλxx gt 0
EX = 1λ
varX = 1λ2
Gamma X simGamma(r λ)λ gt 0 and r gt 0
f (x) = λrxrminus1
(r)expminusλx
x gt 0
EX = rλ
varX = rλ2
6
TABLE 13 Notation and description for more common probability distributions of continuous random variables
Notation and Density and Mean and
Name Parameter Space Sample Space Variance
Lognormal X simLognormal(μ σ2)μ isin and σ gt 0
f (x) = 1
xradic
2πσ2exp
minus1
2
(logx minus μ
σ
)2
x isin EX = expμ + σ22varX = exp2μ + 2σ2 minus exp2μ + σ2
MultivariateNormal
X sim Nk(μ )μ = (μ1 μk) isin k
positive definite
f (x) = expminus(x minus μ)T minus1(x minus μ)2(2π)k2||12
x = (x1 xk) isin k
EX = μ
varX =
Normal X simN(μ σ2)μ isin and σ gt 0
f (x) = 1radic2πσ2
exp
minus1
2
(x minus μ
σ
)2
x isin EX = μ
varX = σ2
Studentrsquos t X sim tν
ν gt 0f (x) = ((ν + 1)2)
(ν2)radic
πν
(1 + x2
ν
)minus(ν+1)2
x isin EX = 0 if ν gt 1
varX = ν
ν + 2if ν gt 2
Uniform X simUnif(a b)a b isin and a lt b
f (x) = 1
b minus ax isin [a b]
EX = (a + b)2varX = (b minus a)212
Weibull X simWeibull(a b)a gt 0 and b gt 0
f (x) = abxbminus1 expminusaxbx gt 0
EX = (1 + 1b)
a1b
varX = (1 + 2b) minus (1 + 1b)2
a2b
7
8 CHAPTER 1 REVIEW(n
k1 km
)= nprodm
i=1 kiwhere n =
msumi=1
ki (115)
(r) =
(r minus 1) for r = 1 2 int infin0 trminus1 expminust dt for general r gt 0
(116)
It is worth knowing that
(
12
)= radic
π and (n + 1
2
)= 1 times 3 times 5 times middot middot middot times (2n minus 1)
radicπ2n
for positive integer nMany of the distributions commonly used in statistics are members of an
exponential family A k-parameter exponential family density can be expressed as
f (x|γ) = c1(x)c2(γ) exp
ksum
i=1
yi(x)θi(γ)
(117)
for nonnegative functions c1 and c2 The vectorγ denotes the familiar parameters suchas λ for the Poisson density and p for the binomial density The real-valued θi(γ) arethe natural or canonical parameters which are usually transformations of γ Theyi(x) are the sufficient statistics for the canonical parameters It is straightforwardto show
Ey(X) =κprime(θ) (118)
and
vary(X) =κprimeprime(θ) (119)
where κ(θ) = minus log c3(θ) letting c3(θ) denote the reexpression of c2(γ) in terms ofthe canonical parameters θ = (θ1 θk) and y(X) = (y1(X) yk(X)) Theseresults can be rewritten in terms of the original parameters γ as
E
ksum
i=1
dθi(γ)
dγj
yi(X)
= minus d
dγj
log c2(γ) (120)
and
var
ksum
i=1
dθi(γ)
dγj
yi(X)
= minus d2
dγ2j
log c2(γ) minus E
ksum
i=1
d2θi(γ)
dγ2j
yi(X)
(121)
Example 11 (Poisson) The Poisson distribution belongs to the exponential familywith c1(x) = 1x c2(λ) = expminusλ y(x) = x and θ(λ) = log λ Deriving momentsin terms of θ we have κ(θ) = expθ so EX = κprime(θ) = expθ = λ and varX =κprimeprime(θ) = expθ = λ The same results may be obtained with (120) and (121) notingthat dθdλ = 1λ For example (120) gives E Xλ = 1
It is also important to know how the distribution of a random variable changeswhen it is transformed Let X = (X1 Xp) denote a p-dimensional random
14 LIKELIHOOD INFERENCE 9
variable with continuous density function f Suppose that
U = g(X) = (g1(X) gp(X)) = (U1 Up) (122)
where g is a one-to-one function mapping the support region of f onto the space ofall u = g(x) for which x satisfies f (x) gt 0 To derive the probability distribution ofU from that of X we need to use the Jacobian matrix The density of the transformedvariables is
f (u) = f (gminus1(u)) |J(u)| (123)
where |J(u)| is the absolute value of the determinant of the Jacobian matrix of gminus1
evaluated at u having (i j)th element dxiduj where these derivatives are assumedto be continuous over the support region of U
14 LIKELIHOOD INFERENCE
If X1 Xn are independent and identically distributed (iid) each having densityf (x | θ) that depends on a vector of p unknown parameters θ = (θ1 θp) then thejoint likelihood function is
L(θ) =nprod
i=1
f (xi|θ) (124)
When the data are not iid the joint likelihood is still expressed as the joint densityf (x1 xn|θ) viewed as a function of θ
The observed data x1 xn might have been realized under many differentvalues for θ The parameters for which observing x1 xn would be most likelyconstitute the maximum likelihood estimate of θ In other words if ϑ is the functionof x1 xn that maximizes L(θ) then θ = ϑ(X1 Xn) is the maximum likeli-hood estimator (MLE) for θ MLEs are invariant to transformation so the MLE of atransformation of θ equals the transformation of θ
It is typically easier to work with the log likelihood function
l (θ) = log L(θ) (125)
which has the same maximum as the original likelihood since log is a strictly mono-tonic function Furthermore any additive constants (involving possibly x1 xn
but not θ) may be omitted from the log likelihood without changing the location of itsmaximum or differences between log likelihoods at different θ Note that maximizingL(θ) with respect to θ is equivalent to solving the system of equations
lprime(θ) = 0 (126)
where
lprime(θ) =(
dl(θ)
dθ1
dl(θ)
dθn
)
10 CHAPTER 1 REVIEW
is called the score function The score function satisfies
Elprime(θ) = 0 (127)
where the expectation is taken with respect to the distribution of X1 Xn Some-times an analytical solution to (126) provides the MLE this book describes a varietyof methods that can be used when the MLE cannot be solved for in closed form Itis worth noting that there are pathological circumstances where the MLE is not asolution of the score equation or the MLE is not unique see [127] for examples
The MLE has a sampling distribution because it depends on the realizationof the random variables X1 Xn The MLE may be biased or unbiased for θ yetunder quite general conditions it is asymptotically unbiased as n rarr infin The samplingvariance of the MLE depends on the average curvature of the log likelihood When thelog likelihood is very pointy the location of the maximum is more precisely known
To make this precise let lprimeprime(θ) denote the p times p matrix having (i j)th elementgiven by d2l(θ)(dθidθj) The Fisher information matrix is defined as
I(θ) = Elprime(θ)lprime(θ)T = minusElprimeprime(θ) (128)
where the expectations are taken with respect to the distribution of X1 Xn Thefinal equality in (128) requires mild assumptions which are satisfied for example inexponential families I(θ) may sometimes be called the expected Fisher informationto distinguish it from minuslprimeprime(θ) which is the observed Fisher information There aretwo reasons why the observed Fisher information is quite useful First it can becalculated even if the expectations in (128) cannot easily be computed Second it isa good approximation to I(θ) that improves as n increases
Under regularity conditions the asymptotic variancendashcovariance matrix ofthe MLE θ is I(θlowast)minus1 where θlowast denotes the true value of θ Indeed as n rarr infin thelimiting distribution of θ is Np(θlowast I(θlowast)minus1) Since the true parameter values areunknown I(θlowast)minus1 must be estimated in order to estimate the variancendashcovariancematrix of the MLE An obvious approach is to use I(θ)minus1 Alternatively it is alsoreasonable to use minuslprimeprime(θ)minus1 Standard errors for individual parameter MLEs canbe estimated by taking the square root of the diagonal elements of the chosenestimate of I(θlowast)minus1 A thorough introduction to maximum likelihood theory andthe relative merits of these estimates of I(θlowast)minus1 can be found in [127 182 371 470]
Profile likelihoods provide an informative way to graph a higher-dimensionallikelihood surface to make inference about some parameters while treating othersas nuisance parameters and to facilitate various optimization strategies The profilelikelihood is obtained by constrained maximization of the full likelihood with respectto parameters to be ignored If θ = (μ φ) then the profile likelihood for φ is
L(φ|μ(φ)) = maxμ
L(μ φ) (129)
Thus for each possible φ a value of μ is chosen to maximize L(μ φ) This optimalμ is a function of φ The profile likelihood is the function that maps φ to the valueof the full likelihood evaluated at φ and its corresponding optimal μ Note that the φ
To Natalie and Neil
CONTENTS
PREFACE xv
ACKNOWLEDGMENTS xvii
1 REVIEW 1
11 Mathematical Notation 1
12 Taylorrsquos Theorem and Mathematical Limit Theory 2
13 Statistical Notation and Probability Distributions 4
14 Likelihood Inference 9
15 Bayesian Inference 11
16 Statistical Limit Theory 13
17 Markov Chains 14
18 Computing 17
PART I
OPTIMIZATION
2 OPTIMIZATION AND SOLVING NONLINEAR EQUATIONS 21
21 Univariate Problems 22
211 Newtonrsquos Method 26
2111 Convergence Order 29
212 Fisher Scoring 30213 Secant Method 30214 Fixed-Point Iteration 32
2141 Scaling 33
22 Multivariate Problems 34
221 Newtonrsquos Method and Fisher Scoring 34
2211 Iteratively Reweighted Least Squares 36
222 Newton-Like Methods 39
2221 Ascent Algorithms 39
2222 Discrete Newton and Fixed-Point Methods 41
2223 Quasi-Newton Methods 41
vii
viii CONTENTS
223 GaussndashNewton Method 44224 NelderndashMead Algorithm 45225 Nonlinear GaussndashSeidel Iteration 52
Problems 54
3 COMBINATORIAL OPTIMIZATION 59
31 Hard Problems and NP-Completeness 59311 Examples 61312 Need for Heuristics 64
32 Local Search 65
33 Simulated Annealing 68331 Practical Issues 70
3311 Neighborhoods and Proposals 70
3312 Cooling Schedule and Convergence 71332 Enhancements 74
34 Genetic Algorithms 75341 Definitions and the Canonical Algorithm 75
3411 Basic Definitions 75
3412 Selection Mechanisms and Genetic Operators 76
3413 Allele Alphabets and Genotypic Representation 78
3414 Initialization Termination and Parameter Values 79342 Variations 80
3421 Fitness 80
3422 Selection Mechanisms and Updating Generations 81
3423 Genetic Operators and PermutationChromosomes 82
343 Initialization and Parameter Values 84344 Convergence 84
35 Tabu Algorithms 85351 Basic Definitions 86352 The Tabu List 87353 Aspiration Criteria 88354 Diversification 89355 Intensification 90356 Comprehensive Tabu Algorithm 91
Problems 92
4 EM OPTIMIZATION METHODS 97
41 Missing Data Marginalization and Notation 97
42 The EM Algorithm 98421 Convergence 102422 Usage in Exponential Families 105
CONTENTS ix
423 Variance Estimation 106
4231 Louisrsquos Method 106
4232 SEM Algorithm 108
4233 Bootstrapping 110
4234 Empirical Information 110
4235 Numerical Differentiation 111
43 EM Variants 111431 Improving the E Step 111
4311 Monte Carlo EM 111432 Improving the M Step 112
4321 ECM Algorithm 113
4322 EM Gradient Algorithm 116433 Acceleration Methods 118
4331 Aitken Acceleration 118
4332 Quasi-Newton Acceleration 119
Problems 121
PART II
INTEGRATION AND SIMULATION
5 NUMERICAL INTEGRATION 129
51 NewtonndashCotes Quadrature 129511 Riemann Rule 130512 Trapezoidal Rule 134513 Simpsonrsquos Rule 136514 General kth-Degree Rule 138
52 Romberg Integration 139
53 Gaussian Quadrature 142531 Orthogonal Polynomials 143532 The Gaussian Quadrature Rule 143
54 Frequently Encountered Problems 146541 Range of Integration 146542 Integrands with Singularities or Other Extreme Behavior 146543 Multiple Integrals 147544 Adaptive Quadrature 147545 Software for Exact Integration 148
Problems 148
6 SIMULATION AND MONTE CARLO INTEGRATION 151
61 Introduction to the Monte Carlo Method 151
62 Exact Simulation 152
x CONTENTS
621 Generating from Standard Parametric Families 153622 Inverse Cumulative Distribution Function 153623 Rejection Sampling 155
6231 Squeezed Rejection Sampling 158
6232 Adaptive Rejection Sampling 159
63 Approximate Simulation 163
631 Sampling Importance Resampling Algorithm 163
6311 Adaptive Importance Bridge and Path Sampling 167
632 Sequential Monte Carlo 168
6321 Sequential Importance Sampling for Markov Processes 169
6322 General Sequential Importance Sampling 170
6323 Weight Degeneracy Rejuvenation and EffectiveSample Size 171
6324 Sequential Importance Sampling for HiddenMarkov Models 175
6325 Particle Filters 179
64 Variance Reduction Techniques 180
641 Importance Sampling 180642 Antithetic Sampling 186643 Control Variates 189644 RaondashBlackwellization 193
Problems 195
7 MARKOV CHAIN MONTE CARLO 201
71 MetropolisndashHastings Algorithm 202
711 Independence Chains 204712 Random Walk Chains 206
72 Gibbs Sampling 209
721 Basic Gibbs Sampler 209722 Properties of the Gibbs Sampler 214723 Update Ordering 216724 Blocking 216725 Hybrid Gibbs Sampling 216726 GriddyndashGibbs Sampler 218
73 Implementation 218
731 Ensuring Good Mixing and Convergence 219
7311 Simple Graphical Diagnostics 219
7312 Burn-in and Run Length 220
7313 Choice of Proposal 222
7314 Reparameterization 223
7315 Comparing Chains Effective Sample Size 224
7316 Number of Chains 225
CONTENTS xi
732 Practical Implementation Advice 226733 Using the Results 226
Problems 230
8 ADVANCED TOPICS IN MCMC 237
81 Adaptive MCMC 237811 Adaptive Random Walk Metropolis-within-Gibbs Algorithm 238812 General Adaptive Metropolis-within-Gibbs Algorithm 240813 Adaptive Metropolis Algorithm 247
82 Reversible Jump MCMC 250821 RJMCMC for Variable Selection in Regression 253
83 Auxiliary Variable Methods 256831 Simulated Tempering 257832 Slice Sampler 258
84 Other MetropolisndashHastings Algorithms 260841 Hit-and-Run Algorithm 260842 Multiple-Try MetropolisndashHastings Algorithm 261843 Langevin MetropolisndashHastings Algorithm 262
85 Perfect Sampling 264851 Coupling from the Past 264
8511 Stochastic Monotonicity and Sandwiching 267
86 Markov Chain Maximum Likelihood 268
87 Example MCMC for Markov Random Fields 269871 Gibbs Sampling for Markov Random Fields 270872 Auxiliary Variable Methods for Markov Random Fields 274873 Perfect Sampling for Markov Random Fields 277
Problems 279
PART III
BOOTSTRAPPING
9 BOOTSTRAPPING 287
91 The Bootstrap Principle 287
92 Basic Methods 288921 Nonparametric Bootstrap 288922 Parametric Bootstrap 289923 Bootstrapping Regression 290924 Bootstrap Bias Correction 291
93 Bootstrap Inference 292931 Percentile Method 292
9311 Justification for the Percentile Method 293932 Pivoting 294
xii CONTENTS
9321 Accelerated Bias-Corrected Percentile Method BCa 294
9322 The Bootstrap t 296
9323 Empirical Variance Stabilization 298
9324 Nested Bootstrap and Prepivoting 299
933 Hypothesis Testing 301
94 Reducing Monte Carlo Error 302
941 Balanced Bootstrap 302942 Antithetic Bootstrap 302
95 Bootstrapping Dependent Data 303
951 Model-Based Approach 304952 Block Bootstrap 304
9521 Nonmoving Block Bootstrap 304
9522 Moving Block Bootstrap 306
9523 Blocks-of-Blocks Bootstrapping 307
9524 Centering and Studentizing 309
9525 Block Size 311
96 Bootstrap Performance 315
961 Independent Data Case 315962 Dependent Data Case 316
97 Other Uses of the Bootstrap 316
98 Permutation Tests 317
Problems 319
PART IV
DENSITY ESTIMATION AND SMOOTHING
10 NONPARAMETRIC DENSITY ESTIMATION 325
101 Measures of Performance 326
102 Kernel Density Estimation 327
1021 Choice of Bandwidth 329
10211 Cross-Validation 332
10212 Plug-in Methods 335
10213 Maximal Smoothing Principle 338
1022 Choice of Kernel 339
10221 Epanechnikov Kernel 339
10222 Canonical Kernels and Rescalings 340
103 Nonkernel Methods 341
1031 Logspline 341
104 Multivariate Methods 345
1041 The Nature of the Problem 345
CONTENTS xiii
1042 Multivariate Kernel Estimators 3461043 Adaptive Kernels and Nearest Neighbors 348
10431 Nearest Neighbor Approaches 349
10432 Variable-Kernel Approaches and Transformations 350
1044 Exploratory Projection Pursuit 353
Problems 359
11 BIVARIATE SMOOTHING 363
111 PredictorndashResponse Data 363
112 Linear Smoothers 365
1121 Constant-Span Running Mean 366
11211 Effect of Span 368
11212 Span Selection for Linear Smoothers 369
1122 Running Lines and Running Polynomials 3721123 Kernel Smoothers 3741124 Local Regression Smoothing 3741125 Spline Smoothing 376
11251 Choice of Penalty 377
113 Comparison of Linear Smoothers 377
114 Nonlinear Smoothers 379
1141 Loess 3791142 Supersmoother 381
115 Confidence Bands 384
116 General Bivariate Data 388
Problems 389
12 MULTIVARIATE SMOOTHING 393
121 PredictorndashResponse Data 393
1211 Additive Models 3941212 Generalized Additive Models 3971213 Other Methods Related to Additive Models 399
12131 Projection Pursuit Regression 399
12132 Neural Networks 402
12133 Alternating Conditional Expectations 403
12134 Additivity and Variance Stabilization 404
1214 Tree-Based Methods 405
12141 Recursive Partitioning Regression Trees 406
12142 Tree Pruning 409
12143 Classification Trees 411
12144 Other Issues for Tree-Based Methods 412
xiv CONTENTS
122 General Multivariate Data 4131221 Principal Curves 413
12211 Definition and Motivation 413
12212 Estimation 415
12213 Span Selection 416
Problems 416
DATA ACKNOWLEDGMENTS 421
REFERENCES 423
INDEX 457
PREFACE
This book covers most topics needed to develop a broad and thorough working knowl-edge of modern computational statistics We seek to develop a practical understandingof how and why existing methods work enabling readers to use modern statisticalmethods effectively Since many new methods are built from components of exist-ing techniques our ultimate goal is to provide scientists with the tools they need tocontribute new ideas to the field
A growing challenge in science is that there is so much of it While the pursuitof important new methods and the honing of existing approaches is a worthy goalthere is also a need to organize and distill the teeming jungle of ideas We attempt todo that here Our choice of topics reflects our view of what constitutes the core of theevolving field of computational statistics and what will be interesting and useful forour readers
Our use of the adjective modern in the first sentence of this preface is potentiallytroublesome There is no way that this book can cover all the latest greatest techniquesWe have not even tried We have instead aimed to provide a reasonably up-to-datesurvey of a broad portion of the field while leaving room for diversions and esoterica
The foundations of optimization and numerical integration are covered in thisbook We include these venerable topics because (i) they are cornerstones of frequen-tist and Bayesian inference (ii) routine application of available software often failsfor hard problems and (iii) the methods themselves are often secondary componentsof other statistical computing algorithms Some topics we have omitted representimportant areas of past and present research in the field but their priority here islowered by the availability of high-quality software For example the generation ofpseudo-random numbers is a classic topic but one that we prefer to address by giv-ing students reliable software Finally some topics (eg principal curves and tabusearch) are included simply because they are interesting and provide very differentperspectives on familiar problems Perhaps a future researcher may draw ideas fromsuch topics to design a creative and effective new algorithm
In this second edition we have both updated and broadened our coverage andwe now provide computer code For example we have added new MCMC topics to re-flect continued activity in that popular area A notable increase in breadth is our inclu-sion of more methods relevant for problems where statistical dependency is importantsuch as block bootstrapping and sequential importance sampling This second editionprovides extensive new support in R Specifically code for the examples in this bookis available from the book website wwwstatcolostateeducomputationalstatistics
Our target audience includes graduate students in statistics and related fieldsstatisticians and quantitative empirical scientists in other fields We hope such readersmay use the book when applying standard methods and developing new methods
xv
xvi PREFACE
The level of mathematics expected of the reader does not extend much beyondTaylor series and linear algebra Breadth of mathematical training is more helpfulthan depth Essential review is provided in Chapter 1 More advanced readers will findgreater mathematical detail in the wide variety of high-quality books available on spe-cific topics many of which are referenced in the text Other readers caring less aboutanalytical details may prefer to focus on our descriptions of algorithms and examples
The expected level of statistics is equivalent to that obtained by a graduatestudent in his or her first year of study of the theory of statistics and probabilityAn understanding of maximum likelihood methods Bayesian methods elementaryasymptotic theory Markov chains and linear models is most important Many ofthese topics are reviewed in Chapter 1
With respect to computer programming we find that good students can learn asthey go However a working knowledge of a suitable language allows implementationof the ideas covered in this book to progress much more quickly We have chosento forgo any language-specific examples algorithms or coding in the text For thosewishing to learn a language while they study this book we recommend that you choosea high-level interactive package that permits the flexible design of graphical displaysand includes supporting statistics and probability functions such as R and MATLAB1
These are the sort of languages often used by researchers during the development ofnew statistical computing techniques and they are suitable for implementing all themethods we describe except in some cases for problems of vast scope or complexityWe use R and recommend it Although lower-level languages such as C++ couldalso be used they are more appropriate for professional-grade implementation ofalgorithms after researchers have refined the methodology
The book is organized into four major parts optimization (Chapters 2 3 and4) integration and simulation (Chapters 5 6 7 and 8) bootstrapping (Chapter 9)and density estimation and smoothing (Chapters 10 11 and 12) The chapters arewritten to stand independently so a course can be built by selecting the topics onewishes to teach For a one-semester course our selection typically weights mostheavily topics from Chapters 2 3 6 7 9 10 and 11 With a leisurely pace or morethorough coverage a shorter list of topics could still easily fill a semester courseThere is sufficient material here to provide a thorough one-year course of studynotwithstanding any supplemental topics one might wish to teach
A variety of homework problems are included at the end of each chapter Someare straightforward while others require the student to develop a thorough under-standing of the modelmethod being used to carefully (and perhaps cleverly) code asuitable technique and to devote considerable attention to the interpretation of resultsA few exercises invite open-ended exploration of methods and ideas We are some-times asked for solutions to the exercises but we prefer to sequester them to preservethe challenge for future students and readers
The datasets discussed in the examples and exercises are available from the bookwebsite wwwstatcolostateeducomputationalstatistics The R code is also providedthere Finally the website includes an errata Responsibility for all errors lies with us
1R is available for free from wwwr-projectorg Information about MATLAB can be found atwwwmathworkscom
ACKNOWLEDGMENTS
The course upon which this book is based was developed and taught by us at ColoradoState University from 1994 onwards Thanks are due to our many students who havebeen semiwilling guinea pigs over the years We also thank our colleagues in theStatistics Department for their continued support The late Richard Tweedie meritsparticular acknowledgment for his mentoring during the early years of our careers
We owe a great deal of intellectual debt to Adrian Raftery who deserves specialthanks not only for his teaching and advising but also for his unwavering support andhis seemingly inexhaustible supply of good ideas In addition we thank our influentialadvisors and teachers at the University of Washington Statistics Department includingDavid Madigan Werner Stuetzle and Judy Zeh Of course each of our chapters couldbe expanded into a full-length book and great scholars have already done so We owemuch to their efforts upon which we relied when developing our course and ourmanuscript
Portions of the first edition were written at the Department of Mathematicsand Statistics University of Otago in Dunedin New Zealand whose faculty wethank for graciously hosting us during our sabbatical in 2003 Much of our workon the second edition was undertaken during our sabbatical visit to the AustraliaCommonwealth Scientific and Research Organization in 2009ndash2010 sponsored byCSIRO Mathematics Informatics and Statistics and hosted at the Longpocket Lab-oratory in Indooroopilly Australia We thank our hosts and colleagues there for theirsupport
Our manuscript has been greatly improved through the constructive reviewsof John Bickham Ben Bird Kate Cowles Jan Hannig Alan Herlihy David HunterDevin Johnson Michael Newton Doug Nychka Steve Sain David W Scott N ScottUrquhart Haonan Wang Darrell Whitley and eight anonymous referees We alsothank the sharp-eyed readers listed in the errata for their suggestions and correctionsOur editor Steve Quigley and the folks at Wiley were supportive and helpful duringthe publication process We thank Nelida Pohl for permission to adapt her photographin the cover design of the first edition We thank Melinda Stelzer for permission to useher painting ldquoChampagne Circuitrdquo 2001 for the cover of the second edition Moreabout her art can be found at wwwfacebookcomgeekchicart We also owe specialnote of thanks to Zube (aka John Dzubera) who kept our own computers runningdespite our best efforts to the contrary
Funding from National Science Foundation (NSF) CAREER grant SBR-9875508 was a significant source of support for the first author during the preparationof the first edition He also thanks his colleagues and friends in the North SlopeBorough Alaska Department of Wildlife Management for their longtime researchsupport The second author gratefully acknowledges the support of STAR Research
xvii
xviii ACKNOWLEDGMENTS
Assistance Agreement CR-829095 awarded to Colorado State University by the USEnvironmental Protection Agency (EPA) The views expressed here are solely thoseof the authors NSF and EPA do not endorse any products or commercial servicesmentioned herein
Finally we thank our parents for enabling and supporting our educations and forproviding us with the ldquostubbornness generdquo necessary for graduate school the tenuretrack or book publicationmdashtake your pick The second edition is dedicated to ourkids Natalie and Neil for continuing to show us what is important and what is not
Geof H GivensJennifer A Hoeting
CHAPTER 1REVIEW
This chapter reviews notation and background material in mathematics probabilityand statistics Readers may wish to skip this chapter and turn directly to Chapter 2returning here only as needed
11 MATHEMATICAL NOTATION
We use boldface to distinguish a vector x = (x1 xp) or a matrix M from a scalarvariable x or a constant M A vector-valued function f evaluated at x is also boldfacedas in f(x) = (f1(x) fp(x)) The transpose of M is denoted MT
Unless otherwise specified all vectors are considered to be column vectors sofor example an n times p matrix can be written as M = (x1 xn)T Let I denote anidentity matrix and 1 and 0 denote vectors of ones and zeros respectively
A symmetric square matrix M is positive definite if xTMx gt 0 for all nonzerovectors x Positive definiteness is equivalent to the condition that all eigenvalues ofM are positive M is nonnegative definite or positive semidefinite if xTMx ge 0 for allnonzero vectors x
The derivative of a function f evaluated at x is denoted f prime(x) When x =(x1 xp) the gradient of f at x is
f prime(x) =(
df (x)
dx1
df (x)
dxp
)
The Hessian matrix for f at x is f primeprime(x) having (i j)th element equal to d2f (x)(dxi dxj) The negative Hessian has important uses in statistical inference
Let J(x) denote the Jacobian matrix evaluated at x for the one-to-one mappingy = f(x) The (i j)th element of J(x) is equal to dfi(x)dxj
A functional is a real-valued function on a space of functions For example ifT (f ) = int 1
0 f (x) dx then the functional T maps suitably integrable functions onto thereal line
The indicator function 1A equals 1 if A is true and 0 otherwise The real lineis denoted and p-dimensional real space is p
Computational Statistics Second Edition Geof H Givens and Jennifer A Hoetingcopy 2013 John Wiley amp Sons Inc Published 2013 by John Wiley amp Sons Inc
1
2 CHAPTER 1 REVIEW
12 TAYLORrsquoS THEOREM AND MATHEMATICALLIMIT THEORY
First we define standard ldquobig ohrdquo and ldquolittle ohrdquo notation for describing the relativeorders of convergence of functions Let the functions f and g be defined on a commonpossibly infinite interval Let z0 be a point in this interval or a boundary point of it(ie minusinfin or infin) We require g(z) = 0 for all z = z0 in a neighborhood of z0 Thenwe say
f (z) = O(g(z)) (11)
if there exists a constant M such that |f (z)| le M|g(z)| as z rarr z0 For example(n + 1)(3n2) = O(nminus1) and it is understood that we are considering n rarr infin Iflimzrarrz0 f (z)g(z) = 0 then we say
f (z) = O(g(z)) (12)
For example f (x0 + h) minus f (x0) = hf prime(x0) + O(h) as h rarr 0 if f is differentiable atx0 The same notation can be used for describing the convergence of a sequence xnas n rarr infin by letting f (n) = xn
Taylorrsquos theorem provides a polynomial approximation to a function f Supposef has finite (n + 1)th derivative on (a b) and continuous nth derivative on [a b] Thenfor any x0 isin [a b] distinct from x the Taylor series expansion of f about x0 is
f (x) =nsum
i=0
1
if (i)(x0)(x minus x0)i + Rn (13)
where f (i)(x0) is the ith derivative of f evaluated at x0 and
Rn = 1
(n + 1)f (n+1)(ξ)(x minus x0)n+1 (14)
for some point ξ in the interval between x and x0 As |x minus x0| rarr 0 note that Rn =O(|x minus x0|n+1)
The multivariate version of Taylorrsquos theorem is analogous Suppose f is areal-valued function of a p-dimensional variable x possessing continuous partialderivatives of all orders up to and including n + 1 with respect to all coordinates inan open convex set containing x and x0 = x Then
f (x) = f (x0) +nsum
i=1
1
iD(i)(f x0 x minus x0) + Rn (15)
where
D(i)(f x y) =psum
j1=1
middot middot middotpsum
ji=1
(di
dtj1 middot middot middot dtji
f (t)
∣∣∣∣t=x
) iprodk=1
yjk
(16)
12 TAYLORrsquoS THEOREM AND MATHEMATICAL LIMIT THEORY 3
and
Rn = 1
(n + 1)D(n+1)(f ξ x minus x0) (17)
for some ξ on the line segment joining x and x0 As |x minus x0| rarr 0 note that Rn =O(|x minus x0|n+1)
The EulerndashMaclaurin formula is useful in many asymptotic analyses If f has2n continuous derivatives in [0 1] thenint 1
0f (x) dx =f (0) + f (1)
2
minusnminus1sumi=0
b2i(f (2iminus1)(1) minus f (2iminus1)(0))
(2i)minus b2nf
(2n)(ξ)
(2n) (18)
where 0 le ξ le 1 f (j) is the jth derivative of f and bj = Bj(0) can be determinedusing the recursion relation
msumj=0
(m + 1
j
)Bj(z) = (m + 1)zm (19)
initialized with B0(z) = 1 The proof of this result is based on repeated integrationsby parts [376]
Finally we note that it is sometimes desirable to approximate the derivative ofa function numerically using finite differences For example the ith component ofthe gradient of f at x can be approximated by
df (x)
dxi
asymp f (x + εiei) minus f (x minus εiei)
2εi
(110)
where εi is a small number and ei is the unit vector in the ith coordinate directionTypically one might start with say εi = 001 or 0001 and approximate the desiredderivative for a sequence of progressively smaller εi The approximation willgenerally improve until εi becomes small enough that the calculation is degraded andeventually dominated by computer roundoff error introduced by subtractive cancel-lation Introductory discussion of this approach and a more sophisticated Richardsonextrapolation strategy for obtaining greater precision are provided in [376] Finitedifferences can also be used to approximate the second derivative of f at x via
df (x)
dxi dxj
asymp 1
4εiεj
(f (x + εiei + εjej) minus f (x + εiei minus εjej)
minus f (x minus εiei + εjej) + f (x minus εiei minus εjej)
)(111)
with similar sequential precision improvements
4 CHAPTER 1 REVIEW
13 STATISTICAL NOTATION AND PROBABILITYDISTRIBUTIONS
We use capital letters to denote random variables such as Y or X and lowercaseletters to represent specific realized values of random variables such as y or x Theprobability density function of X is denoted f the cumulative distribution functionis F We use the notation X sim f (x) to mean that X is distributed with density f (x)Frequently the dependence of f (x) on one or more parameters also will be denotedwith a conditioning bar as in f (x|α β) Because of the diversity of topics covered inthis book we want to be careful to distinguish when f (x|α) refers to a density functionas opposed to the evaluation of that density at a point x When the meaning is unclearfrom the context we will be explicit for example by using f (middot|α) to denote thefunction When it is important to distinguish among several densities we may adoptsubscripts referring to specific random variables so that the density functions forX and Y are fX and fY respectively We use the same notation for distributions ofdiscrete random variables and in the Bayesian context
The conditional distribution of X given that Y equals y (ie X|Y = y) is de-scribed by the density denoted f (x|y) or fX|Y (x|y) In this case we write that X|Y hasdensity f (x|Y ) For notational simplicity we allow density functions to be implicitlyspecified by their arguments so we may use the same symbol say f to refer to manydistinct functions as in the equation f (x y|μ) = f (x|y μ)f (y|μ) Finally f (X) andF (X) are random variables the evaluations of the density and cumulative distributionfunctions respectively at the random argument X
The expectation of a random variable is denoted EX Unless specificallymentioned the distribution with respect to which an expectation is taken is the dis-tribution of X or should be implicit from the context To denote the probability ofan event A we use P[A] = E1A The conditional expectation of X|Y = y isEX|y When Y is unknown EX|Y is a random variable that depends on Y Otherattributes of the distribution of X and Y include varX covX Y corX Y andcvX = varX12EX These quantities are the variance of X the covariance andcorrelation of X and Y and the coefficient of variation of X respectively
A useful result regarding expectations is Jensenrsquos inequality Let g be a convexfunction on a possibly infinite open interval I so
g(λx + (1 minus λ)y) le λg(x) + (1 minus λ)g(y) (112)
for all x y isin I and all 0 lt λ lt 1 Then Jensenrsquos inequality states that Eg(X) geg(EX) for any random variable X having P[X isin I] = 1
Tables 11 12 and 13 provide information about many discrete and contin-uous distributions used throughout this book We refer to the following well-knowncombinatorial constants
n = n(n minus 1)(n minus 2) middot middot middot (3)(2)(1) with 0 = 1 (113)(n
k
)= n
k(n minus k) (114)
TABLE 11 Notation and description for common probability distributions of discrete random variables
Notation and Density and Mean and
Name Parameter Space Sample Space Variance
Bernoulli X simBernoulli(p)0 le p le 1
f (x) = px(1 minus p)1minusx
x = 0 or 1EX = p
varX = p(1 minus p)
Binomial X simBin(n p)0 le p le 1n isin 1 2
f (x) =(
n
x
)px(1 minus p)nminusx
x = 0 1 n
EX = np
varX = np(1 minus p)
Multinomial X simMultinomial(n p)p = (p1 pk)0 le pi le 1 and n isin 1 2 sumk
i=1 pi = 1
f (x) =(
n
x1 xk
)prodk
i=1 pxi
i
x = (x1 xk) and xi isin 0 1 nsumk
i=1 xi = n
EX = npvarXi = npi(1 minus pi)covXi Xj = minusnpipj
NegativeBinomial
X simNegBin(r p)0 le p le 1r isin 1 2
f (x) =(
x + r minus 1
r minus 1
)pr(1 minus p)x
x isin 0 1 EX = r(1 minus p)pvarX = r(1 minus p)p2
Poisson X simPoisson(λ) f (x) = λx
xexpminusλ EX = λ
λ gt 0 x isin 0 1 varX = λ
5
TABLE 12 Notation and description for some common probability distributions of continuous random variables
Notation and Density and Mean and
Name Parameter Space Sample Space Variance
Beta X simBeta(α β)α gt 0 and β gt 0
f (x) = (α + β)
(α)(β)xαminus1(1 minus x)βminus1
0 le x le 1
EX = α
α + β
varX = αβ
(α + β)2(α + β + 1)
Cauchy X simCauchy(α β)α isin and β gt 0
f (x) = 1
πβ
[1 +
(x minus α
β
)2]
x isin
EX is nonexistentvarX is nonexistent
Chi-square X sim χ2ν
ν gt 0f (x) = Gamma(ν2 12)x gt 0
EX = ν
varX = 2ν
Dirichlet X simDirichlet(α)α = (α1 αk)αi gt 0α0 = sumk
i=1 αi
f (x) = (α0)prodk
i=1 xαiminus1iprodk
i=1 (αi)x = (x1 xk) and 0 le xi le 1sumk
i=1 xi = 1
EX = αα0
varXi = αi(α0 minus αi)
α20(α0 + 1)
covXi Xj = minusαiαj
α20(α0 + 1)
Exponential X simExp(λ)λ gt 0
f (x) = λ expminusλxx gt 0
EX = 1λ
varX = 1λ2
Gamma X simGamma(r λ)λ gt 0 and r gt 0
f (x) = λrxrminus1
(r)expminusλx
x gt 0
EX = rλ
varX = rλ2
6
TABLE 13 Notation and description for more common probability distributions of continuous random variables
Notation and Density and Mean and
Name Parameter Space Sample Space Variance
Lognormal X simLognormal(μ σ2)μ isin and σ gt 0
f (x) = 1
xradic
2πσ2exp
minus1
2
(logx minus μ
σ
)2
x isin EX = expμ + σ22varX = exp2μ + 2σ2 minus exp2μ + σ2
MultivariateNormal
X sim Nk(μ )μ = (μ1 μk) isin k
positive definite
f (x) = expminus(x minus μ)T minus1(x minus μ)2(2π)k2||12
x = (x1 xk) isin k
EX = μ
varX =
Normal X simN(μ σ2)μ isin and σ gt 0
f (x) = 1radic2πσ2
exp
minus1
2
(x minus μ
σ
)2
x isin EX = μ
varX = σ2
Studentrsquos t X sim tν
ν gt 0f (x) = ((ν + 1)2)
(ν2)radic
πν
(1 + x2
ν
)minus(ν+1)2
x isin EX = 0 if ν gt 1
varX = ν
ν + 2if ν gt 2
Uniform X simUnif(a b)a b isin and a lt b
f (x) = 1
b minus ax isin [a b]
EX = (a + b)2varX = (b minus a)212
Weibull X simWeibull(a b)a gt 0 and b gt 0
f (x) = abxbminus1 expminusaxbx gt 0
EX = (1 + 1b)
a1b
varX = (1 + 2b) minus (1 + 1b)2
a2b
7
8 CHAPTER 1 REVIEW(n
k1 km
)= nprodm
i=1 kiwhere n =
msumi=1
ki (115)
(r) =
(r minus 1) for r = 1 2 int infin0 trminus1 expminust dt for general r gt 0
(116)
It is worth knowing that
(
12
)= radic
π and (n + 1
2
)= 1 times 3 times 5 times middot middot middot times (2n minus 1)
radicπ2n
for positive integer nMany of the distributions commonly used in statistics are members of an
exponential family A k-parameter exponential family density can be expressed as
f (x|γ) = c1(x)c2(γ) exp
ksum
i=1
yi(x)θi(γ)
(117)
for nonnegative functions c1 and c2 The vectorγ denotes the familiar parameters suchas λ for the Poisson density and p for the binomial density The real-valued θi(γ) arethe natural or canonical parameters which are usually transformations of γ Theyi(x) are the sufficient statistics for the canonical parameters It is straightforwardto show
Ey(X) =κprime(θ) (118)
and
vary(X) =κprimeprime(θ) (119)
where κ(θ) = minus log c3(θ) letting c3(θ) denote the reexpression of c2(γ) in terms ofthe canonical parameters θ = (θ1 θk) and y(X) = (y1(X) yk(X)) Theseresults can be rewritten in terms of the original parameters γ as
E
ksum
i=1
dθi(γ)
dγj
yi(X)
= minus d
dγj
log c2(γ) (120)
and
var
ksum
i=1
dθi(γ)
dγj
yi(X)
= minus d2
dγ2j
log c2(γ) minus E
ksum
i=1
d2θi(γ)
dγ2j
yi(X)
(121)
Example 11 (Poisson) The Poisson distribution belongs to the exponential familywith c1(x) = 1x c2(λ) = expminusλ y(x) = x and θ(λ) = log λ Deriving momentsin terms of θ we have κ(θ) = expθ so EX = κprime(θ) = expθ = λ and varX =κprimeprime(θ) = expθ = λ The same results may be obtained with (120) and (121) notingthat dθdλ = 1λ For example (120) gives E Xλ = 1
It is also important to know how the distribution of a random variable changeswhen it is transformed Let X = (X1 Xp) denote a p-dimensional random
14 LIKELIHOOD INFERENCE 9
variable with continuous density function f Suppose that
U = g(X) = (g1(X) gp(X)) = (U1 Up) (122)
where g is a one-to-one function mapping the support region of f onto the space ofall u = g(x) for which x satisfies f (x) gt 0 To derive the probability distribution ofU from that of X we need to use the Jacobian matrix The density of the transformedvariables is
f (u) = f (gminus1(u)) |J(u)| (123)
where |J(u)| is the absolute value of the determinant of the Jacobian matrix of gminus1
evaluated at u having (i j)th element dxiduj where these derivatives are assumedto be continuous over the support region of U
14 LIKELIHOOD INFERENCE
If X1 Xn are independent and identically distributed (iid) each having densityf (x | θ) that depends on a vector of p unknown parameters θ = (θ1 θp) then thejoint likelihood function is
L(θ) =nprod
i=1
f (xi|θ) (124)
When the data are not iid the joint likelihood is still expressed as the joint densityf (x1 xn|θ) viewed as a function of θ
The observed data x1 xn might have been realized under many differentvalues for θ The parameters for which observing x1 xn would be most likelyconstitute the maximum likelihood estimate of θ In other words if ϑ is the functionof x1 xn that maximizes L(θ) then θ = ϑ(X1 Xn) is the maximum likeli-hood estimator (MLE) for θ MLEs are invariant to transformation so the MLE of atransformation of θ equals the transformation of θ
It is typically easier to work with the log likelihood function
l (θ) = log L(θ) (125)
which has the same maximum as the original likelihood since log is a strictly mono-tonic function Furthermore any additive constants (involving possibly x1 xn
but not θ) may be omitted from the log likelihood without changing the location of itsmaximum or differences between log likelihoods at different θ Note that maximizingL(θ) with respect to θ is equivalent to solving the system of equations
lprime(θ) = 0 (126)
where
lprime(θ) =(
dl(θ)
dθ1
dl(θ)
dθn
)
10 CHAPTER 1 REVIEW
is called the score function The score function satisfies
Elprime(θ) = 0 (127)
where the expectation is taken with respect to the distribution of X1 Xn Some-times an analytical solution to (126) provides the MLE this book describes a varietyof methods that can be used when the MLE cannot be solved for in closed form Itis worth noting that there are pathological circumstances where the MLE is not asolution of the score equation or the MLE is not unique see [127] for examples
The MLE has a sampling distribution because it depends on the realizationof the random variables X1 Xn The MLE may be biased or unbiased for θ yetunder quite general conditions it is asymptotically unbiased as n rarr infin The samplingvariance of the MLE depends on the average curvature of the log likelihood When thelog likelihood is very pointy the location of the maximum is more precisely known
To make this precise let lprimeprime(θ) denote the p times p matrix having (i j)th elementgiven by d2l(θ)(dθidθj) The Fisher information matrix is defined as
I(θ) = Elprime(θ)lprime(θ)T = minusElprimeprime(θ) (128)
where the expectations are taken with respect to the distribution of X1 Xn Thefinal equality in (128) requires mild assumptions which are satisfied for example inexponential families I(θ) may sometimes be called the expected Fisher informationto distinguish it from minuslprimeprime(θ) which is the observed Fisher information There aretwo reasons why the observed Fisher information is quite useful First it can becalculated even if the expectations in (128) cannot easily be computed Second it isa good approximation to I(θ) that improves as n increases
Under regularity conditions the asymptotic variancendashcovariance matrix ofthe MLE θ is I(θlowast)minus1 where θlowast denotes the true value of θ Indeed as n rarr infin thelimiting distribution of θ is Np(θlowast I(θlowast)minus1) Since the true parameter values areunknown I(θlowast)minus1 must be estimated in order to estimate the variancendashcovariancematrix of the MLE An obvious approach is to use I(θ)minus1 Alternatively it is alsoreasonable to use minuslprimeprime(θ)minus1 Standard errors for individual parameter MLEs canbe estimated by taking the square root of the diagonal elements of the chosenestimate of I(θlowast)minus1 A thorough introduction to maximum likelihood theory andthe relative merits of these estimates of I(θlowast)minus1 can be found in [127 182 371 470]
Profile likelihoods provide an informative way to graph a higher-dimensionallikelihood surface to make inference about some parameters while treating othersas nuisance parameters and to facilitate various optimization strategies The profilelikelihood is obtained by constrained maximization of the full likelihood with respectto parameters to be ignored If θ = (μ φ) then the profile likelihood for φ is
L(φ|μ(φ)) = maxμ
L(μ φ) (129)
Thus for each possible φ a value of μ is chosen to maximize L(μ φ) This optimalμ is a function of φ The profile likelihood is the function that maps φ to the valueof the full likelihood evaluated at φ and its corresponding optimal μ Note that the φ
CONTENTS
PREFACE xv
ACKNOWLEDGMENTS xvii
1 REVIEW 1
11 Mathematical Notation 1
12 Taylorrsquos Theorem and Mathematical Limit Theory 2
13 Statistical Notation and Probability Distributions 4
14 Likelihood Inference 9
15 Bayesian Inference 11
16 Statistical Limit Theory 13
17 Markov Chains 14
18 Computing 17
PART I
OPTIMIZATION
2 OPTIMIZATION AND SOLVING NONLINEAR EQUATIONS 21
21 Univariate Problems 22
211 Newtonrsquos Method 26
2111 Convergence Order 29
212 Fisher Scoring 30213 Secant Method 30214 Fixed-Point Iteration 32
2141 Scaling 33
22 Multivariate Problems 34
221 Newtonrsquos Method and Fisher Scoring 34
2211 Iteratively Reweighted Least Squares 36
222 Newton-Like Methods 39
2221 Ascent Algorithms 39
2222 Discrete Newton and Fixed-Point Methods 41
2223 Quasi-Newton Methods 41
vii
viii CONTENTS
223 GaussndashNewton Method 44224 NelderndashMead Algorithm 45225 Nonlinear GaussndashSeidel Iteration 52
Problems 54
3 COMBINATORIAL OPTIMIZATION 59
31 Hard Problems and NP-Completeness 59311 Examples 61312 Need for Heuristics 64
32 Local Search 65
33 Simulated Annealing 68331 Practical Issues 70
3311 Neighborhoods and Proposals 70
3312 Cooling Schedule and Convergence 71332 Enhancements 74
34 Genetic Algorithms 75341 Definitions and the Canonical Algorithm 75
3411 Basic Definitions 75
3412 Selection Mechanisms and Genetic Operators 76
3413 Allele Alphabets and Genotypic Representation 78
3414 Initialization Termination and Parameter Values 79342 Variations 80
3421 Fitness 80
3422 Selection Mechanisms and Updating Generations 81
3423 Genetic Operators and PermutationChromosomes 82
343 Initialization and Parameter Values 84344 Convergence 84
35 Tabu Algorithms 85351 Basic Definitions 86352 The Tabu List 87353 Aspiration Criteria 88354 Diversification 89355 Intensification 90356 Comprehensive Tabu Algorithm 91
Problems 92
4 EM OPTIMIZATION METHODS 97
41 Missing Data Marginalization and Notation 97
42 The EM Algorithm 98421 Convergence 102422 Usage in Exponential Families 105
CONTENTS ix
423 Variance Estimation 106
4231 Louisrsquos Method 106
4232 SEM Algorithm 108
4233 Bootstrapping 110
4234 Empirical Information 110
4235 Numerical Differentiation 111
43 EM Variants 111431 Improving the E Step 111
4311 Monte Carlo EM 111432 Improving the M Step 112
4321 ECM Algorithm 113
4322 EM Gradient Algorithm 116433 Acceleration Methods 118
4331 Aitken Acceleration 118
4332 Quasi-Newton Acceleration 119
Problems 121
PART II
INTEGRATION AND SIMULATION
5 NUMERICAL INTEGRATION 129
51 NewtonndashCotes Quadrature 129511 Riemann Rule 130512 Trapezoidal Rule 134513 Simpsonrsquos Rule 136514 General kth-Degree Rule 138
52 Romberg Integration 139
53 Gaussian Quadrature 142531 Orthogonal Polynomials 143532 The Gaussian Quadrature Rule 143
54 Frequently Encountered Problems 146541 Range of Integration 146542 Integrands with Singularities or Other Extreme Behavior 146543 Multiple Integrals 147544 Adaptive Quadrature 147545 Software for Exact Integration 148
Problems 148
6 SIMULATION AND MONTE CARLO INTEGRATION 151
61 Introduction to the Monte Carlo Method 151
62 Exact Simulation 152
x CONTENTS
621 Generating from Standard Parametric Families 153622 Inverse Cumulative Distribution Function 153623 Rejection Sampling 155
6231 Squeezed Rejection Sampling 158
6232 Adaptive Rejection Sampling 159
63 Approximate Simulation 163
631 Sampling Importance Resampling Algorithm 163
6311 Adaptive Importance Bridge and Path Sampling 167
632 Sequential Monte Carlo 168
6321 Sequential Importance Sampling for Markov Processes 169
6322 General Sequential Importance Sampling 170
6323 Weight Degeneracy Rejuvenation and EffectiveSample Size 171
6324 Sequential Importance Sampling for HiddenMarkov Models 175
6325 Particle Filters 179
64 Variance Reduction Techniques 180
641 Importance Sampling 180642 Antithetic Sampling 186643 Control Variates 189644 RaondashBlackwellization 193
Problems 195
7 MARKOV CHAIN MONTE CARLO 201
71 MetropolisndashHastings Algorithm 202
711 Independence Chains 204712 Random Walk Chains 206
72 Gibbs Sampling 209
721 Basic Gibbs Sampler 209722 Properties of the Gibbs Sampler 214723 Update Ordering 216724 Blocking 216725 Hybrid Gibbs Sampling 216726 GriddyndashGibbs Sampler 218
73 Implementation 218
731 Ensuring Good Mixing and Convergence 219
7311 Simple Graphical Diagnostics 219
7312 Burn-in and Run Length 220
7313 Choice of Proposal 222
7314 Reparameterization 223
7315 Comparing Chains Effective Sample Size 224
7316 Number of Chains 225
CONTENTS xi
732 Practical Implementation Advice 226733 Using the Results 226
Problems 230
8 ADVANCED TOPICS IN MCMC 237
81 Adaptive MCMC 237811 Adaptive Random Walk Metropolis-within-Gibbs Algorithm 238812 General Adaptive Metropolis-within-Gibbs Algorithm 240813 Adaptive Metropolis Algorithm 247
82 Reversible Jump MCMC 250821 RJMCMC for Variable Selection in Regression 253
83 Auxiliary Variable Methods 256831 Simulated Tempering 257832 Slice Sampler 258
84 Other MetropolisndashHastings Algorithms 260841 Hit-and-Run Algorithm 260842 Multiple-Try MetropolisndashHastings Algorithm 261843 Langevin MetropolisndashHastings Algorithm 262
85 Perfect Sampling 264851 Coupling from the Past 264
8511 Stochastic Monotonicity and Sandwiching 267
86 Markov Chain Maximum Likelihood 268
87 Example MCMC for Markov Random Fields 269871 Gibbs Sampling for Markov Random Fields 270872 Auxiliary Variable Methods for Markov Random Fields 274873 Perfect Sampling for Markov Random Fields 277
Problems 279
PART III
BOOTSTRAPPING
9 BOOTSTRAPPING 287
91 The Bootstrap Principle 287
92 Basic Methods 288921 Nonparametric Bootstrap 288922 Parametric Bootstrap 289923 Bootstrapping Regression 290924 Bootstrap Bias Correction 291
93 Bootstrap Inference 292931 Percentile Method 292
9311 Justification for the Percentile Method 293932 Pivoting 294
xii CONTENTS
9321 Accelerated Bias-Corrected Percentile Method BCa 294
9322 The Bootstrap t 296
9323 Empirical Variance Stabilization 298
9324 Nested Bootstrap and Prepivoting 299
933 Hypothesis Testing 301
94 Reducing Monte Carlo Error 302
941 Balanced Bootstrap 302942 Antithetic Bootstrap 302
95 Bootstrapping Dependent Data 303
951 Model-Based Approach 304952 Block Bootstrap 304
9521 Nonmoving Block Bootstrap 304
9522 Moving Block Bootstrap 306
9523 Blocks-of-Blocks Bootstrapping 307
9524 Centering and Studentizing 309
9525 Block Size 311
96 Bootstrap Performance 315
961 Independent Data Case 315962 Dependent Data Case 316
97 Other Uses of the Bootstrap 316
98 Permutation Tests 317
Problems 319
PART IV
DENSITY ESTIMATION AND SMOOTHING
10 NONPARAMETRIC DENSITY ESTIMATION 325
101 Measures of Performance 326
102 Kernel Density Estimation 327
1021 Choice of Bandwidth 329
10211 Cross-Validation 332
10212 Plug-in Methods 335
10213 Maximal Smoothing Principle 338
1022 Choice of Kernel 339
10221 Epanechnikov Kernel 339
10222 Canonical Kernels and Rescalings 340
103 Nonkernel Methods 341
1031 Logspline 341
104 Multivariate Methods 345
1041 The Nature of the Problem 345
CONTENTS xiii
1042 Multivariate Kernel Estimators 3461043 Adaptive Kernels and Nearest Neighbors 348
10431 Nearest Neighbor Approaches 349
10432 Variable-Kernel Approaches and Transformations 350
1044 Exploratory Projection Pursuit 353
Problems 359
11 BIVARIATE SMOOTHING 363
111 PredictorndashResponse Data 363
112 Linear Smoothers 365
1121 Constant-Span Running Mean 366
11211 Effect of Span 368
11212 Span Selection for Linear Smoothers 369
1122 Running Lines and Running Polynomials 3721123 Kernel Smoothers 3741124 Local Regression Smoothing 3741125 Spline Smoothing 376
11251 Choice of Penalty 377
113 Comparison of Linear Smoothers 377
114 Nonlinear Smoothers 379
1141 Loess 3791142 Supersmoother 381
115 Confidence Bands 384
116 General Bivariate Data 388
Problems 389
12 MULTIVARIATE SMOOTHING 393
121 PredictorndashResponse Data 393
1211 Additive Models 3941212 Generalized Additive Models 3971213 Other Methods Related to Additive Models 399
12131 Projection Pursuit Regression 399
12132 Neural Networks 402
12133 Alternating Conditional Expectations 403
12134 Additivity and Variance Stabilization 404
1214 Tree-Based Methods 405
12141 Recursive Partitioning Regression Trees 406
12142 Tree Pruning 409
12143 Classification Trees 411
12144 Other Issues for Tree-Based Methods 412
xiv CONTENTS
122 General Multivariate Data 4131221 Principal Curves 413
12211 Definition and Motivation 413
12212 Estimation 415
12213 Span Selection 416
Problems 416
DATA ACKNOWLEDGMENTS 421
REFERENCES 423
INDEX 457
PREFACE
This book covers most topics needed to develop a broad and thorough working knowl-edge of modern computational statistics We seek to develop a practical understandingof how and why existing methods work enabling readers to use modern statisticalmethods effectively Since many new methods are built from components of exist-ing techniques our ultimate goal is to provide scientists with the tools they need tocontribute new ideas to the field
A growing challenge in science is that there is so much of it While the pursuitof important new methods and the honing of existing approaches is a worthy goalthere is also a need to organize and distill the teeming jungle of ideas We attempt todo that here Our choice of topics reflects our view of what constitutes the core of theevolving field of computational statistics and what will be interesting and useful forour readers
Our use of the adjective modern in the first sentence of this preface is potentiallytroublesome There is no way that this book can cover all the latest greatest techniquesWe have not even tried We have instead aimed to provide a reasonably up-to-datesurvey of a broad portion of the field while leaving room for diversions and esoterica
The foundations of optimization and numerical integration are covered in thisbook We include these venerable topics because (i) they are cornerstones of frequen-tist and Bayesian inference (ii) routine application of available software often failsfor hard problems and (iii) the methods themselves are often secondary componentsof other statistical computing algorithms Some topics we have omitted representimportant areas of past and present research in the field but their priority here islowered by the availability of high-quality software For example the generation ofpseudo-random numbers is a classic topic but one that we prefer to address by giv-ing students reliable software Finally some topics (eg principal curves and tabusearch) are included simply because they are interesting and provide very differentperspectives on familiar problems Perhaps a future researcher may draw ideas fromsuch topics to design a creative and effective new algorithm
In this second edition we have both updated and broadened our coverage andwe now provide computer code For example we have added new MCMC topics to re-flect continued activity in that popular area A notable increase in breadth is our inclu-sion of more methods relevant for problems where statistical dependency is importantsuch as block bootstrapping and sequential importance sampling This second editionprovides extensive new support in R Specifically code for the examples in this bookis available from the book website wwwstatcolostateeducomputationalstatistics
Our target audience includes graduate students in statistics and related fieldsstatisticians and quantitative empirical scientists in other fields We hope such readersmay use the book when applying standard methods and developing new methods
xv
xvi PREFACE
The level of mathematics expected of the reader does not extend much beyondTaylor series and linear algebra Breadth of mathematical training is more helpfulthan depth Essential review is provided in Chapter 1 More advanced readers will findgreater mathematical detail in the wide variety of high-quality books available on spe-cific topics many of which are referenced in the text Other readers caring less aboutanalytical details may prefer to focus on our descriptions of algorithms and examples
The expected level of statistics is equivalent to that obtained by a graduatestudent in his or her first year of study of the theory of statistics and probabilityAn understanding of maximum likelihood methods Bayesian methods elementaryasymptotic theory Markov chains and linear models is most important Many ofthese topics are reviewed in Chapter 1
With respect to computer programming we find that good students can learn asthey go However a working knowledge of a suitable language allows implementationof the ideas covered in this book to progress much more quickly We have chosento forgo any language-specific examples algorithms or coding in the text For thosewishing to learn a language while they study this book we recommend that you choosea high-level interactive package that permits the flexible design of graphical displaysand includes supporting statistics and probability functions such as R and MATLAB1
These are the sort of languages often used by researchers during the development ofnew statistical computing techniques and they are suitable for implementing all themethods we describe except in some cases for problems of vast scope or complexityWe use R and recommend it Although lower-level languages such as C++ couldalso be used they are more appropriate for professional-grade implementation ofalgorithms after researchers have refined the methodology
The book is organized into four major parts optimization (Chapters 2 3 and4) integration and simulation (Chapters 5 6 7 and 8) bootstrapping (Chapter 9)and density estimation and smoothing (Chapters 10 11 and 12) The chapters arewritten to stand independently so a course can be built by selecting the topics onewishes to teach For a one-semester course our selection typically weights mostheavily topics from Chapters 2 3 6 7 9 10 and 11 With a leisurely pace or morethorough coverage a shorter list of topics could still easily fill a semester courseThere is sufficient material here to provide a thorough one-year course of studynotwithstanding any supplemental topics one might wish to teach
A variety of homework problems are included at the end of each chapter Someare straightforward while others require the student to develop a thorough under-standing of the modelmethod being used to carefully (and perhaps cleverly) code asuitable technique and to devote considerable attention to the interpretation of resultsA few exercises invite open-ended exploration of methods and ideas We are some-times asked for solutions to the exercises but we prefer to sequester them to preservethe challenge for future students and readers
The datasets discussed in the examples and exercises are available from the bookwebsite wwwstatcolostateeducomputationalstatistics The R code is also providedthere Finally the website includes an errata Responsibility for all errors lies with us
1R is available for free from wwwr-projectorg Information about MATLAB can be found atwwwmathworkscom
ACKNOWLEDGMENTS
The course upon which this book is based was developed and taught by us at ColoradoState University from 1994 onwards Thanks are due to our many students who havebeen semiwilling guinea pigs over the years We also thank our colleagues in theStatistics Department for their continued support The late Richard Tweedie meritsparticular acknowledgment for his mentoring during the early years of our careers
We owe a great deal of intellectual debt to Adrian Raftery who deserves specialthanks not only for his teaching and advising but also for his unwavering support andhis seemingly inexhaustible supply of good ideas In addition we thank our influentialadvisors and teachers at the University of Washington Statistics Department includingDavid Madigan Werner Stuetzle and Judy Zeh Of course each of our chapters couldbe expanded into a full-length book and great scholars have already done so We owemuch to their efforts upon which we relied when developing our course and ourmanuscript
Portions of the first edition were written at the Department of Mathematicsand Statistics University of Otago in Dunedin New Zealand whose faculty wethank for graciously hosting us during our sabbatical in 2003 Much of our workon the second edition was undertaken during our sabbatical visit to the AustraliaCommonwealth Scientific and Research Organization in 2009ndash2010 sponsored byCSIRO Mathematics Informatics and Statistics and hosted at the Longpocket Lab-oratory in Indooroopilly Australia We thank our hosts and colleagues there for theirsupport
Our manuscript has been greatly improved through the constructive reviewsof John Bickham Ben Bird Kate Cowles Jan Hannig Alan Herlihy David HunterDevin Johnson Michael Newton Doug Nychka Steve Sain David W Scott N ScottUrquhart Haonan Wang Darrell Whitley and eight anonymous referees We alsothank the sharp-eyed readers listed in the errata for their suggestions and correctionsOur editor Steve Quigley and the folks at Wiley were supportive and helpful duringthe publication process We thank Nelida Pohl for permission to adapt her photographin the cover design of the first edition We thank Melinda Stelzer for permission to useher painting ldquoChampagne Circuitrdquo 2001 for the cover of the second edition Moreabout her art can be found at wwwfacebookcomgeekchicart We also owe specialnote of thanks to Zube (aka John Dzubera) who kept our own computers runningdespite our best efforts to the contrary
Funding from National Science Foundation (NSF) CAREER grant SBR-9875508 was a significant source of support for the first author during the preparationof the first edition He also thanks his colleagues and friends in the North SlopeBorough Alaska Department of Wildlife Management for their longtime researchsupport The second author gratefully acknowledges the support of STAR Research
xvii
xviii ACKNOWLEDGMENTS
Assistance Agreement CR-829095 awarded to Colorado State University by the USEnvironmental Protection Agency (EPA) The views expressed here are solely thoseof the authors NSF and EPA do not endorse any products or commercial servicesmentioned herein
Finally we thank our parents for enabling and supporting our educations and forproviding us with the ldquostubbornness generdquo necessary for graduate school the tenuretrack or book publicationmdashtake your pick The second edition is dedicated to ourkids Natalie and Neil for continuing to show us what is important and what is not
Geof H GivensJennifer A Hoeting
CHAPTER 1REVIEW
This chapter reviews notation and background material in mathematics probabilityand statistics Readers may wish to skip this chapter and turn directly to Chapter 2returning here only as needed
11 MATHEMATICAL NOTATION
We use boldface to distinguish a vector x = (x1 xp) or a matrix M from a scalarvariable x or a constant M A vector-valued function f evaluated at x is also boldfacedas in f(x) = (f1(x) fp(x)) The transpose of M is denoted MT
Unless otherwise specified all vectors are considered to be column vectors sofor example an n times p matrix can be written as M = (x1 xn)T Let I denote anidentity matrix and 1 and 0 denote vectors of ones and zeros respectively
A symmetric square matrix M is positive definite if xTMx gt 0 for all nonzerovectors x Positive definiteness is equivalent to the condition that all eigenvalues ofM are positive M is nonnegative definite or positive semidefinite if xTMx ge 0 for allnonzero vectors x
The derivative of a function f evaluated at x is denoted f prime(x) When x =(x1 xp) the gradient of f at x is
f prime(x) =(
df (x)
dx1
df (x)
dxp
)
The Hessian matrix for f at x is f primeprime(x) having (i j)th element equal to d2f (x)(dxi dxj) The negative Hessian has important uses in statistical inference
Let J(x) denote the Jacobian matrix evaluated at x for the one-to-one mappingy = f(x) The (i j)th element of J(x) is equal to dfi(x)dxj
A functional is a real-valued function on a space of functions For example ifT (f ) = int 1
0 f (x) dx then the functional T maps suitably integrable functions onto thereal line
The indicator function 1A equals 1 if A is true and 0 otherwise The real lineis denoted and p-dimensional real space is p
Computational Statistics Second Edition Geof H Givens and Jennifer A Hoetingcopy 2013 John Wiley amp Sons Inc Published 2013 by John Wiley amp Sons Inc
1
2 CHAPTER 1 REVIEW
12 TAYLORrsquoS THEOREM AND MATHEMATICALLIMIT THEORY
First we define standard ldquobig ohrdquo and ldquolittle ohrdquo notation for describing the relativeorders of convergence of functions Let the functions f and g be defined on a commonpossibly infinite interval Let z0 be a point in this interval or a boundary point of it(ie minusinfin or infin) We require g(z) = 0 for all z = z0 in a neighborhood of z0 Thenwe say
f (z) = O(g(z)) (11)
if there exists a constant M such that |f (z)| le M|g(z)| as z rarr z0 For example(n + 1)(3n2) = O(nminus1) and it is understood that we are considering n rarr infin Iflimzrarrz0 f (z)g(z) = 0 then we say
f (z) = O(g(z)) (12)
For example f (x0 + h) minus f (x0) = hf prime(x0) + O(h) as h rarr 0 if f is differentiable atx0 The same notation can be used for describing the convergence of a sequence xnas n rarr infin by letting f (n) = xn
Taylorrsquos theorem provides a polynomial approximation to a function f Supposef has finite (n + 1)th derivative on (a b) and continuous nth derivative on [a b] Thenfor any x0 isin [a b] distinct from x the Taylor series expansion of f about x0 is
f (x) =nsum
i=0
1
if (i)(x0)(x minus x0)i + Rn (13)
where f (i)(x0) is the ith derivative of f evaluated at x0 and
Rn = 1
(n + 1)f (n+1)(ξ)(x minus x0)n+1 (14)
for some point ξ in the interval between x and x0 As |x minus x0| rarr 0 note that Rn =O(|x minus x0|n+1)
The multivariate version of Taylorrsquos theorem is analogous Suppose f is areal-valued function of a p-dimensional variable x possessing continuous partialderivatives of all orders up to and including n + 1 with respect to all coordinates inan open convex set containing x and x0 = x Then
f (x) = f (x0) +nsum
i=1
1
iD(i)(f x0 x minus x0) + Rn (15)
where
D(i)(f x y) =psum
j1=1
middot middot middotpsum
ji=1
(di
dtj1 middot middot middot dtji
f (t)
∣∣∣∣t=x
) iprodk=1
yjk
(16)
12 TAYLORrsquoS THEOREM AND MATHEMATICAL LIMIT THEORY 3
and
Rn = 1
(n + 1)D(n+1)(f ξ x minus x0) (17)
for some ξ on the line segment joining x and x0 As |x minus x0| rarr 0 note that Rn =O(|x minus x0|n+1)
The EulerndashMaclaurin formula is useful in many asymptotic analyses If f has2n continuous derivatives in [0 1] thenint 1
0f (x) dx =f (0) + f (1)
2
minusnminus1sumi=0
b2i(f (2iminus1)(1) minus f (2iminus1)(0))
(2i)minus b2nf
(2n)(ξ)
(2n) (18)
where 0 le ξ le 1 f (j) is the jth derivative of f and bj = Bj(0) can be determinedusing the recursion relation
msumj=0
(m + 1
j
)Bj(z) = (m + 1)zm (19)
initialized with B0(z) = 1 The proof of this result is based on repeated integrationsby parts [376]
Finally we note that it is sometimes desirable to approximate the derivative ofa function numerically using finite differences For example the ith component ofthe gradient of f at x can be approximated by
df (x)
dxi
asymp f (x + εiei) minus f (x minus εiei)
2εi
(110)
where εi is a small number and ei is the unit vector in the ith coordinate directionTypically one might start with say εi = 001 or 0001 and approximate the desiredderivative for a sequence of progressively smaller εi The approximation willgenerally improve until εi becomes small enough that the calculation is degraded andeventually dominated by computer roundoff error introduced by subtractive cancel-lation Introductory discussion of this approach and a more sophisticated Richardsonextrapolation strategy for obtaining greater precision are provided in [376] Finitedifferences can also be used to approximate the second derivative of f at x via
df (x)
dxi dxj
asymp 1
4εiεj
(f (x + εiei + εjej) minus f (x + εiei minus εjej)
minus f (x minus εiei + εjej) + f (x minus εiei minus εjej)
)(111)
with similar sequential precision improvements
4 CHAPTER 1 REVIEW
13 STATISTICAL NOTATION AND PROBABILITYDISTRIBUTIONS
We use capital letters to denote random variables such as Y or X and lowercaseletters to represent specific realized values of random variables such as y or x Theprobability density function of X is denoted f the cumulative distribution functionis F We use the notation X sim f (x) to mean that X is distributed with density f (x)Frequently the dependence of f (x) on one or more parameters also will be denotedwith a conditioning bar as in f (x|α β) Because of the diversity of topics covered inthis book we want to be careful to distinguish when f (x|α) refers to a density functionas opposed to the evaluation of that density at a point x When the meaning is unclearfrom the context we will be explicit for example by using f (middot|α) to denote thefunction When it is important to distinguish among several densities we may adoptsubscripts referring to specific random variables so that the density functions forX and Y are fX and fY respectively We use the same notation for distributions ofdiscrete random variables and in the Bayesian context
The conditional distribution of X given that Y equals y (ie X|Y = y) is de-scribed by the density denoted f (x|y) or fX|Y (x|y) In this case we write that X|Y hasdensity f (x|Y ) For notational simplicity we allow density functions to be implicitlyspecified by their arguments so we may use the same symbol say f to refer to manydistinct functions as in the equation f (x y|μ) = f (x|y μ)f (y|μ) Finally f (X) andF (X) are random variables the evaluations of the density and cumulative distributionfunctions respectively at the random argument X
The expectation of a random variable is denoted EX Unless specificallymentioned the distribution with respect to which an expectation is taken is the dis-tribution of X or should be implicit from the context To denote the probability ofan event A we use P[A] = E1A The conditional expectation of X|Y = y isEX|y When Y is unknown EX|Y is a random variable that depends on Y Otherattributes of the distribution of X and Y include varX covX Y corX Y andcvX = varX12EX These quantities are the variance of X the covariance andcorrelation of X and Y and the coefficient of variation of X respectively
A useful result regarding expectations is Jensenrsquos inequality Let g be a convexfunction on a possibly infinite open interval I so
g(λx + (1 minus λ)y) le λg(x) + (1 minus λ)g(y) (112)
for all x y isin I and all 0 lt λ lt 1 Then Jensenrsquos inequality states that Eg(X) geg(EX) for any random variable X having P[X isin I] = 1
Tables 11 12 and 13 provide information about many discrete and contin-uous distributions used throughout this book We refer to the following well-knowncombinatorial constants
n = n(n minus 1)(n minus 2) middot middot middot (3)(2)(1) with 0 = 1 (113)(n
k
)= n
k(n minus k) (114)
TABLE 11 Notation and description for common probability distributions of discrete random variables
Notation and Density and Mean and
Name Parameter Space Sample Space Variance
Bernoulli X simBernoulli(p)0 le p le 1
f (x) = px(1 minus p)1minusx
x = 0 or 1EX = p
varX = p(1 minus p)
Binomial X simBin(n p)0 le p le 1n isin 1 2
f (x) =(
n
x
)px(1 minus p)nminusx
x = 0 1 n
EX = np
varX = np(1 minus p)
Multinomial X simMultinomial(n p)p = (p1 pk)0 le pi le 1 and n isin 1 2 sumk
i=1 pi = 1
f (x) =(
n
x1 xk
)prodk
i=1 pxi
i
x = (x1 xk) and xi isin 0 1 nsumk
i=1 xi = n
EX = npvarXi = npi(1 minus pi)covXi Xj = minusnpipj
NegativeBinomial
X simNegBin(r p)0 le p le 1r isin 1 2
f (x) =(
x + r minus 1
r minus 1
)pr(1 minus p)x
x isin 0 1 EX = r(1 minus p)pvarX = r(1 minus p)p2
Poisson X simPoisson(λ) f (x) = λx
xexpminusλ EX = λ
λ gt 0 x isin 0 1 varX = λ
5
TABLE 12 Notation and description for some common probability distributions of continuous random variables
Notation and Density and Mean and
Name Parameter Space Sample Space Variance
Beta X simBeta(α β)α gt 0 and β gt 0
f (x) = (α + β)
(α)(β)xαminus1(1 minus x)βminus1
0 le x le 1
EX = α
α + β
varX = αβ
(α + β)2(α + β + 1)
Cauchy X simCauchy(α β)α isin and β gt 0
f (x) = 1
πβ
[1 +
(x minus α
β
)2]
x isin
EX is nonexistentvarX is nonexistent
Chi-square X sim χ2ν
ν gt 0f (x) = Gamma(ν2 12)x gt 0
EX = ν
varX = 2ν
Dirichlet X simDirichlet(α)α = (α1 αk)αi gt 0α0 = sumk
i=1 αi
f (x) = (α0)prodk
i=1 xαiminus1iprodk
i=1 (αi)x = (x1 xk) and 0 le xi le 1sumk
i=1 xi = 1
EX = αα0
varXi = αi(α0 minus αi)
α20(α0 + 1)
covXi Xj = minusαiαj
α20(α0 + 1)
Exponential X simExp(λ)λ gt 0
f (x) = λ expminusλxx gt 0
EX = 1λ
varX = 1λ2
Gamma X simGamma(r λ)λ gt 0 and r gt 0
f (x) = λrxrminus1
(r)expminusλx
x gt 0
EX = rλ
varX = rλ2
6
TABLE 13 Notation and description for more common probability distributions of continuous random variables
Notation and Density and Mean and
Name Parameter Space Sample Space Variance
Lognormal X simLognormal(μ σ2)μ isin and σ gt 0
f (x) = 1
xradic
2πσ2exp
minus1
2
(logx minus μ
σ
)2
x isin EX = expμ + σ22varX = exp2μ + 2σ2 minus exp2μ + σ2
MultivariateNormal
X sim Nk(μ )μ = (μ1 μk) isin k
positive definite
f (x) = expminus(x minus μ)T minus1(x minus μ)2(2π)k2||12
x = (x1 xk) isin k
EX = μ
varX =
Normal X simN(μ σ2)μ isin and σ gt 0
f (x) = 1radic2πσ2
exp
minus1
2
(x minus μ
σ
)2
x isin EX = μ
varX = σ2
Studentrsquos t X sim tν
ν gt 0f (x) = ((ν + 1)2)
(ν2)radic
πν
(1 + x2
ν
)minus(ν+1)2
x isin EX = 0 if ν gt 1
varX = ν
ν + 2if ν gt 2
Uniform X simUnif(a b)a b isin and a lt b
f (x) = 1
b minus ax isin [a b]
EX = (a + b)2varX = (b minus a)212
Weibull X simWeibull(a b)a gt 0 and b gt 0
f (x) = abxbminus1 expminusaxbx gt 0
EX = (1 + 1b)
a1b
varX = (1 + 2b) minus (1 + 1b)2
a2b
7
8 CHAPTER 1 REVIEW(n
k1 km
)= nprodm
i=1 kiwhere n =
msumi=1
ki (115)
(r) =
(r minus 1) for r = 1 2 int infin0 trminus1 expminust dt for general r gt 0
(116)
It is worth knowing that
(
12
)= radic
π and (n + 1
2
)= 1 times 3 times 5 times middot middot middot times (2n minus 1)
radicπ2n
for positive integer nMany of the distributions commonly used in statistics are members of an
exponential family A k-parameter exponential family density can be expressed as
f (x|γ) = c1(x)c2(γ) exp
ksum
i=1
yi(x)θi(γ)
(117)
for nonnegative functions c1 and c2 The vectorγ denotes the familiar parameters suchas λ for the Poisson density and p for the binomial density The real-valued θi(γ) arethe natural or canonical parameters which are usually transformations of γ Theyi(x) are the sufficient statistics for the canonical parameters It is straightforwardto show
Ey(X) =κprime(θ) (118)
and
vary(X) =κprimeprime(θ) (119)
where κ(θ) = minus log c3(θ) letting c3(θ) denote the reexpression of c2(γ) in terms ofthe canonical parameters θ = (θ1 θk) and y(X) = (y1(X) yk(X)) Theseresults can be rewritten in terms of the original parameters γ as
E
ksum
i=1
dθi(γ)
dγj
yi(X)
= minus d
dγj
log c2(γ) (120)
and
var
ksum
i=1
dθi(γ)
dγj
yi(X)
= minus d2
dγ2j
log c2(γ) minus E
ksum
i=1
d2θi(γ)
dγ2j
yi(X)
(121)
Example 11 (Poisson) The Poisson distribution belongs to the exponential familywith c1(x) = 1x c2(λ) = expminusλ y(x) = x and θ(λ) = log λ Deriving momentsin terms of θ we have κ(θ) = expθ so EX = κprime(θ) = expθ = λ and varX =κprimeprime(θ) = expθ = λ The same results may be obtained with (120) and (121) notingthat dθdλ = 1λ For example (120) gives E Xλ = 1
It is also important to know how the distribution of a random variable changeswhen it is transformed Let X = (X1 Xp) denote a p-dimensional random
14 LIKELIHOOD INFERENCE 9
variable with continuous density function f Suppose that
U = g(X) = (g1(X) gp(X)) = (U1 Up) (122)
where g is a one-to-one function mapping the support region of f onto the space ofall u = g(x) for which x satisfies f (x) gt 0 To derive the probability distribution ofU from that of X we need to use the Jacobian matrix The density of the transformedvariables is
f (u) = f (gminus1(u)) |J(u)| (123)
where |J(u)| is the absolute value of the determinant of the Jacobian matrix of gminus1
evaluated at u having (i j)th element dxiduj where these derivatives are assumedto be continuous over the support region of U
14 LIKELIHOOD INFERENCE
If X1 Xn are independent and identically distributed (iid) each having densityf (x | θ) that depends on a vector of p unknown parameters θ = (θ1 θp) then thejoint likelihood function is
L(θ) =nprod
i=1
f (xi|θ) (124)
When the data are not iid the joint likelihood is still expressed as the joint densityf (x1 xn|θ) viewed as a function of θ
The observed data x1 xn might have been realized under many differentvalues for θ The parameters for which observing x1 xn would be most likelyconstitute the maximum likelihood estimate of θ In other words if ϑ is the functionof x1 xn that maximizes L(θ) then θ = ϑ(X1 Xn) is the maximum likeli-hood estimator (MLE) for θ MLEs are invariant to transformation so the MLE of atransformation of θ equals the transformation of θ
It is typically easier to work with the log likelihood function
l (θ) = log L(θ) (125)
which has the same maximum as the original likelihood since log is a strictly mono-tonic function Furthermore any additive constants (involving possibly x1 xn
but not θ) may be omitted from the log likelihood without changing the location of itsmaximum or differences between log likelihoods at different θ Note that maximizingL(θ) with respect to θ is equivalent to solving the system of equations
lprime(θ) = 0 (126)
where
lprime(θ) =(
dl(θ)
dθ1
dl(θ)
dθn
)
10 CHAPTER 1 REVIEW
is called the score function The score function satisfies
Elprime(θ) = 0 (127)
where the expectation is taken with respect to the distribution of X1 Xn Some-times an analytical solution to (126) provides the MLE this book describes a varietyof methods that can be used when the MLE cannot be solved for in closed form Itis worth noting that there are pathological circumstances where the MLE is not asolution of the score equation or the MLE is not unique see [127] for examples
The MLE has a sampling distribution because it depends on the realizationof the random variables X1 Xn The MLE may be biased or unbiased for θ yetunder quite general conditions it is asymptotically unbiased as n rarr infin The samplingvariance of the MLE depends on the average curvature of the log likelihood When thelog likelihood is very pointy the location of the maximum is more precisely known
To make this precise let lprimeprime(θ) denote the p times p matrix having (i j)th elementgiven by d2l(θ)(dθidθj) The Fisher information matrix is defined as
I(θ) = Elprime(θ)lprime(θ)T = minusElprimeprime(θ) (128)
where the expectations are taken with respect to the distribution of X1 Xn Thefinal equality in (128) requires mild assumptions which are satisfied for example inexponential families I(θ) may sometimes be called the expected Fisher informationto distinguish it from minuslprimeprime(θ) which is the observed Fisher information There aretwo reasons why the observed Fisher information is quite useful First it can becalculated even if the expectations in (128) cannot easily be computed Second it isa good approximation to I(θ) that improves as n increases
Under regularity conditions the asymptotic variancendashcovariance matrix ofthe MLE θ is I(θlowast)minus1 where θlowast denotes the true value of θ Indeed as n rarr infin thelimiting distribution of θ is Np(θlowast I(θlowast)minus1) Since the true parameter values areunknown I(θlowast)minus1 must be estimated in order to estimate the variancendashcovariancematrix of the MLE An obvious approach is to use I(θ)minus1 Alternatively it is alsoreasonable to use minuslprimeprime(θ)minus1 Standard errors for individual parameter MLEs canbe estimated by taking the square root of the diagonal elements of the chosenestimate of I(θlowast)minus1 A thorough introduction to maximum likelihood theory andthe relative merits of these estimates of I(θlowast)minus1 can be found in [127 182 371 470]
Profile likelihoods provide an informative way to graph a higher-dimensionallikelihood surface to make inference about some parameters while treating othersas nuisance parameters and to facilitate various optimization strategies The profilelikelihood is obtained by constrained maximization of the full likelihood with respectto parameters to be ignored If θ = (μ φ) then the profile likelihood for φ is
L(φ|μ(φ)) = maxμ
L(μ φ) (129)
Thus for each possible φ a value of μ is chosen to maximize L(μ φ) This optimalμ is a function of φ The profile likelihood is the function that maps φ to the valueof the full likelihood evaluated at φ and its corresponding optimal μ Note that the φ
viii CONTENTS
223 GaussndashNewton Method 44224 NelderndashMead Algorithm 45225 Nonlinear GaussndashSeidel Iteration 52
Problems 54
3 COMBINATORIAL OPTIMIZATION 59
31 Hard Problems and NP-Completeness 59311 Examples 61312 Need for Heuristics 64
32 Local Search 65
33 Simulated Annealing 68331 Practical Issues 70
3311 Neighborhoods and Proposals 70
3312 Cooling Schedule and Convergence 71332 Enhancements 74
34 Genetic Algorithms 75341 Definitions and the Canonical Algorithm 75
3411 Basic Definitions 75
3412 Selection Mechanisms and Genetic Operators 76
3413 Allele Alphabets and Genotypic Representation 78
3414 Initialization Termination and Parameter Values 79342 Variations 80
3421 Fitness 80
3422 Selection Mechanisms and Updating Generations 81
3423 Genetic Operators and PermutationChromosomes 82
343 Initialization and Parameter Values 84344 Convergence 84
35 Tabu Algorithms 85351 Basic Definitions 86352 The Tabu List 87353 Aspiration Criteria 88354 Diversification 89355 Intensification 90356 Comprehensive Tabu Algorithm 91
Problems 92
4 EM OPTIMIZATION METHODS 97
41 Missing Data Marginalization and Notation 97
42 The EM Algorithm 98421 Convergence 102422 Usage in Exponential Families 105
CONTENTS ix
423 Variance Estimation 106
4231 Louisrsquos Method 106
4232 SEM Algorithm 108
4233 Bootstrapping 110
4234 Empirical Information 110
4235 Numerical Differentiation 111
43 EM Variants 111431 Improving the E Step 111
4311 Monte Carlo EM 111432 Improving the M Step 112
4321 ECM Algorithm 113
4322 EM Gradient Algorithm 116433 Acceleration Methods 118
4331 Aitken Acceleration 118
4332 Quasi-Newton Acceleration 119
Problems 121
PART II
INTEGRATION AND SIMULATION
5 NUMERICAL INTEGRATION 129
51 NewtonndashCotes Quadrature 129511 Riemann Rule 130512 Trapezoidal Rule 134513 Simpsonrsquos Rule 136514 General kth-Degree Rule 138
52 Romberg Integration 139
53 Gaussian Quadrature 142531 Orthogonal Polynomials 143532 The Gaussian Quadrature Rule 143
54 Frequently Encountered Problems 146541 Range of Integration 146542 Integrands with Singularities or Other Extreme Behavior 146543 Multiple Integrals 147544 Adaptive Quadrature 147545 Software for Exact Integration 148
Problems 148
6 SIMULATION AND MONTE CARLO INTEGRATION 151
61 Introduction to the Monte Carlo Method 151
62 Exact Simulation 152
x CONTENTS
621 Generating from Standard Parametric Families 153622 Inverse Cumulative Distribution Function 153623 Rejection Sampling 155
6231 Squeezed Rejection Sampling 158
6232 Adaptive Rejection Sampling 159
63 Approximate Simulation 163
631 Sampling Importance Resampling Algorithm 163
6311 Adaptive Importance Bridge and Path Sampling 167
632 Sequential Monte Carlo 168
6321 Sequential Importance Sampling for Markov Processes 169
6322 General Sequential Importance Sampling 170
6323 Weight Degeneracy Rejuvenation and EffectiveSample Size 171
6324 Sequential Importance Sampling for HiddenMarkov Models 175
6325 Particle Filters 179
64 Variance Reduction Techniques 180
641 Importance Sampling 180642 Antithetic Sampling 186643 Control Variates 189644 RaondashBlackwellization 193
Problems 195
7 MARKOV CHAIN MONTE CARLO 201
71 MetropolisndashHastings Algorithm 202
711 Independence Chains 204712 Random Walk Chains 206
72 Gibbs Sampling 209
721 Basic Gibbs Sampler 209722 Properties of the Gibbs Sampler 214723 Update Ordering 216724 Blocking 216725 Hybrid Gibbs Sampling 216726 GriddyndashGibbs Sampler 218
73 Implementation 218
731 Ensuring Good Mixing and Convergence 219
7311 Simple Graphical Diagnostics 219
7312 Burn-in and Run Length 220
7313 Choice of Proposal 222
7314 Reparameterization 223
7315 Comparing Chains Effective Sample Size 224
7316 Number of Chains 225
CONTENTS xi
732 Practical Implementation Advice 226733 Using the Results 226
Problems 230
8 ADVANCED TOPICS IN MCMC 237
81 Adaptive MCMC 237811 Adaptive Random Walk Metropolis-within-Gibbs Algorithm 238812 General Adaptive Metropolis-within-Gibbs Algorithm 240813 Adaptive Metropolis Algorithm 247
82 Reversible Jump MCMC 250821 RJMCMC for Variable Selection in Regression 253
83 Auxiliary Variable Methods 256831 Simulated Tempering 257832 Slice Sampler 258
84 Other MetropolisndashHastings Algorithms 260841 Hit-and-Run Algorithm 260842 Multiple-Try MetropolisndashHastings Algorithm 261843 Langevin MetropolisndashHastings Algorithm 262
85 Perfect Sampling 264851 Coupling from the Past 264
8511 Stochastic Monotonicity and Sandwiching 267
86 Markov Chain Maximum Likelihood 268
87 Example MCMC for Markov Random Fields 269871 Gibbs Sampling for Markov Random Fields 270872 Auxiliary Variable Methods for Markov Random Fields 274873 Perfect Sampling for Markov Random Fields 277
Problems 279
PART III
BOOTSTRAPPING
9 BOOTSTRAPPING 287
91 The Bootstrap Principle 287
92 Basic Methods 288921 Nonparametric Bootstrap 288922 Parametric Bootstrap 289923 Bootstrapping Regression 290924 Bootstrap Bias Correction 291
93 Bootstrap Inference 292931 Percentile Method 292
9311 Justification for the Percentile Method 293932 Pivoting 294
xii CONTENTS
9321 Accelerated Bias-Corrected Percentile Method BCa 294
9322 The Bootstrap t 296
9323 Empirical Variance Stabilization 298
9324 Nested Bootstrap and Prepivoting 299
933 Hypothesis Testing 301
94 Reducing Monte Carlo Error 302
941 Balanced Bootstrap 302942 Antithetic Bootstrap 302
95 Bootstrapping Dependent Data 303
951 Model-Based Approach 304952 Block Bootstrap 304
9521 Nonmoving Block Bootstrap 304
9522 Moving Block Bootstrap 306
9523 Blocks-of-Blocks Bootstrapping 307
9524 Centering and Studentizing 309
9525 Block Size 311
96 Bootstrap Performance 315
961 Independent Data Case 315962 Dependent Data Case 316
97 Other Uses of the Bootstrap 316
98 Permutation Tests 317
Problems 319
PART IV
DENSITY ESTIMATION AND SMOOTHING
10 NONPARAMETRIC DENSITY ESTIMATION 325
101 Measures of Performance 326
102 Kernel Density Estimation 327
1021 Choice of Bandwidth 329
10211 Cross-Validation 332
10212 Plug-in Methods 335
10213 Maximal Smoothing Principle 338
1022 Choice of Kernel 339
10221 Epanechnikov Kernel 339
10222 Canonical Kernels and Rescalings 340
103 Nonkernel Methods 341
1031 Logspline 341
104 Multivariate Methods 345
1041 The Nature of the Problem 345
CONTENTS xiii
1042 Multivariate Kernel Estimators 3461043 Adaptive Kernels and Nearest Neighbors 348
10431 Nearest Neighbor Approaches 349
10432 Variable-Kernel Approaches and Transformations 350
1044 Exploratory Projection Pursuit 353
Problems 359
11 BIVARIATE SMOOTHING 363
111 PredictorndashResponse Data 363
112 Linear Smoothers 365
1121 Constant-Span Running Mean 366
11211 Effect of Span 368
11212 Span Selection for Linear Smoothers 369
1122 Running Lines and Running Polynomials 3721123 Kernel Smoothers 3741124 Local Regression Smoothing 3741125 Spline Smoothing 376
11251 Choice of Penalty 377
113 Comparison of Linear Smoothers 377
114 Nonlinear Smoothers 379
1141 Loess 3791142 Supersmoother 381
115 Confidence Bands 384
116 General Bivariate Data 388
Problems 389
12 MULTIVARIATE SMOOTHING 393
121 PredictorndashResponse Data 393
1211 Additive Models 3941212 Generalized Additive Models 3971213 Other Methods Related to Additive Models 399
12131 Projection Pursuit Regression 399
12132 Neural Networks 402
12133 Alternating Conditional Expectations 403
12134 Additivity and Variance Stabilization 404
1214 Tree-Based Methods 405
12141 Recursive Partitioning Regression Trees 406
12142 Tree Pruning 409
12143 Classification Trees 411
12144 Other Issues for Tree-Based Methods 412
xiv CONTENTS
122 General Multivariate Data 4131221 Principal Curves 413
12211 Definition and Motivation 413
12212 Estimation 415
12213 Span Selection 416
Problems 416
DATA ACKNOWLEDGMENTS 421
REFERENCES 423
INDEX 457
PREFACE
This book covers most topics needed to develop a broad and thorough working knowl-edge of modern computational statistics We seek to develop a practical understandingof how and why existing methods work enabling readers to use modern statisticalmethods effectively Since many new methods are built from components of exist-ing techniques our ultimate goal is to provide scientists with the tools they need tocontribute new ideas to the field
A growing challenge in science is that there is so much of it While the pursuitof important new methods and the honing of existing approaches is a worthy goalthere is also a need to organize and distill the teeming jungle of ideas We attempt todo that here Our choice of topics reflects our view of what constitutes the core of theevolving field of computational statistics and what will be interesting and useful forour readers
Our use of the adjective modern in the first sentence of this preface is potentiallytroublesome There is no way that this book can cover all the latest greatest techniquesWe have not even tried We have instead aimed to provide a reasonably up-to-datesurvey of a broad portion of the field while leaving room for diversions and esoterica
The foundations of optimization and numerical integration are covered in thisbook We include these venerable topics because (i) they are cornerstones of frequen-tist and Bayesian inference (ii) routine application of available software often failsfor hard problems and (iii) the methods themselves are often secondary componentsof other statistical computing algorithms Some topics we have omitted representimportant areas of past and present research in the field but their priority here islowered by the availability of high-quality software For example the generation ofpseudo-random numbers is a classic topic but one that we prefer to address by giv-ing students reliable software Finally some topics (eg principal curves and tabusearch) are included simply because they are interesting and provide very differentperspectives on familiar problems Perhaps a future researcher may draw ideas fromsuch topics to design a creative and effective new algorithm
In this second edition we have both updated and broadened our coverage andwe now provide computer code For example we have added new MCMC topics to re-flect continued activity in that popular area A notable increase in breadth is our inclu-sion of more methods relevant for problems where statistical dependency is importantsuch as block bootstrapping and sequential importance sampling This second editionprovides extensive new support in R Specifically code for the examples in this bookis available from the book website wwwstatcolostateeducomputationalstatistics
Our target audience includes graduate students in statistics and related fieldsstatisticians and quantitative empirical scientists in other fields We hope such readersmay use the book when applying standard methods and developing new methods
xv
xvi PREFACE
The level of mathematics expected of the reader does not extend much beyondTaylor series and linear algebra Breadth of mathematical training is more helpfulthan depth Essential review is provided in Chapter 1 More advanced readers will findgreater mathematical detail in the wide variety of high-quality books available on spe-cific topics many of which are referenced in the text Other readers caring less aboutanalytical details may prefer to focus on our descriptions of algorithms and examples
The expected level of statistics is equivalent to that obtained by a graduatestudent in his or her first year of study of the theory of statistics and probabilityAn understanding of maximum likelihood methods Bayesian methods elementaryasymptotic theory Markov chains and linear models is most important Many ofthese topics are reviewed in Chapter 1
With respect to computer programming we find that good students can learn asthey go However a working knowledge of a suitable language allows implementationof the ideas covered in this book to progress much more quickly We have chosento forgo any language-specific examples algorithms or coding in the text For thosewishing to learn a language while they study this book we recommend that you choosea high-level interactive package that permits the flexible design of graphical displaysand includes supporting statistics and probability functions such as R and MATLAB1
These are the sort of languages often used by researchers during the development ofnew statistical computing techniques and they are suitable for implementing all themethods we describe except in some cases for problems of vast scope or complexityWe use R and recommend it Although lower-level languages such as C++ couldalso be used they are more appropriate for professional-grade implementation ofalgorithms after researchers have refined the methodology
The book is organized into four major parts optimization (Chapters 2 3 and4) integration and simulation (Chapters 5 6 7 and 8) bootstrapping (Chapter 9)and density estimation and smoothing (Chapters 10 11 and 12) The chapters arewritten to stand independently so a course can be built by selecting the topics onewishes to teach For a one-semester course our selection typically weights mostheavily topics from Chapters 2 3 6 7 9 10 and 11 With a leisurely pace or morethorough coverage a shorter list of topics could still easily fill a semester courseThere is sufficient material here to provide a thorough one-year course of studynotwithstanding any supplemental topics one might wish to teach
A variety of homework problems are included at the end of each chapter Someare straightforward while others require the student to develop a thorough under-standing of the modelmethod being used to carefully (and perhaps cleverly) code asuitable technique and to devote considerable attention to the interpretation of resultsA few exercises invite open-ended exploration of methods and ideas We are some-times asked for solutions to the exercises but we prefer to sequester them to preservethe challenge for future students and readers
The datasets discussed in the examples and exercises are available from the bookwebsite wwwstatcolostateeducomputationalstatistics The R code is also providedthere Finally the website includes an errata Responsibility for all errors lies with us
1R is available for free from wwwr-projectorg Information about MATLAB can be found atwwwmathworkscom
ACKNOWLEDGMENTS
The course upon which this book is based was developed and taught by us at ColoradoState University from 1994 onwards Thanks are due to our many students who havebeen semiwilling guinea pigs over the years We also thank our colleagues in theStatistics Department for their continued support The late Richard Tweedie meritsparticular acknowledgment for his mentoring during the early years of our careers
We owe a great deal of intellectual debt to Adrian Raftery who deserves specialthanks not only for his teaching and advising but also for his unwavering support andhis seemingly inexhaustible supply of good ideas In addition we thank our influentialadvisors and teachers at the University of Washington Statistics Department includingDavid Madigan Werner Stuetzle and Judy Zeh Of course each of our chapters couldbe expanded into a full-length book and great scholars have already done so We owemuch to their efforts upon which we relied when developing our course and ourmanuscript
Portions of the first edition were written at the Department of Mathematicsand Statistics University of Otago in Dunedin New Zealand whose faculty wethank for graciously hosting us during our sabbatical in 2003 Much of our workon the second edition was undertaken during our sabbatical visit to the AustraliaCommonwealth Scientific and Research Organization in 2009ndash2010 sponsored byCSIRO Mathematics Informatics and Statistics and hosted at the Longpocket Lab-oratory in Indooroopilly Australia We thank our hosts and colleagues there for theirsupport
Our manuscript has been greatly improved through the constructive reviewsof John Bickham Ben Bird Kate Cowles Jan Hannig Alan Herlihy David HunterDevin Johnson Michael Newton Doug Nychka Steve Sain David W Scott N ScottUrquhart Haonan Wang Darrell Whitley and eight anonymous referees We alsothank the sharp-eyed readers listed in the errata for their suggestions and correctionsOur editor Steve Quigley and the folks at Wiley were supportive and helpful duringthe publication process We thank Nelida Pohl for permission to adapt her photographin the cover design of the first edition We thank Melinda Stelzer for permission to useher painting ldquoChampagne Circuitrdquo 2001 for the cover of the second edition Moreabout her art can be found at wwwfacebookcomgeekchicart We also owe specialnote of thanks to Zube (aka John Dzubera) who kept our own computers runningdespite our best efforts to the contrary
Funding from National Science Foundation (NSF) CAREER grant SBR-9875508 was a significant source of support for the first author during the preparationof the first edition He also thanks his colleagues and friends in the North SlopeBorough Alaska Department of Wildlife Management for their longtime researchsupport The second author gratefully acknowledges the support of STAR Research
xvii
xviii ACKNOWLEDGMENTS
Assistance Agreement CR-829095 awarded to Colorado State University by the USEnvironmental Protection Agency (EPA) The views expressed here are solely thoseof the authors NSF and EPA do not endorse any products or commercial servicesmentioned herein
Finally we thank our parents for enabling and supporting our educations and forproviding us with the ldquostubbornness generdquo necessary for graduate school the tenuretrack or book publicationmdashtake your pick The second edition is dedicated to ourkids Natalie and Neil for continuing to show us what is important and what is not
Geof H GivensJennifer A Hoeting
CHAPTER 1REVIEW
This chapter reviews notation and background material in mathematics probabilityand statistics Readers may wish to skip this chapter and turn directly to Chapter 2returning here only as needed
11 MATHEMATICAL NOTATION
We use boldface to distinguish a vector x = (x1 xp) or a matrix M from a scalarvariable x or a constant M A vector-valued function f evaluated at x is also boldfacedas in f(x) = (f1(x) fp(x)) The transpose of M is denoted MT
Unless otherwise specified all vectors are considered to be column vectors sofor example an n times p matrix can be written as M = (x1 xn)T Let I denote anidentity matrix and 1 and 0 denote vectors of ones and zeros respectively
A symmetric square matrix M is positive definite if xTMx gt 0 for all nonzerovectors x Positive definiteness is equivalent to the condition that all eigenvalues ofM are positive M is nonnegative definite or positive semidefinite if xTMx ge 0 for allnonzero vectors x
The derivative of a function f evaluated at x is denoted f prime(x) When x =(x1 xp) the gradient of f at x is
f prime(x) =(
df (x)
dx1
df (x)
dxp
)
The Hessian matrix for f at x is f primeprime(x) having (i j)th element equal to d2f (x)(dxi dxj) The negative Hessian has important uses in statistical inference
Let J(x) denote the Jacobian matrix evaluated at x for the one-to-one mappingy = f(x) The (i j)th element of J(x) is equal to dfi(x)dxj
A functional is a real-valued function on a space of functions For example ifT (f ) = int 1
0 f (x) dx then the functional T maps suitably integrable functions onto thereal line
The indicator function 1A equals 1 if A is true and 0 otherwise The real lineis denoted and p-dimensional real space is p
Computational Statistics Second Edition Geof H Givens and Jennifer A Hoetingcopy 2013 John Wiley amp Sons Inc Published 2013 by John Wiley amp Sons Inc
1
2 CHAPTER 1 REVIEW
12 TAYLORrsquoS THEOREM AND MATHEMATICALLIMIT THEORY
First we define standard ldquobig ohrdquo and ldquolittle ohrdquo notation for describing the relativeorders of convergence of functions Let the functions f and g be defined on a commonpossibly infinite interval Let z0 be a point in this interval or a boundary point of it(ie minusinfin or infin) We require g(z) = 0 for all z = z0 in a neighborhood of z0 Thenwe say
f (z) = O(g(z)) (11)
if there exists a constant M such that |f (z)| le M|g(z)| as z rarr z0 For example(n + 1)(3n2) = O(nminus1) and it is understood that we are considering n rarr infin Iflimzrarrz0 f (z)g(z) = 0 then we say
f (z) = O(g(z)) (12)
For example f (x0 + h) minus f (x0) = hf prime(x0) + O(h) as h rarr 0 if f is differentiable atx0 The same notation can be used for describing the convergence of a sequence xnas n rarr infin by letting f (n) = xn
Taylorrsquos theorem provides a polynomial approximation to a function f Supposef has finite (n + 1)th derivative on (a b) and continuous nth derivative on [a b] Thenfor any x0 isin [a b] distinct from x the Taylor series expansion of f about x0 is
f (x) =nsum
i=0
1
if (i)(x0)(x minus x0)i + Rn (13)
where f (i)(x0) is the ith derivative of f evaluated at x0 and
Rn = 1
(n + 1)f (n+1)(ξ)(x minus x0)n+1 (14)
for some point ξ in the interval between x and x0 As |x minus x0| rarr 0 note that Rn =O(|x minus x0|n+1)
The multivariate version of Taylorrsquos theorem is analogous Suppose f is areal-valued function of a p-dimensional variable x possessing continuous partialderivatives of all orders up to and including n + 1 with respect to all coordinates inan open convex set containing x and x0 = x Then
f (x) = f (x0) +nsum
i=1
1
iD(i)(f x0 x minus x0) + Rn (15)
where
D(i)(f x y) =psum
j1=1
middot middot middotpsum
ji=1
(di
dtj1 middot middot middot dtji
f (t)
∣∣∣∣t=x
) iprodk=1
yjk
(16)
12 TAYLORrsquoS THEOREM AND MATHEMATICAL LIMIT THEORY 3
and
Rn = 1
(n + 1)D(n+1)(f ξ x minus x0) (17)
for some ξ on the line segment joining x and x0 As |x minus x0| rarr 0 note that Rn =O(|x minus x0|n+1)
The EulerndashMaclaurin formula is useful in many asymptotic analyses If f has2n continuous derivatives in [0 1] thenint 1
0f (x) dx =f (0) + f (1)
2
minusnminus1sumi=0
b2i(f (2iminus1)(1) minus f (2iminus1)(0))
(2i)minus b2nf
(2n)(ξ)
(2n) (18)
where 0 le ξ le 1 f (j) is the jth derivative of f and bj = Bj(0) can be determinedusing the recursion relation
msumj=0
(m + 1
j
)Bj(z) = (m + 1)zm (19)
initialized with B0(z) = 1 The proof of this result is based on repeated integrationsby parts [376]
Finally we note that it is sometimes desirable to approximate the derivative ofa function numerically using finite differences For example the ith component ofthe gradient of f at x can be approximated by
df (x)
dxi
asymp f (x + εiei) minus f (x minus εiei)
2εi
(110)
where εi is a small number and ei is the unit vector in the ith coordinate directionTypically one might start with say εi = 001 or 0001 and approximate the desiredderivative for a sequence of progressively smaller εi The approximation willgenerally improve until εi becomes small enough that the calculation is degraded andeventually dominated by computer roundoff error introduced by subtractive cancel-lation Introductory discussion of this approach and a more sophisticated Richardsonextrapolation strategy for obtaining greater precision are provided in [376] Finitedifferences can also be used to approximate the second derivative of f at x via
df (x)
dxi dxj
asymp 1
4εiεj
(f (x + εiei + εjej) minus f (x + εiei minus εjej)
minus f (x minus εiei + εjej) + f (x minus εiei minus εjej)
)(111)
with similar sequential precision improvements
4 CHAPTER 1 REVIEW
13 STATISTICAL NOTATION AND PROBABILITYDISTRIBUTIONS
We use capital letters to denote random variables such as Y or X and lowercaseletters to represent specific realized values of random variables such as y or x Theprobability density function of X is denoted f the cumulative distribution functionis F We use the notation X sim f (x) to mean that X is distributed with density f (x)Frequently the dependence of f (x) on one or more parameters also will be denotedwith a conditioning bar as in f (x|α β) Because of the diversity of topics covered inthis book we want to be careful to distinguish when f (x|α) refers to a density functionas opposed to the evaluation of that density at a point x When the meaning is unclearfrom the context we will be explicit for example by using f (middot|α) to denote thefunction When it is important to distinguish among several densities we may adoptsubscripts referring to specific random variables so that the density functions forX and Y are fX and fY respectively We use the same notation for distributions ofdiscrete random variables and in the Bayesian context
The conditional distribution of X given that Y equals y (ie X|Y = y) is de-scribed by the density denoted f (x|y) or fX|Y (x|y) In this case we write that X|Y hasdensity f (x|Y ) For notational simplicity we allow density functions to be implicitlyspecified by their arguments so we may use the same symbol say f to refer to manydistinct functions as in the equation f (x y|μ) = f (x|y μ)f (y|μ) Finally f (X) andF (X) are random variables the evaluations of the density and cumulative distributionfunctions respectively at the random argument X
The expectation of a random variable is denoted EX Unless specificallymentioned the distribution with respect to which an expectation is taken is the dis-tribution of X or should be implicit from the context To denote the probability ofan event A we use P[A] = E1A The conditional expectation of X|Y = y isEX|y When Y is unknown EX|Y is a random variable that depends on Y Otherattributes of the distribution of X and Y include varX covX Y corX Y andcvX = varX12EX These quantities are the variance of X the covariance andcorrelation of X and Y and the coefficient of variation of X respectively
A useful result regarding expectations is Jensenrsquos inequality Let g be a convexfunction on a possibly infinite open interval I so
g(λx + (1 minus λ)y) le λg(x) + (1 minus λ)g(y) (112)
for all x y isin I and all 0 lt λ lt 1 Then Jensenrsquos inequality states that Eg(X) geg(EX) for any random variable X having P[X isin I] = 1
Tables 11 12 and 13 provide information about many discrete and contin-uous distributions used throughout this book We refer to the following well-knowncombinatorial constants
n = n(n minus 1)(n minus 2) middot middot middot (3)(2)(1) with 0 = 1 (113)(n
k
)= n
k(n minus k) (114)
TABLE 11 Notation and description for common probability distributions of discrete random variables
Notation and Density and Mean and
Name Parameter Space Sample Space Variance
Bernoulli X simBernoulli(p)0 le p le 1
f (x) = px(1 minus p)1minusx
x = 0 or 1EX = p
varX = p(1 minus p)
Binomial X simBin(n p)0 le p le 1n isin 1 2
f (x) =(
n
x
)px(1 minus p)nminusx
x = 0 1 n
EX = np
varX = np(1 minus p)
Multinomial X simMultinomial(n p)p = (p1 pk)0 le pi le 1 and n isin 1 2 sumk
i=1 pi = 1
f (x) =(
n
x1 xk
)prodk
i=1 pxi
i
x = (x1 xk) and xi isin 0 1 nsumk
i=1 xi = n
EX = npvarXi = npi(1 minus pi)covXi Xj = minusnpipj
NegativeBinomial
X simNegBin(r p)0 le p le 1r isin 1 2
f (x) =(
x + r minus 1
r minus 1
)pr(1 minus p)x
x isin 0 1 EX = r(1 minus p)pvarX = r(1 minus p)p2
Poisson X simPoisson(λ) f (x) = λx
xexpminusλ EX = λ
λ gt 0 x isin 0 1 varX = λ
5
TABLE 12 Notation and description for some common probability distributions of continuous random variables
Notation and Density and Mean and
Name Parameter Space Sample Space Variance
Beta X simBeta(α β)α gt 0 and β gt 0
f (x) = (α + β)
(α)(β)xαminus1(1 minus x)βminus1
0 le x le 1
EX = α
α + β
varX = αβ
(α + β)2(α + β + 1)
Cauchy X simCauchy(α β)α isin and β gt 0
f (x) = 1
πβ
[1 +
(x minus α
β
)2]
x isin
EX is nonexistentvarX is nonexistent
Chi-square X sim χ2ν
ν gt 0f (x) = Gamma(ν2 12)x gt 0
EX = ν
varX = 2ν
Dirichlet X simDirichlet(α)α = (α1 αk)αi gt 0α0 = sumk
i=1 αi
f (x) = (α0)prodk
i=1 xαiminus1iprodk
i=1 (αi)x = (x1 xk) and 0 le xi le 1sumk
i=1 xi = 1
EX = αα0
varXi = αi(α0 minus αi)
α20(α0 + 1)
covXi Xj = minusαiαj
α20(α0 + 1)
Exponential X simExp(λ)λ gt 0
f (x) = λ expminusλxx gt 0
EX = 1λ
varX = 1λ2
Gamma X simGamma(r λ)λ gt 0 and r gt 0
f (x) = λrxrminus1
(r)expminusλx
x gt 0
EX = rλ
varX = rλ2
6
TABLE 13 Notation and description for more common probability distributions of continuous random variables
Notation and Density and Mean and
Name Parameter Space Sample Space Variance
Lognormal X simLognormal(μ σ2)μ isin and σ gt 0
f (x) = 1
xradic
2πσ2exp
minus1
2
(logx minus μ
σ
)2
x isin EX = expμ + σ22varX = exp2μ + 2σ2 minus exp2μ + σ2
MultivariateNormal
X sim Nk(μ )μ = (μ1 μk) isin k
positive definite
f (x) = expminus(x minus μ)T minus1(x minus μ)2(2π)k2||12
x = (x1 xk) isin k
EX = μ
varX =
Normal X simN(μ σ2)μ isin and σ gt 0
f (x) = 1radic2πσ2
exp
minus1
2
(x minus μ
σ
)2
x isin EX = μ
varX = σ2
Studentrsquos t X sim tν
ν gt 0f (x) = ((ν + 1)2)
(ν2)radic
πν
(1 + x2
ν
)minus(ν+1)2
x isin EX = 0 if ν gt 1
varX = ν
ν + 2if ν gt 2
Uniform X simUnif(a b)a b isin and a lt b
f (x) = 1
b minus ax isin [a b]
EX = (a + b)2varX = (b minus a)212
Weibull X simWeibull(a b)a gt 0 and b gt 0
f (x) = abxbminus1 expminusaxbx gt 0
EX = (1 + 1b)
a1b
varX = (1 + 2b) minus (1 + 1b)2
a2b
7
8 CHAPTER 1 REVIEW(n
k1 km
)= nprodm
i=1 kiwhere n =
msumi=1
ki (115)
(r) =
(r minus 1) for r = 1 2 int infin0 trminus1 expminust dt for general r gt 0
(116)
It is worth knowing that
(
12
)= radic
π and (n + 1
2
)= 1 times 3 times 5 times middot middot middot times (2n minus 1)
radicπ2n
for positive integer nMany of the distributions commonly used in statistics are members of an
exponential family A k-parameter exponential family density can be expressed as
f (x|γ) = c1(x)c2(γ) exp
ksum
i=1
yi(x)θi(γ)
(117)
for nonnegative functions c1 and c2 The vectorγ denotes the familiar parameters suchas λ for the Poisson density and p for the binomial density The real-valued θi(γ) arethe natural or canonical parameters which are usually transformations of γ Theyi(x) are the sufficient statistics for the canonical parameters It is straightforwardto show
Ey(X) =κprime(θ) (118)
and
vary(X) =κprimeprime(θ) (119)
where κ(θ) = minus log c3(θ) letting c3(θ) denote the reexpression of c2(γ) in terms ofthe canonical parameters θ = (θ1 θk) and y(X) = (y1(X) yk(X)) Theseresults can be rewritten in terms of the original parameters γ as
E
ksum
i=1
dθi(γ)
dγj
yi(X)
= minus d
dγj
log c2(γ) (120)
and
var
ksum
i=1
dθi(γ)
dγj
yi(X)
= minus d2
dγ2j
log c2(γ) minus E
ksum
i=1
d2θi(γ)
dγ2j
yi(X)
(121)
Example 11 (Poisson) The Poisson distribution belongs to the exponential familywith c1(x) = 1x c2(λ) = expminusλ y(x) = x and θ(λ) = log λ Deriving momentsin terms of θ we have κ(θ) = expθ so EX = κprime(θ) = expθ = λ and varX =κprimeprime(θ) = expθ = λ The same results may be obtained with (120) and (121) notingthat dθdλ = 1λ For example (120) gives E Xλ = 1
It is also important to know how the distribution of a random variable changeswhen it is transformed Let X = (X1 Xp) denote a p-dimensional random
14 LIKELIHOOD INFERENCE 9
variable with continuous density function f Suppose that
U = g(X) = (g1(X) gp(X)) = (U1 Up) (122)
where g is a one-to-one function mapping the support region of f onto the space ofall u = g(x) for which x satisfies f (x) gt 0 To derive the probability distribution ofU from that of X we need to use the Jacobian matrix The density of the transformedvariables is
f (u) = f (gminus1(u)) |J(u)| (123)
where |J(u)| is the absolute value of the determinant of the Jacobian matrix of gminus1
evaluated at u having (i j)th element dxiduj where these derivatives are assumedto be continuous over the support region of U
14 LIKELIHOOD INFERENCE
If X1 Xn are independent and identically distributed (iid) each having densityf (x | θ) that depends on a vector of p unknown parameters θ = (θ1 θp) then thejoint likelihood function is
L(θ) =nprod
i=1
f (xi|θ) (124)
When the data are not iid the joint likelihood is still expressed as the joint densityf (x1 xn|θ) viewed as a function of θ
The observed data x1 xn might have been realized under many differentvalues for θ The parameters for which observing x1 xn would be most likelyconstitute the maximum likelihood estimate of θ In other words if ϑ is the functionof x1 xn that maximizes L(θ) then θ = ϑ(X1 Xn) is the maximum likeli-hood estimator (MLE) for θ MLEs are invariant to transformation so the MLE of atransformation of θ equals the transformation of θ
It is typically easier to work with the log likelihood function
l (θ) = log L(θ) (125)
which has the same maximum as the original likelihood since log is a strictly mono-tonic function Furthermore any additive constants (involving possibly x1 xn
but not θ) may be omitted from the log likelihood without changing the location of itsmaximum or differences between log likelihoods at different θ Note that maximizingL(θ) with respect to θ is equivalent to solving the system of equations
lprime(θ) = 0 (126)
where
lprime(θ) =(
dl(θ)
dθ1
dl(θ)
dθn
)
10 CHAPTER 1 REVIEW
is called the score function The score function satisfies
Elprime(θ) = 0 (127)
where the expectation is taken with respect to the distribution of X1 Xn Some-times an analytical solution to (126) provides the MLE this book describes a varietyof methods that can be used when the MLE cannot be solved for in closed form Itis worth noting that there are pathological circumstances where the MLE is not asolution of the score equation or the MLE is not unique see [127] for examples
The MLE has a sampling distribution because it depends on the realizationof the random variables X1 Xn The MLE may be biased or unbiased for θ yetunder quite general conditions it is asymptotically unbiased as n rarr infin The samplingvariance of the MLE depends on the average curvature of the log likelihood When thelog likelihood is very pointy the location of the maximum is more precisely known
To make this precise let lprimeprime(θ) denote the p times p matrix having (i j)th elementgiven by d2l(θ)(dθidθj) The Fisher information matrix is defined as
I(θ) = Elprime(θ)lprime(θ)T = minusElprimeprime(θ) (128)
where the expectations are taken with respect to the distribution of X1 Xn Thefinal equality in (128) requires mild assumptions which are satisfied for example inexponential families I(θ) may sometimes be called the expected Fisher informationto distinguish it from minuslprimeprime(θ) which is the observed Fisher information There aretwo reasons why the observed Fisher information is quite useful First it can becalculated even if the expectations in (128) cannot easily be computed Second it isa good approximation to I(θ) that improves as n increases
Under regularity conditions the asymptotic variancendashcovariance matrix ofthe MLE θ is I(θlowast)minus1 where θlowast denotes the true value of θ Indeed as n rarr infin thelimiting distribution of θ is Np(θlowast I(θlowast)minus1) Since the true parameter values areunknown I(θlowast)minus1 must be estimated in order to estimate the variancendashcovariancematrix of the MLE An obvious approach is to use I(θ)minus1 Alternatively it is alsoreasonable to use minuslprimeprime(θ)minus1 Standard errors for individual parameter MLEs canbe estimated by taking the square root of the diagonal elements of the chosenestimate of I(θlowast)minus1 A thorough introduction to maximum likelihood theory andthe relative merits of these estimates of I(θlowast)minus1 can be found in [127 182 371 470]
Profile likelihoods provide an informative way to graph a higher-dimensionallikelihood surface to make inference about some parameters while treating othersas nuisance parameters and to facilitate various optimization strategies The profilelikelihood is obtained by constrained maximization of the full likelihood with respectto parameters to be ignored If θ = (μ φ) then the profile likelihood for φ is
L(φ|μ(φ)) = maxμ
L(μ φ) (129)
Thus for each possible φ a value of μ is chosen to maximize L(μ φ) This optimalμ is a function of φ The profile likelihood is the function that maps φ to the valueof the full likelihood evaluated at φ and its corresponding optimal μ Note that the φ
CONTENTS ix
423 Variance Estimation 106
4231 Louisrsquos Method 106
4232 SEM Algorithm 108
4233 Bootstrapping 110
4234 Empirical Information 110
4235 Numerical Differentiation 111
43 EM Variants 111431 Improving the E Step 111
4311 Monte Carlo EM 111432 Improving the M Step 112
4321 ECM Algorithm 113
4322 EM Gradient Algorithm 116433 Acceleration Methods 118
4331 Aitken Acceleration 118
4332 Quasi-Newton Acceleration 119
Problems 121
PART II
INTEGRATION AND SIMULATION
5 NUMERICAL INTEGRATION 129
51 NewtonndashCotes Quadrature 129511 Riemann Rule 130512 Trapezoidal Rule 134513 Simpsonrsquos Rule 136514 General kth-Degree Rule 138
52 Romberg Integration 139
53 Gaussian Quadrature 142531 Orthogonal Polynomials 143532 The Gaussian Quadrature Rule 143
54 Frequently Encountered Problems 146541 Range of Integration 146542 Integrands with Singularities or Other Extreme Behavior 146543 Multiple Integrals 147544 Adaptive Quadrature 147545 Software for Exact Integration 148
Problems 148
6 SIMULATION AND MONTE CARLO INTEGRATION 151
61 Introduction to the Monte Carlo Method 151
62 Exact Simulation 152
x CONTENTS
621 Generating from Standard Parametric Families 153622 Inverse Cumulative Distribution Function 153623 Rejection Sampling 155
6231 Squeezed Rejection Sampling 158
6232 Adaptive Rejection Sampling 159
63 Approximate Simulation 163
631 Sampling Importance Resampling Algorithm 163
6311 Adaptive Importance Bridge and Path Sampling 167
632 Sequential Monte Carlo 168
6321 Sequential Importance Sampling for Markov Processes 169
6322 General Sequential Importance Sampling 170
6323 Weight Degeneracy Rejuvenation and EffectiveSample Size 171
6324 Sequential Importance Sampling for HiddenMarkov Models 175
6325 Particle Filters 179
64 Variance Reduction Techniques 180
641 Importance Sampling 180642 Antithetic Sampling 186643 Control Variates 189644 RaondashBlackwellization 193
Problems 195
7 MARKOV CHAIN MONTE CARLO 201
71 MetropolisndashHastings Algorithm 202
711 Independence Chains 204712 Random Walk Chains 206
72 Gibbs Sampling 209
721 Basic Gibbs Sampler 209722 Properties of the Gibbs Sampler 214723 Update Ordering 216724 Blocking 216725 Hybrid Gibbs Sampling 216726 GriddyndashGibbs Sampler 218
73 Implementation 218
731 Ensuring Good Mixing and Convergence 219
7311 Simple Graphical Diagnostics 219
7312 Burn-in and Run Length 220
7313 Choice of Proposal 222
7314 Reparameterization 223
7315 Comparing Chains Effective Sample Size 224
7316 Number of Chains 225
CONTENTS xi
732 Practical Implementation Advice 226733 Using the Results 226
Problems 230
8 ADVANCED TOPICS IN MCMC 237
81 Adaptive MCMC 237811 Adaptive Random Walk Metropolis-within-Gibbs Algorithm 238812 General Adaptive Metropolis-within-Gibbs Algorithm 240813 Adaptive Metropolis Algorithm 247
82 Reversible Jump MCMC 250821 RJMCMC for Variable Selection in Regression 253
83 Auxiliary Variable Methods 256831 Simulated Tempering 257832 Slice Sampler 258
84 Other MetropolisndashHastings Algorithms 260841 Hit-and-Run Algorithm 260842 Multiple-Try MetropolisndashHastings Algorithm 261843 Langevin MetropolisndashHastings Algorithm 262
85 Perfect Sampling 264851 Coupling from the Past 264
8511 Stochastic Monotonicity and Sandwiching 267
86 Markov Chain Maximum Likelihood 268
87 Example MCMC for Markov Random Fields 269871 Gibbs Sampling for Markov Random Fields 270872 Auxiliary Variable Methods for Markov Random Fields 274873 Perfect Sampling for Markov Random Fields 277
Problems 279
PART III
BOOTSTRAPPING
9 BOOTSTRAPPING 287
91 The Bootstrap Principle 287
92 Basic Methods 288921 Nonparametric Bootstrap 288922 Parametric Bootstrap 289923 Bootstrapping Regression 290924 Bootstrap Bias Correction 291
93 Bootstrap Inference 292931 Percentile Method 292
9311 Justification for the Percentile Method 293932 Pivoting 294
xii CONTENTS
9321 Accelerated Bias-Corrected Percentile Method BCa 294
9322 The Bootstrap t 296
9323 Empirical Variance Stabilization 298
9324 Nested Bootstrap and Prepivoting 299
933 Hypothesis Testing 301
94 Reducing Monte Carlo Error 302
941 Balanced Bootstrap 302942 Antithetic Bootstrap 302
95 Bootstrapping Dependent Data 303
951 Model-Based Approach 304952 Block Bootstrap 304
9521 Nonmoving Block Bootstrap 304
9522 Moving Block Bootstrap 306
9523 Blocks-of-Blocks Bootstrapping 307
9524 Centering and Studentizing 309
9525 Block Size 311
96 Bootstrap Performance 315
961 Independent Data Case 315962 Dependent Data Case 316
97 Other Uses of the Bootstrap 316
98 Permutation Tests 317
Problems 319
PART IV
DENSITY ESTIMATION AND SMOOTHING
10 NONPARAMETRIC DENSITY ESTIMATION 325
101 Measures of Performance 326
102 Kernel Density Estimation 327
1021 Choice of Bandwidth 329
10211 Cross-Validation 332
10212 Plug-in Methods 335
10213 Maximal Smoothing Principle 338
1022 Choice of Kernel 339
10221 Epanechnikov Kernel 339
10222 Canonical Kernels and Rescalings 340
103 Nonkernel Methods 341
1031 Logspline 341
104 Multivariate Methods 345
1041 The Nature of the Problem 345
CONTENTS xiii
1042 Multivariate Kernel Estimators 3461043 Adaptive Kernels and Nearest Neighbors 348
10431 Nearest Neighbor Approaches 349
10432 Variable-Kernel Approaches and Transformations 350
1044 Exploratory Projection Pursuit 353
Problems 359
11 BIVARIATE SMOOTHING 363
111 PredictorndashResponse Data 363
112 Linear Smoothers 365
1121 Constant-Span Running Mean 366
11211 Effect of Span 368
11212 Span Selection for Linear Smoothers 369
1122 Running Lines and Running Polynomials 3721123 Kernel Smoothers 3741124 Local Regression Smoothing 3741125 Spline Smoothing 376
11251 Choice of Penalty 377
113 Comparison of Linear Smoothers 377
114 Nonlinear Smoothers 379
1141 Loess 3791142 Supersmoother 381
115 Confidence Bands 384
116 General Bivariate Data 388
Problems 389
12 MULTIVARIATE SMOOTHING 393
121 PredictorndashResponse Data 393
1211 Additive Models 3941212 Generalized Additive Models 3971213 Other Methods Related to Additive Models 399
12131 Projection Pursuit Regression 399
12132 Neural Networks 402
12133 Alternating Conditional Expectations 403
12134 Additivity and Variance Stabilization 404
1214 Tree-Based Methods 405
12141 Recursive Partitioning Regression Trees 406
12142 Tree Pruning 409
12143 Classification Trees 411
12144 Other Issues for Tree-Based Methods 412
xiv CONTENTS
122 General Multivariate Data 4131221 Principal Curves 413
12211 Definition and Motivation 413
12212 Estimation 415
12213 Span Selection 416
Problems 416
DATA ACKNOWLEDGMENTS 421
REFERENCES 423
INDEX 457
PREFACE
This book covers most topics needed to develop a broad and thorough working knowl-edge of modern computational statistics We seek to develop a practical understandingof how and why existing methods work enabling readers to use modern statisticalmethods effectively Since many new methods are built from components of exist-ing techniques our ultimate goal is to provide scientists with the tools they need tocontribute new ideas to the field
A growing challenge in science is that there is so much of it While the pursuitof important new methods and the honing of existing approaches is a worthy goalthere is also a need to organize and distill the teeming jungle of ideas We attempt todo that here Our choice of topics reflects our view of what constitutes the core of theevolving field of computational statistics and what will be interesting and useful forour readers
Our use of the adjective modern in the first sentence of this preface is potentiallytroublesome There is no way that this book can cover all the latest greatest techniquesWe have not even tried We have instead aimed to provide a reasonably up-to-datesurvey of a broad portion of the field while leaving room for diversions and esoterica
The foundations of optimization and numerical integration are covered in thisbook We include these venerable topics because (i) they are cornerstones of frequen-tist and Bayesian inference (ii) routine application of available software often failsfor hard problems and (iii) the methods themselves are often secondary componentsof other statistical computing algorithms Some topics we have omitted representimportant areas of past and present research in the field but their priority here islowered by the availability of high-quality software For example the generation ofpseudo-random numbers is a classic topic but one that we prefer to address by giv-ing students reliable software Finally some topics (eg principal curves and tabusearch) are included simply because they are interesting and provide very differentperspectives on familiar problems Perhaps a future researcher may draw ideas fromsuch topics to design a creative and effective new algorithm
In this second edition we have both updated and broadened our coverage andwe now provide computer code For example we have added new MCMC topics to re-flect continued activity in that popular area A notable increase in breadth is our inclu-sion of more methods relevant for problems where statistical dependency is importantsuch as block bootstrapping and sequential importance sampling This second editionprovides extensive new support in R Specifically code for the examples in this bookis available from the book website wwwstatcolostateeducomputationalstatistics
Our target audience includes graduate students in statistics and related fieldsstatisticians and quantitative empirical scientists in other fields We hope such readersmay use the book when applying standard methods and developing new methods
xv
xvi PREFACE
The level of mathematics expected of the reader does not extend much beyondTaylor series and linear algebra Breadth of mathematical training is more helpfulthan depth Essential review is provided in Chapter 1 More advanced readers will findgreater mathematical detail in the wide variety of high-quality books available on spe-cific topics many of which are referenced in the text Other readers caring less aboutanalytical details may prefer to focus on our descriptions of algorithms and examples
The expected level of statistics is equivalent to that obtained by a graduatestudent in his or her first year of study of the theory of statistics and probabilityAn understanding of maximum likelihood methods Bayesian methods elementaryasymptotic theory Markov chains and linear models is most important Many ofthese topics are reviewed in Chapter 1
With respect to computer programming we find that good students can learn asthey go However a working knowledge of a suitable language allows implementationof the ideas covered in this book to progress much more quickly We have chosento forgo any language-specific examples algorithms or coding in the text For thosewishing to learn a language while they study this book we recommend that you choosea high-level interactive package that permits the flexible design of graphical displaysand includes supporting statistics and probability functions such as R and MATLAB1
These are the sort of languages often used by researchers during the development ofnew statistical computing techniques and they are suitable for implementing all themethods we describe except in some cases for problems of vast scope or complexityWe use R and recommend it Although lower-level languages such as C++ couldalso be used they are more appropriate for professional-grade implementation ofalgorithms after researchers have refined the methodology
The book is organized into four major parts optimization (Chapters 2 3 and4) integration and simulation (Chapters 5 6 7 and 8) bootstrapping (Chapter 9)and density estimation and smoothing (Chapters 10 11 and 12) The chapters arewritten to stand independently so a course can be built by selecting the topics onewishes to teach For a one-semester course our selection typically weights mostheavily topics from Chapters 2 3 6 7 9 10 and 11 With a leisurely pace or morethorough coverage a shorter list of topics could still easily fill a semester courseThere is sufficient material here to provide a thorough one-year course of studynotwithstanding any supplemental topics one might wish to teach
A variety of homework problems are included at the end of each chapter Someare straightforward while others require the student to develop a thorough under-standing of the modelmethod being used to carefully (and perhaps cleverly) code asuitable technique and to devote considerable attention to the interpretation of resultsA few exercises invite open-ended exploration of methods and ideas We are some-times asked for solutions to the exercises but we prefer to sequester them to preservethe challenge for future students and readers
The datasets discussed in the examples and exercises are available from the bookwebsite wwwstatcolostateeducomputationalstatistics The R code is also providedthere Finally the website includes an errata Responsibility for all errors lies with us
1R is available for free from wwwr-projectorg Information about MATLAB can be found atwwwmathworkscom
ACKNOWLEDGMENTS
The course upon which this book is based was developed and taught by us at ColoradoState University from 1994 onwards Thanks are due to our many students who havebeen semiwilling guinea pigs over the years We also thank our colleagues in theStatistics Department for their continued support The late Richard Tweedie meritsparticular acknowledgment for his mentoring during the early years of our careers
We owe a great deal of intellectual debt to Adrian Raftery who deserves specialthanks not only for his teaching and advising but also for his unwavering support andhis seemingly inexhaustible supply of good ideas In addition we thank our influentialadvisors and teachers at the University of Washington Statistics Department includingDavid Madigan Werner Stuetzle and Judy Zeh Of course each of our chapters couldbe expanded into a full-length book and great scholars have already done so We owemuch to their efforts upon which we relied when developing our course and ourmanuscript
Portions of the first edition were written at the Department of Mathematicsand Statistics University of Otago in Dunedin New Zealand whose faculty wethank for graciously hosting us during our sabbatical in 2003 Much of our workon the second edition was undertaken during our sabbatical visit to the AustraliaCommonwealth Scientific and Research Organization in 2009ndash2010 sponsored byCSIRO Mathematics Informatics and Statistics and hosted at the Longpocket Lab-oratory in Indooroopilly Australia We thank our hosts and colleagues there for theirsupport
Our manuscript has been greatly improved through the constructive reviewsof John Bickham Ben Bird Kate Cowles Jan Hannig Alan Herlihy David HunterDevin Johnson Michael Newton Doug Nychka Steve Sain David W Scott N ScottUrquhart Haonan Wang Darrell Whitley and eight anonymous referees We alsothank the sharp-eyed readers listed in the errata for their suggestions and correctionsOur editor Steve Quigley and the folks at Wiley were supportive and helpful duringthe publication process We thank Nelida Pohl for permission to adapt her photographin the cover design of the first edition We thank Melinda Stelzer for permission to useher painting ldquoChampagne Circuitrdquo 2001 for the cover of the second edition Moreabout her art can be found at wwwfacebookcomgeekchicart We also owe specialnote of thanks to Zube (aka John Dzubera) who kept our own computers runningdespite our best efforts to the contrary
Funding from National Science Foundation (NSF) CAREER grant SBR-9875508 was a significant source of support for the first author during the preparationof the first edition He also thanks his colleagues and friends in the North SlopeBorough Alaska Department of Wildlife Management for their longtime researchsupport The second author gratefully acknowledges the support of STAR Research
xvii
xviii ACKNOWLEDGMENTS
Assistance Agreement CR-829095 awarded to Colorado State University by the USEnvironmental Protection Agency (EPA) The views expressed here are solely thoseof the authors NSF and EPA do not endorse any products or commercial servicesmentioned herein
Finally we thank our parents for enabling and supporting our educations and forproviding us with the ldquostubbornness generdquo necessary for graduate school the tenuretrack or book publicationmdashtake your pick The second edition is dedicated to ourkids Natalie and Neil for continuing to show us what is important and what is not
Geof H GivensJennifer A Hoeting
CHAPTER 1REVIEW
This chapter reviews notation and background material in mathematics probabilityand statistics Readers may wish to skip this chapter and turn directly to Chapter 2returning here only as needed
11 MATHEMATICAL NOTATION
We use boldface to distinguish a vector x = (x1 xp) or a matrix M from a scalarvariable x or a constant M A vector-valued function f evaluated at x is also boldfacedas in f(x) = (f1(x) fp(x)) The transpose of M is denoted MT
Unless otherwise specified all vectors are considered to be column vectors sofor example an n times p matrix can be written as M = (x1 xn)T Let I denote anidentity matrix and 1 and 0 denote vectors of ones and zeros respectively
A symmetric square matrix M is positive definite if xTMx gt 0 for all nonzerovectors x Positive definiteness is equivalent to the condition that all eigenvalues ofM are positive M is nonnegative definite or positive semidefinite if xTMx ge 0 for allnonzero vectors x
The derivative of a function f evaluated at x is denoted f prime(x) When x =(x1 xp) the gradient of f at x is
f prime(x) =(
df (x)
dx1
df (x)
dxp
)
The Hessian matrix for f at x is f primeprime(x) having (i j)th element equal to d2f (x)(dxi dxj) The negative Hessian has important uses in statistical inference
Let J(x) denote the Jacobian matrix evaluated at x for the one-to-one mappingy = f(x) The (i j)th element of J(x) is equal to dfi(x)dxj
A functional is a real-valued function on a space of functions For example ifT (f ) = int 1
0 f (x) dx then the functional T maps suitably integrable functions onto thereal line
The indicator function 1A equals 1 if A is true and 0 otherwise The real lineis denoted and p-dimensional real space is p
Computational Statistics Second Edition Geof H Givens and Jennifer A Hoetingcopy 2013 John Wiley amp Sons Inc Published 2013 by John Wiley amp Sons Inc
1
2 CHAPTER 1 REVIEW
12 TAYLORrsquoS THEOREM AND MATHEMATICALLIMIT THEORY
First we define standard ldquobig ohrdquo and ldquolittle ohrdquo notation for describing the relativeorders of convergence of functions Let the functions f and g be defined on a commonpossibly infinite interval Let z0 be a point in this interval or a boundary point of it(ie minusinfin or infin) We require g(z) = 0 for all z = z0 in a neighborhood of z0 Thenwe say
f (z) = O(g(z)) (11)
if there exists a constant M such that |f (z)| le M|g(z)| as z rarr z0 For example(n + 1)(3n2) = O(nminus1) and it is understood that we are considering n rarr infin Iflimzrarrz0 f (z)g(z) = 0 then we say
f (z) = O(g(z)) (12)
For example f (x0 + h) minus f (x0) = hf prime(x0) + O(h) as h rarr 0 if f is differentiable atx0 The same notation can be used for describing the convergence of a sequence xnas n rarr infin by letting f (n) = xn
Taylorrsquos theorem provides a polynomial approximation to a function f Supposef has finite (n + 1)th derivative on (a b) and continuous nth derivative on [a b] Thenfor any x0 isin [a b] distinct from x the Taylor series expansion of f about x0 is
f (x) =nsum
i=0
1
if (i)(x0)(x minus x0)i + Rn (13)
where f (i)(x0) is the ith derivative of f evaluated at x0 and
Rn = 1
(n + 1)f (n+1)(ξ)(x minus x0)n+1 (14)
for some point ξ in the interval between x and x0 As |x minus x0| rarr 0 note that Rn =O(|x minus x0|n+1)
The multivariate version of Taylorrsquos theorem is analogous Suppose f is areal-valued function of a p-dimensional variable x possessing continuous partialderivatives of all orders up to and including n + 1 with respect to all coordinates inan open convex set containing x and x0 = x Then
f (x) = f (x0) +nsum
i=1
1
iD(i)(f x0 x minus x0) + Rn (15)
where
D(i)(f x y) =psum
j1=1
middot middot middotpsum
ji=1
(di
dtj1 middot middot middot dtji
f (t)
∣∣∣∣t=x
) iprodk=1
yjk
(16)
12 TAYLORrsquoS THEOREM AND MATHEMATICAL LIMIT THEORY 3
and
Rn = 1
(n + 1)D(n+1)(f ξ x minus x0) (17)
for some ξ on the line segment joining x and x0 As |x minus x0| rarr 0 note that Rn =O(|x minus x0|n+1)
The EulerndashMaclaurin formula is useful in many asymptotic analyses If f has2n continuous derivatives in [0 1] thenint 1
0f (x) dx =f (0) + f (1)
2
minusnminus1sumi=0
b2i(f (2iminus1)(1) minus f (2iminus1)(0))
(2i)minus b2nf
(2n)(ξ)
(2n) (18)
where 0 le ξ le 1 f (j) is the jth derivative of f and bj = Bj(0) can be determinedusing the recursion relation
msumj=0
(m + 1
j
)Bj(z) = (m + 1)zm (19)
initialized with B0(z) = 1 The proof of this result is based on repeated integrationsby parts [376]
Finally we note that it is sometimes desirable to approximate the derivative ofa function numerically using finite differences For example the ith component ofthe gradient of f at x can be approximated by
df (x)
dxi
asymp f (x + εiei) minus f (x minus εiei)
2εi
(110)
where εi is a small number and ei is the unit vector in the ith coordinate directionTypically one might start with say εi = 001 or 0001 and approximate the desiredderivative for a sequence of progressively smaller εi The approximation willgenerally improve until εi becomes small enough that the calculation is degraded andeventually dominated by computer roundoff error introduced by subtractive cancel-lation Introductory discussion of this approach and a more sophisticated Richardsonextrapolation strategy for obtaining greater precision are provided in [376] Finitedifferences can also be used to approximate the second derivative of f at x via
df (x)
dxi dxj
asymp 1
4εiεj
(f (x + εiei + εjej) minus f (x + εiei minus εjej)
minus f (x minus εiei + εjej) + f (x minus εiei minus εjej)
)(111)
with similar sequential precision improvements
4 CHAPTER 1 REVIEW
13 STATISTICAL NOTATION AND PROBABILITYDISTRIBUTIONS
We use capital letters to denote random variables such as Y or X and lowercaseletters to represent specific realized values of random variables such as y or x Theprobability density function of X is denoted f the cumulative distribution functionis F We use the notation X sim f (x) to mean that X is distributed with density f (x)Frequently the dependence of f (x) on one or more parameters also will be denotedwith a conditioning bar as in f (x|α β) Because of the diversity of topics covered inthis book we want to be careful to distinguish when f (x|α) refers to a density functionas opposed to the evaluation of that density at a point x When the meaning is unclearfrom the context we will be explicit for example by using f (middot|α) to denote thefunction When it is important to distinguish among several densities we may adoptsubscripts referring to specific random variables so that the density functions forX and Y are fX and fY respectively We use the same notation for distributions ofdiscrete random variables and in the Bayesian context
The conditional distribution of X given that Y equals y (ie X|Y = y) is de-scribed by the density denoted f (x|y) or fX|Y (x|y) In this case we write that X|Y hasdensity f (x|Y ) For notational simplicity we allow density functions to be implicitlyspecified by their arguments so we may use the same symbol say f to refer to manydistinct functions as in the equation f (x y|μ) = f (x|y μ)f (y|μ) Finally f (X) andF (X) are random variables the evaluations of the density and cumulative distributionfunctions respectively at the random argument X
The expectation of a random variable is denoted EX Unless specificallymentioned the distribution with respect to which an expectation is taken is the dis-tribution of X or should be implicit from the context To denote the probability ofan event A we use P[A] = E1A The conditional expectation of X|Y = y isEX|y When Y is unknown EX|Y is a random variable that depends on Y Otherattributes of the distribution of X and Y include varX covX Y corX Y andcvX = varX12EX These quantities are the variance of X the covariance andcorrelation of X and Y and the coefficient of variation of X respectively
A useful result regarding expectations is Jensenrsquos inequality Let g be a convexfunction on a possibly infinite open interval I so
g(λx + (1 minus λ)y) le λg(x) + (1 minus λ)g(y) (112)
for all x y isin I and all 0 lt λ lt 1 Then Jensenrsquos inequality states that Eg(X) geg(EX) for any random variable X having P[X isin I] = 1
Tables 11 12 and 13 provide information about many discrete and contin-uous distributions used throughout this book We refer to the following well-knowncombinatorial constants
n = n(n minus 1)(n minus 2) middot middot middot (3)(2)(1) with 0 = 1 (113)(n
k
)= n
k(n minus k) (114)
TABLE 11 Notation and description for common probability distributions of discrete random variables
Notation and Density and Mean and
Name Parameter Space Sample Space Variance
Bernoulli X simBernoulli(p)0 le p le 1
f (x) = px(1 minus p)1minusx
x = 0 or 1EX = p
varX = p(1 minus p)
Binomial X simBin(n p)0 le p le 1n isin 1 2
f (x) =(
n
x
)px(1 minus p)nminusx
x = 0 1 n
EX = np
varX = np(1 minus p)
Multinomial X simMultinomial(n p)p = (p1 pk)0 le pi le 1 and n isin 1 2 sumk
i=1 pi = 1
f (x) =(
n
x1 xk
)prodk
i=1 pxi
i
x = (x1 xk) and xi isin 0 1 nsumk
i=1 xi = n
EX = npvarXi = npi(1 minus pi)covXi Xj = minusnpipj
NegativeBinomial
X simNegBin(r p)0 le p le 1r isin 1 2
f (x) =(
x + r minus 1
r minus 1
)pr(1 minus p)x
x isin 0 1 EX = r(1 minus p)pvarX = r(1 minus p)p2
Poisson X simPoisson(λ) f (x) = λx
xexpminusλ EX = λ
λ gt 0 x isin 0 1 varX = λ
5
TABLE 12 Notation and description for some common probability distributions of continuous random variables
Notation and Density and Mean and
Name Parameter Space Sample Space Variance
Beta X simBeta(α β)α gt 0 and β gt 0
f (x) = (α + β)
(α)(β)xαminus1(1 minus x)βminus1
0 le x le 1
EX = α
α + β
varX = αβ
(α + β)2(α + β + 1)
Cauchy X simCauchy(α β)α isin and β gt 0
f (x) = 1
πβ
[1 +
(x minus α
β
)2]
x isin
EX is nonexistentvarX is nonexistent
Chi-square X sim χ2ν
ν gt 0f (x) = Gamma(ν2 12)x gt 0
EX = ν
varX = 2ν
Dirichlet X simDirichlet(α)α = (α1 αk)αi gt 0α0 = sumk
i=1 αi
f (x) = (α0)prodk
i=1 xαiminus1iprodk
i=1 (αi)x = (x1 xk) and 0 le xi le 1sumk
i=1 xi = 1
EX = αα0
varXi = αi(α0 minus αi)
α20(α0 + 1)
covXi Xj = minusαiαj
α20(α0 + 1)
Exponential X simExp(λ)λ gt 0
f (x) = λ expminusλxx gt 0
EX = 1λ
varX = 1λ2
Gamma X simGamma(r λ)λ gt 0 and r gt 0
f (x) = λrxrminus1
(r)expminusλx
x gt 0
EX = rλ
varX = rλ2
6
TABLE 13 Notation and description for more common probability distributions of continuous random variables
Notation and Density and Mean and
Name Parameter Space Sample Space Variance
Lognormal X simLognormal(μ σ2)μ isin and σ gt 0
f (x) = 1
xradic
2πσ2exp
minus1
2
(logx minus μ
σ
)2
x isin EX = expμ + σ22varX = exp2μ + 2σ2 minus exp2μ + σ2
MultivariateNormal
X sim Nk(μ )μ = (μ1 μk) isin k
positive definite
f (x) = expminus(x minus μ)T minus1(x minus μ)2(2π)k2||12
x = (x1 xk) isin k
EX = μ
varX =
Normal X simN(μ σ2)μ isin and σ gt 0
f (x) = 1radic2πσ2
exp
minus1
2
(x minus μ
σ
)2
x isin EX = μ
varX = σ2
Studentrsquos t X sim tν
ν gt 0f (x) = ((ν + 1)2)
(ν2)radic
πν
(1 + x2
ν
)minus(ν+1)2
x isin EX = 0 if ν gt 1
varX = ν
ν + 2if ν gt 2
Uniform X simUnif(a b)a b isin and a lt b
f (x) = 1
b minus ax isin [a b]
EX = (a + b)2varX = (b minus a)212
Weibull X simWeibull(a b)a gt 0 and b gt 0
f (x) = abxbminus1 expminusaxbx gt 0
EX = (1 + 1b)
a1b
varX = (1 + 2b) minus (1 + 1b)2
a2b
7
8 CHAPTER 1 REVIEW(n
k1 km
)= nprodm
i=1 kiwhere n =
msumi=1
ki (115)
(r) =
(r minus 1) for r = 1 2 int infin0 trminus1 expminust dt for general r gt 0
(116)
It is worth knowing that
(
12
)= radic
π and (n + 1
2
)= 1 times 3 times 5 times middot middot middot times (2n minus 1)
radicπ2n
for positive integer nMany of the distributions commonly used in statistics are members of an
exponential family A k-parameter exponential family density can be expressed as
f (x|γ) = c1(x)c2(γ) exp
ksum
i=1
yi(x)θi(γ)
(117)
for nonnegative functions c1 and c2 The vectorγ denotes the familiar parameters suchas λ for the Poisson density and p for the binomial density The real-valued θi(γ) arethe natural or canonical parameters which are usually transformations of γ Theyi(x) are the sufficient statistics for the canonical parameters It is straightforwardto show
Ey(X) =κprime(θ) (118)
and
vary(X) =κprimeprime(θ) (119)
where κ(θ) = minus log c3(θ) letting c3(θ) denote the reexpression of c2(γ) in terms ofthe canonical parameters θ = (θ1 θk) and y(X) = (y1(X) yk(X)) Theseresults can be rewritten in terms of the original parameters γ as
E
ksum
i=1
dθi(γ)
dγj
yi(X)
= minus d
dγj
log c2(γ) (120)
and
var
ksum
i=1
dθi(γ)
dγj
yi(X)
= minus d2
dγ2j
log c2(γ) minus E
ksum
i=1
d2θi(γ)
dγ2j
yi(X)
(121)
Example 11 (Poisson) The Poisson distribution belongs to the exponential familywith c1(x) = 1x c2(λ) = expminusλ y(x) = x and θ(λ) = log λ Deriving momentsin terms of θ we have κ(θ) = expθ so EX = κprime(θ) = expθ = λ and varX =κprimeprime(θ) = expθ = λ The same results may be obtained with (120) and (121) notingthat dθdλ = 1λ For example (120) gives E Xλ = 1
It is also important to know how the distribution of a random variable changeswhen it is transformed Let X = (X1 Xp) denote a p-dimensional random
14 LIKELIHOOD INFERENCE 9
variable with continuous density function f Suppose that
U = g(X) = (g1(X) gp(X)) = (U1 Up) (122)
where g is a one-to-one function mapping the support region of f onto the space ofall u = g(x) for which x satisfies f (x) gt 0 To derive the probability distribution ofU from that of X we need to use the Jacobian matrix The density of the transformedvariables is
f (u) = f (gminus1(u)) |J(u)| (123)
where |J(u)| is the absolute value of the determinant of the Jacobian matrix of gminus1
evaluated at u having (i j)th element dxiduj where these derivatives are assumedto be continuous over the support region of U
14 LIKELIHOOD INFERENCE
If X1 Xn are independent and identically distributed (iid) each having densityf (x | θ) that depends on a vector of p unknown parameters θ = (θ1 θp) then thejoint likelihood function is
L(θ) =nprod
i=1
f (xi|θ) (124)
When the data are not iid the joint likelihood is still expressed as the joint densityf (x1 xn|θ) viewed as a function of θ
The observed data x1 xn might have been realized under many differentvalues for θ The parameters for which observing x1 xn would be most likelyconstitute the maximum likelihood estimate of θ In other words if ϑ is the functionof x1 xn that maximizes L(θ) then θ = ϑ(X1 Xn) is the maximum likeli-hood estimator (MLE) for θ MLEs are invariant to transformation so the MLE of atransformation of θ equals the transformation of θ
It is typically easier to work with the log likelihood function
l (θ) = log L(θ) (125)
which has the same maximum as the original likelihood since log is a strictly mono-tonic function Furthermore any additive constants (involving possibly x1 xn
but not θ) may be omitted from the log likelihood without changing the location of itsmaximum or differences between log likelihoods at different θ Note that maximizingL(θ) with respect to θ is equivalent to solving the system of equations
lprime(θ) = 0 (126)
where
lprime(θ) =(
dl(θ)
dθ1
dl(θ)
dθn
)
10 CHAPTER 1 REVIEW
is called the score function The score function satisfies
Elprime(θ) = 0 (127)
where the expectation is taken with respect to the distribution of X1 Xn Some-times an analytical solution to (126) provides the MLE this book describes a varietyof methods that can be used when the MLE cannot be solved for in closed form Itis worth noting that there are pathological circumstances where the MLE is not asolution of the score equation or the MLE is not unique see [127] for examples
The MLE has a sampling distribution because it depends on the realizationof the random variables X1 Xn The MLE may be biased or unbiased for θ yetunder quite general conditions it is asymptotically unbiased as n rarr infin The samplingvariance of the MLE depends on the average curvature of the log likelihood When thelog likelihood is very pointy the location of the maximum is more precisely known
To make this precise let lprimeprime(θ) denote the p times p matrix having (i j)th elementgiven by d2l(θ)(dθidθj) The Fisher information matrix is defined as
I(θ) = Elprime(θ)lprime(θ)T = minusElprimeprime(θ) (128)
where the expectations are taken with respect to the distribution of X1 Xn Thefinal equality in (128) requires mild assumptions which are satisfied for example inexponential families I(θ) may sometimes be called the expected Fisher informationto distinguish it from minuslprimeprime(θ) which is the observed Fisher information There aretwo reasons why the observed Fisher information is quite useful First it can becalculated even if the expectations in (128) cannot easily be computed Second it isa good approximation to I(θ) that improves as n increases
Under regularity conditions the asymptotic variancendashcovariance matrix ofthe MLE θ is I(θlowast)minus1 where θlowast denotes the true value of θ Indeed as n rarr infin thelimiting distribution of θ is Np(θlowast I(θlowast)minus1) Since the true parameter values areunknown I(θlowast)minus1 must be estimated in order to estimate the variancendashcovariancematrix of the MLE An obvious approach is to use I(θ)minus1 Alternatively it is alsoreasonable to use minuslprimeprime(θ)minus1 Standard errors for individual parameter MLEs canbe estimated by taking the square root of the diagonal elements of the chosenestimate of I(θlowast)minus1 A thorough introduction to maximum likelihood theory andthe relative merits of these estimates of I(θlowast)minus1 can be found in [127 182 371 470]
Profile likelihoods provide an informative way to graph a higher-dimensionallikelihood surface to make inference about some parameters while treating othersas nuisance parameters and to facilitate various optimization strategies The profilelikelihood is obtained by constrained maximization of the full likelihood with respectto parameters to be ignored If θ = (μ φ) then the profile likelihood for φ is
L(φ|μ(φ)) = maxμ
L(μ φ) (129)
Thus for each possible φ a value of μ is chosen to maximize L(μ φ) This optimalμ is a function of φ The profile likelihood is the function that maps φ to the valueof the full likelihood evaluated at φ and its corresponding optimal μ Note that the φ
x CONTENTS
621 Generating from Standard Parametric Families 153622 Inverse Cumulative Distribution Function 153623 Rejection Sampling 155
6231 Squeezed Rejection Sampling 158
6232 Adaptive Rejection Sampling 159
63 Approximate Simulation 163
631 Sampling Importance Resampling Algorithm 163
6311 Adaptive Importance Bridge and Path Sampling 167
632 Sequential Monte Carlo 168
6321 Sequential Importance Sampling for Markov Processes 169
6322 General Sequential Importance Sampling 170
6323 Weight Degeneracy Rejuvenation and EffectiveSample Size 171
6324 Sequential Importance Sampling for HiddenMarkov Models 175
6325 Particle Filters 179
64 Variance Reduction Techniques 180
641 Importance Sampling 180642 Antithetic Sampling 186643 Control Variates 189644 RaondashBlackwellization 193
Problems 195
7 MARKOV CHAIN MONTE CARLO 201
71 MetropolisndashHastings Algorithm 202
711 Independence Chains 204712 Random Walk Chains 206
72 Gibbs Sampling 209
721 Basic Gibbs Sampler 209722 Properties of the Gibbs Sampler 214723 Update Ordering 216724 Blocking 216725 Hybrid Gibbs Sampling 216726 GriddyndashGibbs Sampler 218
73 Implementation 218
731 Ensuring Good Mixing and Convergence 219
7311 Simple Graphical Diagnostics 219
7312 Burn-in and Run Length 220
7313 Choice of Proposal 222
7314 Reparameterization 223
7315 Comparing Chains Effective Sample Size 224
7316 Number of Chains 225
CONTENTS xi
732 Practical Implementation Advice 226733 Using the Results 226
Problems 230
8 ADVANCED TOPICS IN MCMC 237
81 Adaptive MCMC 237811 Adaptive Random Walk Metropolis-within-Gibbs Algorithm 238812 General Adaptive Metropolis-within-Gibbs Algorithm 240813 Adaptive Metropolis Algorithm 247
82 Reversible Jump MCMC 250821 RJMCMC for Variable Selection in Regression 253
83 Auxiliary Variable Methods 256831 Simulated Tempering 257832 Slice Sampler 258
84 Other MetropolisndashHastings Algorithms 260841 Hit-and-Run Algorithm 260842 Multiple-Try MetropolisndashHastings Algorithm 261843 Langevin MetropolisndashHastings Algorithm 262
85 Perfect Sampling 264851 Coupling from the Past 264
8511 Stochastic Monotonicity and Sandwiching 267
86 Markov Chain Maximum Likelihood 268
87 Example MCMC for Markov Random Fields 269871 Gibbs Sampling for Markov Random Fields 270872 Auxiliary Variable Methods for Markov Random Fields 274873 Perfect Sampling for Markov Random Fields 277
Problems 279
PART III
BOOTSTRAPPING
9 BOOTSTRAPPING 287
91 The Bootstrap Principle 287
92 Basic Methods 288921 Nonparametric Bootstrap 288922 Parametric Bootstrap 289923 Bootstrapping Regression 290924 Bootstrap Bias Correction 291
93 Bootstrap Inference 292931 Percentile Method 292
9311 Justification for the Percentile Method 293932 Pivoting 294
xii CONTENTS
9321 Accelerated Bias-Corrected Percentile Method BCa 294
9322 The Bootstrap t 296
9323 Empirical Variance Stabilization 298
9324 Nested Bootstrap and Prepivoting 299
933 Hypothesis Testing 301
94 Reducing Monte Carlo Error 302
941 Balanced Bootstrap 302942 Antithetic Bootstrap 302
95 Bootstrapping Dependent Data 303
951 Model-Based Approach 304952 Block Bootstrap 304
9521 Nonmoving Block Bootstrap 304
9522 Moving Block Bootstrap 306
9523 Blocks-of-Blocks Bootstrapping 307
9524 Centering and Studentizing 309
9525 Block Size 311
96 Bootstrap Performance 315
961 Independent Data Case 315962 Dependent Data Case 316
97 Other Uses of the Bootstrap 316
98 Permutation Tests 317
Problems 319
PART IV
DENSITY ESTIMATION AND SMOOTHING
10 NONPARAMETRIC DENSITY ESTIMATION 325
101 Measures of Performance 326
102 Kernel Density Estimation 327
1021 Choice of Bandwidth 329
10211 Cross-Validation 332
10212 Plug-in Methods 335
10213 Maximal Smoothing Principle 338
1022 Choice of Kernel 339
10221 Epanechnikov Kernel 339
10222 Canonical Kernels and Rescalings 340
103 Nonkernel Methods 341
1031 Logspline 341
104 Multivariate Methods 345
1041 The Nature of the Problem 345
CONTENTS xiii
1042 Multivariate Kernel Estimators 3461043 Adaptive Kernels and Nearest Neighbors 348
10431 Nearest Neighbor Approaches 349
10432 Variable-Kernel Approaches and Transformations 350
1044 Exploratory Projection Pursuit 353
Problems 359
11 BIVARIATE SMOOTHING 363
111 PredictorndashResponse Data 363
112 Linear Smoothers 365
1121 Constant-Span Running Mean 366
11211 Effect of Span 368
11212 Span Selection for Linear Smoothers 369
1122 Running Lines and Running Polynomials 3721123 Kernel Smoothers 3741124 Local Regression Smoothing 3741125 Spline Smoothing 376
11251 Choice of Penalty 377
113 Comparison of Linear Smoothers 377
114 Nonlinear Smoothers 379
1141 Loess 3791142 Supersmoother 381
115 Confidence Bands 384
116 General Bivariate Data 388
Problems 389
12 MULTIVARIATE SMOOTHING 393
121 PredictorndashResponse Data 393
1211 Additive Models 3941212 Generalized Additive Models 3971213 Other Methods Related to Additive Models 399
12131 Projection Pursuit Regression 399
12132 Neural Networks 402
12133 Alternating Conditional Expectations 403
12134 Additivity and Variance Stabilization 404
1214 Tree-Based Methods 405
12141 Recursive Partitioning Regression Trees 406
12142 Tree Pruning 409
12143 Classification Trees 411
12144 Other Issues for Tree-Based Methods 412
xiv CONTENTS
122 General Multivariate Data 4131221 Principal Curves 413
12211 Definition and Motivation 413
12212 Estimation 415
12213 Span Selection 416
Problems 416
DATA ACKNOWLEDGMENTS 421
REFERENCES 423
INDEX 457
PREFACE
This book covers most topics needed to develop a broad and thorough working knowl-edge of modern computational statistics We seek to develop a practical understandingof how and why existing methods work enabling readers to use modern statisticalmethods effectively Since many new methods are built from components of exist-ing techniques our ultimate goal is to provide scientists with the tools they need tocontribute new ideas to the field
A growing challenge in science is that there is so much of it While the pursuitof important new methods and the honing of existing approaches is a worthy goalthere is also a need to organize and distill the teeming jungle of ideas We attempt todo that here Our choice of topics reflects our view of what constitutes the core of theevolving field of computational statistics and what will be interesting and useful forour readers
Our use of the adjective modern in the first sentence of this preface is potentiallytroublesome There is no way that this book can cover all the latest greatest techniquesWe have not even tried We have instead aimed to provide a reasonably up-to-datesurvey of a broad portion of the field while leaving room for diversions and esoterica
The foundations of optimization and numerical integration are covered in thisbook We include these venerable topics because (i) they are cornerstones of frequen-tist and Bayesian inference (ii) routine application of available software often failsfor hard problems and (iii) the methods themselves are often secondary componentsof other statistical computing algorithms Some topics we have omitted representimportant areas of past and present research in the field but their priority here islowered by the availability of high-quality software For example the generation ofpseudo-random numbers is a classic topic but one that we prefer to address by giv-ing students reliable software Finally some topics (eg principal curves and tabusearch) are included simply because they are interesting and provide very differentperspectives on familiar problems Perhaps a future researcher may draw ideas fromsuch topics to design a creative and effective new algorithm
In this second edition we have both updated and broadened our coverage andwe now provide computer code For example we have added new MCMC topics to re-flect continued activity in that popular area A notable increase in breadth is our inclu-sion of more methods relevant for problems where statistical dependency is importantsuch as block bootstrapping and sequential importance sampling This second editionprovides extensive new support in R Specifically code for the examples in this bookis available from the book website wwwstatcolostateeducomputationalstatistics
Our target audience includes graduate students in statistics and related fieldsstatisticians and quantitative empirical scientists in other fields We hope such readersmay use the book when applying standard methods and developing new methods
xv
xvi PREFACE
The level of mathematics expected of the reader does not extend much beyondTaylor series and linear algebra Breadth of mathematical training is more helpfulthan depth Essential review is provided in Chapter 1 More advanced readers will findgreater mathematical detail in the wide variety of high-quality books available on spe-cific topics many of which are referenced in the text Other readers caring less aboutanalytical details may prefer to focus on our descriptions of algorithms and examples
The expected level of statistics is equivalent to that obtained by a graduatestudent in his or her first year of study of the theory of statistics and probabilityAn understanding of maximum likelihood methods Bayesian methods elementaryasymptotic theory Markov chains and linear models is most important Many ofthese topics are reviewed in Chapter 1
With respect to computer programming we find that good students can learn asthey go However a working knowledge of a suitable language allows implementationof the ideas covered in this book to progress much more quickly We have chosento forgo any language-specific examples algorithms or coding in the text For thosewishing to learn a language while they study this book we recommend that you choosea high-level interactive package that permits the flexible design of graphical displaysand includes supporting statistics and probability functions such as R and MATLAB1
These are the sort of languages often used by researchers during the development ofnew statistical computing techniques and they are suitable for implementing all themethods we describe except in some cases for problems of vast scope or complexityWe use R and recommend it Although lower-level languages such as C++ couldalso be used they are more appropriate for professional-grade implementation ofalgorithms after researchers have refined the methodology
The book is organized into four major parts optimization (Chapters 2 3 and4) integration and simulation (Chapters 5 6 7 and 8) bootstrapping (Chapter 9)and density estimation and smoothing (Chapters 10 11 and 12) The chapters arewritten to stand independently so a course can be built by selecting the topics onewishes to teach For a one-semester course our selection typically weights mostheavily topics from Chapters 2 3 6 7 9 10 and 11 With a leisurely pace or morethorough coverage a shorter list of topics could still easily fill a semester courseThere is sufficient material here to provide a thorough one-year course of studynotwithstanding any supplemental topics one might wish to teach
A variety of homework problems are included at the end of each chapter Someare straightforward while others require the student to develop a thorough under-standing of the modelmethod being used to carefully (and perhaps cleverly) code asuitable technique and to devote considerable attention to the interpretation of resultsA few exercises invite open-ended exploration of methods and ideas We are some-times asked for solutions to the exercises but we prefer to sequester them to preservethe challenge for future students and readers
The datasets discussed in the examples and exercises are available from the bookwebsite wwwstatcolostateeducomputationalstatistics The R code is also providedthere Finally the website includes an errata Responsibility for all errors lies with us
1R is available for free from wwwr-projectorg Information about MATLAB can be found atwwwmathworkscom
ACKNOWLEDGMENTS
The course upon which this book is based was developed and taught by us at ColoradoState University from 1994 onwards Thanks are due to our many students who havebeen semiwilling guinea pigs over the years We also thank our colleagues in theStatistics Department for their continued support The late Richard Tweedie meritsparticular acknowledgment for his mentoring during the early years of our careers
We owe a great deal of intellectual debt to Adrian Raftery who deserves specialthanks not only for his teaching and advising but also for his unwavering support andhis seemingly inexhaustible supply of good ideas In addition we thank our influentialadvisors and teachers at the University of Washington Statistics Department includingDavid Madigan Werner Stuetzle and Judy Zeh Of course each of our chapters couldbe expanded into a full-length book and great scholars have already done so We owemuch to their efforts upon which we relied when developing our course and ourmanuscript
Portions of the first edition were written at the Department of Mathematicsand Statistics University of Otago in Dunedin New Zealand whose faculty wethank for graciously hosting us during our sabbatical in 2003 Much of our workon the second edition was undertaken during our sabbatical visit to the AustraliaCommonwealth Scientific and Research Organization in 2009ndash2010 sponsored byCSIRO Mathematics Informatics and Statistics and hosted at the Longpocket Lab-oratory in Indooroopilly Australia We thank our hosts and colleagues there for theirsupport
Our manuscript has been greatly improved through the constructive reviewsof John Bickham Ben Bird Kate Cowles Jan Hannig Alan Herlihy David HunterDevin Johnson Michael Newton Doug Nychka Steve Sain David W Scott N ScottUrquhart Haonan Wang Darrell Whitley and eight anonymous referees We alsothank the sharp-eyed readers listed in the errata for their suggestions and correctionsOur editor Steve Quigley and the folks at Wiley were supportive and helpful duringthe publication process We thank Nelida Pohl for permission to adapt her photographin the cover design of the first edition We thank Melinda Stelzer for permission to useher painting ldquoChampagne Circuitrdquo 2001 for the cover of the second edition Moreabout her art can be found at wwwfacebookcomgeekchicart We also owe specialnote of thanks to Zube (aka John Dzubera) who kept our own computers runningdespite our best efforts to the contrary
Funding from National Science Foundation (NSF) CAREER grant SBR-9875508 was a significant source of support for the first author during the preparationof the first edition He also thanks his colleagues and friends in the North SlopeBorough Alaska Department of Wildlife Management for their longtime researchsupport The second author gratefully acknowledges the support of STAR Research
xvii
xviii ACKNOWLEDGMENTS
Assistance Agreement CR-829095 awarded to Colorado State University by the USEnvironmental Protection Agency (EPA) The views expressed here are solely thoseof the authors NSF and EPA do not endorse any products or commercial servicesmentioned herein
Finally we thank our parents for enabling and supporting our educations and forproviding us with the ldquostubbornness generdquo necessary for graduate school the tenuretrack or book publicationmdashtake your pick The second edition is dedicated to ourkids Natalie and Neil for continuing to show us what is important and what is not
Geof H GivensJennifer A Hoeting
CHAPTER 1REVIEW
This chapter reviews notation and background material in mathematics probabilityand statistics Readers may wish to skip this chapter and turn directly to Chapter 2returning here only as needed
11 MATHEMATICAL NOTATION
We use boldface to distinguish a vector x = (x1 xp) or a matrix M from a scalarvariable x or a constant M A vector-valued function f evaluated at x is also boldfacedas in f(x) = (f1(x) fp(x)) The transpose of M is denoted MT
Unless otherwise specified all vectors are considered to be column vectors sofor example an n times p matrix can be written as M = (x1 xn)T Let I denote anidentity matrix and 1 and 0 denote vectors of ones and zeros respectively
A symmetric square matrix M is positive definite if xTMx gt 0 for all nonzerovectors x Positive definiteness is equivalent to the condition that all eigenvalues ofM are positive M is nonnegative definite or positive semidefinite if xTMx ge 0 for allnonzero vectors x
The derivative of a function f evaluated at x is denoted f prime(x) When x =(x1 xp) the gradient of f at x is
f prime(x) =(
df (x)
dx1
df (x)
dxp
)
The Hessian matrix for f at x is f primeprime(x) having (i j)th element equal to d2f (x)(dxi dxj) The negative Hessian has important uses in statistical inference
Let J(x) denote the Jacobian matrix evaluated at x for the one-to-one mappingy = f(x) The (i j)th element of J(x) is equal to dfi(x)dxj
A functional is a real-valued function on a space of functions For example ifT (f ) = int 1
0 f (x) dx then the functional T maps suitably integrable functions onto thereal line
The indicator function 1A equals 1 if A is true and 0 otherwise The real lineis denoted and p-dimensional real space is p
Computational Statistics Second Edition Geof H Givens and Jennifer A Hoetingcopy 2013 John Wiley amp Sons Inc Published 2013 by John Wiley amp Sons Inc
1
2 CHAPTER 1 REVIEW
12 TAYLORrsquoS THEOREM AND MATHEMATICALLIMIT THEORY
First we define standard ldquobig ohrdquo and ldquolittle ohrdquo notation for describing the relativeorders of convergence of functions Let the functions f and g be defined on a commonpossibly infinite interval Let z0 be a point in this interval or a boundary point of it(ie minusinfin or infin) We require g(z) = 0 for all z = z0 in a neighborhood of z0 Thenwe say
f (z) = O(g(z)) (11)
if there exists a constant M such that |f (z)| le M|g(z)| as z rarr z0 For example(n + 1)(3n2) = O(nminus1) and it is understood that we are considering n rarr infin Iflimzrarrz0 f (z)g(z) = 0 then we say
f (z) = O(g(z)) (12)
For example f (x0 + h) minus f (x0) = hf prime(x0) + O(h) as h rarr 0 if f is differentiable atx0 The same notation can be used for describing the convergence of a sequence xnas n rarr infin by letting f (n) = xn
Taylorrsquos theorem provides a polynomial approximation to a function f Supposef has finite (n + 1)th derivative on (a b) and continuous nth derivative on [a b] Thenfor any x0 isin [a b] distinct from x the Taylor series expansion of f about x0 is
f (x) =nsum
i=0
1
if (i)(x0)(x minus x0)i + Rn (13)
where f (i)(x0) is the ith derivative of f evaluated at x0 and
Rn = 1
(n + 1)f (n+1)(ξ)(x minus x0)n+1 (14)
for some point ξ in the interval between x and x0 As |x minus x0| rarr 0 note that Rn =O(|x minus x0|n+1)
The multivariate version of Taylorrsquos theorem is analogous Suppose f is areal-valued function of a p-dimensional variable x possessing continuous partialderivatives of all orders up to and including n + 1 with respect to all coordinates inan open convex set containing x and x0 = x Then
f (x) = f (x0) +nsum
i=1
1
iD(i)(f x0 x minus x0) + Rn (15)
where
D(i)(f x y) =psum
j1=1
middot middot middotpsum
ji=1
(di
dtj1 middot middot middot dtji
f (t)
∣∣∣∣t=x
) iprodk=1
yjk
(16)
12 TAYLORrsquoS THEOREM AND MATHEMATICAL LIMIT THEORY 3
and
Rn = 1
(n + 1)D(n+1)(f ξ x minus x0) (17)
for some ξ on the line segment joining x and x0 As |x minus x0| rarr 0 note that Rn =O(|x minus x0|n+1)
The EulerndashMaclaurin formula is useful in many asymptotic analyses If f has2n continuous derivatives in [0 1] thenint 1
0f (x) dx =f (0) + f (1)
2
minusnminus1sumi=0
b2i(f (2iminus1)(1) minus f (2iminus1)(0))
(2i)minus b2nf
(2n)(ξ)
(2n) (18)
where 0 le ξ le 1 f (j) is the jth derivative of f and bj = Bj(0) can be determinedusing the recursion relation
msumj=0
(m + 1
j
)Bj(z) = (m + 1)zm (19)
initialized with B0(z) = 1 The proof of this result is based on repeated integrationsby parts [376]
Finally we note that it is sometimes desirable to approximate the derivative ofa function numerically using finite differences For example the ith component ofthe gradient of f at x can be approximated by
df (x)
dxi
asymp f (x + εiei) minus f (x minus εiei)
2εi
(110)
where εi is a small number and ei is the unit vector in the ith coordinate directionTypically one might start with say εi = 001 or 0001 and approximate the desiredderivative for a sequence of progressively smaller εi The approximation willgenerally improve until εi becomes small enough that the calculation is degraded andeventually dominated by computer roundoff error introduced by subtractive cancel-lation Introductory discussion of this approach and a more sophisticated Richardsonextrapolation strategy for obtaining greater precision are provided in [376] Finitedifferences can also be used to approximate the second derivative of f at x via
df (x)
dxi dxj
asymp 1
4εiεj
(f (x + εiei + εjej) minus f (x + εiei minus εjej)
minus f (x minus εiei + εjej) + f (x minus εiei minus εjej)
)(111)
with similar sequential precision improvements
4 CHAPTER 1 REVIEW
13 STATISTICAL NOTATION AND PROBABILITYDISTRIBUTIONS
We use capital letters to denote random variables such as Y or X and lowercaseletters to represent specific realized values of random variables such as y or x Theprobability density function of X is denoted f the cumulative distribution functionis F We use the notation X sim f (x) to mean that X is distributed with density f (x)Frequently the dependence of f (x) on one or more parameters also will be denotedwith a conditioning bar as in f (x|α β) Because of the diversity of topics covered inthis book we want to be careful to distinguish when f (x|α) refers to a density functionas opposed to the evaluation of that density at a point x When the meaning is unclearfrom the context we will be explicit for example by using f (middot|α) to denote thefunction When it is important to distinguish among several densities we may adoptsubscripts referring to specific random variables so that the density functions forX and Y are fX and fY respectively We use the same notation for distributions ofdiscrete random variables and in the Bayesian context
The conditional distribution of X given that Y equals y (ie X|Y = y) is de-scribed by the density denoted f (x|y) or fX|Y (x|y) In this case we write that X|Y hasdensity f (x|Y ) For notational simplicity we allow density functions to be implicitlyspecified by their arguments so we may use the same symbol say f to refer to manydistinct functions as in the equation f (x y|μ) = f (x|y μ)f (y|μ) Finally f (X) andF (X) are random variables the evaluations of the density and cumulative distributionfunctions respectively at the random argument X
The expectation of a random variable is denoted EX Unless specificallymentioned the distribution with respect to which an expectation is taken is the dis-tribution of X or should be implicit from the context To denote the probability ofan event A we use P[A] = E1A The conditional expectation of X|Y = y isEX|y When Y is unknown EX|Y is a random variable that depends on Y Otherattributes of the distribution of X and Y include varX covX Y corX Y andcvX = varX12EX These quantities are the variance of X the covariance andcorrelation of X and Y and the coefficient of variation of X respectively
A useful result regarding expectations is Jensenrsquos inequality Let g be a convexfunction on a possibly infinite open interval I so
g(λx + (1 minus λ)y) le λg(x) + (1 minus λ)g(y) (112)
for all x y isin I and all 0 lt λ lt 1 Then Jensenrsquos inequality states that Eg(X) geg(EX) for any random variable X having P[X isin I] = 1
Tables 11 12 and 13 provide information about many discrete and contin-uous distributions used throughout this book We refer to the following well-knowncombinatorial constants
n = n(n minus 1)(n minus 2) middot middot middot (3)(2)(1) with 0 = 1 (113)(n
k
)= n
k(n minus k) (114)
TABLE 11 Notation and description for common probability distributions of discrete random variables
Notation and Density and Mean and
Name Parameter Space Sample Space Variance
Bernoulli X simBernoulli(p)0 le p le 1
f (x) = px(1 minus p)1minusx
x = 0 or 1EX = p
varX = p(1 minus p)
Binomial X simBin(n p)0 le p le 1n isin 1 2
f (x) =(
n
x
)px(1 minus p)nminusx
x = 0 1 n
EX = np
varX = np(1 minus p)
Multinomial X simMultinomial(n p)p = (p1 pk)0 le pi le 1 and n isin 1 2 sumk
i=1 pi = 1
f (x) =(
n
x1 xk
)prodk
i=1 pxi
i
x = (x1 xk) and xi isin 0 1 nsumk
i=1 xi = n
EX = npvarXi = npi(1 minus pi)covXi Xj = minusnpipj
NegativeBinomial
X simNegBin(r p)0 le p le 1r isin 1 2
f (x) =(
x + r minus 1
r minus 1
)pr(1 minus p)x
x isin 0 1 EX = r(1 minus p)pvarX = r(1 minus p)p2
Poisson X simPoisson(λ) f (x) = λx
xexpminusλ EX = λ
λ gt 0 x isin 0 1 varX = λ
5
TABLE 12 Notation and description for some common probability distributions of continuous random variables
Notation and Density and Mean and
Name Parameter Space Sample Space Variance
Beta X simBeta(α β)α gt 0 and β gt 0
f (x) = (α + β)
(α)(β)xαminus1(1 minus x)βminus1
0 le x le 1
EX = α
α + β
varX = αβ
(α + β)2(α + β + 1)
Cauchy X simCauchy(α β)α isin and β gt 0
f (x) = 1
πβ
[1 +
(x minus α
β
)2]
x isin
EX is nonexistentvarX is nonexistent
Chi-square X sim χ2ν
ν gt 0f (x) = Gamma(ν2 12)x gt 0
EX = ν
varX = 2ν
Dirichlet X simDirichlet(α)α = (α1 αk)αi gt 0α0 = sumk
i=1 αi
f (x) = (α0)prodk
i=1 xαiminus1iprodk
i=1 (αi)x = (x1 xk) and 0 le xi le 1sumk
i=1 xi = 1
EX = αα0
varXi = αi(α0 minus αi)
α20(α0 + 1)
covXi Xj = minusαiαj
α20(α0 + 1)
Exponential X simExp(λ)λ gt 0
f (x) = λ expminusλxx gt 0
EX = 1λ
varX = 1λ2
Gamma X simGamma(r λ)λ gt 0 and r gt 0
f (x) = λrxrminus1
(r)expminusλx
x gt 0
EX = rλ
varX = rλ2
6
TABLE 13 Notation and description for more common probability distributions of continuous random variables
Notation and Density and Mean and
Name Parameter Space Sample Space Variance
Lognormal X simLognormal(μ σ2)μ isin and σ gt 0
f (x) = 1
xradic
2πσ2exp
minus1
2
(logx minus μ
σ
)2
x isin EX = expμ + σ22varX = exp2μ + 2σ2 minus exp2μ + σ2
MultivariateNormal
X sim Nk(μ )μ = (μ1 μk) isin k
positive definite
f (x) = expminus(x minus μ)T minus1(x minus μ)2(2π)k2||12
x = (x1 xk) isin k
EX = μ
varX =
Normal X simN(μ σ2)μ isin and σ gt 0
f (x) = 1radic2πσ2
exp
minus1
2
(x minus μ
σ
)2
x isin EX = μ
varX = σ2
Studentrsquos t X sim tν
ν gt 0f (x) = ((ν + 1)2)
(ν2)radic
πν
(1 + x2
ν
)minus(ν+1)2
x isin EX = 0 if ν gt 1
varX = ν
ν + 2if ν gt 2
Uniform X simUnif(a b)a b isin and a lt b
f (x) = 1
b minus ax isin [a b]
EX = (a + b)2varX = (b minus a)212
Weibull X simWeibull(a b)a gt 0 and b gt 0
f (x) = abxbminus1 expminusaxbx gt 0
EX = (1 + 1b)
a1b
varX = (1 + 2b) minus (1 + 1b)2
a2b
7
8 CHAPTER 1 REVIEW(n
k1 km
)= nprodm
i=1 kiwhere n =
msumi=1
ki (115)
(r) =
(r minus 1) for r = 1 2 int infin0 trminus1 expminust dt for general r gt 0
(116)
It is worth knowing that
(
12
)= radic
π and (n + 1
2
)= 1 times 3 times 5 times middot middot middot times (2n minus 1)
radicπ2n
for positive integer nMany of the distributions commonly used in statistics are members of an
exponential family A k-parameter exponential family density can be expressed as
f (x|γ) = c1(x)c2(γ) exp
ksum
i=1
yi(x)θi(γ)
(117)
for nonnegative functions c1 and c2 The vectorγ denotes the familiar parameters suchas λ for the Poisson density and p for the binomial density The real-valued θi(γ) arethe natural or canonical parameters which are usually transformations of γ Theyi(x) are the sufficient statistics for the canonical parameters It is straightforwardto show
Ey(X) =κprime(θ) (118)
and
vary(X) =κprimeprime(θ) (119)
where κ(θ) = minus log c3(θ) letting c3(θ) denote the reexpression of c2(γ) in terms ofthe canonical parameters θ = (θ1 θk) and y(X) = (y1(X) yk(X)) Theseresults can be rewritten in terms of the original parameters γ as
E
ksum
i=1
dθi(γ)
dγj
yi(X)
= minus d
dγj
log c2(γ) (120)
and
var
ksum
i=1
dθi(γ)
dγj
yi(X)
= minus d2
dγ2j
log c2(γ) minus E
ksum
i=1
d2θi(γ)
dγ2j
yi(X)
(121)
Example 11 (Poisson) The Poisson distribution belongs to the exponential familywith c1(x) = 1x c2(λ) = expminusλ y(x) = x and θ(λ) = log λ Deriving momentsin terms of θ we have κ(θ) = expθ so EX = κprime(θ) = expθ = λ and varX =κprimeprime(θ) = expθ = λ The same results may be obtained with (120) and (121) notingthat dθdλ = 1λ For example (120) gives E Xλ = 1
It is also important to know how the distribution of a random variable changeswhen it is transformed Let X = (X1 Xp) denote a p-dimensional random
14 LIKELIHOOD INFERENCE 9
variable with continuous density function f Suppose that
U = g(X) = (g1(X) gp(X)) = (U1 Up) (122)
where g is a one-to-one function mapping the support region of f onto the space ofall u = g(x) for which x satisfies f (x) gt 0 To derive the probability distribution ofU from that of X we need to use the Jacobian matrix The density of the transformedvariables is
f (u) = f (gminus1(u)) |J(u)| (123)
where |J(u)| is the absolute value of the determinant of the Jacobian matrix of gminus1
evaluated at u having (i j)th element dxiduj where these derivatives are assumedto be continuous over the support region of U
14 LIKELIHOOD INFERENCE
If X1 Xn are independent and identically distributed (iid) each having densityf (x | θ) that depends on a vector of p unknown parameters θ = (θ1 θp) then thejoint likelihood function is
L(θ) =nprod
i=1
f (xi|θ) (124)
When the data are not iid the joint likelihood is still expressed as the joint densityf (x1 xn|θ) viewed as a function of θ
The observed data x1 xn might have been realized under many differentvalues for θ The parameters for which observing x1 xn would be most likelyconstitute the maximum likelihood estimate of θ In other words if ϑ is the functionof x1 xn that maximizes L(θ) then θ = ϑ(X1 Xn) is the maximum likeli-hood estimator (MLE) for θ MLEs are invariant to transformation so the MLE of atransformation of θ equals the transformation of θ
It is typically easier to work with the log likelihood function
l (θ) = log L(θ) (125)
which has the same maximum as the original likelihood since log is a strictly mono-tonic function Furthermore any additive constants (involving possibly x1 xn
but not θ) may be omitted from the log likelihood without changing the location of itsmaximum or differences between log likelihoods at different θ Note that maximizingL(θ) with respect to θ is equivalent to solving the system of equations
lprime(θ) = 0 (126)
where
lprime(θ) =(
dl(θ)
dθ1
dl(θ)
dθn
)
10 CHAPTER 1 REVIEW
is called the score function The score function satisfies
Elprime(θ) = 0 (127)
where the expectation is taken with respect to the distribution of X1 Xn Some-times an analytical solution to (126) provides the MLE this book describes a varietyof methods that can be used when the MLE cannot be solved for in closed form Itis worth noting that there are pathological circumstances where the MLE is not asolution of the score equation or the MLE is not unique see [127] for examples
The MLE has a sampling distribution because it depends on the realizationof the random variables X1 Xn The MLE may be biased or unbiased for θ yetunder quite general conditions it is asymptotically unbiased as n rarr infin The samplingvariance of the MLE depends on the average curvature of the log likelihood When thelog likelihood is very pointy the location of the maximum is more precisely known
To make this precise let lprimeprime(θ) denote the p times p matrix having (i j)th elementgiven by d2l(θ)(dθidθj) The Fisher information matrix is defined as
I(θ) = Elprime(θ)lprime(θ)T = minusElprimeprime(θ) (128)
where the expectations are taken with respect to the distribution of X1 Xn Thefinal equality in (128) requires mild assumptions which are satisfied for example inexponential families I(θ) may sometimes be called the expected Fisher informationto distinguish it from minuslprimeprime(θ) which is the observed Fisher information There aretwo reasons why the observed Fisher information is quite useful First it can becalculated even if the expectations in (128) cannot easily be computed Second it isa good approximation to I(θ) that improves as n increases
Under regularity conditions the asymptotic variancendashcovariance matrix ofthe MLE θ is I(θlowast)minus1 where θlowast denotes the true value of θ Indeed as n rarr infin thelimiting distribution of θ is Np(θlowast I(θlowast)minus1) Since the true parameter values areunknown I(θlowast)minus1 must be estimated in order to estimate the variancendashcovariancematrix of the MLE An obvious approach is to use I(θ)minus1 Alternatively it is alsoreasonable to use minuslprimeprime(θ)minus1 Standard errors for individual parameter MLEs canbe estimated by taking the square root of the diagonal elements of the chosenestimate of I(θlowast)minus1 A thorough introduction to maximum likelihood theory andthe relative merits of these estimates of I(θlowast)minus1 can be found in [127 182 371 470]
Profile likelihoods provide an informative way to graph a higher-dimensionallikelihood surface to make inference about some parameters while treating othersas nuisance parameters and to facilitate various optimization strategies The profilelikelihood is obtained by constrained maximization of the full likelihood with respectto parameters to be ignored If θ = (μ φ) then the profile likelihood for φ is
L(φ|μ(φ)) = maxμ
L(μ φ) (129)
Thus for each possible φ a value of μ is chosen to maximize L(μ φ) This optimalμ is a function of φ The profile likelihood is the function that maps φ to the valueof the full likelihood evaluated at φ and its corresponding optimal μ Note that the φ
CONTENTS xi
732 Practical Implementation Advice 226733 Using the Results 226
Problems 230
8 ADVANCED TOPICS IN MCMC 237
81 Adaptive MCMC 237811 Adaptive Random Walk Metropolis-within-Gibbs Algorithm 238812 General Adaptive Metropolis-within-Gibbs Algorithm 240813 Adaptive Metropolis Algorithm 247
82 Reversible Jump MCMC 250821 RJMCMC for Variable Selection in Regression 253
83 Auxiliary Variable Methods 256831 Simulated Tempering 257832 Slice Sampler 258
84 Other MetropolisndashHastings Algorithms 260841 Hit-and-Run Algorithm 260842 Multiple-Try MetropolisndashHastings Algorithm 261843 Langevin MetropolisndashHastings Algorithm 262
85 Perfect Sampling 264851 Coupling from the Past 264
8511 Stochastic Monotonicity and Sandwiching 267
86 Markov Chain Maximum Likelihood 268
87 Example MCMC for Markov Random Fields 269871 Gibbs Sampling for Markov Random Fields 270872 Auxiliary Variable Methods for Markov Random Fields 274873 Perfect Sampling for Markov Random Fields 277
Problems 279
PART III
BOOTSTRAPPING
9 BOOTSTRAPPING 287
91 The Bootstrap Principle 287
92 Basic Methods 288921 Nonparametric Bootstrap 288922 Parametric Bootstrap 289923 Bootstrapping Regression 290924 Bootstrap Bias Correction 291
93 Bootstrap Inference 292931 Percentile Method 292
9311 Justification for the Percentile Method 293932 Pivoting 294
xii CONTENTS
9321 Accelerated Bias-Corrected Percentile Method BCa 294
9322 The Bootstrap t 296
9323 Empirical Variance Stabilization 298
9324 Nested Bootstrap and Prepivoting 299
933 Hypothesis Testing 301
94 Reducing Monte Carlo Error 302
941 Balanced Bootstrap 302942 Antithetic Bootstrap 302
95 Bootstrapping Dependent Data 303
951 Model-Based Approach 304952 Block Bootstrap 304
9521 Nonmoving Block Bootstrap 304
9522 Moving Block Bootstrap 306
9523 Blocks-of-Blocks Bootstrapping 307
9524 Centering and Studentizing 309
9525 Block Size 311
96 Bootstrap Performance 315
961 Independent Data Case 315962 Dependent Data Case 316
97 Other Uses of the Bootstrap 316
98 Permutation Tests 317
Problems 319
PART IV
DENSITY ESTIMATION AND SMOOTHING
10 NONPARAMETRIC DENSITY ESTIMATION 325
101 Measures of Performance 326
102 Kernel Density Estimation 327
1021 Choice of Bandwidth 329
10211 Cross-Validation 332
10212 Plug-in Methods 335
10213 Maximal Smoothing Principle 338
1022 Choice of Kernel 339
10221 Epanechnikov Kernel 339
10222 Canonical Kernels and Rescalings 340
103 Nonkernel Methods 341
1031 Logspline 341
104 Multivariate Methods 345
1041 The Nature of the Problem 345
CONTENTS xiii
1042 Multivariate Kernel Estimators 3461043 Adaptive Kernels and Nearest Neighbors 348
10431 Nearest Neighbor Approaches 349
10432 Variable-Kernel Approaches and Transformations 350
1044 Exploratory Projection Pursuit 353
Problems 359
11 BIVARIATE SMOOTHING 363
111 PredictorndashResponse Data 363
112 Linear Smoothers 365
1121 Constant-Span Running Mean 366
11211 Effect of Span 368
11212 Span Selection for Linear Smoothers 369
1122 Running Lines and Running Polynomials 3721123 Kernel Smoothers 3741124 Local Regression Smoothing 3741125 Spline Smoothing 376
11251 Choice of Penalty 377
113 Comparison of Linear Smoothers 377
114 Nonlinear Smoothers 379
1141 Loess 3791142 Supersmoother 381
115 Confidence Bands 384
116 General Bivariate Data 388
Problems 389
12 MULTIVARIATE SMOOTHING 393
121 PredictorndashResponse Data 393
1211 Additive Models 3941212 Generalized Additive Models 3971213 Other Methods Related to Additive Models 399
12131 Projection Pursuit Regression 399
12132 Neural Networks 402
12133 Alternating Conditional Expectations 403
12134 Additivity and Variance Stabilization 404
1214 Tree-Based Methods 405
12141 Recursive Partitioning Regression Trees 406
12142 Tree Pruning 409
12143 Classification Trees 411
12144 Other Issues for Tree-Based Methods 412
xiv CONTENTS
122 General Multivariate Data 4131221 Principal Curves 413
12211 Definition and Motivation 413
12212 Estimation 415
12213 Span Selection 416
Problems 416
DATA ACKNOWLEDGMENTS 421
REFERENCES 423
INDEX 457
PREFACE
This book covers most topics needed to develop a broad and thorough working knowl-edge of modern computational statistics We seek to develop a practical understandingof how and why existing methods work enabling readers to use modern statisticalmethods effectively Since many new methods are built from components of exist-ing techniques our ultimate goal is to provide scientists with the tools they need tocontribute new ideas to the field
A growing challenge in science is that there is so much of it While the pursuitof important new methods and the honing of existing approaches is a worthy goalthere is also a need to organize and distill the teeming jungle of ideas We attempt todo that here Our choice of topics reflects our view of what constitutes the core of theevolving field of computational statistics and what will be interesting and useful forour readers
Our use of the adjective modern in the first sentence of this preface is potentiallytroublesome There is no way that this book can cover all the latest greatest techniquesWe have not even tried We have instead aimed to provide a reasonably up-to-datesurvey of a broad portion of the field while leaving room for diversions and esoterica
The foundations of optimization and numerical integration are covered in thisbook We include these venerable topics because (i) they are cornerstones of frequen-tist and Bayesian inference (ii) routine application of available software often failsfor hard problems and (iii) the methods themselves are often secondary componentsof other statistical computing algorithms Some topics we have omitted representimportant areas of past and present research in the field but their priority here islowered by the availability of high-quality software For example the generation ofpseudo-random numbers is a classic topic but one that we prefer to address by giv-ing students reliable software Finally some topics (eg principal curves and tabusearch) are included simply because they are interesting and provide very differentperspectives on familiar problems Perhaps a future researcher may draw ideas fromsuch topics to design a creative and effective new algorithm
In this second edition we have both updated and broadened our coverage andwe now provide computer code For example we have added new MCMC topics to re-flect continued activity in that popular area A notable increase in breadth is our inclu-sion of more methods relevant for problems where statistical dependency is importantsuch as block bootstrapping and sequential importance sampling This second editionprovides extensive new support in R Specifically code for the examples in this bookis available from the book website wwwstatcolostateeducomputationalstatistics
Our target audience includes graduate students in statistics and related fieldsstatisticians and quantitative empirical scientists in other fields We hope such readersmay use the book when applying standard methods and developing new methods
xv
xvi PREFACE
The level of mathematics expected of the reader does not extend much beyondTaylor series and linear algebra Breadth of mathematical training is more helpfulthan depth Essential review is provided in Chapter 1 More advanced readers will findgreater mathematical detail in the wide variety of high-quality books available on spe-cific topics many of which are referenced in the text Other readers caring less aboutanalytical details may prefer to focus on our descriptions of algorithms and examples
The expected level of statistics is equivalent to that obtained by a graduatestudent in his or her first year of study of the theory of statistics and probabilityAn understanding of maximum likelihood methods Bayesian methods elementaryasymptotic theory Markov chains and linear models is most important Many ofthese topics are reviewed in Chapter 1
With respect to computer programming we find that good students can learn asthey go However a working knowledge of a suitable language allows implementationof the ideas covered in this book to progress much more quickly We have chosento forgo any language-specific examples algorithms or coding in the text For thosewishing to learn a language while they study this book we recommend that you choosea high-level interactive package that permits the flexible design of graphical displaysand includes supporting statistics and probability functions such as R and MATLAB1
These are the sort of languages often used by researchers during the development ofnew statistical computing techniques and they are suitable for implementing all themethods we describe except in some cases for problems of vast scope or complexityWe use R and recommend it Although lower-level languages such as C++ couldalso be used they are more appropriate for professional-grade implementation ofalgorithms after researchers have refined the methodology
The book is organized into four major parts optimization (Chapters 2 3 and4) integration and simulation (Chapters 5 6 7 and 8) bootstrapping (Chapter 9)and density estimation and smoothing (Chapters 10 11 and 12) The chapters arewritten to stand independently so a course can be built by selecting the topics onewishes to teach For a one-semester course our selection typically weights mostheavily topics from Chapters 2 3 6 7 9 10 and 11 With a leisurely pace or morethorough coverage a shorter list of topics could still easily fill a semester courseThere is sufficient material here to provide a thorough one-year course of studynotwithstanding any supplemental topics one might wish to teach
A variety of homework problems are included at the end of each chapter Someare straightforward while others require the student to develop a thorough under-standing of the modelmethod being used to carefully (and perhaps cleverly) code asuitable technique and to devote considerable attention to the interpretation of resultsA few exercises invite open-ended exploration of methods and ideas We are some-times asked for solutions to the exercises but we prefer to sequester them to preservethe challenge for future students and readers
The datasets discussed in the examples and exercises are available from the bookwebsite wwwstatcolostateeducomputationalstatistics The R code is also providedthere Finally the website includes an errata Responsibility for all errors lies with us
1R is available for free from wwwr-projectorg Information about MATLAB can be found atwwwmathworkscom
ACKNOWLEDGMENTS
The course upon which this book is based was developed and taught by us at ColoradoState University from 1994 onwards Thanks are due to our many students who havebeen semiwilling guinea pigs over the years We also thank our colleagues in theStatistics Department for their continued support The late Richard Tweedie meritsparticular acknowledgment for his mentoring during the early years of our careers
We owe a great deal of intellectual debt to Adrian Raftery who deserves specialthanks not only for his teaching and advising but also for his unwavering support andhis seemingly inexhaustible supply of good ideas In addition we thank our influentialadvisors and teachers at the University of Washington Statistics Department includingDavid Madigan Werner Stuetzle and Judy Zeh Of course each of our chapters couldbe expanded into a full-length book and great scholars have already done so We owemuch to their efforts upon which we relied when developing our course and ourmanuscript
Portions of the first edition were written at the Department of Mathematicsand Statistics University of Otago in Dunedin New Zealand whose faculty wethank for graciously hosting us during our sabbatical in 2003 Much of our workon the second edition was undertaken during our sabbatical visit to the AustraliaCommonwealth Scientific and Research Organization in 2009ndash2010 sponsored byCSIRO Mathematics Informatics and Statistics and hosted at the Longpocket Lab-oratory in Indooroopilly Australia We thank our hosts and colleagues there for theirsupport
Our manuscript has been greatly improved through the constructive reviewsof John Bickham Ben Bird Kate Cowles Jan Hannig Alan Herlihy David HunterDevin Johnson Michael Newton Doug Nychka Steve Sain David W Scott N ScottUrquhart Haonan Wang Darrell Whitley and eight anonymous referees We alsothank the sharp-eyed readers listed in the errata for their suggestions and correctionsOur editor Steve Quigley and the folks at Wiley were supportive and helpful duringthe publication process We thank Nelida Pohl for permission to adapt her photographin the cover design of the first edition We thank Melinda Stelzer for permission to useher painting ldquoChampagne Circuitrdquo 2001 for the cover of the second edition Moreabout her art can be found at wwwfacebookcomgeekchicart We also owe specialnote of thanks to Zube (aka John Dzubera) who kept our own computers runningdespite our best efforts to the contrary
Funding from National Science Foundation (NSF) CAREER grant SBR-9875508 was a significant source of support for the first author during the preparationof the first edition He also thanks his colleagues and friends in the North SlopeBorough Alaska Department of Wildlife Management for their longtime researchsupport The second author gratefully acknowledges the support of STAR Research
xvii
xviii ACKNOWLEDGMENTS
Assistance Agreement CR-829095 awarded to Colorado State University by the USEnvironmental Protection Agency (EPA) The views expressed here are solely thoseof the authors NSF and EPA do not endorse any products or commercial servicesmentioned herein
Finally we thank our parents for enabling and supporting our educations and forproviding us with the ldquostubbornness generdquo necessary for graduate school the tenuretrack or book publicationmdashtake your pick The second edition is dedicated to ourkids Natalie and Neil for continuing to show us what is important and what is not
Geof H GivensJennifer A Hoeting
CHAPTER 1REVIEW
This chapter reviews notation and background material in mathematics probabilityand statistics Readers may wish to skip this chapter and turn directly to Chapter 2returning here only as needed
11 MATHEMATICAL NOTATION
We use boldface to distinguish a vector x = (x1 xp) or a matrix M from a scalarvariable x or a constant M A vector-valued function f evaluated at x is also boldfacedas in f(x) = (f1(x) fp(x)) The transpose of M is denoted MT
Unless otherwise specified all vectors are considered to be column vectors sofor example an n times p matrix can be written as M = (x1 xn)T Let I denote anidentity matrix and 1 and 0 denote vectors of ones and zeros respectively
A symmetric square matrix M is positive definite if xTMx gt 0 for all nonzerovectors x Positive definiteness is equivalent to the condition that all eigenvalues ofM are positive M is nonnegative definite or positive semidefinite if xTMx ge 0 for allnonzero vectors x
The derivative of a function f evaluated at x is denoted f prime(x) When x =(x1 xp) the gradient of f at x is
f prime(x) =(
df (x)
dx1
df (x)
dxp
)
The Hessian matrix for f at x is f primeprime(x) having (i j)th element equal to d2f (x)(dxi dxj) The negative Hessian has important uses in statistical inference
Let J(x) denote the Jacobian matrix evaluated at x for the one-to-one mappingy = f(x) The (i j)th element of J(x) is equal to dfi(x)dxj
A functional is a real-valued function on a space of functions For example ifT (f ) = int 1
0 f (x) dx then the functional T maps suitably integrable functions onto thereal line
The indicator function 1A equals 1 if A is true and 0 otherwise The real lineis denoted and p-dimensional real space is p
Computational Statistics Second Edition Geof H Givens and Jennifer A Hoetingcopy 2013 John Wiley amp Sons Inc Published 2013 by John Wiley amp Sons Inc
1
2 CHAPTER 1 REVIEW
12 TAYLORrsquoS THEOREM AND MATHEMATICALLIMIT THEORY
First we define standard ldquobig ohrdquo and ldquolittle ohrdquo notation for describing the relativeorders of convergence of functions Let the functions f and g be defined on a commonpossibly infinite interval Let z0 be a point in this interval or a boundary point of it(ie minusinfin or infin) We require g(z) = 0 for all z = z0 in a neighborhood of z0 Thenwe say
f (z) = O(g(z)) (11)
if there exists a constant M such that |f (z)| le M|g(z)| as z rarr z0 For example(n + 1)(3n2) = O(nminus1) and it is understood that we are considering n rarr infin Iflimzrarrz0 f (z)g(z) = 0 then we say
f (z) = O(g(z)) (12)
For example f (x0 + h) minus f (x0) = hf prime(x0) + O(h) as h rarr 0 if f is differentiable atx0 The same notation can be used for describing the convergence of a sequence xnas n rarr infin by letting f (n) = xn
Taylorrsquos theorem provides a polynomial approximation to a function f Supposef has finite (n + 1)th derivative on (a b) and continuous nth derivative on [a b] Thenfor any x0 isin [a b] distinct from x the Taylor series expansion of f about x0 is
f (x) =nsum
i=0
1
if (i)(x0)(x minus x0)i + Rn (13)
where f (i)(x0) is the ith derivative of f evaluated at x0 and
Rn = 1
(n + 1)f (n+1)(ξ)(x minus x0)n+1 (14)
for some point ξ in the interval between x and x0 As |x minus x0| rarr 0 note that Rn =O(|x minus x0|n+1)
The multivariate version of Taylorrsquos theorem is analogous Suppose f is areal-valued function of a p-dimensional variable x possessing continuous partialderivatives of all orders up to and including n + 1 with respect to all coordinates inan open convex set containing x and x0 = x Then
f (x) = f (x0) +nsum
i=1
1
iD(i)(f x0 x minus x0) + Rn (15)
where
D(i)(f x y) =psum
j1=1
middot middot middotpsum
ji=1
(di
dtj1 middot middot middot dtji
f (t)
∣∣∣∣t=x
) iprodk=1
yjk
(16)
12 TAYLORrsquoS THEOREM AND MATHEMATICAL LIMIT THEORY 3
and
Rn = 1
(n + 1)D(n+1)(f ξ x minus x0) (17)
for some ξ on the line segment joining x and x0 As |x minus x0| rarr 0 note that Rn =O(|x minus x0|n+1)
The EulerndashMaclaurin formula is useful in many asymptotic analyses If f has2n continuous derivatives in [0 1] thenint 1
0f (x) dx =f (0) + f (1)
2
minusnminus1sumi=0
b2i(f (2iminus1)(1) minus f (2iminus1)(0))
(2i)minus b2nf
(2n)(ξ)
(2n) (18)
where 0 le ξ le 1 f (j) is the jth derivative of f and bj = Bj(0) can be determinedusing the recursion relation
msumj=0
(m + 1
j
)Bj(z) = (m + 1)zm (19)
initialized with B0(z) = 1 The proof of this result is based on repeated integrationsby parts [376]
Finally we note that it is sometimes desirable to approximate the derivative ofa function numerically using finite differences For example the ith component ofthe gradient of f at x can be approximated by
df (x)
dxi
asymp f (x + εiei) minus f (x minus εiei)
2εi
(110)
where εi is a small number and ei is the unit vector in the ith coordinate directionTypically one might start with say εi = 001 or 0001 and approximate the desiredderivative for a sequence of progressively smaller εi The approximation willgenerally improve until εi becomes small enough that the calculation is degraded andeventually dominated by computer roundoff error introduced by subtractive cancel-lation Introductory discussion of this approach and a more sophisticated Richardsonextrapolation strategy for obtaining greater precision are provided in [376] Finitedifferences can also be used to approximate the second derivative of f at x via
df (x)
dxi dxj
asymp 1
4εiεj
(f (x + εiei + εjej) minus f (x + εiei minus εjej)
minus f (x minus εiei + εjej) + f (x minus εiei minus εjej)
)(111)
with similar sequential precision improvements
4 CHAPTER 1 REVIEW
13 STATISTICAL NOTATION AND PROBABILITYDISTRIBUTIONS
We use capital letters to denote random variables such as Y or X and lowercaseletters to represent specific realized values of random variables such as y or x Theprobability density function of X is denoted f the cumulative distribution functionis F We use the notation X sim f (x) to mean that X is distributed with density f (x)Frequently the dependence of f (x) on one or more parameters also will be denotedwith a conditioning bar as in f (x|α β) Because of the diversity of topics covered inthis book we want to be careful to distinguish when f (x|α) refers to a density functionas opposed to the evaluation of that density at a point x When the meaning is unclearfrom the context we will be explicit for example by using f (middot|α) to denote thefunction When it is important to distinguish among several densities we may adoptsubscripts referring to specific random variables so that the density functions forX and Y are fX and fY respectively We use the same notation for distributions ofdiscrete random variables and in the Bayesian context
The conditional distribution of X given that Y equals y (ie X|Y = y) is de-scribed by the density denoted f (x|y) or fX|Y (x|y) In this case we write that X|Y hasdensity f (x|Y ) For notational simplicity we allow density functions to be implicitlyspecified by their arguments so we may use the same symbol say f to refer to manydistinct functions as in the equation f (x y|μ) = f (x|y μ)f (y|μ) Finally f (X) andF (X) are random variables the evaluations of the density and cumulative distributionfunctions respectively at the random argument X
The expectation of a random variable is denoted EX Unless specificallymentioned the distribution with respect to which an expectation is taken is the dis-tribution of X or should be implicit from the context To denote the probability ofan event A we use P[A] = E1A The conditional expectation of X|Y = y isEX|y When Y is unknown EX|Y is a random variable that depends on Y Otherattributes of the distribution of X and Y include varX covX Y corX Y andcvX = varX12EX These quantities are the variance of X the covariance andcorrelation of X and Y and the coefficient of variation of X respectively
A useful result regarding expectations is Jensenrsquos inequality Let g be a convexfunction on a possibly infinite open interval I so
g(λx + (1 minus λ)y) le λg(x) + (1 minus λ)g(y) (112)
for all x y isin I and all 0 lt λ lt 1 Then Jensenrsquos inequality states that Eg(X) geg(EX) for any random variable X having P[X isin I] = 1
Tables 11 12 and 13 provide information about many discrete and contin-uous distributions used throughout this book We refer to the following well-knowncombinatorial constants
n = n(n minus 1)(n minus 2) middot middot middot (3)(2)(1) with 0 = 1 (113)(n
k
)= n
k(n minus k) (114)
TABLE 11 Notation and description for common probability distributions of discrete random variables
Notation and Density and Mean and
Name Parameter Space Sample Space Variance
Bernoulli X simBernoulli(p)0 le p le 1
f (x) = px(1 minus p)1minusx
x = 0 or 1EX = p
varX = p(1 minus p)
Binomial X simBin(n p)0 le p le 1n isin 1 2
f (x) =(
n
x
)px(1 minus p)nminusx
x = 0 1 n
EX = np
varX = np(1 minus p)
Multinomial X simMultinomial(n p)p = (p1 pk)0 le pi le 1 and n isin 1 2 sumk
i=1 pi = 1
f (x) =(
n
x1 xk
)prodk
i=1 pxi
i
x = (x1 xk) and xi isin 0 1 nsumk
i=1 xi = n
EX = npvarXi = npi(1 minus pi)covXi Xj = minusnpipj
NegativeBinomial
X simNegBin(r p)0 le p le 1r isin 1 2
f (x) =(
x + r minus 1
r minus 1
)pr(1 minus p)x
x isin 0 1 EX = r(1 minus p)pvarX = r(1 minus p)p2
Poisson X simPoisson(λ) f (x) = λx
xexpminusλ EX = λ
λ gt 0 x isin 0 1 varX = λ
5
TABLE 12 Notation and description for some common probability distributions of continuous random variables
Notation and Density and Mean and
Name Parameter Space Sample Space Variance
Beta X simBeta(α β)α gt 0 and β gt 0
f (x) = (α + β)
(α)(β)xαminus1(1 minus x)βminus1
0 le x le 1
EX = α
α + β
varX = αβ
(α + β)2(α + β + 1)
Cauchy X simCauchy(α β)α isin and β gt 0
f (x) = 1
πβ
[1 +
(x minus α
β
)2]
x isin
EX is nonexistentvarX is nonexistent
Chi-square X sim χ2ν
ν gt 0f (x) = Gamma(ν2 12)x gt 0
EX = ν
varX = 2ν
Dirichlet X simDirichlet(α)α = (α1 αk)αi gt 0α0 = sumk
i=1 αi
f (x) = (α0)prodk
i=1 xαiminus1iprodk
i=1 (αi)x = (x1 xk) and 0 le xi le 1sumk
i=1 xi = 1
EX = αα0
varXi = αi(α0 minus αi)
α20(α0 + 1)
covXi Xj = minusαiαj
α20(α0 + 1)
Exponential X simExp(λ)λ gt 0
f (x) = λ expminusλxx gt 0
EX = 1λ
varX = 1λ2
Gamma X simGamma(r λ)λ gt 0 and r gt 0
f (x) = λrxrminus1
(r)expminusλx
x gt 0
EX = rλ
varX = rλ2
6
TABLE 13 Notation and description for more common probability distributions of continuous random variables
Notation and Density and Mean and
Name Parameter Space Sample Space Variance
Lognormal X simLognormal(μ σ2)μ isin and σ gt 0
f (x) = 1
xradic
2πσ2exp
minus1
2
(logx minus μ
σ
)2
x isin EX = expμ + σ22varX = exp2μ + 2σ2 minus exp2μ + σ2
MultivariateNormal
X sim Nk(μ )μ = (μ1 μk) isin k
positive definite
f (x) = expminus(x minus μ)T minus1(x minus μ)2(2π)k2||12
x = (x1 xk) isin k
EX = μ
varX =
Normal X simN(μ σ2)μ isin and σ gt 0
f (x) = 1radic2πσ2
exp
minus1
2
(x minus μ
σ
)2
x isin EX = μ
varX = σ2
Studentrsquos t X sim tν
ν gt 0f (x) = ((ν + 1)2)
(ν2)radic
πν
(1 + x2
ν
)minus(ν+1)2
x isin EX = 0 if ν gt 1
varX = ν
ν + 2if ν gt 2
Uniform X simUnif(a b)a b isin and a lt b
f (x) = 1
b minus ax isin [a b]
EX = (a + b)2varX = (b minus a)212
Weibull X simWeibull(a b)a gt 0 and b gt 0
f (x) = abxbminus1 expminusaxbx gt 0
EX = (1 + 1b)
a1b
varX = (1 + 2b) minus (1 + 1b)2
a2b
7
8 CHAPTER 1 REVIEW(n
k1 km
)= nprodm
i=1 kiwhere n =
msumi=1
ki (115)
(r) =
(r minus 1) for r = 1 2 int infin0 trminus1 expminust dt for general r gt 0
(116)
It is worth knowing that
(
12
)= radic
π and (n + 1
2
)= 1 times 3 times 5 times middot middot middot times (2n minus 1)
radicπ2n
for positive integer nMany of the distributions commonly used in statistics are members of an
exponential family A k-parameter exponential family density can be expressed as
f (x|γ) = c1(x)c2(γ) exp
ksum
i=1
yi(x)θi(γ)
(117)
for nonnegative functions c1 and c2 The vectorγ denotes the familiar parameters suchas λ for the Poisson density and p for the binomial density The real-valued θi(γ) arethe natural or canonical parameters which are usually transformations of γ Theyi(x) are the sufficient statistics for the canonical parameters It is straightforwardto show
Ey(X) =κprime(θ) (118)
and
vary(X) =κprimeprime(θ) (119)
where κ(θ) = minus log c3(θ) letting c3(θ) denote the reexpression of c2(γ) in terms ofthe canonical parameters θ = (θ1 θk) and y(X) = (y1(X) yk(X)) Theseresults can be rewritten in terms of the original parameters γ as
E
ksum
i=1
dθi(γ)
dγj
yi(X)
= minus d
dγj
log c2(γ) (120)
and
var
ksum
i=1
dθi(γ)
dγj
yi(X)
= minus d2
dγ2j
log c2(γ) minus E
ksum
i=1
d2θi(γ)
dγ2j
yi(X)
(121)
Example 11 (Poisson) The Poisson distribution belongs to the exponential familywith c1(x) = 1x c2(λ) = expminusλ y(x) = x and θ(λ) = log λ Deriving momentsin terms of θ we have κ(θ) = expθ so EX = κprime(θ) = expθ = λ and varX =κprimeprime(θ) = expθ = λ The same results may be obtained with (120) and (121) notingthat dθdλ = 1λ For example (120) gives E Xλ = 1
It is also important to know how the distribution of a random variable changeswhen it is transformed Let X = (X1 Xp) denote a p-dimensional random
14 LIKELIHOOD INFERENCE 9
variable with continuous density function f Suppose that
U = g(X) = (g1(X) gp(X)) = (U1 Up) (122)
where g is a one-to-one function mapping the support region of f onto the space ofall u = g(x) for which x satisfies f (x) gt 0 To derive the probability distribution ofU from that of X we need to use the Jacobian matrix The density of the transformedvariables is
f (u) = f (gminus1(u)) |J(u)| (123)
where |J(u)| is the absolute value of the determinant of the Jacobian matrix of gminus1
evaluated at u having (i j)th element dxiduj where these derivatives are assumedto be continuous over the support region of U
14 LIKELIHOOD INFERENCE
If X1 Xn are independent and identically distributed (iid) each having densityf (x | θ) that depends on a vector of p unknown parameters θ = (θ1 θp) then thejoint likelihood function is
L(θ) =nprod
i=1
f (xi|θ) (124)
When the data are not iid the joint likelihood is still expressed as the joint densityf (x1 xn|θ) viewed as a function of θ
The observed data x1 xn might have been realized under many differentvalues for θ The parameters for which observing x1 xn would be most likelyconstitute the maximum likelihood estimate of θ In other words if ϑ is the functionof x1 xn that maximizes L(θ) then θ = ϑ(X1 Xn) is the maximum likeli-hood estimator (MLE) for θ MLEs are invariant to transformation so the MLE of atransformation of θ equals the transformation of θ
It is typically easier to work with the log likelihood function
l (θ) = log L(θ) (125)
which has the same maximum as the original likelihood since log is a strictly mono-tonic function Furthermore any additive constants (involving possibly x1 xn
but not θ) may be omitted from the log likelihood without changing the location of itsmaximum or differences between log likelihoods at different θ Note that maximizingL(θ) with respect to θ is equivalent to solving the system of equations
lprime(θ) = 0 (126)
where
lprime(θ) =(
dl(θ)
dθ1
dl(θ)
dθn
)
10 CHAPTER 1 REVIEW
is called the score function The score function satisfies
Elprime(θ) = 0 (127)
where the expectation is taken with respect to the distribution of X1 Xn Some-times an analytical solution to (126) provides the MLE this book describes a varietyof methods that can be used when the MLE cannot be solved for in closed form Itis worth noting that there are pathological circumstances where the MLE is not asolution of the score equation or the MLE is not unique see [127] for examples
The MLE has a sampling distribution because it depends on the realizationof the random variables X1 Xn The MLE may be biased or unbiased for θ yetunder quite general conditions it is asymptotically unbiased as n rarr infin The samplingvariance of the MLE depends on the average curvature of the log likelihood When thelog likelihood is very pointy the location of the maximum is more precisely known
To make this precise let lprimeprime(θ) denote the p times p matrix having (i j)th elementgiven by d2l(θ)(dθidθj) The Fisher information matrix is defined as
I(θ) = Elprime(θ)lprime(θ)T = minusElprimeprime(θ) (128)
where the expectations are taken with respect to the distribution of X1 Xn Thefinal equality in (128) requires mild assumptions which are satisfied for example inexponential families I(θ) may sometimes be called the expected Fisher informationto distinguish it from minuslprimeprime(θ) which is the observed Fisher information There aretwo reasons why the observed Fisher information is quite useful First it can becalculated even if the expectations in (128) cannot easily be computed Second it isa good approximation to I(θ) that improves as n increases
Under regularity conditions the asymptotic variancendashcovariance matrix ofthe MLE θ is I(θlowast)minus1 where θlowast denotes the true value of θ Indeed as n rarr infin thelimiting distribution of θ is Np(θlowast I(θlowast)minus1) Since the true parameter values areunknown I(θlowast)minus1 must be estimated in order to estimate the variancendashcovariancematrix of the MLE An obvious approach is to use I(θ)minus1 Alternatively it is alsoreasonable to use minuslprimeprime(θ)minus1 Standard errors for individual parameter MLEs canbe estimated by taking the square root of the diagonal elements of the chosenestimate of I(θlowast)minus1 A thorough introduction to maximum likelihood theory andthe relative merits of these estimates of I(θlowast)minus1 can be found in [127 182 371 470]
Profile likelihoods provide an informative way to graph a higher-dimensionallikelihood surface to make inference about some parameters while treating othersas nuisance parameters and to facilitate various optimization strategies The profilelikelihood is obtained by constrained maximization of the full likelihood with respectto parameters to be ignored If θ = (μ φ) then the profile likelihood for φ is
L(φ|μ(φ)) = maxμ
L(μ φ) (129)
Thus for each possible φ a value of μ is chosen to maximize L(μ φ) This optimalμ is a function of φ The profile likelihood is the function that maps φ to the valueof the full likelihood evaluated at φ and its corresponding optimal μ Note that the φ
xii CONTENTS
9321 Accelerated Bias-Corrected Percentile Method BCa 294
9322 The Bootstrap t 296
9323 Empirical Variance Stabilization 298
9324 Nested Bootstrap and Prepivoting 299
933 Hypothesis Testing 301
94 Reducing Monte Carlo Error 302
941 Balanced Bootstrap 302942 Antithetic Bootstrap 302
95 Bootstrapping Dependent Data 303
951 Model-Based Approach 304952 Block Bootstrap 304
9521 Nonmoving Block Bootstrap 304
9522 Moving Block Bootstrap 306
9523 Blocks-of-Blocks Bootstrapping 307
9524 Centering and Studentizing 309
9525 Block Size 311
96 Bootstrap Performance 315
961 Independent Data Case 315962 Dependent Data Case 316
97 Other Uses of the Bootstrap 316
98 Permutation Tests 317
Problems 319
PART IV
DENSITY ESTIMATION AND SMOOTHING
10 NONPARAMETRIC DENSITY ESTIMATION 325
101 Measures of Performance 326
102 Kernel Density Estimation 327
1021 Choice of Bandwidth 329
10211 Cross-Validation 332
10212 Plug-in Methods 335
10213 Maximal Smoothing Principle 338
1022 Choice of Kernel 339
10221 Epanechnikov Kernel 339
10222 Canonical Kernels and Rescalings 340
103 Nonkernel Methods 341
1031 Logspline 341
104 Multivariate Methods 345
1041 The Nature of the Problem 345
CONTENTS xiii
1042 Multivariate Kernel Estimators 3461043 Adaptive Kernels and Nearest Neighbors 348
10431 Nearest Neighbor Approaches 349
10432 Variable-Kernel Approaches and Transformations 350
1044 Exploratory Projection Pursuit 353
Problems 359
11 BIVARIATE SMOOTHING 363
111 PredictorndashResponse Data 363
112 Linear Smoothers 365
1121 Constant-Span Running Mean 366
11211 Effect of Span 368
11212 Span Selection for Linear Smoothers 369
1122 Running Lines and Running Polynomials 3721123 Kernel Smoothers 3741124 Local Regression Smoothing 3741125 Spline Smoothing 376
11251 Choice of Penalty 377
113 Comparison of Linear Smoothers 377
114 Nonlinear Smoothers 379
1141 Loess 3791142 Supersmoother 381
115 Confidence Bands 384
116 General Bivariate Data 388
Problems 389
12 MULTIVARIATE SMOOTHING 393
121 PredictorndashResponse Data 393
1211 Additive Models 3941212 Generalized Additive Models 3971213 Other Methods Related to Additive Models 399
12131 Projection Pursuit Regression 399
12132 Neural Networks 402
12133 Alternating Conditional Expectations 403
12134 Additivity and Variance Stabilization 404
1214 Tree-Based Methods 405
12141 Recursive Partitioning Regression Trees 406
12142 Tree Pruning 409
12143 Classification Trees 411
12144 Other Issues for Tree-Based Methods 412
xiv CONTENTS
122 General Multivariate Data 4131221 Principal Curves 413
12211 Definition and Motivation 413
12212 Estimation 415
12213 Span Selection 416
Problems 416
DATA ACKNOWLEDGMENTS 421
REFERENCES 423
INDEX 457
PREFACE
This book covers most topics needed to develop a broad and thorough working knowl-edge of modern computational statistics We seek to develop a practical understandingof how and why existing methods work enabling readers to use modern statisticalmethods effectively Since many new methods are built from components of exist-ing techniques our ultimate goal is to provide scientists with the tools they need tocontribute new ideas to the field
A growing challenge in science is that there is so much of it While the pursuitof important new methods and the honing of existing approaches is a worthy goalthere is also a need to organize and distill the teeming jungle of ideas We attempt todo that here Our choice of topics reflects our view of what constitutes the core of theevolving field of computational statistics and what will be interesting and useful forour readers
Our use of the adjective modern in the first sentence of this preface is potentiallytroublesome There is no way that this book can cover all the latest greatest techniquesWe have not even tried We have instead aimed to provide a reasonably up-to-datesurvey of a broad portion of the field while leaving room for diversions and esoterica
The foundations of optimization and numerical integration are covered in thisbook We include these venerable topics because (i) they are cornerstones of frequen-tist and Bayesian inference (ii) routine application of available software often failsfor hard problems and (iii) the methods themselves are often secondary componentsof other statistical computing algorithms Some topics we have omitted representimportant areas of past and present research in the field but their priority here islowered by the availability of high-quality software For example the generation ofpseudo-random numbers is a classic topic but one that we prefer to address by giv-ing students reliable software Finally some topics (eg principal curves and tabusearch) are included simply because they are interesting and provide very differentperspectives on familiar problems Perhaps a future researcher may draw ideas fromsuch topics to design a creative and effective new algorithm
In this second edition we have both updated and broadened our coverage andwe now provide computer code For example we have added new MCMC topics to re-flect continued activity in that popular area A notable increase in breadth is our inclu-sion of more methods relevant for problems where statistical dependency is importantsuch as block bootstrapping and sequential importance sampling This second editionprovides extensive new support in R Specifically code for the examples in this bookis available from the book website wwwstatcolostateeducomputationalstatistics
Our target audience includes graduate students in statistics and related fieldsstatisticians and quantitative empirical scientists in other fields We hope such readersmay use the book when applying standard methods and developing new methods
xv
xvi PREFACE
The level of mathematics expected of the reader does not extend much beyondTaylor series and linear algebra Breadth of mathematical training is more helpfulthan depth Essential review is provided in Chapter 1 More advanced readers will findgreater mathematical detail in the wide variety of high-quality books available on spe-cific topics many of which are referenced in the text Other readers caring less aboutanalytical details may prefer to focus on our descriptions of algorithms and examples
The expected level of statistics is equivalent to that obtained by a graduatestudent in his or her first year of study of the theory of statistics and probabilityAn understanding of maximum likelihood methods Bayesian methods elementaryasymptotic theory Markov chains and linear models is most important Many ofthese topics are reviewed in Chapter 1
With respect to computer programming we find that good students can learn asthey go However a working knowledge of a suitable language allows implementationof the ideas covered in this book to progress much more quickly We have chosento forgo any language-specific examples algorithms or coding in the text For thosewishing to learn a language while they study this book we recommend that you choosea high-level interactive package that permits the flexible design of graphical displaysand includes supporting statistics and probability functions such as R and MATLAB1
These are the sort of languages often used by researchers during the development ofnew statistical computing techniques and they are suitable for implementing all themethods we describe except in some cases for problems of vast scope or complexityWe use R and recommend it Although lower-level languages such as C++ couldalso be used they are more appropriate for professional-grade implementation ofalgorithms after researchers have refined the methodology
The book is organized into four major parts optimization (Chapters 2 3 and4) integration and simulation (Chapters 5 6 7 and 8) bootstrapping (Chapter 9)and density estimation and smoothing (Chapters 10 11 and 12) The chapters arewritten to stand independently so a course can be built by selecting the topics onewishes to teach For a one-semester course our selection typically weights mostheavily topics from Chapters 2 3 6 7 9 10 and 11 With a leisurely pace or morethorough coverage a shorter list of topics could still easily fill a semester courseThere is sufficient material here to provide a thorough one-year course of studynotwithstanding any supplemental topics one might wish to teach
A variety of homework problems are included at the end of each chapter Someare straightforward while others require the student to develop a thorough under-standing of the modelmethod being used to carefully (and perhaps cleverly) code asuitable technique and to devote considerable attention to the interpretation of resultsA few exercises invite open-ended exploration of methods and ideas We are some-times asked for solutions to the exercises but we prefer to sequester them to preservethe challenge for future students and readers
The datasets discussed in the examples and exercises are available from the bookwebsite wwwstatcolostateeducomputationalstatistics The R code is also providedthere Finally the website includes an errata Responsibility for all errors lies with us
1R is available for free from wwwr-projectorg Information about MATLAB can be found atwwwmathworkscom
ACKNOWLEDGMENTS
The course upon which this book is based was developed and taught by us at ColoradoState University from 1994 onwards Thanks are due to our many students who havebeen semiwilling guinea pigs over the years We also thank our colleagues in theStatistics Department for their continued support The late Richard Tweedie meritsparticular acknowledgment for his mentoring during the early years of our careers
We owe a great deal of intellectual debt to Adrian Raftery who deserves specialthanks not only for his teaching and advising but also for his unwavering support andhis seemingly inexhaustible supply of good ideas In addition we thank our influentialadvisors and teachers at the University of Washington Statistics Department includingDavid Madigan Werner Stuetzle and Judy Zeh Of course each of our chapters couldbe expanded into a full-length book and great scholars have already done so We owemuch to their efforts upon which we relied when developing our course and ourmanuscript
Portions of the first edition were written at the Department of Mathematicsand Statistics University of Otago in Dunedin New Zealand whose faculty wethank for graciously hosting us during our sabbatical in 2003 Much of our workon the second edition was undertaken during our sabbatical visit to the AustraliaCommonwealth Scientific and Research Organization in 2009ndash2010 sponsored byCSIRO Mathematics Informatics and Statistics and hosted at the Longpocket Lab-oratory in Indooroopilly Australia We thank our hosts and colleagues there for theirsupport
Our manuscript has been greatly improved through the constructive reviewsof John Bickham Ben Bird Kate Cowles Jan Hannig Alan Herlihy David HunterDevin Johnson Michael Newton Doug Nychka Steve Sain David W Scott N ScottUrquhart Haonan Wang Darrell Whitley and eight anonymous referees We alsothank the sharp-eyed readers listed in the errata for their suggestions and correctionsOur editor Steve Quigley and the folks at Wiley were supportive and helpful duringthe publication process We thank Nelida Pohl for permission to adapt her photographin the cover design of the first edition We thank Melinda Stelzer for permission to useher painting ldquoChampagne Circuitrdquo 2001 for the cover of the second edition Moreabout her art can be found at wwwfacebookcomgeekchicart We also owe specialnote of thanks to Zube (aka John Dzubera) who kept our own computers runningdespite our best efforts to the contrary
Funding from National Science Foundation (NSF) CAREER grant SBR-9875508 was a significant source of support for the first author during the preparationof the first edition He also thanks his colleagues and friends in the North SlopeBorough Alaska Department of Wildlife Management for their longtime researchsupport The second author gratefully acknowledges the support of STAR Research
xvii
xviii ACKNOWLEDGMENTS
Assistance Agreement CR-829095 awarded to Colorado State University by the USEnvironmental Protection Agency (EPA) The views expressed here are solely thoseof the authors NSF and EPA do not endorse any products or commercial servicesmentioned herein
Finally we thank our parents for enabling and supporting our educations and forproviding us with the ldquostubbornness generdquo necessary for graduate school the tenuretrack or book publicationmdashtake your pick The second edition is dedicated to ourkids Natalie and Neil for continuing to show us what is important and what is not
Geof H GivensJennifer A Hoeting
CHAPTER 1REVIEW
This chapter reviews notation and background material in mathematics probabilityand statistics Readers may wish to skip this chapter and turn directly to Chapter 2returning here only as needed
11 MATHEMATICAL NOTATION
We use boldface to distinguish a vector x = (x1 xp) or a matrix M from a scalarvariable x or a constant M A vector-valued function f evaluated at x is also boldfacedas in f(x) = (f1(x) fp(x)) The transpose of M is denoted MT
Unless otherwise specified all vectors are considered to be column vectors sofor example an n times p matrix can be written as M = (x1 xn)T Let I denote anidentity matrix and 1 and 0 denote vectors of ones and zeros respectively
A symmetric square matrix M is positive definite if xTMx gt 0 for all nonzerovectors x Positive definiteness is equivalent to the condition that all eigenvalues ofM are positive M is nonnegative definite or positive semidefinite if xTMx ge 0 for allnonzero vectors x
The derivative of a function f evaluated at x is denoted f prime(x) When x =(x1 xp) the gradient of f at x is
f prime(x) =(
df (x)
dx1
df (x)
dxp
)
The Hessian matrix for f at x is f primeprime(x) having (i j)th element equal to d2f (x)(dxi dxj) The negative Hessian has important uses in statistical inference
Let J(x) denote the Jacobian matrix evaluated at x for the one-to-one mappingy = f(x) The (i j)th element of J(x) is equal to dfi(x)dxj
A functional is a real-valued function on a space of functions For example ifT (f ) = int 1
0 f (x) dx then the functional T maps suitably integrable functions onto thereal line
The indicator function 1A equals 1 if A is true and 0 otherwise The real lineis denoted and p-dimensional real space is p
Computational Statistics Second Edition Geof H Givens and Jennifer A Hoetingcopy 2013 John Wiley amp Sons Inc Published 2013 by John Wiley amp Sons Inc
1
2 CHAPTER 1 REVIEW
12 TAYLORrsquoS THEOREM AND MATHEMATICALLIMIT THEORY
First we define standard ldquobig ohrdquo and ldquolittle ohrdquo notation for describing the relativeorders of convergence of functions Let the functions f and g be defined on a commonpossibly infinite interval Let z0 be a point in this interval or a boundary point of it(ie minusinfin or infin) We require g(z) = 0 for all z = z0 in a neighborhood of z0 Thenwe say
f (z) = O(g(z)) (11)
if there exists a constant M such that |f (z)| le M|g(z)| as z rarr z0 For example(n + 1)(3n2) = O(nminus1) and it is understood that we are considering n rarr infin Iflimzrarrz0 f (z)g(z) = 0 then we say
f (z) = O(g(z)) (12)
For example f (x0 + h) minus f (x0) = hf prime(x0) + O(h) as h rarr 0 if f is differentiable atx0 The same notation can be used for describing the convergence of a sequence xnas n rarr infin by letting f (n) = xn
Taylorrsquos theorem provides a polynomial approximation to a function f Supposef has finite (n + 1)th derivative on (a b) and continuous nth derivative on [a b] Thenfor any x0 isin [a b] distinct from x the Taylor series expansion of f about x0 is
f (x) =nsum
i=0
1
if (i)(x0)(x minus x0)i + Rn (13)
where f (i)(x0) is the ith derivative of f evaluated at x0 and
Rn = 1
(n + 1)f (n+1)(ξ)(x minus x0)n+1 (14)
for some point ξ in the interval between x and x0 As |x minus x0| rarr 0 note that Rn =O(|x minus x0|n+1)
The multivariate version of Taylorrsquos theorem is analogous Suppose f is areal-valued function of a p-dimensional variable x possessing continuous partialderivatives of all orders up to and including n + 1 with respect to all coordinates inan open convex set containing x and x0 = x Then
f (x) = f (x0) +nsum
i=1
1
iD(i)(f x0 x minus x0) + Rn (15)
where
D(i)(f x y) =psum
j1=1
middot middot middotpsum
ji=1
(di
dtj1 middot middot middot dtji
f (t)
∣∣∣∣t=x
) iprodk=1
yjk
(16)
12 TAYLORrsquoS THEOREM AND MATHEMATICAL LIMIT THEORY 3
and
Rn = 1
(n + 1)D(n+1)(f ξ x minus x0) (17)
for some ξ on the line segment joining x and x0 As |x minus x0| rarr 0 note that Rn =O(|x minus x0|n+1)
The EulerndashMaclaurin formula is useful in many asymptotic analyses If f has2n continuous derivatives in [0 1] thenint 1
0f (x) dx =f (0) + f (1)
2
minusnminus1sumi=0
b2i(f (2iminus1)(1) minus f (2iminus1)(0))
(2i)minus b2nf
(2n)(ξ)
(2n) (18)
where 0 le ξ le 1 f (j) is the jth derivative of f and bj = Bj(0) can be determinedusing the recursion relation
msumj=0
(m + 1
j
)Bj(z) = (m + 1)zm (19)
initialized with B0(z) = 1 The proof of this result is based on repeated integrationsby parts [376]
Finally we note that it is sometimes desirable to approximate the derivative ofa function numerically using finite differences For example the ith component ofthe gradient of f at x can be approximated by
df (x)
dxi
asymp f (x + εiei) minus f (x minus εiei)
2εi
(110)
where εi is a small number and ei is the unit vector in the ith coordinate directionTypically one might start with say εi = 001 or 0001 and approximate the desiredderivative for a sequence of progressively smaller εi The approximation willgenerally improve until εi becomes small enough that the calculation is degraded andeventually dominated by computer roundoff error introduced by subtractive cancel-lation Introductory discussion of this approach and a more sophisticated Richardsonextrapolation strategy for obtaining greater precision are provided in [376] Finitedifferences can also be used to approximate the second derivative of f at x via
df (x)
dxi dxj
asymp 1
4εiεj
(f (x + εiei + εjej) minus f (x + εiei minus εjej)
minus f (x minus εiei + εjej) + f (x minus εiei minus εjej)
)(111)
with similar sequential precision improvements
4 CHAPTER 1 REVIEW
13 STATISTICAL NOTATION AND PROBABILITYDISTRIBUTIONS
We use capital letters to denote random variables such as Y or X and lowercaseletters to represent specific realized values of random variables such as y or x Theprobability density function of X is denoted f the cumulative distribution functionis F We use the notation X sim f (x) to mean that X is distributed with density f (x)Frequently the dependence of f (x) on one or more parameters also will be denotedwith a conditioning bar as in f (x|α β) Because of the diversity of topics covered inthis book we want to be careful to distinguish when f (x|α) refers to a density functionas opposed to the evaluation of that density at a point x When the meaning is unclearfrom the context we will be explicit for example by using f (middot|α) to denote thefunction When it is important to distinguish among several densities we may adoptsubscripts referring to specific random variables so that the density functions forX and Y are fX and fY respectively We use the same notation for distributions ofdiscrete random variables and in the Bayesian context
The conditional distribution of X given that Y equals y (ie X|Y = y) is de-scribed by the density denoted f (x|y) or fX|Y (x|y) In this case we write that X|Y hasdensity f (x|Y ) For notational simplicity we allow density functions to be implicitlyspecified by their arguments so we may use the same symbol say f to refer to manydistinct functions as in the equation f (x y|μ) = f (x|y μ)f (y|μ) Finally f (X) andF (X) are random variables the evaluations of the density and cumulative distributionfunctions respectively at the random argument X
The expectation of a random variable is denoted EX Unless specificallymentioned the distribution with respect to which an expectation is taken is the dis-tribution of X or should be implicit from the context To denote the probability ofan event A we use P[A] = E1A The conditional expectation of X|Y = y isEX|y When Y is unknown EX|Y is a random variable that depends on Y Otherattributes of the distribution of X and Y include varX covX Y corX Y andcvX = varX12EX These quantities are the variance of X the covariance andcorrelation of X and Y and the coefficient of variation of X respectively
A useful result regarding expectations is Jensenrsquos inequality Let g be a convexfunction on a possibly infinite open interval I so
g(λx + (1 minus λ)y) le λg(x) + (1 minus λ)g(y) (112)
for all x y isin I and all 0 lt λ lt 1 Then Jensenrsquos inequality states that Eg(X) geg(EX) for any random variable X having P[X isin I] = 1
Tables 11 12 and 13 provide information about many discrete and contin-uous distributions used throughout this book We refer to the following well-knowncombinatorial constants
n = n(n minus 1)(n minus 2) middot middot middot (3)(2)(1) with 0 = 1 (113)(n
k
)= n
k(n minus k) (114)
TABLE 11 Notation and description for common probability distributions of discrete random variables
Notation and Density and Mean and
Name Parameter Space Sample Space Variance
Bernoulli X simBernoulli(p)0 le p le 1
f (x) = px(1 minus p)1minusx
x = 0 or 1EX = p
varX = p(1 minus p)
Binomial X simBin(n p)0 le p le 1n isin 1 2
f (x) =(
n
x
)px(1 minus p)nminusx
x = 0 1 n
EX = np
varX = np(1 minus p)
Multinomial X simMultinomial(n p)p = (p1 pk)0 le pi le 1 and n isin 1 2 sumk
i=1 pi = 1
f (x) =(
n
x1 xk
)prodk
i=1 pxi
i
x = (x1 xk) and xi isin 0 1 nsumk
i=1 xi = n
EX = npvarXi = npi(1 minus pi)covXi Xj = minusnpipj
NegativeBinomial
X simNegBin(r p)0 le p le 1r isin 1 2
f (x) =(
x + r minus 1
r minus 1
)pr(1 minus p)x
x isin 0 1 EX = r(1 minus p)pvarX = r(1 minus p)p2
Poisson X simPoisson(λ) f (x) = λx
xexpminusλ EX = λ
λ gt 0 x isin 0 1 varX = λ
5
TABLE 12 Notation and description for some common probability distributions of continuous random variables
Notation and Density and Mean and
Name Parameter Space Sample Space Variance
Beta X simBeta(α β)α gt 0 and β gt 0
f (x) = (α + β)
(α)(β)xαminus1(1 minus x)βminus1
0 le x le 1
EX = α
α + β
varX = αβ
(α + β)2(α + β + 1)
Cauchy X simCauchy(α β)α isin and β gt 0
f (x) = 1
πβ
[1 +
(x minus α
β
)2]
x isin
EX is nonexistentvarX is nonexistent
Chi-square X sim χ2ν
ν gt 0f (x) = Gamma(ν2 12)x gt 0
EX = ν
varX = 2ν
Dirichlet X simDirichlet(α)α = (α1 αk)αi gt 0α0 = sumk
i=1 αi
f (x) = (α0)prodk
i=1 xαiminus1iprodk
i=1 (αi)x = (x1 xk) and 0 le xi le 1sumk
i=1 xi = 1
EX = αα0
varXi = αi(α0 minus αi)
α20(α0 + 1)
covXi Xj = minusαiαj
α20(α0 + 1)
Exponential X simExp(λ)λ gt 0
f (x) = λ expminusλxx gt 0
EX = 1λ
varX = 1λ2
Gamma X simGamma(r λ)λ gt 0 and r gt 0
f (x) = λrxrminus1
(r)expminusλx
x gt 0
EX = rλ
varX = rλ2
6
TABLE 13 Notation and description for more common probability distributions of continuous random variables
Notation and Density and Mean and
Name Parameter Space Sample Space Variance
Lognormal X simLognormal(μ σ2)μ isin and σ gt 0
f (x) = 1
xradic
2πσ2exp
minus1
2
(logx minus μ
σ
)2
x isin EX = expμ + σ22varX = exp2μ + 2σ2 minus exp2μ + σ2
MultivariateNormal
X sim Nk(μ )μ = (μ1 μk) isin k
positive definite
f (x) = expminus(x minus μ)T minus1(x minus μ)2(2π)k2||12
x = (x1 xk) isin k
EX = μ
varX =
Normal X simN(μ σ2)μ isin and σ gt 0
f (x) = 1radic2πσ2
exp
minus1
2
(x minus μ
σ
)2
x isin EX = μ
varX = σ2
Studentrsquos t X sim tν
ν gt 0f (x) = ((ν + 1)2)
(ν2)radic
πν
(1 + x2
ν
)minus(ν+1)2
x isin EX = 0 if ν gt 1
varX = ν
ν + 2if ν gt 2
Uniform X simUnif(a b)a b isin and a lt b
f (x) = 1
b minus ax isin [a b]
EX = (a + b)2varX = (b minus a)212
Weibull X simWeibull(a b)a gt 0 and b gt 0
f (x) = abxbminus1 expminusaxbx gt 0
EX = (1 + 1b)
a1b
varX = (1 + 2b) minus (1 + 1b)2
a2b
7
8 CHAPTER 1 REVIEW(n
k1 km
)= nprodm
i=1 kiwhere n =
msumi=1
ki (115)
(r) =
(r minus 1) for r = 1 2 int infin0 trminus1 expminust dt for general r gt 0
(116)
It is worth knowing that
(
12
)= radic
π and (n + 1
2
)= 1 times 3 times 5 times middot middot middot times (2n minus 1)
radicπ2n
for positive integer nMany of the distributions commonly used in statistics are members of an
exponential family A k-parameter exponential family density can be expressed as
f (x|γ) = c1(x)c2(γ) exp
ksum
i=1
yi(x)θi(γ)
(117)
for nonnegative functions c1 and c2 The vectorγ denotes the familiar parameters suchas λ for the Poisson density and p for the binomial density The real-valued θi(γ) arethe natural or canonical parameters which are usually transformations of γ Theyi(x) are the sufficient statistics for the canonical parameters It is straightforwardto show
Ey(X) =κprime(θ) (118)
and
vary(X) =κprimeprime(θ) (119)
where κ(θ) = minus log c3(θ) letting c3(θ) denote the reexpression of c2(γ) in terms ofthe canonical parameters θ = (θ1 θk) and y(X) = (y1(X) yk(X)) Theseresults can be rewritten in terms of the original parameters γ as
E
ksum
i=1
dθi(γ)
dγj
yi(X)
= minus d
dγj
log c2(γ) (120)
and
var
ksum
i=1
dθi(γ)
dγj
yi(X)
= minus d2
dγ2j
log c2(γ) minus E
ksum
i=1
d2θi(γ)
dγ2j
yi(X)
(121)
Example 11 (Poisson) The Poisson distribution belongs to the exponential familywith c1(x) = 1x c2(λ) = expminusλ y(x) = x and θ(λ) = log λ Deriving momentsin terms of θ we have κ(θ) = expθ so EX = κprime(θ) = expθ = λ and varX =κprimeprime(θ) = expθ = λ The same results may be obtained with (120) and (121) notingthat dθdλ = 1λ For example (120) gives E Xλ = 1
It is also important to know how the distribution of a random variable changeswhen it is transformed Let X = (X1 Xp) denote a p-dimensional random
14 LIKELIHOOD INFERENCE 9
variable with continuous density function f Suppose that
U = g(X) = (g1(X) gp(X)) = (U1 Up) (122)
where g is a one-to-one function mapping the support region of f onto the space ofall u = g(x) for which x satisfies f (x) gt 0 To derive the probability distribution ofU from that of X we need to use the Jacobian matrix The density of the transformedvariables is
f (u) = f (gminus1(u)) |J(u)| (123)
where |J(u)| is the absolute value of the determinant of the Jacobian matrix of gminus1
evaluated at u having (i j)th element dxiduj where these derivatives are assumedto be continuous over the support region of U
14 LIKELIHOOD INFERENCE
If X1 Xn are independent and identically distributed (iid) each having densityf (x | θ) that depends on a vector of p unknown parameters θ = (θ1 θp) then thejoint likelihood function is
L(θ) =nprod
i=1
f (xi|θ) (124)
When the data are not iid the joint likelihood is still expressed as the joint densityf (x1 xn|θ) viewed as a function of θ
The observed data x1 xn might have been realized under many differentvalues for θ The parameters for which observing x1 xn would be most likelyconstitute the maximum likelihood estimate of θ In other words if ϑ is the functionof x1 xn that maximizes L(θ) then θ = ϑ(X1 Xn) is the maximum likeli-hood estimator (MLE) for θ MLEs are invariant to transformation so the MLE of atransformation of θ equals the transformation of θ
It is typically easier to work with the log likelihood function
l (θ) = log L(θ) (125)
which has the same maximum as the original likelihood since log is a strictly mono-tonic function Furthermore any additive constants (involving possibly x1 xn
but not θ) may be omitted from the log likelihood without changing the location of itsmaximum or differences between log likelihoods at different θ Note that maximizingL(θ) with respect to θ is equivalent to solving the system of equations
lprime(θ) = 0 (126)
where
lprime(θ) =(
dl(θ)
dθ1
dl(θ)
dθn
)
10 CHAPTER 1 REVIEW
is called the score function The score function satisfies
Elprime(θ) = 0 (127)
where the expectation is taken with respect to the distribution of X1 Xn Some-times an analytical solution to (126) provides the MLE this book describes a varietyof methods that can be used when the MLE cannot be solved for in closed form Itis worth noting that there are pathological circumstances where the MLE is not asolution of the score equation or the MLE is not unique see [127] for examples
The MLE has a sampling distribution because it depends on the realizationof the random variables X1 Xn The MLE may be biased or unbiased for θ yetunder quite general conditions it is asymptotically unbiased as n rarr infin The samplingvariance of the MLE depends on the average curvature of the log likelihood When thelog likelihood is very pointy the location of the maximum is more precisely known
To make this precise let lprimeprime(θ) denote the p times p matrix having (i j)th elementgiven by d2l(θ)(dθidθj) The Fisher information matrix is defined as
I(θ) = Elprime(θ)lprime(θ)T = minusElprimeprime(θ) (128)
where the expectations are taken with respect to the distribution of X1 Xn Thefinal equality in (128) requires mild assumptions which are satisfied for example inexponential families I(θ) may sometimes be called the expected Fisher informationto distinguish it from minuslprimeprime(θ) which is the observed Fisher information There aretwo reasons why the observed Fisher information is quite useful First it can becalculated even if the expectations in (128) cannot easily be computed Second it isa good approximation to I(θ) that improves as n increases
Under regularity conditions the asymptotic variancendashcovariance matrix ofthe MLE θ is I(θlowast)minus1 where θlowast denotes the true value of θ Indeed as n rarr infin thelimiting distribution of θ is Np(θlowast I(θlowast)minus1) Since the true parameter values areunknown I(θlowast)minus1 must be estimated in order to estimate the variancendashcovariancematrix of the MLE An obvious approach is to use I(θ)minus1 Alternatively it is alsoreasonable to use minuslprimeprime(θ)minus1 Standard errors for individual parameter MLEs canbe estimated by taking the square root of the diagonal elements of the chosenestimate of I(θlowast)minus1 A thorough introduction to maximum likelihood theory andthe relative merits of these estimates of I(θlowast)minus1 can be found in [127 182 371 470]
Profile likelihoods provide an informative way to graph a higher-dimensionallikelihood surface to make inference about some parameters while treating othersas nuisance parameters and to facilitate various optimization strategies The profilelikelihood is obtained by constrained maximization of the full likelihood with respectto parameters to be ignored If θ = (μ φ) then the profile likelihood for φ is
L(φ|μ(φ)) = maxμ
L(μ φ) (129)
Thus for each possible φ a value of μ is chosen to maximize L(μ φ) This optimalμ is a function of φ The profile likelihood is the function that maps φ to the valueof the full likelihood evaluated at φ and its corresponding optimal μ Note that the φ
CONTENTS xiii
1042 Multivariate Kernel Estimators 3461043 Adaptive Kernels and Nearest Neighbors 348
10431 Nearest Neighbor Approaches 349
10432 Variable-Kernel Approaches and Transformations 350
1044 Exploratory Projection Pursuit 353
Problems 359
11 BIVARIATE SMOOTHING 363
111 PredictorndashResponse Data 363
112 Linear Smoothers 365
1121 Constant-Span Running Mean 366
11211 Effect of Span 368
11212 Span Selection for Linear Smoothers 369
1122 Running Lines and Running Polynomials 3721123 Kernel Smoothers 3741124 Local Regression Smoothing 3741125 Spline Smoothing 376
11251 Choice of Penalty 377
113 Comparison of Linear Smoothers 377
114 Nonlinear Smoothers 379
1141 Loess 3791142 Supersmoother 381
115 Confidence Bands 384
116 General Bivariate Data 388
Problems 389
12 MULTIVARIATE SMOOTHING 393
121 PredictorndashResponse Data 393
1211 Additive Models 3941212 Generalized Additive Models 3971213 Other Methods Related to Additive Models 399
12131 Projection Pursuit Regression 399
12132 Neural Networks 402
12133 Alternating Conditional Expectations 403
12134 Additivity and Variance Stabilization 404
1214 Tree-Based Methods 405
12141 Recursive Partitioning Regression Trees 406
12142 Tree Pruning 409
12143 Classification Trees 411
12144 Other Issues for Tree-Based Methods 412
xiv CONTENTS
122 General Multivariate Data 4131221 Principal Curves 413
12211 Definition and Motivation 413
12212 Estimation 415
12213 Span Selection 416
Problems 416
DATA ACKNOWLEDGMENTS 421
REFERENCES 423
INDEX 457
PREFACE
This book covers most topics needed to develop a broad and thorough working knowl-edge of modern computational statistics We seek to develop a practical understandingof how and why existing methods work enabling readers to use modern statisticalmethods effectively Since many new methods are built from components of exist-ing techniques our ultimate goal is to provide scientists with the tools they need tocontribute new ideas to the field
A growing challenge in science is that there is so much of it While the pursuitof important new methods and the honing of existing approaches is a worthy goalthere is also a need to organize and distill the teeming jungle of ideas We attempt todo that here Our choice of topics reflects our view of what constitutes the core of theevolving field of computational statistics and what will be interesting and useful forour readers
Our use of the adjective modern in the first sentence of this preface is potentiallytroublesome There is no way that this book can cover all the latest greatest techniquesWe have not even tried We have instead aimed to provide a reasonably up-to-datesurvey of a broad portion of the field while leaving room for diversions and esoterica
The foundations of optimization and numerical integration are covered in thisbook We include these venerable topics because (i) they are cornerstones of frequen-tist and Bayesian inference (ii) routine application of available software often failsfor hard problems and (iii) the methods themselves are often secondary componentsof other statistical computing algorithms Some topics we have omitted representimportant areas of past and present research in the field but their priority here islowered by the availability of high-quality software For example the generation ofpseudo-random numbers is a classic topic but one that we prefer to address by giv-ing students reliable software Finally some topics (eg principal curves and tabusearch) are included simply because they are interesting and provide very differentperspectives on familiar problems Perhaps a future researcher may draw ideas fromsuch topics to design a creative and effective new algorithm
In this second edition we have both updated and broadened our coverage andwe now provide computer code For example we have added new MCMC topics to re-flect continued activity in that popular area A notable increase in breadth is our inclu-sion of more methods relevant for problems where statistical dependency is importantsuch as block bootstrapping and sequential importance sampling This second editionprovides extensive new support in R Specifically code for the examples in this bookis available from the book website wwwstatcolostateeducomputationalstatistics
Our target audience includes graduate students in statistics and related fieldsstatisticians and quantitative empirical scientists in other fields We hope such readersmay use the book when applying standard methods and developing new methods
xv
xvi PREFACE
The level of mathematics expected of the reader does not extend much beyondTaylor series and linear algebra Breadth of mathematical training is more helpfulthan depth Essential review is provided in Chapter 1 More advanced readers will findgreater mathematical detail in the wide variety of high-quality books available on spe-cific topics many of which are referenced in the text Other readers caring less aboutanalytical details may prefer to focus on our descriptions of algorithms and examples
The expected level of statistics is equivalent to that obtained by a graduatestudent in his or her first year of study of the theory of statistics and probabilityAn understanding of maximum likelihood methods Bayesian methods elementaryasymptotic theory Markov chains and linear models is most important Many ofthese topics are reviewed in Chapter 1
With respect to computer programming we find that good students can learn asthey go However a working knowledge of a suitable language allows implementationof the ideas covered in this book to progress much more quickly We have chosento forgo any language-specific examples algorithms or coding in the text For thosewishing to learn a language while they study this book we recommend that you choosea high-level interactive package that permits the flexible design of graphical displaysand includes supporting statistics and probability functions such as R and MATLAB1
These are the sort of languages often used by researchers during the development ofnew statistical computing techniques and they are suitable for implementing all themethods we describe except in some cases for problems of vast scope or complexityWe use R and recommend it Although lower-level languages such as C++ couldalso be used they are more appropriate for professional-grade implementation ofalgorithms after researchers have refined the methodology
The book is organized into four major parts optimization (Chapters 2 3 and4) integration and simulation (Chapters 5 6 7 and 8) bootstrapping (Chapter 9)and density estimation and smoothing (Chapters 10 11 and 12) The chapters arewritten to stand independently so a course can be built by selecting the topics onewishes to teach For a one-semester course our selection typically weights mostheavily topics from Chapters 2 3 6 7 9 10 and 11 With a leisurely pace or morethorough coverage a shorter list of topics could still easily fill a semester courseThere is sufficient material here to provide a thorough one-year course of studynotwithstanding any supplemental topics one might wish to teach
A variety of homework problems are included at the end of each chapter Someare straightforward while others require the student to develop a thorough under-standing of the modelmethod being used to carefully (and perhaps cleverly) code asuitable technique and to devote considerable attention to the interpretation of resultsA few exercises invite open-ended exploration of methods and ideas We are some-times asked for solutions to the exercises but we prefer to sequester them to preservethe challenge for future students and readers
The datasets discussed in the examples and exercises are available from the bookwebsite wwwstatcolostateeducomputationalstatistics The R code is also providedthere Finally the website includes an errata Responsibility for all errors lies with us
1R is available for free from wwwr-projectorg Information about MATLAB can be found atwwwmathworkscom
ACKNOWLEDGMENTS
The course upon which this book is based was developed and taught by us at ColoradoState University from 1994 onwards Thanks are due to our many students who havebeen semiwilling guinea pigs over the years We also thank our colleagues in theStatistics Department for their continued support The late Richard Tweedie meritsparticular acknowledgment for his mentoring during the early years of our careers
We owe a great deal of intellectual debt to Adrian Raftery who deserves specialthanks not only for his teaching and advising but also for his unwavering support andhis seemingly inexhaustible supply of good ideas In addition we thank our influentialadvisors and teachers at the University of Washington Statistics Department includingDavid Madigan Werner Stuetzle and Judy Zeh Of course each of our chapters couldbe expanded into a full-length book and great scholars have already done so We owemuch to their efforts upon which we relied when developing our course and ourmanuscript
Portions of the first edition were written at the Department of Mathematicsand Statistics University of Otago in Dunedin New Zealand whose faculty wethank for graciously hosting us during our sabbatical in 2003 Much of our workon the second edition was undertaken during our sabbatical visit to the AustraliaCommonwealth Scientific and Research Organization in 2009ndash2010 sponsored byCSIRO Mathematics Informatics and Statistics and hosted at the Longpocket Lab-oratory in Indooroopilly Australia We thank our hosts and colleagues there for theirsupport
Our manuscript has been greatly improved through the constructive reviewsof John Bickham Ben Bird Kate Cowles Jan Hannig Alan Herlihy David HunterDevin Johnson Michael Newton Doug Nychka Steve Sain David W Scott N ScottUrquhart Haonan Wang Darrell Whitley and eight anonymous referees We alsothank the sharp-eyed readers listed in the errata for their suggestions and correctionsOur editor Steve Quigley and the folks at Wiley were supportive and helpful duringthe publication process We thank Nelida Pohl for permission to adapt her photographin the cover design of the first edition We thank Melinda Stelzer for permission to useher painting ldquoChampagne Circuitrdquo 2001 for the cover of the second edition Moreabout her art can be found at wwwfacebookcomgeekchicart We also owe specialnote of thanks to Zube (aka John Dzubera) who kept our own computers runningdespite our best efforts to the contrary
Funding from National Science Foundation (NSF) CAREER grant SBR-9875508 was a significant source of support for the first author during the preparationof the first edition He also thanks his colleagues and friends in the North SlopeBorough Alaska Department of Wildlife Management for their longtime researchsupport The second author gratefully acknowledges the support of STAR Research
xvii
xviii ACKNOWLEDGMENTS
Assistance Agreement CR-829095 awarded to Colorado State University by the USEnvironmental Protection Agency (EPA) The views expressed here are solely thoseof the authors NSF and EPA do not endorse any products or commercial servicesmentioned herein
Finally we thank our parents for enabling and supporting our educations and forproviding us with the ldquostubbornness generdquo necessary for graduate school the tenuretrack or book publicationmdashtake your pick The second edition is dedicated to ourkids Natalie and Neil for continuing to show us what is important and what is not
Geof H GivensJennifer A Hoeting
CHAPTER 1REVIEW
This chapter reviews notation and background material in mathematics probabilityand statistics Readers may wish to skip this chapter and turn directly to Chapter 2returning here only as needed
11 MATHEMATICAL NOTATION
We use boldface to distinguish a vector x = (x1 xp) or a matrix M from a scalarvariable x or a constant M A vector-valued function f evaluated at x is also boldfacedas in f(x) = (f1(x) fp(x)) The transpose of M is denoted MT
Unless otherwise specified all vectors are considered to be column vectors sofor example an n times p matrix can be written as M = (x1 xn)T Let I denote anidentity matrix and 1 and 0 denote vectors of ones and zeros respectively
A symmetric square matrix M is positive definite if xTMx gt 0 for all nonzerovectors x Positive definiteness is equivalent to the condition that all eigenvalues ofM are positive M is nonnegative definite or positive semidefinite if xTMx ge 0 for allnonzero vectors x
The derivative of a function f evaluated at x is denoted f prime(x) When x =(x1 xp) the gradient of f at x is
f prime(x) =(
df (x)
dx1
df (x)
dxp
)
The Hessian matrix for f at x is f primeprime(x) having (i j)th element equal to d2f (x)(dxi dxj) The negative Hessian has important uses in statistical inference
Let J(x) denote the Jacobian matrix evaluated at x for the one-to-one mappingy = f(x) The (i j)th element of J(x) is equal to dfi(x)dxj
A functional is a real-valued function on a space of functions For example ifT (f ) = int 1
0 f (x) dx then the functional T maps suitably integrable functions onto thereal line
The indicator function 1A equals 1 if A is true and 0 otherwise The real lineis denoted and p-dimensional real space is p
Computational Statistics Second Edition Geof H Givens and Jennifer A Hoetingcopy 2013 John Wiley amp Sons Inc Published 2013 by John Wiley amp Sons Inc
1
2 CHAPTER 1 REVIEW
12 TAYLORrsquoS THEOREM AND MATHEMATICALLIMIT THEORY
First we define standard ldquobig ohrdquo and ldquolittle ohrdquo notation for describing the relativeorders of convergence of functions Let the functions f and g be defined on a commonpossibly infinite interval Let z0 be a point in this interval or a boundary point of it(ie minusinfin or infin) We require g(z) = 0 for all z = z0 in a neighborhood of z0 Thenwe say
f (z) = O(g(z)) (11)
if there exists a constant M such that |f (z)| le M|g(z)| as z rarr z0 For example(n + 1)(3n2) = O(nminus1) and it is understood that we are considering n rarr infin Iflimzrarrz0 f (z)g(z) = 0 then we say
f (z) = O(g(z)) (12)
For example f (x0 + h) minus f (x0) = hf prime(x0) + O(h) as h rarr 0 if f is differentiable atx0 The same notation can be used for describing the convergence of a sequence xnas n rarr infin by letting f (n) = xn
Taylorrsquos theorem provides a polynomial approximation to a function f Supposef has finite (n + 1)th derivative on (a b) and continuous nth derivative on [a b] Thenfor any x0 isin [a b] distinct from x the Taylor series expansion of f about x0 is
f (x) =nsum
i=0
1
if (i)(x0)(x minus x0)i + Rn (13)
where f (i)(x0) is the ith derivative of f evaluated at x0 and
Rn = 1
(n + 1)f (n+1)(ξ)(x minus x0)n+1 (14)
for some point ξ in the interval between x and x0 As |x minus x0| rarr 0 note that Rn =O(|x minus x0|n+1)
The multivariate version of Taylorrsquos theorem is analogous Suppose f is areal-valued function of a p-dimensional variable x possessing continuous partialderivatives of all orders up to and including n + 1 with respect to all coordinates inan open convex set containing x and x0 = x Then
f (x) = f (x0) +nsum
i=1
1
iD(i)(f x0 x minus x0) + Rn (15)
where
D(i)(f x y) =psum
j1=1
middot middot middotpsum
ji=1
(di
dtj1 middot middot middot dtji
f (t)
∣∣∣∣t=x
) iprodk=1
yjk
(16)
12 TAYLORrsquoS THEOREM AND MATHEMATICAL LIMIT THEORY 3
and
Rn = 1
(n + 1)D(n+1)(f ξ x minus x0) (17)
for some ξ on the line segment joining x and x0 As |x minus x0| rarr 0 note that Rn =O(|x minus x0|n+1)
The EulerndashMaclaurin formula is useful in many asymptotic analyses If f has2n continuous derivatives in [0 1] thenint 1
0f (x) dx =f (0) + f (1)
2
minusnminus1sumi=0
b2i(f (2iminus1)(1) minus f (2iminus1)(0))
(2i)minus b2nf
(2n)(ξ)
(2n) (18)
where 0 le ξ le 1 f (j) is the jth derivative of f and bj = Bj(0) can be determinedusing the recursion relation
msumj=0
(m + 1
j
)Bj(z) = (m + 1)zm (19)
initialized with B0(z) = 1 The proof of this result is based on repeated integrationsby parts [376]
Finally we note that it is sometimes desirable to approximate the derivative ofa function numerically using finite differences For example the ith component ofthe gradient of f at x can be approximated by
df (x)
dxi
asymp f (x + εiei) minus f (x minus εiei)
2εi
(110)
where εi is a small number and ei is the unit vector in the ith coordinate directionTypically one might start with say εi = 001 or 0001 and approximate the desiredderivative for a sequence of progressively smaller εi The approximation willgenerally improve until εi becomes small enough that the calculation is degraded andeventually dominated by computer roundoff error introduced by subtractive cancel-lation Introductory discussion of this approach and a more sophisticated Richardsonextrapolation strategy for obtaining greater precision are provided in [376] Finitedifferences can also be used to approximate the second derivative of f at x via
df (x)
dxi dxj
asymp 1
4εiεj
(f (x + εiei + εjej) minus f (x + εiei minus εjej)
minus f (x minus εiei + εjej) + f (x minus εiei minus εjej)
)(111)
with similar sequential precision improvements
4 CHAPTER 1 REVIEW
13 STATISTICAL NOTATION AND PROBABILITYDISTRIBUTIONS
We use capital letters to denote random variables such as Y or X and lowercaseletters to represent specific realized values of random variables such as y or x Theprobability density function of X is denoted f the cumulative distribution functionis F We use the notation X sim f (x) to mean that X is distributed with density f (x)Frequently the dependence of f (x) on one or more parameters also will be denotedwith a conditioning bar as in f (x|α β) Because of the diversity of topics covered inthis book we want to be careful to distinguish when f (x|α) refers to a density functionas opposed to the evaluation of that density at a point x When the meaning is unclearfrom the context we will be explicit for example by using f (middot|α) to denote thefunction When it is important to distinguish among several densities we may adoptsubscripts referring to specific random variables so that the density functions forX and Y are fX and fY respectively We use the same notation for distributions ofdiscrete random variables and in the Bayesian context
The conditional distribution of X given that Y equals y (ie X|Y = y) is de-scribed by the density denoted f (x|y) or fX|Y (x|y) In this case we write that X|Y hasdensity f (x|Y ) For notational simplicity we allow density functions to be implicitlyspecified by their arguments so we may use the same symbol say f to refer to manydistinct functions as in the equation f (x y|μ) = f (x|y μ)f (y|μ) Finally f (X) andF (X) are random variables the evaluations of the density and cumulative distributionfunctions respectively at the random argument X
The expectation of a random variable is denoted EX Unless specificallymentioned the distribution with respect to which an expectation is taken is the dis-tribution of X or should be implicit from the context To denote the probability ofan event A we use P[A] = E1A The conditional expectation of X|Y = y isEX|y When Y is unknown EX|Y is a random variable that depends on Y Otherattributes of the distribution of X and Y include varX covX Y corX Y andcvX = varX12EX These quantities are the variance of X the covariance andcorrelation of X and Y and the coefficient of variation of X respectively
A useful result regarding expectations is Jensenrsquos inequality Let g be a convexfunction on a possibly infinite open interval I so
g(λx + (1 minus λ)y) le λg(x) + (1 minus λ)g(y) (112)
for all x y isin I and all 0 lt λ lt 1 Then Jensenrsquos inequality states that Eg(X) geg(EX) for any random variable X having P[X isin I] = 1
Tables 11 12 and 13 provide information about many discrete and contin-uous distributions used throughout this book We refer to the following well-knowncombinatorial constants
n = n(n minus 1)(n minus 2) middot middot middot (3)(2)(1) with 0 = 1 (113)(n
k
)= n
k(n minus k) (114)
TABLE 11 Notation and description for common probability distributions of discrete random variables
Notation and Density and Mean and
Name Parameter Space Sample Space Variance
Bernoulli X simBernoulli(p)0 le p le 1
f (x) = px(1 minus p)1minusx
x = 0 or 1EX = p
varX = p(1 minus p)
Binomial X simBin(n p)0 le p le 1n isin 1 2
f (x) =(
n
x
)px(1 minus p)nminusx
x = 0 1 n
EX = np
varX = np(1 minus p)
Multinomial X simMultinomial(n p)p = (p1 pk)0 le pi le 1 and n isin 1 2 sumk
i=1 pi = 1
f (x) =(
n
x1 xk
)prodk
i=1 pxi
i
x = (x1 xk) and xi isin 0 1 nsumk
i=1 xi = n
EX = npvarXi = npi(1 minus pi)covXi Xj = minusnpipj
NegativeBinomial
X simNegBin(r p)0 le p le 1r isin 1 2
f (x) =(
x + r minus 1
r minus 1
)pr(1 minus p)x
x isin 0 1 EX = r(1 minus p)pvarX = r(1 minus p)p2
Poisson X simPoisson(λ) f (x) = λx
xexpminusλ EX = λ
λ gt 0 x isin 0 1 varX = λ
5
TABLE 12 Notation and description for some common probability distributions of continuous random variables
Notation and Density and Mean and
Name Parameter Space Sample Space Variance
Beta X simBeta(α β)α gt 0 and β gt 0
f (x) = (α + β)
(α)(β)xαminus1(1 minus x)βminus1
0 le x le 1
EX = α
α + β
varX = αβ
(α + β)2(α + β + 1)
Cauchy X simCauchy(α β)α isin and β gt 0
f (x) = 1
πβ
[1 +
(x minus α
β
)2]
x isin
EX is nonexistentvarX is nonexistent
Chi-square X sim χ2ν
ν gt 0f (x) = Gamma(ν2 12)x gt 0
EX = ν
varX = 2ν
Dirichlet X simDirichlet(α)α = (α1 αk)αi gt 0α0 = sumk
i=1 αi
f (x) = (α0)prodk
i=1 xαiminus1iprodk
i=1 (αi)x = (x1 xk) and 0 le xi le 1sumk
i=1 xi = 1
EX = αα0
varXi = αi(α0 minus αi)
α20(α0 + 1)
covXi Xj = minusαiαj
α20(α0 + 1)
Exponential X simExp(λ)λ gt 0
f (x) = λ expminusλxx gt 0
EX = 1λ
varX = 1λ2
Gamma X simGamma(r λ)λ gt 0 and r gt 0
f (x) = λrxrminus1
(r)expminusλx
x gt 0
EX = rλ
varX = rλ2
6
TABLE 13 Notation and description for more common probability distributions of continuous random variables
Notation and Density and Mean and
Name Parameter Space Sample Space Variance
Lognormal X simLognormal(μ σ2)μ isin and σ gt 0
f (x) = 1
xradic
2πσ2exp
minus1
2
(logx minus μ
σ
)2
x isin EX = expμ + σ22varX = exp2μ + 2σ2 minus exp2μ + σ2
MultivariateNormal
X sim Nk(μ )μ = (μ1 μk) isin k
positive definite
f (x) = expminus(x minus μ)T minus1(x minus μ)2(2π)k2||12
x = (x1 xk) isin k
EX = μ
varX =
Normal X simN(μ σ2)μ isin and σ gt 0
f (x) = 1radic2πσ2
exp
minus1
2
(x minus μ
σ
)2
x isin EX = μ
varX = σ2
Studentrsquos t X sim tν
ν gt 0f (x) = ((ν + 1)2)
(ν2)radic
πν
(1 + x2
ν
)minus(ν+1)2
x isin EX = 0 if ν gt 1
varX = ν
ν + 2if ν gt 2
Uniform X simUnif(a b)a b isin and a lt b
f (x) = 1
b minus ax isin [a b]
EX = (a + b)2varX = (b minus a)212
Weibull X simWeibull(a b)a gt 0 and b gt 0
f (x) = abxbminus1 expminusaxbx gt 0
EX = (1 + 1b)
a1b
varX = (1 + 2b) minus (1 + 1b)2
a2b
7
8 CHAPTER 1 REVIEW(n
k1 km
)= nprodm
i=1 kiwhere n =
msumi=1
ki (115)
(r) =
(r minus 1) for r = 1 2 int infin0 trminus1 expminust dt for general r gt 0
(116)
It is worth knowing that
(
12
)= radic
π and (n + 1
2
)= 1 times 3 times 5 times middot middot middot times (2n minus 1)
radicπ2n
for positive integer nMany of the distributions commonly used in statistics are members of an
exponential family A k-parameter exponential family density can be expressed as
f (x|γ) = c1(x)c2(γ) exp
ksum
i=1
yi(x)θi(γ)
(117)
for nonnegative functions c1 and c2 The vectorγ denotes the familiar parameters suchas λ for the Poisson density and p for the binomial density The real-valued θi(γ) arethe natural or canonical parameters which are usually transformations of γ Theyi(x) are the sufficient statistics for the canonical parameters It is straightforwardto show
Ey(X) =κprime(θ) (118)
and
vary(X) =κprimeprime(θ) (119)
where κ(θ) = minus log c3(θ) letting c3(θ) denote the reexpression of c2(γ) in terms ofthe canonical parameters θ = (θ1 θk) and y(X) = (y1(X) yk(X)) Theseresults can be rewritten in terms of the original parameters γ as
E
ksum
i=1
dθi(γ)
dγj
yi(X)
= minus d
dγj
log c2(γ) (120)
and
var
ksum
i=1
dθi(γ)
dγj
yi(X)
= minus d2
dγ2j
log c2(γ) minus E
ksum
i=1
d2θi(γ)
dγ2j
yi(X)
(121)
Example 11 (Poisson) The Poisson distribution belongs to the exponential familywith c1(x) = 1x c2(λ) = expminusλ y(x) = x and θ(λ) = log λ Deriving momentsin terms of θ we have κ(θ) = expθ so EX = κprime(θ) = expθ = λ and varX =κprimeprime(θ) = expθ = λ The same results may be obtained with (120) and (121) notingthat dθdλ = 1λ For example (120) gives E Xλ = 1
It is also important to know how the distribution of a random variable changeswhen it is transformed Let X = (X1 Xp) denote a p-dimensional random
14 LIKELIHOOD INFERENCE 9
variable with continuous density function f Suppose that
U = g(X) = (g1(X) gp(X)) = (U1 Up) (122)
where g is a one-to-one function mapping the support region of f onto the space ofall u = g(x) for which x satisfies f (x) gt 0 To derive the probability distribution ofU from that of X we need to use the Jacobian matrix The density of the transformedvariables is
f (u) = f (gminus1(u)) |J(u)| (123)
where |J(u)| is the absolute value of the determinant of the Jacobian matrix of gminus1
evaluated at u having (i j)th element dxiduj where these derivatives are assumedto be continuous over the support region of U
14 LIKELIHOOD INFERENCE
If X1 Xn are independent and identically distributed (iid) each having densityf (x | θ) that depends on a vector of p unknown parameters θ = (θ1 θp) then thejoint likelihood function is
L(θ) =nprod
i=1
f (xi|θ) (124)
When the data are not iid the joint likelihood is still expressed as the joint densityf (x1 xn|θ) viewed as a function of θ
The observed data x1 xn might have been realized under many differentvalues for θ The parameters for which observing x1 xn would be most likelyconstitute the maximum likelihood estimate of θ In other words if ϑ is the functionof x1 xn that maximizes L(θ) then θ = ϑ(X1 Xn) is the maximum likeli-hood estimator (MLE) for θ MLEs are invariant to transformation so the MLE of atransformation of θ equals the transformation of θ
It is typically easier to work with the log likelihood function
l (θ) = log L(θ) (125)
which has the same maximum as the original likelihood since log is a strictly mono-tonic function Furthermore any additive constants (involving possibly x1 xn
but not θ) may be omitted from the log likelihood without changing the location of itsmaximum or differences between log likelihoods at different θ Note that maximizingL(θ) with respect to θ is equivalent to solving the system of equations
lprime(θ) = 0 (126)
where
lprime(θ) =(
dl(θ)
dθ1
dl(θ)
dθn
)
10 CHAPTER 1 REVIEW
is called the score function The score function satisfies
Elprime(θ) = 0 (127)
where the expectation is taken with respect to the distribution of X1 Xn Some-times an analytical solution to (126) provides the MLE this book describes a varietyof methods that can be used when the MLE cannot be solved for in closed form Itis worth noting that there are pathological circumstances where the MLE is not asolution of the score equation or the MLE is not unique see [127] for examples
The MLE has a sampling distribution because it depends on the realizationof the random variables X1 Xn The MLE may be biased or unbiased for θ yetunder quite general conditions it is asymptotically unbiased as n rarr infin The samplingvariance of the MLE depends on the average curvature of the log likelihood When thelog likelihood is very pointy the location of the maximum is more precisely known
To make this precise let lprimeprime(θ) denote the p times p matrix having (i j)th elementgiven by d2l(θ)(dθidθj) The Fisher information matrix is defined as
I(θ) = Elprime(θ)lprime(θ)T = minusElprimeprime(θ) (128)
where the expectations are taken with respect to the distribution of X1 Xn Thefinal equality in (128) requires mild assumptions which are satisfied for example inexponential families I(θ) may sometimes be called the expected Fisher informationto distinguish it from minuslprimeprime(θ) which is the observed Fisher information There aretwo reasons why the observed Fisher information is quite useful First it can becalculated even if the expectations in (128) cannot easily be computed Second it isa good approximation to I(θ) that improves as n increases
Under regularity conditions the asymptotic variancendashcovariance matrix ofthe MLE θ is I(θlowast)minus1 where θlowast denotes the true value of θ Indeed as n rarr infin thelimiting distribution of θ is Np(θlowast I(θlowast)minus1) Since the true parameter values areunknown I(θlowast)minus1 must be estimated in order to estimate the variancendashcovariancematrix of the MLE An obvious approach is to use I(θ)minus1 Alternatively it is alsoreasonable to use minuslprimeprime(θ)minus1 Standard errors for individual parameter MLEs canbe estimated by taking the square root of the diagonal elements of the chosenestimate of I(θlowast)minus1 A thorough introduction to maximum likelihood theory andthe relative merits of these estimates of I(θlowast)minus1 can be found in [127 182 371 470]
Profile likelihoods provide an informative way to graph a higher-dimensionallikelihood surface to make inference about some parameters while treating othersas nuisance parameters and to facilitate various optimization strategies The profilelikelihood is obtained by constrained maximization of the full likelihood with respectto parameters to be ignored If θ = (μ φ) then the profile likelihood for φ is
L(φ|μ(φ)) = maxμ
L(μ φ) (129)
Thus for each possible φ a value of μ is chosen to maximize L(μ φ) This optimalμ is a function of φ The profile likelihood is the function that maps φ to the valueof the full likelihood evaluated at φ and its corresponding optimal μ Note that the φ
xiv CONTENTS
122 General Multivariate Data 4131221 Principal Curves 413
12211 Definition and Motivation 413
12212 Estimation 415
12213 Span Selection 416
Problems 416
DATA ACKNOWLEDGMENTS 421
REFERENCES 423
INDEX 457
PREFACE
This book covers most topics needed to develop a broad and thorough working knowl-edge of modern computational statistics We seek to develop a practical understandingof how and why existing methods work enabling readers to use modern statisticalmethods effectively Since many new methods are built from components of exist-ing techniques our ultimate goal is to provide scientists with the tools they need tocontribute new ideas to the field
A growing challenge in science is that there is so much of it While the pursuitof important new methods and the honing of existing approaches is a worthy goalthere is also a need to organize and distill the teeming jungle of ideas We attempt todo that here Our choice of topics reflects our view of what constitutes the core of theevolving field of computational statistics and what will be interesting and useful forour readers
Our use of the adjective modern in the first sentence of this preface is potentiallytroublesome There is no way that this book can cover all the latest greatest techniquesWe have not even tried We have instead aimed to provide a reasonably up-to-datesurvey of a broad portion of the field while leaving room for diversions and esoterica
The foundations of optimization and numerical integration are covered in thisbook We include these venerable topics because (i) they are cornerstones of frequen-tist and Bayesian inference (ii) routine application of available software often failsfor hard problems and (iii) the methods themselves are often secondary componentsof other statistical computing algorithms Some topics we have omitted representimportant areas of past and present research in the field but their priority here islowered by the availability of high-quality software For example the generation ofpseudo-random numbers is a classic topic but one that we prefer to address by giv-ing students reliable software Finally some topics (eg principal curves and tabusearch) are included simply because they are interesting and provide very differentperspectives on familiar problems Perhaps a future researcher may draw ideas fromsuch topics to design a creative and effective new algorithm
In this second edition we have both updated and broadened our coverage andwe now provide computer code For example we have added new MCMC topics to re-flect continued activity in that popular area A notable increase in breadth is our inclu-sion of more methods relevant for problems where statistical dependency is importantsuch as block bootstrapping and sequential importance sampling This second editionprovides extensive new support in R Specifically code for the examples in this bookis available from the book website wwwstatcolostateeducomputationalstatistics
Our target audience includes graduate students in statistics and related fieldsstatisticians and quantitative empirical scientists in other fields We hope such readersmay use the book when applying standard methods and developing new methods
xv
xvi PREFACE
The level of mathematics expected of the reader does not extend much beyondTaylor series and linear algebra Breadth of mathematical training is more helpfulthan depth Essential review is provided in Chapter 1 More advanced readers will findgreater mathematical detail in the wide variety of high-quality books available on spe-cific topics many of which are referenced in the text Other readers caring less aboutanalytical details may prefer to focus on our descriptions of algorithms and examples
The expected level of statistics is equivalent to that obtained by a graduatestudent in his or her first year of study of the theory of statistics and probabilityAn understanding of maximum likelihood methods Bayesian methods elementaryasymptotic theory Markov chains and linear models is most important Many ofthese topics are reviewed in Chapter 1
With respect to computer programming we find that good students can learn asthey go However a working knowledge of a suitable language allows implementationof the ideas covered in this book to progress much more quickly We have chosento forgo any language-specific examples algorithms or coding in the text For thosewishing to learn a language while they study this book we recommend that you choosea high-level interactive package that permits the flexible design of graphical displaysand includes supporting statistics and probability functions such as R and MATLAB1
These are the sort of languages often used by researchers during the development ofnew statistical computing techniques and they are suitable for implementing all themethods we describe except in some cases for problems of vast scope or complexityWe use R and recommend it Although lower-level languages such as C++ couldalso be used they are more appropriate for professional-grade implementation ofalgorithms after researchers have refined the methodology
The book is organized into four major parts optimization (Chapters 2 3 and4) integration and simulation (Chapters 5 6 7 and 8) bootstrapping (Chapter 9)and density estimation and smoothing (Chapters 10 11 and 12) The chapters arewritten to stand independently so a course can be built by selecting the topics onewishes to teach For a one-semester course our selection typically weights mostheavily topics from Chapters 2 3 6 7 9 10 and 11 With a leisurely pace or morethorough coverage a shorter list of topics could still easily fill a semester courseThere is sufficient material here to provide a thorough one-year course of studynotwithstanding any supplemental topics one might wish to teach
A variety of homework problems are included at the end of each chapter Someare straightforward while others require the student to develop a thorough under-standing of the modelmethod being used to carefully (and perhaps cleverly) code asuitable technique and to devote considerable attention to the interpretation of resultsA few exercises invite open-ended exploration of methods and ideas We are some-times asked for solutions to the exercises but we prefer to sequester them to preservethe challenge for future students and readers
The datasets discussed in the examples and exercises are available from the bookwebsite wwwstatcolostateeducomputationalstatistics The R code is also providedthere Finally the website includes an errata Responsibility for all errors lies with us
1R is available for free from wwwr-projectorg Information about MATLAB can be found atwwwmathworkscom
ACKNOWLEDGMENTS
The course upon which this book is based was developed and taught by us at ColoradoState University from 1994 onwards Thanks are due to our many students who havebeen semiwilling guinea pigs over the years We also thank our colleagues in theStatistics Department for their continued support The late Richard Tweedie meritsparticular acknowledgment for his mentoring during the early years of our careers
We owe a great deal of intellectual debt to Adrian Raftery who deserves specialthanks not only for his teaching and advising but also for his unwavering support andhis seemingly inexhaustible supply of good ideas In addition we thank our influentialadvisors and teachers at the University of Washington Statistics Department includingDavid Madigan Werner Stuetzle and Judy Zeh Of course each of our chapters couldbe expanded into a full-length book and great scholars have already done so We owemuch to their efforts upon which we relied when developing our course and ourmanuscript
Portions of the first edition were written at the Department of Mathematicsand Statistics University of Otago in Dunedin New Zealand whose faculty wethank for graciously hosting us during our sabbatical in 2003 Much of our workon the second edition was undertaken during our sabbatical visit to the AustraliaCommonwealth Scientific and Research Organization in 2009ndash2010 sponsored byCSIRO Mathematics Informatics and Statistics and hosted at the Longpocket Lab-oratory in Indooroopilly Australia We thank our hosts and colleagues there for theirsupport
Our manuscript has been greatly improved through the constructive reviewsof John Bickham Ben Bird Kate Cowles Jan Hannig Alan Herlihy David HunterDevin Johnson Michael Newton Doug Nychka Steve Sain David W Scott N ScottUrquhart Haonan Wang Darrell Whitley and eight anonymous referees We alsothank the sharp-eyed readers listed in the errata for their suggestions and correctionsOur editor Steve Quigley and the folks at Wiley were supportive and helpful duringthe publication process We thank Nelida Pohl for permission to adapt her photographin the cover design of the first edition We thank Melinda Stelzer for permission to useher painting ldquoChampagne Circuitrdquo 2001 for the cover of the second edition Moreabout her art can be found at wwwfacebookcomgeekchicart We also owe specialnote of thanks to Zube (aka John Dzubera) who kept our own computers runningdespite our best efforts to the contrary
Funding from National Science Foundation (NSF) CAREER grant SBR-9875508 was a significant source of support for the first author during the preparationof the first edition He also thanks his colleagues and friends in the North SlopeBorough Alaska Department of Wildlife Management for their longtime researchsupport The second author gratefully acknowledges the support of STAR Research
xvii
xviii ACKNOWLEDGMENTS
Assistance Agreement CR-829095 awarded to Colorado State University by the USEnvironmental Protection Agency (EPA) The views expressed here are solely thoseof the authors NSF and EPA do not endorse any products or commercial servicesmentioned herein
Finally we thank our parents for enabling and supporting our educations and forproviding us with the ldquostubbornness generdquo necessary for graduate school the tenuretrack or book publicationmdashtake your pick The second edition is dedicated to ourkids Natalie and Neil for continuing to show us what is important and what is not
Geof H GivensJennifer A Hoeting
CHAPTER 1REVIEW
This chapter reviews notation and background material in mathematics probabilityand statistics Readers may wish to skip this chapter and turn directly to Chapter 2returning here only as needed
11 MATHEMATICAL NOTATION
We use boldface to distinguish a vector x = (x1 xp) or a matrix M from a scalarvariable x or a constant M A vector-valued function f evaluated at x is also boldfacedas in f(x) = (f1(x) fp(x)) The transpose of M is denoted MT
Unless otherwise specified all vectors are considered to be column vectors sofor example an n times p matrix can be written as M = (x1 xn)T Let I denote anidentity matrix and 1 and 0 denote vectors of ones and zeros respectively
A symmetric square matrix M is positive definite if xTMx gt 0 for all nonzerovectors x Positive definiteness is equivalent to the condition that all eigenvalues ofM are positive M is nonnegative definite or positive semidefinite if xTMx ge 0 for allnonzero vectors x
The derivative of a function f evaluated at x is denoted f prime(x) When x =(x1 xp) the gradient of f at x is
f prime(x) =(
df (x)
dx1
df (x)
dxp
)
The Hessian matrix for f at x is f primeprime(x) having (i j)th element equal to d2f (x)(dxi dxj) The negative Hessian has important uses in statistical inference
Let J(x) denote the Jacobian matrix evaluated at x for the one-to-one mappingy = f(x) The (i j)th element of J(x) is equal to dfi(x)dxj
A functional is a real-valued function on a space of functions For example ifT (f ) = int 1
0 f (x) dx then the functional T maps suitably integrable functions onto thereal line
The indicator function 1A equals 1 if A is true and 0 otherwise The real lineis denoted and p-dimensional real space is p
Computational Statistics Second Edition Geof H Givens and Jennifer A Hoetingcopy 2013 John Wiley amp Sons Inc Published 2013 by John Wiley amp Sons Inc
1
2 CHAPTER 1 REVIEW
12 TAYLORrsquoS THEOREM AND MATHEMATICALLIMIT THEORY
First we define standard ldquobig ohrdquo and ldquolittle ohrdquo notation for describing the relativeorders of convergence of functions Let the functions f and g be defined on a commonpossibly infinite interval Let z0 be a point in this interval or a boundary point of it(ie minusinfin or infin) We require g(z) = 0 for all z = z0 in a neighborhood of z0 Thenwe say
f (z) = O(g(z)) (11)
if there exists a constant M such that |f (z)| le M|g(z)| as z rarr z0 For example(n + 1)(3n2) = O(nminus1) and it is understood that we are considering n rarr infin Iflimzrarrz0 f (z)g(z) = 0 then we say
f (z) = O(g(z)) (12)
For example f (x0 + h) minus f (x0) = hf prime(x0) + O(h) as h rarr 0 if f is differentiable atx0 The same notation can be used for describing the convergence of a sequence xnas n rarr infin by letting f (n) = xn
Taylorrsquos theorem provides a polynomial approximation to a function f Supposef has finite (n + 1)th derivative on (a b) and continuous nth derivative on [a b] Thenfor any x0 isin [a b] distinct from x the Taylor series expansion of f about x0 is
f (x) =nsum
i=0
1
if (i)(x0)(x minus x0)i + Rn (13)
where f (i)(x0) is the ith derivative of f evaluated at x0 and
Rn = 1
(n + 1)f (n+1)(ξ)(x minus x0)n+1 (14)
for some point ξ in the interval between x and x0 As |x minus x0| rarr 0 note that Rn =O(|x minus x0|n+1)
The multivariate version of Taylorrsquos theorem is analogous Suppose f is areal-valued function of a p-dimensional variable x possessing continuous partialderivatives of all orders up to and including n + 1 with respect to all coordinates inan open convex set containing x and x0 = x Then
f (x) = f (x0) +nsum
i=1
1
iD(i)(f x0 x minus x0) + Rn (15)
where
D(i)(f x y) =psum
j1=1
middot middot middotpsum
ji=1
(di
dtj1 middot middot middot dtji
f (t)
∣∣∣∣t=x
) iprodk=1
yjk
(16)
12 TAYLORrsquoS THEOREM AND MATHEMATICAL LIMIT THEORY 3
and
Rn = 1
(n + 1)D(n+1)(f ξ x minus x0) (17)
for some ξ on the line segment joining x and x0 As |x minus x0| rarr 0 note that Rn =O(|x minus x0|n+1)
The EulerndashMaclaurin formula is useful in many asymptotic analyses If f has2n continuous derivatives in [0 1] thenint 1
0f (x) dx =f (0) + f (1)
2
minusnminus1sumi=0
b2i(f (2iminus1)(1) minus f (2iminus1)(0))
(2i)minus b2nf
(2n)(ξ)
(2n) (18)
where 0 le ξ le 1 f (j) is the jth derivative of f and bj = Bj(0) can be determinedusing the recursion relation
msumj=0
(m + 1
j
)Bj(z) = (m + 1)zm (19)
initialized with B0(z) = 1 The proof of this result is based on repeated integrationsby parts [376]
Finally we note that it is sometimes desirable to approximate the derivative ofa function numerically using finite differences For example the ith component ofthe gradient of f at x can be approximated by
df (x)
dxi
asymp f (x + εiei) minus f (x minus εiei)
2εi
(110)
where εi is a small number and ei is the unit vector in the ith coordinate directionTypically one might start with say εi = 001 or 0001 and approximate the desiredderivative for a sequence of progressively smaller εi The approximation willgenerally improve until εi becomes small enough that the calculation is degraded andeventually dominated by computer roundoff error introduced by subtractive cancel-lation Introductory discussion of this approach and a more sophisticated Richardsonextrapolation strategy for obtaining greater precision are provided in [376] Finitedifferences can also be used to approximate the second derivative of f at x via
df (x)
dxi dxj
asymp 1
4εiεj
(f (x + εiei + εjej) minus f (x + εiei minus εjej)
minus f (x minus εiei + εjej) + f (x minus εiei minus εjej)
)(111)
with similar sequential precision improvements
4 CHAPTER 1 REVIEW
13 STATISTICAL NOTATION AND PROBABILITYDISTRIBUTIONS
We use capital letters to denote random variables such as Y or X and lowercaseletters to represent specific realized values of random variables such as y or x Theprobability density function of X is denoted f the cumulative distribution functionis F We use the notation X sim f (x) to mean that X is distributed with density f (x)Frequently the dependence of f (x) on one or more parameters also will be denotedwith a conditioning bar as in f (x|α β) Because of the diversity of topics covered inthis book we want to be careful to distinguish when f (x|α) refers to a density functionas opposed to the evaluation of that density at a point x When the meaning is unclearfrom the context we will be explicit for example by using f (middot|α) to denote thefunction When it is important to distinguish among several densities we may adoptsubscripts referring to specific random variables so that the density functions forX and Y are fX and fY respectively We use the same notation for distributions ofdiscrete random variables and in the Bayesian context
The conditional distribution of X given that Y equals y (ie X|Y = y) is de-scribed by the density denoted f (x|y) or fX|Y (x|y) In this case we write that X|Y hasdensity f (x|Y ) For notational simplicity we allow density functions to be implicitlyspecified by their arguments so we may use the same symbol say f to refer to manydistinct functions as in the equation f (x y|μ) = f (x|y μ)f (y|μ) Finally f (X) andF (X) are random variables the evaluations of the density and cumulative distributionfunctions respectively at the random argument X
The expectation of a random variable is denoted EX Unless specificallymentioned the distribution with respect to which an expectation is taken is the dis-tribution of X or should be implicit from the context To denote the probability ofan event A we use P[A] = E1A The conditional expectation of X|Y = y isEX|y When Y is unknown EX|Y is a random variable that depends on Y Otherattributes of the distribution of X and Y include varX covX Y corX Y andcvX = varX12EX These quantities are the variance of X the covariance andcorrelation of X and Y and the coefficient of variation of X respectively
A useful result regarding expectations is Jensenrsquos inequality Let g be a convexfunction on a possibly infinite open interval I so
g(λx + (1 minus λ)y) le λg(x) + (1 minus λ)g(y) (112)
for all x y isin I and all 0 lt λ lt 1 Then Jensenrsquos inequality states that Eg(X) geg(EX) for any random variable X having P[X isin I] = 1
Tables 11 12 and 13 provide information about many discrete and contin-uous distributions used throughout this book We refer to the following well-knowncombinatorial constants
n = n(n minus 1)(n minus 2) middot middot middot (3)(2)(1) with 0 = 1 (113)(n
k
)= n
k(n minus k) (114)
TABLE 11 Notation and description for common probability distributions of discrete random variables
Notation and Density and Mean and
Name Parameter Space Sample Space Variance
Bernoulli X simBernoulli(p)0 le p le 1
f (x) = px(1 minus p)1minusx
x = 0 or 1EX = p
varX = p(1 minus p)
Binomial X simBin(n p)0 le p le 1n isin 1 2
f (x) =(
n
x
)px(1 minus p)nminusx
x = 0 1 n
EX = np
varX = np(1 minus p)
Multinomial X simMultinomial(n p)p = (p1 pk)0 le pi le 1 and n isin 1 2 sumk
i=1 pi = 1
f (x) =(
n
x1 xk
)prodk
i=1 pxi
i
x = (x1 xk) and xi isin 0 1 nsumk
i=1 xi = n
EX = npvarXi = npi(1 minus pi)covXi Xj = minusnpipj
NegativeBinomial
X simNegBin(r p)0 le p le 1r isin 1 2
f (x) =(
x + r minus 1
r minus 1
)pr(1 minus p)x
x isin 0 1 EX = r(1 minus p)pvarX = r(1 minus p)p2
Poisson X simPoisson(λ) f (x) = λx
xexpminusλ EX = λ
λ gt 0 x isin 0 1 varX = λ
5
TABLE 12 Notation and description for some common probability distributions of continuous random variables
Notation and Density and Mean and
Name Parameter Space Sample Space Variance
Beta X simBeta(α β)α gt 0 and β gt 0
f (x) = (α + β)
(α)(β)xαminus1(1 minus x)βminus1
0 le x le 1
EX = α
α + β
varX = αβ
(α + β)2(α + β + 1)
Cauchy X simCauchy(α β)α isin and β gt 0
f (x) = 1
πβ
[1 +
(x minus α
β
)2]
x isin
EX is nonexistentvarX is nonexistent
Chi-square X sim χ2ν
ν gt 0f (x) = Gamma(ν2 12)x gt 0
EX = ν
varX = 2ν
Dirichlet X simDirichlet(α)α = (α1 αk)αi gt 0α0 = sumk
i=1 αi
f (x) = (α0)prodk
i=1 xαiminus1iprodk
i=1 (αi)x = (x1 xk) and 0 le xi le 1sumk
i=1 xi = 1
EX = αα0
varXi = αi(α0 minus αi)
α20(α0 + 1)
covXi Xj = minusαiαj
α20(α0 + 1)
Exponential X simExp(λ)λ gt 0
f (x) = λ expminusλxx gt 0
EX = 1λ
varX = 1λ2
Gamma X simGamma(r λ)λ gt 0 and r gt 0
f (x) = λrxrminus1
(r)expminusλx
x gt 0
EX = rλ
varX = rλ2
6
TABLE 13 Notation and description for more common probability distributions of continuous random variables
Notation and Density and Mean and
Name Parameter Space Sample Space Variance
Lognormal X simLognormal(μ σ2)μ isin and σ gt 0
f (x) = 1
xradic
2πσ2exp
minus1
2
(logx minus μ
σ
)2
x isin EX = expμ + σ22varX = exp2μ + 2σ2 minus exp2μ + σ2
MultivariateNormal
X sim Nk(μ )μ = (μ1 μk) isin k
positive definite
f (x) = expminus(x minus μ)T minus1(x minus μ)2(2π)k2||12
x = (x1 xk) isin k
EX = μ
varX =
Normal X simN(μ σ2)μ isin and σ gt 0
f (x) = 1radic2πσ2
exp
minus1
2
(x minus μ
σ
)2
x isin EX = μ
varX = σ2
Studentrsquos t X sim tν
ν gt 0f (x) = ((ν + 1)2)
(ν2)radic
πν
(1 + x2
ν
)minus(ν+1)2
x isin EX = 0 if ν gt 1
varX = ν
ν + 2if ν gt 2
Uniform X simUnif(a b)a b isin and a lt b
f (x) = 1
b minus ax isin [a b]
EX = (a + b)2varX = (b minus a)212
Weibull X simWeibull(a b)a gt 0 and b gt 0
f (x) = abxbminus1 expminusaxbx gt 0
EX = (1 + 1b)
a1b
varX = (1 + 2b) minus (1 + 1b)2
a2b
7
8 CHAPTER 1 REVIEW(n
k1 km
)= nprodm
i=1 kiwhere n =
msumi=1
ki (115)
(r) =
(r minus 1) for r = 1 2 int infin0 trminus1 expminust dt for general r gt 0
(116)
It is worth knowing that
(
12
)= radic
π and (n + 1
2
)= 1 times 3 times 5 times middot middot middot times (2n minus 1)
radicπ2n
for positive integer nMany of the distributions commonly used in statistics are members of an
exponential family A k-parameter exponential family density can be expressed as
f (x|γ) = c1(x)c2(γ) exp
ksum
i=1
yi(x)θi(γ)
(117)
for nonnegative functions c1 and c2 The vectorγ denotes the familiar parameters suchas λ for the Poisson density and p for the binomial density The real-valued θi(γ) arethe natural or canonical parameters which are usually transformations of γ Theyi(x) are the sufficient statistics for the canonical parameters It is straightforwardto show
Ey(X) =κprime(θ) (118)
and
vary(X) =κprimeprime(θ) (119)
where κ(θ) = minus log c3(θ) letting c3(θ) denote the reexpression of c2(γ) in terms ofthe canonical parameters θ = (θ1 θk) and y(X) = (y1(X) yk(X)) Theseresults can be rewritten in terms of the original parameters γ as
E
ksum
i=1
dθi(γ)
dγj
yi(X)
= minus d
dγj
log c2(γ) (120)
and
var
ksum
i=1
dθi(γ)
dγj
yi(X)
= minus d2
dγ2j
log c2(γ) minus E
ksum
i=1
d2θi(γ)
dγ2j
yi(X)
(121)
Example 11 (Poisson) The Poisson distribution belongs to the exponential familywith c1(x) = 1x c2(λ) = expminusλ y(x) = x and θ(λ) = log λ Deriving momentsin terms of θ we have κ(θ) = expθ so EX = κprime(θ) = expθ = λ and varX =κprimeprime(θ) = expθ = λ The same results may be obtained with (120) and (121) notingthat dθdλ = 1λ For example (120) gives E Xλ = 1
It is also important to know how the distribution of a random variable changeswhen it is transformed Let X = (X1 Xp) denote a p-dimensional random
14 LIKELIHOOD INFERENCE 9
variable with continuous density function f Suppose that
U = g(X) = (g1(X) gp(X)) = (U1 Up) (122)
where g is a one-to-one function mapping the support region of f onto the space ofall u = g(x) for which x satisfies f (x) gt 0 To derive the probability distribution ofU from that of X we need to use the Jacobian matrix The density of the transformedvariables is
f (u) = f (gminus1(u)) |J(u)| (123)
where |J(u)| is the absolute value of the determinant of the Jacobian matrix of gminus1
evaluated at u having (i j)th element dxiduj where these derivatives are assumedto be continuous over the support region of U
14 LIKELIHOOD INFERENCE
If X1 Xn are independent and identically distributed (iid) each having densityf (x | θ) that depends on a vector of p unknown parameters θ = (θ1 θp) then thejoint likelihood function is
L(θ) =nprod
i=1
f (xi|θ) (124)
When the data are not iid the joint likelihood is still expressed as the joint densityf (x1 xn|θ) viewed as a function of θ
The observed data x1 xn might have been realized under many differentvalues for θ The parameters for which observing x1 xn would be most likelyconstitute the maximum likelihood estimate of θ In other words if ϑ is the functionof x1 xn that maximizes L(θ) then θ = ϑ(X1 Xn) is the maximum likeli-hood estimator (MLE) for θ MLEs are invariant to transformation so the MLE of atransformation of θ equals the transformation of θ
It is typically easier to work with the log likelihood function
l (θ) = log L(θ) (125)
which has the same maximum as the original likelihood since log is a strictly mono-tonic function Furthermore any additive constants (involving possibly x1 xn
but not θ) may be omitted from the log likelihood without changing the location of itsmaximum or differences between log likelihoods at different θ Note that maximizingL(θ) with respect to θ is equivalent to solving the system of equations
lprime(θ) = 0 (126)
where
lprime(θ) =(
dl(θ)
dθ1
dl(θ)
dθn
)
10 CHAPTER 1 REVIEW
is called the score function The score function satisfies
Elprime(θ) = 0 (127)
where the expectation is taken with respect to the distribution of X1 Xn Some-times an analytical solution to (126) provides the MLE this book describes a varietyof methods that can be used when the MLE cannot be solved for in closed form Itis worth noting that there are pathological circumstances where the MLE is not asolution of the score equation or the MLE is not unique see [127] for examples
The MLE has a sampling distribution because it depends on the realizationof the random variables X1 Xn The MLE may be biased or unbiased for θ yetunder quite general conditions it is asymptotically unbiased as n rarr infin The samplingvariance of the MLE depends on the average curvature of the log likelihood When thelog likelihood is very pointy the location of the maximum is more precisely known
To make this precise let lprimeprime(θ) denote the p times p matrix having (i j)th elementgiven by d2l(θ)(dθidθj) The Fisher information matrix is defined as
I(θ) = Elprime(θ)lprime(θ)T = minusElprimeprime(θ) (128)
where the expectations are taken with respect to the distribution of X1 Xn Thefinal equality in (128) requires mild assumptions which are satisfied for example inexponential families I(θ) may sometimes be called the expected Fisher informationto distinguish it from minuslprimeprime(θ) which is the observed Fisher information There aretwo reasons why the observed Fisher information is quite useful First it can becalculated even if the expectations in (128) cannot easily be computed Second it isa good approximation to I(θ) that improves as n increases
Under regularity conditions the asymptotic variancendashcovariance matrix ofthe MLE θ is I(θlowast)minus1 where θlowast denotes the true value of θ Indeed as n rarr infin thelimiting distribution of θ is Np(θlowast I(θlowast)minus1) Since the true parameter values areunknown I(θlowast)minus1 must be estimated in order to estimate the variancendashcovariancematrix of the MLE An obvious approach is to use I(θ)minus1 Alternatively it is alsoreasonable to use minuslprimeprime(θ)minus1 Standard errors for individual parameter MLEs canbe estimated by taking the square root of the diagonal elements of the chosenestimate of I(θlowast)minus1 A thorough introduction to maximum likelihood theory andthe relative merits of these estimates of I(θlowast)minus1 can be found in [127 182 371 470]
Profile likelihoods provide an informative way to graph a higher-dimensionallikelihood surface to make inference about some parameters while treating othersas nuisance parameters and to facilitate various optimization strategies The profilelikelihood is obtained by constrained maximization of the full likelihood with respectto parameters to be ignored If θ = (μ φ) then the profile likelihood for φ is
L(φ|μ(φ)) = maxμ
L(μ φ) (129)
Thus for each possible φ a value of μ is chosen to maximize L(μ φ) This optimalμ is a function of φ The profile likelihood is the function that maps φ to the valueof the full likelihood evaluated at φ and its corresponding optimal μ Note that the φ
PREFACE
This book covers most topics needed to develop a broad and thorough working knowl-edge of modern computational statistics We seek to develop a practical understandingof how and why existing methods work enabling readers to use modern statisticalmethods effectively Since many new methods are built from components of exist-ing techniques our ultimate goal is to provide scientists with the tools they need tocontribute new ideas to the field
A growing challenge in science is that there is so much of it While the pursuitof important new methods and the honing of existing approaches is a worthy goalthere is also a need to organize and distill the teeming jungle of ideas We attempt todo that here Our choice of topics reflects our view of what constitutes the core of theevolving field of computational statistics and what will be interesting and useful forour readers
Our use of the adjective modern in the first sentence of this preface is potentiallytroublesome There is no way that this book can cover all the latest greatest techniquesWe have not even tried We have instead aimed to provide a reasonably up-to-datesurvey of a broad portion of the field while leaving room for diversions and esoterica
The foundations of optimization and numerical integration are covered in thisbook We include these venerable topics because (i) they are cornerstones of frequen-tist and Bayesian inference (ii) routine application of available software often failsfor hard problems and (iii) the methods themselves are often secondary componentsof other statistical computing algorithms Some topics we have omitted representimportant areas of past and present research in the field but their priority here islowered by the availability of high-quality software For example the generation ofpseudo-random numbers is a classic topic but one that we prefer to address by giv-ing students reliable software Finally some topics (eg principal curves and tabusearch) are included simply because they are interesting and provide very differentperspectives on familiar problems Perhaps a future researcher may draw ideas fromsuch topics to design a creative and effective new algorithm
In this second edition we have both updated and broadened our coverage andwe now provide computer code For example we have added new MCMC topics to re-flect continued activity in that popular area A notable increase in breadth is our inclu-sion of more methods relevant for problems where statistical dependency is importantsuch as block bootstrapping and sequential importance sampling This second editionprovides extensive new support in R Specifically code for the examples in this bookis available from the book website wwwstatcolostateeducomputationalstatistics
Our target audience includes graduate students in statistics and related fieldsstatisticians and quantitative empirical scientists in other fields We hope such readersmay use the book when applying standard methods and developing new methods
xv
xvi PREFACE
The level of mathematics expected of the reader does not extend much beyondTaylor series and linear algebra Breadth of mathematical training is more helpfulthan depth Essential review is provided in Chapter 1 More advanced readers will findgreater mathematical detail in the wide variety of high-quality books available on spe-cific topics many of which are referenced in the text Other readers caring less aboutanalytical details may prefer to focus on our descriptions of algorithms and examples
The expected level of statistics is equivalent to that obtained by a graduatestudent in his or her first year of study of the theory of statistics and probabilityAn understanding of maximum likelihood methods Bayesian methods elementaryasymptotic theory Markov chains and linear models is most important Many ofthese topics are reviewed in Chapter 1
With respect to computer programming we find that good students can learn asthey go However a working knowledge of a suitable language allows implementationof the ideas covered in this book to progress much more quickly We have chosento forgo any language-specific examples algorithms or coding in the text For thosewishing to learn a language while they study this book we recommend that you choosea high-level interactive package that permits the flexible design of graphical displaysand includes supporting statistics and probability functions such as R and MATLAB1
These are the sort of languages often used by researchers during the development ofnew statistical computing techniques and they are suitable for implementing all themethods we describe except in some cases for problems of vast scope or complexityWe use R and recommend it Although lower-level languages such as C++ couldalso be used they are more appropriate for professional-grade implementation ofalgorithms after researchers have refined the methodology
The book is organized into four major parts optimization (Chapters 2 3 and4) integration and simulation (Chapters 5 6 7 and 8) bootstrapping (Chapter 9)and density estimation and smoothing (Chapters 10 11 and 12) The chapters arewritten to stand independently so a course can be built by selecting the topics onewishes to teach For a one-semester course our selection typically weights mostheavily topics from Chapters 2 3 6 7 9 10 and 11 With a leisurely pace or morethorough coverage a shorter list of topics could still easily fill a semester courseThere is sufficient material here to provide a thorough one-year course of studynotwithstanding any supplemental topics one might wish to teach
A variety of homework problems are included at the end of each chapter Someare straightforward while others require the student to develop a thorough under-standing of the modelmethod being used to carefully (and perhaps cleverly) code asuitable technique and to devote considerable attention to the interpretation of resultsA few exercises invite open-ended exploration of methods and ideas We are some-times asked for solutions to the exercises but we prefer to sequester them to preservethe challenge for future students and readers
The datasets discussed in the examples and exercises are available from the bookwebsite wwwstatcolostateeducomputationalstatistics The R code is also providedthere Finally the website includes an errata Responsibility for all errors lies with us
1R is available for free from wwwr-projectorg Information about MATLAB can be found atwwwmathworkscom
ACKNOWLEDGMENTS
The course upon which this book is based was developed and taught by us at ColoradoState University from 1994 onwards Thanks are due to our many students who havebeen semiwilling guinea pigs over the years We also thank our colleagues in theStatistics Department for their continued support The late Richard Tweedie meritsparticular acknowledgment for his mentoring during the early years of our careers
We owe a great deal of intellectual debt to Adrian Raftery who deserves specialthanks not only for his teaching and advising but also for his unwavering support andhis seemingly inexhaustible supply of good ideas In addition we thank our influentialadvisors and teachers at the University of Washington Statistics Department includingDavid Madigan Werner Stuetzle and Judy Zeh Of course each of our chapters couldbe expanded into a full-length book and great scholars have already done so We owemuch to their efforts upon which we relied when developing our course and ourmanuscript
Portions of the first edition were written at the Department of Mathematicsand Statistics University of Otago in Dunedin New Zealand whose faculty wethank for graciously hosting us during our sabbatical in 2003 Much of our workon the second edition was undertaken during our sabbatical visit to the AustraliaCommonwealth Scientific and Research Organization in 2009ndash2010 sponsored byCSIRO Mathematics Informatics and Statistics and hosted at the Longpocket Lab-oratory in Indooroopilly Australia We thank our hosts and colleagues there for theirsupport
Our manuscript has been greatly improved through the constructive reviewsof John Bickham Ben Bird Kate Cowles Jan Hannig Alan Herlihy David HunterDevin Johnson Michael Newton Doug Nychka Steve Sain David W Scott N ScottUrquhart Haonan Wang Darrell Whitley and eight anonymous referees We alsothank the sharp-eyed readers listed in the errata for their suggestions and correctionsOur editor Steve Quigley and the folks at Wiley were supportive and helpful duringthe publication process We thank Nelida Pohl for permission to adapt her photographin the cover design of the first edition We thank Melinda Stelzer for permission to useher painting ldquoChampagne Circuitrdquo 2001 for the cover of the second edition Moreabout her art can be found at wwwfacebookcomgeekchicart We also owe specialnote of thanks to Zube (aka John Dzubera) who kept our own computers runningdespite our best efforts to the contrary
Funding from National Science Foundation (NSF) CAREER grant SBR-9875508 was a significant source of support for the first author during the preparationof the first edition He also thanks his colleagues and friends in the North SlopeBorough Alaska Department of Wildlife Management for their longtime researchsupport The second author gratefully acknowledges the support of STAR Research
xvii
xviii ACKNOWLEDGMENTS
Assistance Agreement CR-829095 awarded to Colorado State University by the USEnvironmental Protection Agency (EPA) The views expressed here are solely thoseof the authors NSF and EPA do not endorse any products or commercial servicesmentioned herein
Finally we thank our parents for enabling and supporting our educations and forproviding us with the ldquostubbornness generdquo necessary for graduate school the tenuretrack or book publicationmdashtake your pick The second edition is dedicated to ourkids Natalie and Neil for continuing to show us what is important and what is not
Geof H GivensJennifer A Hoeting
CHAPTER 1REVIEW
This chapter reviews notation and background material in mathematics probabilityand statistics Readers may wish to skip this chapter and turn directly to Chapter 2returning here only as needed
11 MATHEMATICAL NOTATION
We use boldface to distinguish a vector x = (x1 xp) or a matrix M from a scalarvariable x or a constant M A vector-valued function f evaluated at x is also boldfacedas in f(x) = (f1(x) fp(x)) The transpose of M is denoted MT
Unless otherwise specified all vectors are considered to be column vectors sofor example an n times p matrix can be written as M = (x1 xn)T Let I denote anidentity matrix and 1 and 0 denote vectors of ones and zeros respectively
A symmetric square matrix M is positive definite if xTMx gt 0 for all nonzerovectors x Positive definiteness is equivalent to the condition that all eigenvalues ofM are positive M is nonnegative definite or positive semidefinite if xTMx ge 0 for allnonzero vectors x
The derivative of a function f evaluated at x is denoted f prime(x) When x =(x1 xp) the gradient of f at x is
f prime(x) =(
df (x)
dx1
df (x)
dxp
)
The Hessian matrix for f at x is f primeprime(x) having (i j)th element equal to d2f (x)(dxi dxj) The negative Hessian has important uses in statistical inference
Let J(x) denote the Jacobian matrix evaluated at x for the one-to-one mappingy = f(x) The (i j)th element of J(x) is equal to dfi(x)dxj
A functional is a real-valued function on a space of functions For example ifT (f ) = int 1
0 f (x) dx then the functional T maps suitably integrable functions onto thereal line
The indicator function 1A equals 1 if A is true and 0 otherwise The real lineis denoted and p-dimensional real space is p
Computational Statistics Second Edition Geof H Givens and Jennifer A Hoetingcopy 2013 John Wiley amp Sons Inc Published 2013 by John Wiley amp Sons Inc
1
2 CHAPTER 1 REVIEW
12 TAYLORrsquoS THEOREM AND MATHEMATICALLIMIT THEORY
First we define standard ldquobig ohrdquo and ldquolittle ohrdquo notation for describing the relativeorders of convergence of functions Let the functions f and g be defined on a commonpossibly infinite interval Let z0 be a point in this interval or a boundary point of it(ie minusinfin or infin) We require g(z) = 0 for all z = z0 in a neighborhood of z0 Thenwe say
f (z) = O(g(z)) (11)
if there exists a constant M such that |f (z)| le M|g(z)| as z rarr z0 For example(n + 1)(3n2) = O(nminus1) and it is understood that we are considering n rarr infin Iflimzrarrz0 f (z)g(z) = 0 then we say
f (z) = O(g(z)) (12)
For example f (x0 + h) minus f (x0) = hf prime(x0) + O(h) as h rarr 0 if f is differentiable atx0 The same notation can be used for describing the convergence of a sequence xnas n rarr infin by letting f (n) = xn
Taylorrsquos theorem provides a polynomial approximation to a function f Supposef has finite (n + 1)th derivative on (a b) and continuous nth derivative on [a b] Thenfor any x0 isin [a b] distinct from x the Taylor series expansion of f about x0 is
f (x) =nsum
i=0
1
if (i)(x0)(x minus x0)i + Rn (13)
where f (i)(x0) is the ith derivative of f evaluated at x0 and
Rn = 1
(n + 1)f (n+1)(ξ)(x minus x0)n+1 (14)
for some point ξ in the interval between x and x0 As |x minus x0| rarr 0 note that Rn =O(|x minus x0|n+1)
The multivariate version of Taylorrsquos theorem is analogous Suppose f is areal-valued function of a p-dimensional variable x possessing continuous partialderivatives of all orders up to and including n + 1 with respect to all coordinates inan open convex set containing x and x0 = x Then
f (x) = f (x0) +nsum
i=1
1
iD(i)(f x0 x minus x0) + Rn (15)
where
D(i)(f x y) =psum
j1=1
middot middot middotpsum
ji=1
(di
dtj1 middot middot middot dtji
f (t)
∣∣∣∣t=x
) iprodk=1
yjk
(16)
12 TAYLORrsquoS THEOREM AND MATHEMATICAL LIMIT THEORY 3
and
Rn = 1
(n + 1)D(n+1)(f ξ x minus x0) (17)
for some ξ on the line segment joining x and x0 As |x minus x0| rarr 0 note that Rn =O(|x minus x0|n+1)
The EulerndashMaclaurin formula is useful in many asymptotic analyses If f has2n continuous derivatives in [0 1] thenint 1
0f (x) dx =f (0) + f (1)
2
minusnminus1sumi=0
b2i(f (2iminus1)(1) minus f (2iminus1)(0))
(2i)minus b2nf
(2n)(ξ)
(2n) (18)
where 0 le ξ le 1 f (j) is the jth derivative of f and bj = Bj(0) can be determinedusing the recursion relation
msumj=0
(m + 1
j
)Bj(z) = (m + 1)zm (19)
initialized with B0(z) = 1 The proof of this result is based on repeated integrationsby parts [376]
Finally we note that it is sometimes desirable to approximate the derivative ofa function numerically using finite differences For example the ith component ofthe gradient of f at x can be approximated by
df (x)
dxi
asymp f (x + εiei) minus f (x minus εiei)
2εi
(110)
where εi is a small number and ei is the unit vector in the ith coordinate directionTypically one might start with say εi = 001 or 0001 and approximate the desiredderivative for a sequence of progressively smaller εi The approximation willgenerally improve until εi becomes small enough that the calculation is degraded andeventually dominated by computer roundoff error introduced by subtractive cancel-lation Introductory discussion of this approach and a more sophisticated Richardsonextrapolation strategy for obtaining greater precision are provided in [376] Finitedifferences can also be used to approximate the second derivative of f at x via
df (x)
dxi dxj
asymp 1
4εiεj
(f (x + εiei + εjej) minus f (x + εiei minus εjej)
minus f (x minus εiei + εjej) + f (x minus εiei minus εjej)
)(111)
with similar sequential precision improvements
4 CHAPTER 1 REVIEW
13 STATISTICAL NOTATION AND PROBABILITYDISTRIBUTIONS
We use capital letters to denote random variables such as Y or X and lowercaseletters to represent specific realized values of random variables such as y or x Theprobability density function of X is denoted f the cumulative distribution functionis F We use the notation X sim f (x) to mean that X is distributed with density f (x)Frequently the dependence of f (x) on one or more parameters also will be denotedwith a conditioning bar as in f (x|α β) Because of the diversity of topics covered inthis book we want to be careful to distinguish when f (x|α) refers to a density functionas opposed to the evaluation of that density at a point x When the meaning is unclearfrom the context we will be explicit for example by using f (middot|α) to denote thefunction When it is important to distinguish among several densities we may adoptsubscripts referring to specific random variables so that the density functions forX and Y are fX and fY respectively We use the same notation for distributions ofdiscrete random variables and in the Bayesian context
The conditional distribution of X given that Y equals y (ie X|Y = y) is de-scribed by the density denoted f (x|y) or fX|Y (x|y) In this case we write that X|Y hasdensity f (x|Y ) For notational simplicity we allow density functions to be implicitlyspecified by their arguments so we may use the same symbol say f to refer to manydistinct functions as in the equation f (x y|μ) = f (x|y μ)f (y|μ) Finally f (X) andF (X) are random variables the evaluations of the density and cumulative distributionfunctions respectively at the random argument X
The expectation of a random variable is denoted EX Unless specificallymentioned the distribution with respect to which an expectation is taken is the dis-tribution of X or should be implicit from the context To denote the probability ofan event A we use P[A] = E1A The conditional expectation of X|Y = y isEX|y When Y is unknown EX|Y is a random variable that depends on Y Otherattributes of the distribution of X and Y include varX covX Y corX Y andcvX = varX12EX These quantities are the variance of X the covariance andcorrelation of X and Y and the coefficient of variation of X respectively
A useful result regarding expectations is Jensenrsquos inequality Let g be a convexfunction on a possibly infinite open interval I so
g(λx + (1 minus λ)y) le λg(x) + (1 minus λ)g(y) (112)
for all x y isin I and all 0 lt λ lt 1 Then Jensenrsquos inequality states that Eg(X) geg(EX) for any random variable X having P[X isin I] = 1
Tables 11 12 and 13 provide information about many discrete and contin-uous distributions used throughout this book We refer to the following well-knowncombinatorial constants
n = n(n minus 1)(n minus 2) middot middot middot (3)(2)(1) with 0 = 1 (113)(n
k
)= n
k(n minus k) (114)
TABLE 11 Notation and description for common probability distributions of discrete random variables
Notation and Density and Mean and
Name Parameter Space Sample Space Variance
Bernoulli X simBernoulli(p)0 le p le 1
f (x) = px(1 minus p)1minusx
x = 0 or 1EX = p
varX = p(1 minus p)
Binomial X simBin(n p)0 le p le 1n isin 1 2
f (x) =(
n
x
)px(1 minus p)nminusx
x = 0 1 n
EX = np
varX = np(1 minus p)
Multinomial X simMultinomial(n p)p = (p1 pk)0 le pi le 1 and n isin 1 2 sumk
i=1 pi = 1
f (x) =(
n
x1 xk
)prodk
i=1 pxi
i
x = (x1 xk) and xi isin 0 1 nsumk
i=1 xi = n
EX = npvarXi = npi(1 minus pi)covXi Xj = minusnpipj
NegativeBinomial
X simNegBin(r p)0 le p le 1r isin 1 2
f (x) =(
x + r minus 1
r minus 1
)pr(1 minus p)x
x isin 0 1 EX = r(1 minus p)pvarX = r(1 minus p)p2
Poisson X simPoisson(λ) f (x) = λx
xexpminusλ EX = λ
λ gt 0 x isin 0 1 varX = λ
5
TABLE 12 Notation and description for some common probability distributions of continuous random variables
Notation and Density and Mean and
Name Parameter Space Sample Space Variance
Beta X simBeta(α β)α gt 0 and β gt 0
f (x) = (α + β)
(α)(β)xαminus1(1 minus x)βminus1
0 le x le 1
EX = α
α + β
varX = αβ
(α + β)2(α + β + 1)
Cauchy X simCauchy(α β)α isin and β gt 0
f (x) = 1
πβ
[1 +
(x minus α
β
)2]
x isin
EX is nonexistentvarX is nonexistent
Chi-square X sim χ2ν
ν gt 0f (x) = Gamma(ν2 12)x gt 0
EX = ν
varX = 2ν
Dirichlet X simDirichlet(α)α = (α1 αk)αi gt 0α0 = sumk
i=1 αi
f (x) = (α0)prodk
i=1 xαiminus1iprodk
i=1 (αi)x = (x1 xk) and 0 le xi le 1sumk
i=1 xi = 1
EX = αα0
varXi = αi(α0 minus αi)
α20(α0 + 1)
covXi Xj = minusαiαj
α20(α0 + 1)
Exponential X simExp(λ)λ gt 0
f (x) = λ expminusλxx gt 0
EX = 1λ
varX = 1λ2
Gamma X simGamma(r λ)λ gt 0 and r gt 0
f (x) = λrxrminus1
(r)expminusλx
x gt 0
EX = rλ
varX = rλ2
6
TABLE 13 Notation and description for more common probability distributions of continuous random variables
Notation and Density and Mean and
Name Parameter Space Sample Space Variance
Lognormal X simLognormal(μ σ2)μ isin and σ gt 0
f (x) = 1
xradic
2πσ2exp
minus1
2
(logx minus μ
σ
)2
x isin EX = expμ + σ22varX = exp2μ + 2σ2 minus exp2μ + σ2
MultivariateNormal
X sim Nk(μ )μ = (μ1 μk) isin k
positive definite
f (x) = expminus(x minus μ)T minus1(x minus μ)2(2π)k2||12
x = (x1 xk) isin k
EX = μ
varX =
Normal X simN(μ σ2)μ isin and σ gt 0
f (x) = 1radic2πσ2
exp
minus1
2
(x minus μ
σ
)2
x isin EX = μ
varX = σ2
Studentrsquos t X sim tν
ν gt 0f (x) = ((ν + 1)2)
(ν2)radic
πν
(1 + x2
ν
)minus(ν+1)2
x isin EX = 0 if ν gt 1
varX = ν
ν + 2if ν gt 2
Uniform X simUnif(a b)a b isin and a lt b
f (x) = 1
b minus ax isin [a b]
EX = (a + b)2varX = (b minus a)212
Weibull X simWeibull(a b)a gt 0 and b gt 0
f (x) = abxbminus1 expminusaxbx gt 0
EX = (1 + 1b)
a1b
varX = (1 + 2b) minus (1 + 1b)2
a2b
7
8 CHAPTER 1 REVIEW(n
k1 km
)= nprodm
i=1 kiwhere n =
msumi=1
ki (115)
(r) =
(r minus 1) for r = 1 2 int infin0 trminus1 expminust dt for general r gt 0
(116)
It is worth knowing that
(
12
)= radic
π and (n + 1
2
)= 1 times 3 times 5 times middot middot middot times (2n minus 1)
radicπ2n
for positive integer nMany of the distributions commonly used in statistics are members of an
exponential family A k-parameter exponential family density can be expressed as
f (x|γ) = c1(x)c2(γ) exp
ksum
i=1
yi(x)θi(γ)
(117)
for nonnegative functions c1 and c2 The vectorγ denotes the familiar parameters suchas λ for the Poisson density and p for the binomial density The real-valued θi(γ) arethe natural or canonical parameters which are usually transformations of γ Theyi(x) are the sufficient statistics for the canonical parameters It is straightforwardto show
Ey(X) =κprime(θ) (118)
and
vary(X) =κprimeprime(θ) (119)
where κ(θ) = minus log c3(θ) letting c3(θ) denote the reexpression of c2(γ) in terms ofthe canonical parameters θ = (θ1 θk) and y(X) = (y1(X) yk(X)) Theseresults can be rewritten in terms of the original parameters γ as
E
ksum
i=1
dθi(γ)
dγj
yi(X)
= minus d
dγj
log c2(γ) (120)
and
var
ksum
i=1
dθi(γ)
dγj
yi(X)
= minus d2
dγ2j
log c2(γ) minus E
ksum
i=1
d2θi(γ)
dγ2j
yi(X)
(121)
Example 11 (Poisson) The Poisson distribution belongs to the exponential familywith c1(x) = 1x c2(λ) = expminusλ y(x) = x and θ(λ) = log λ Deriving momentsin terms of θ we have κ(θ) = expθ so EX = κprime(θ) = expθ = λ and varX =κprimeprime(θ) = expθ = λ The same results may be obtained with (120) and (121) notingthat dθdλ = 1λ For example (120) gives E Xλ = 1
It is also important to know how the distribution of a random variable changeswhen it is transformed Let X = (X1 Xp) denote a p-dimensional random
14 LIKELIHOOD INFERENCE 9
variable with continuous density function f Suppose that
U = g(X) = (g1(X) gp(X)) = (U1 Up) (122)
where g is a one-to-one function mapping the support region of f onto the space ofall u = g(x) for which x satisfies f (x) gt 0 To derive the probability distribution ofU from that of X we need to use the Jacobian matrix The density of the transformedvariables is
f (u) = f (gminus1(u)) |J(u)| (123)
where |J(u)| is the absolute value of the determinant of the Jacobian matrix of gminus1
evaluated at u having (i j)th element dxiduj where these derivatives are assumedto be continuous over the support region of U
14 LIKELIHOOD INFERENCE
If X1 Xn are independent and identically distributed (iid) each having densityf (x | θ) that depends on a vector of p unknown parameters θ = (θ1 θp) then thejoint likelihood function is
L(θ) =nprod
i=1
f (xi|θ) (124)
When the data are not iid the joint likelihood is still expressed as the joint densityf (x1 xn|θ) viewed as a function of θ
The observed data x1 xn might have been realized under many differentvalues for θ The parameters for which observing x1 xn would be most likelyconstitute the maximum likelihood estimate of θ In other words if ϑ is the functionof x1 xn that maximizes L(θ) then θ = ϑ(X1 Xn) is the maximum likeli-hood estimator (MLE) for θ MLEs are invariant to transformation so the MLE of atransformation of θ equals the transformation of θ
It is typically easier to work with the log likelihood function
l (θ) = log L(θ) (125)
which has the same maximum as the original likelihood since log is a strictly mono-tonic function Furthermore any additive constants (involving possibly x1 xn
but not θ) may be omitted from the log likelihood without changing the location of itsmaximum or differences between log likelihoods at different θ Note that maximizingL(θ) with respect to θ is equivalent to solving the system of equations
lprime(θ) = 0 (126)
where
lprime(θ) =(
dl(θ)
dθ1
dl(θ)
dθn
)
10 CHAPTER 1 REVIEW
is called the score function The score function satisfies
Elprime(θ) = 0 (127)
where the expectation is taken with respect to the distribution of X1 Xn Some-times an analytical solution to (126) provides the MLE this book describes a varietyof methods that can be used when the MLE cannot be solved for in closed form Itis worth noting that there are pathological circumstances where the MLE is not asolution of the score equation or the MLE is not unique see [127] for examples
The MLE has a sampling distribution because it depends on the realizationof the random variables X1 Xn The MLE may be biased or unbiased for θ yetunder quite general conditions it is asymptotically unbiased as n rarr infin The samplingvariance of the MLE depends on the average curvature of the log likelihood When thelog likelihood is very pointy the location of the maximum is more precisely known
To make this precise let lprimeprime(θ) denote the p times p matrix having (i j)th elementgiven by d2l(θ)(dθidθj) The Fisher information matrix is defined as
I(θ) = Elprime(θ)lprime(θ)T = minusElprimeprime(θ) (128)
where the expectations are taken with respect to the distribution of X1 Xn Thefinal equality in (128) requires mild assumptions which are satisfied for example inexponential families I(θ) may sometimes be called the expected Fisher informationto distinguish it from minuslprimeprime(θ) which is the observed Fisher information There aretwo reasons why the observed Fisher information is quite useful First it can becalculated even if the expectations in (128) cannot easily be computed Second it isa good approximation to I(θ) that improves as n increases
Under regularity conditions the asymptotic variancendashcovariance matrix ofthe MLE θ is I(θlowast)minus1 where θlowast denotes the true value of θ Indeed as n rarr infin thelimiting distribution of θ is Np(θlowast I(θlowast)minus1) Since the true parameter values areunknown I(θlowast)minus1 must be estimated in order to estimate the variancendashcovariancematrix of the MLE An obvious approach is to use I(θ)minus1 Alternatively it is alsoreasonable to use minuslprimeprime(θ)minus1 Standard errors for individual parameter MLEs canbe estimated by taking the square root of the diagonal elements of the chosenestimate of I(θlowast)minus1 A thorough introduction to maximum likelihood theory andthe relative merits of these estimates of I(θlowast)minus1 can be found in [127 182 371 470]
Profile likelihoods provide an informative way to graph a higher-dimensionallikelihood surface to make inference about some parameters while treating othersas nuisance parameters and to facilitate various optimization strategies The profilelikelihood is obtained by constrained maximization of the full likelihood with respectto parameters to be ignored If θ = (μ φ) then the profile likelihood for φ is
L(φ|μ(φ)) = maxμ
L(μ φ) (129)
Thus for each possible φ a value of μ is chosen to maximize L(μ φ) This optimalμ is a function of φ The profile likelihood is the function that maps φ to the valueof the full likelihood evaluated at φ and its corresponding optimal μ Note that the φ
xvi PREFACE
The level of mathematics expected of the reader does not extend much beyondTaylor series and linear algebra Breadth of mathematical training is more helpfulthan depth Essential review is provided in Chapter 1 More advanced readers will findgreater mathematical detail in the wide variety of high-quality books available on spe-cific topics many of which are referenced in the text Other readers caring less aboutanalytical details may prefer to focus on our descriptions of algorithms and examples
The expected level of statistics is equivalent to that obtained by a graduatestudent in his or her first year of study of the theory of statistics and probabilityAn understanding of maximum likelihood methods Bayesian methods elementaryasymptotic theory Markov chains and linear models is most important Many ofthese topics are reviewed in Chapter 1
With respect to computer programming we find that good students can learn asthey go However a working knowledge of a suitable language allows implementationof the ideas covered in this book to progress much more quickly We have chosento forgo any language-specific examples algorithms or coding in the text For thosewishing to learn a language while they study this book we recommend that you choosea high-level interactive package that permits the flexible design of graphical displaysand includes supporting statistics and probability functions such as R and MATLAB1
These are the sort of languages often used by researchers during the development ofnew statistical computing techniques and they are suitable for implementing all themethods we describe except in some cases for problems of vast scope or complexityWe use R and recommend it Although lower-level languages such as C++ couldalso be used they are more appropriate for professional-grade implementation ofalgorithms after researchers have refined the methodology
The book is organized into four major parts optimization (Chapters 2 3 and4) integration and simulation (Chapters 5 6 7 and 8) bootstrapping (Chapter 9)and density estimation and smoothing (Chapters 10 11 and 12) The chapters arewritten to stand independently so a course can be built by selecting the topics onewishes to teach For a one-semester course our selection typically weights mostheavily topics from Chapters 2 3 6 7 9 10 and 11 With a leisurely pace or morethorough coverage a shorter list of topics could still easily fill a semester courseThere is sufficient material here to provide a thorough one-year course of studynotwithstanding any supplemental topics one might wish to teach
A variety of homework problems are included at the end of each chapter Someare straightforward while others require the student to develop a thorough under-standing of the modelmethod being used to carefully (and perhaps cleverly) code asuitable technique and to devote considerable attention to the interpretation of resultsA few exercises invite open-ended exploration of methods and ideas We are some-times asked for solutions to the exercises but we prefer to sequester them to preservethe challenge for future students and readers
The datasets discussed in the examples and exercises are available from the bookwebsite wwwstatcolostateeducomputationalstatistics The R code is also providedthere Finally the website includes an errata Responsibility for all errors lies with us
1R is available for free from wwwr-projectorg Information about MATLAB can be found atwwwmathworkscom
ACKNOWLEDGMENTS
The course upon which this book is based was developed and taught by us at ColoradoState University from 1994 onwards Thanks are due to our many students who havebeen semiwilling guinea pigs over the years We also thank our colleagues in theStatistics Department for their continued support The late Richard Tweedie meritsparticular acknowledgment for his mentoring during the early years of our careers
We owe a great deal of intellectual debt to Adrian Raftery who deserves specialthanks not only for his teaching and advising but also for his unwavering support andhis seemingly inexhaustible supply of good ideas In addition we thank our influentialadvisors and teachers at the University of Washington Statistics Department includingDavid Madigan Werner Stuetzle and Judy Zeh Of course each of our chapters couldbe expanded into a full-length book and great scholars have already done so We owemuch to their efforts upon which we relied when developing our course and ourmanuscript
Portions of the first edition were written at the Department of Mathematicsand Statistics University of Otago in Dunedin New Zealand whose faculty wethank for graciously hosting us during our sabbatical in 2003 Much of our workon the second edition was undertaken during our sabbatical visit to the AustraliaCommonwealth Scientific and Research Organization in 2009ndash2010 sponsored byCSIRO Mathematics Informatics and Statistics and hosted at the Longpocket Lab-oratory in Indooroopilly Australia We thank our hosts and colleagues there for theirsupport
Our manuscript has been greatly improved through the constructive reviewsof John Bickham Ben Bird Kate Cowles Jan Hannig Alan Herlihy David HunterDevin Johnson Michael Newton Doug Nychka Steve Sain David W Scott N ScottUrquhart Haonan Wang Darrell Whitley and eight anonymous referees We alsothank the sharp-eyed readers listed in the errata for their suggestions and correctionsOur editor Steve Quigley and the folks at Wiley were supportive and helpful duringthe publication process We thank Nelida Pohl for permission to adapt her photographin the cover design of the first edition We thank Melinda Stelzer for permission to useher painting ldquoChampagne Circuitrdquo 2001 for the cover of the second edition Moreabout her art can be found at wwwfacebookcomgeekchicart We also owe specialnote of thanks to Zube (aka John Dzubera) who kept our own computers runningdespite our best efforts to the contrary
Funding from National Science Foundation (NSF) CAREER grant SBR-9875508 was a significant source of support for the first author during the preparationof the first edition He also thanks his colleagues and friends in the North SlopeBorough Alaska Department of Wildlife Management for their longtime researchsupport The second author gratefully acknowledges the support of STAR Research
xvii
xviii ACKNOWLEDGMENTS
Assistance Agreement CR-829095 awarded to Colorado State University by the USEnvironmental Protection Agency (EPA) The views expressed here are solely thoseof the authors NSF and EPA do not endorse any products or commercial servicesmentioned herein
Finally we thank our parents for enabling and supporting our educations and forproviding us with the ldquostubbornness generdquo necessary for graduate school the tenuretrack or book publicationmdashtake your pick The second edition is dedicated to ourkids Natalie and Neil for continuing to show us what is important and what is not
Geof H GivensJennifer A Hoeting
CHAPTER 1REVIEW
This chapter reviews notation and background material in mathematics probabilityand statistics Readers may wish to skip this chapter and turn directly to Chapter 2returning here only as needed
11 MATHEMATICAL NOTATION
We use boldface to distinguish a vector x = (x1 xp) or a matrix M from a scalarvariable x or a constant M A vector-valued function f evaluated at x is also boldfacedas in f(x) = (f1(x) fp(x)) The transpose of M is denoted MT
Unless otherwise specified all vectors are considered to be column vectors sofor example an n times p matrix can be written as M = (x1 xn)T Let I denote anidentity matrix and 1 and 0 denote vectors of ones and zeros respectively
A symmetric square matrix M is positive definite if xTMx gt 0 for all nonzerovectors x Positive definiteness is equivalent to the condition that all eigenvalues ofM are positive M is nonnegative definite or positive semidefinite if xTMx ge 0 for allnonzero vectors x
The derivative of a function f evaluated at x is denoted f prime(x) When x =(x1 xp) the gradient of f at x is
f prime(x) =(
df (x)
dx1
df (x)
dxp
)
The Hessian matrix for f at x is f primeprime(x) having (i j)th element equal to d2f (x)(dxi dxj) The negative Hessian has important uses in statistical inference
Let J(x) denote the Jacobian matrix evaluated at x for the one-to-one mappingy = f(x) The (i j)th element of J(x) is equal to dfi(x)dxj
A functional is a real-valued function on a space of functions For example ifT (f ) = int 1
0 f (x) dx then the functional T maps suitably integrable functions onto thereal line
The indicator function 1A equals 1 if A is true and 0 otherwise The real lineis denoted and p-dimensional real space is p
Computational Statistics Second Edition Geof H Givens and Jennifer A Hoetingcopy 2013 John Wiley amp Sons Inc Published 2013 by John Wiley amp Sons Inc
1
2 CHAPTER 1 REVIEW
12 TAYLORrsquoS THEOREM AND MATHEMATICALLIMIT THEORY
First we define standard ldquobig ohrdquo and ldquolittle ohrdquo notation for describing the relativeorders of convergence of functions Let the functions f and g be defined on a commonpossibly infinite interval Let z0 be a point in this interval or a boundary point of it(ie minusinfin or infin) We require g(z) = 0 for all z = z0 in a neighborhood of z0 Thenwe say
f (z) = O(g(z)) (11)
if there exists a constant M such that |f (z)| le M|g(z)| as z rarr z0 For example(n + 1)(3n2) = O(nminus1) and it is understood that we are considering n rarr infin Iflimzrarrz0 f (z)g(z) = 0 then we say
f (z) = O(g(z)) (12)
For example f (x0 + h) minus f (x0) = hf prime(x0) + O(h) as h rarr 0 if f is differentiable atx0 The same notation can be used for describing the convergence of a sequence xnas n rarr infin by letting f (n) = xn
Taylorrsquos theorem provides a polynomial approximation to a function f Supposef has finite (n + 1)th derivative on (a b) and continuous nth derivative on [a b] Thenfor any x0 isin [a b] distinct from x the Taylor series expansion of f about x0 is
f (x) =nsum
i=0
1
if (i)(x0)(x minus x0)i + Rn (13)
where f (i)(x0) is the ith derivative of f evaluated at x0 and
Rn = 1
(n + 1)f (n+1)(ξ)(x minus x0)n+1 (14)
for some point ξ in the interval between x and x0 As |x minus x0| rarr 0 note that Rn =O(|x minus x0|n+1)
The multivariate version of Taylorrsquos theorem is analogous Suppose f is areal-valued function of a p-dimensional variable x possessing continuous partialderivatives of all orders up to and including n + 1 with respect to all coordinates inan open convex set containing x and x0 = x Then
f (x) = f (x0) +nsum
i=1
1
iD(i)(f x0 x minus x0) + Rn (15)
where
D(i)(f x y) =psum
j1=1
middot middot middotpsum
ji=1
(di
dtj1 middot middot middot dtji
f (t)
∣∣∣∣t=x
) iprodk=1
yjk
(16)
12 TAYLORrsquoS THEOREM AND MATHEMATICAL LIMIT THEORY 3
and
Rn = 1
(n + 1)D(n+1)(f ξ x minus x0) (17)
for some ξ on the line segment joining x and x0 As |x minus x0| rarr 0 note that Rn =O(|x minus x0|n+1)
The EulerndashMaclaurin formula is useful in many asymptotic analyses If f has2n continuous derivatives in [0 1] thenint 1
0f (x) dx =f (0) + f (1)
2
minusnminus1sumi=0
b2i(f (2iminus1)(1) minus f (2iminus1)(0))
(2i)minus b2nf
(2n)(ξ)
(2n) (18)
where 0 le ξ le 1 f (j) is the jth derivative of f and bj = Bj(0) can be determinedusing the recursion relation
msumj=0
(m + 1
j
)Bj(z) = (m + 1)zm (19)
initialized with B0(z) = 1 The proof of this result is based on repeated integrationsby parts [376]
Finally we note that it is sometimes desirable to approximate the derivative ofa function numerically using finite differences For example the ith component ofthe gradient of f at x can be approximated by
df (x)
dxi
asymp f (x + εiei) minus f (x minus εiei)
2εi
(110)
where εi is a small number and ei is the unit vector in the ith coordinate directionTypically one might start with say εi = 001 or 0001 and approximate the desiredderivative for a sequence of progressively smaller εi The approximation willgenerally improve until εi becomes small enough that the calculation is degraded andeventually dominated by computer roundoff error introduced by subtractive cancel-lation Introductory discussion of this approach and a more sophisticated Richardsonextrapolation strategy for obtaining greater precision are provided in [376] Finitedifferences can also be used to approximate the second derivative of f at x via
df (x)
dxi dxj
asymp 1
4εiεj
(f (x + εiei + εjej) minus f (x + εiei minus εjej)
minus f (x minus εiei + εjej) + f (x minus εiei minus εjej)
)(111)
with similar sequential precision improvements
4 CHAPTER 1 REVIEW
13 STATISTICAL NOTATION AND PROBABILITYDISTRIBUTIONS
We use capital letters to denote random variables such as Y or X and lowercaseletters to represent specific realized values of random variables such as y or x Theprobability density function of X is denoted f the cumulative distribution functionis F We use the notation X sim f (x) to mean that X is distributed with density f (x)Frequently the dependence of f (x) on one or more parameters also will be denotedwith a conditioning bar as in f (x|α β) Because of the diversity of topics covered inthis book we want to be careful to distinguish when f (x|α) refers to a density functionas opposed to the evaluation of that density at a point x When the meaning is unclearfrom the context we will be explicit for example by using f (middot|α) to denote thefunction When it is important to distinguish among several densities we may adoptsubscripts referring to specific random variables so that the density functions forX and Y are fX and fY respectively We use the same notation for distributions ofdiscrete random variables and in the Bayesian context
The conditional distribution of X given that Y equals y (ie X|Y = y) is de-scribed by the density denoted f (x|y) or fX|Y (x|y) In this case we write that X|Y hasdensity f (x|Y ) For notational simplicity we allow density functions to be implicitlyspecified by their arguments so we may use the same symbol say f to refer to manydistinct functions as in the equation f (x y|μ) = f (x|y μ)f (y|μ) Finally f (X) andF (X) are random variables the evaluations of the density and cumulative distributionfunctions respectively at the random argument X
The expectation of a random variable is denoted EX Unless specificallymentioned the distribution with respect to which an expectation is taken is the dis-tribution of X or should be implicit from the context To denote the probability ofan event A we use P[A] = E1A The conditional expectation of X|Y = y isEX|y When Y is unknown EX|Y is a random variable that depends on Y Otherattributes of the distribution of X and Y include varX covX Y corX Y andcvX = varX12EX These quantities are the variance of X the covariance andcorrelation of X and Y and the coefficient of variation of X respectively
A useful result regarding expectations is Jensenrsquos inequality Let g be a convexfunction on a possibly infinite open interval I so
g(λx + (1 minus λ)y) le λg(x) + (1 minus λ)g(y) (112)
for all x y isin I and all 0 lt λ lt 1 Then Jensenrsquos inequality states that Eg(X) geg(EX) for any random variable X having P[X isin I] = 1
Tables 11 12 and 13 provide information about many discrete and contin-uous distributions used throughout this book We refer to the following well-knowncombinatorial constants
n = n(n minus 1)(n minus 2) middot middot middot (3)(2)(1) with 0 = 1 (113)(n
k
)= n
k(n minus k) (114)
TABLE 11 Notation and description for common probability distributions of discrete random variables
Notation and Density and Mean and
Name Parameter Space Sample Space Variance
Bernoulli X simBernoulli(p)0 le p le 1
f (x) = px(1 minus p)1minusx
x = 0 or 1EX = p
varX = p(1 minus p)
Binomial X simBin(n p)0 le p le 1n isin 1 2
f (x) =(
n
x
)px(1 minus p)nminusx
x = 0 1 n
EX = np
varX = np(1 minus p)
Multinomial X simMultinomial(n p)p = (p1 pk)0 le pi le 1 and n isin 1 2 sumk
i=1 pi = 1
f (x) =(
n
x1 xk
)prodk
i=1 pxi
i
x = (x1 xk) and xi isin 0 1 nsumk
i=1 xi = n
EX = npvarXi = npi(1 minus pi)covXi Xj = minusnpipj
NegativeBinomial
X simNegBin(r p)0 le p le 1r isin 1 2
f (x) =(
x + r minus 1
r minus 1
)pr(1 minus p)x
x isin 0 1 EX = r(1 minus p)pvarX = r(1 minus p)p2
Poisson X simPoisson(λ) f (x) = λx
xexpminusλ EX = λ
λ gt 0 x isin 0 1 varX = λ
5
TABLE 12 Notation and description for some common probability distributions of continuous random variables
Notation and Density and Mean and
Name Parameter Space Sample Space Variance
Beta X simBeta(α β)α gt 0 and β gt 0
f (x) = (α + β)
(α)(β)xαminus1(1 minus x)βminus1
0 le x le 1
EX = α
α + β
varX = αβ
(α + β)2(α + β + 1)
Cauchy X simCauchy(α β)α isin and β gt 0
f (x) = 1
πβ
[1 +
(x minus α
β
)2]
x isin
EX is nonexistentvarX is nonexistent
Chi-square X sim χ2ν
ν gt 0f (x) = Gamma(ν2 12)x gt 0
EX = ν
varX = 2ν
Dirichlet X simDirichlet(α)α = (α1 αk)αi gt 0α0 = sumk
i=1 αi
f (x) = (α0)prodk
i=1 xαiminus1iprodk
i=1 (αi)x = (x1 xk) and 0 le xi le 1sumk
i=1 xi = 1
EX = αα0
varXi = αi(α0 minus αi)
α20(α0 + 1)
covXi Xj = minusαiαj
α20(α0 + 1)
Exponential X simExp(λ)λ gt 0
f (x) = λ expminusλxx gt 0
EX = 1λ
varX = 1λ2
Gamma X simGamma(r λ)λ gt 0 and r gt 0
f (x) = λrxrminus1
(r)expminusλx
x gt 0
EX = rλ
varX = rλ2
6
TABLE 13 Notation and description for more common probability distributions of continuous random variables
Notation and Density and Mean and
Name Parameter Space Sample Space Variance
Lognormal X simLognormal(μ σ2)μ isin and σ gt 0
f (x) = 1
xradic
2πσ2exp
minus1
2
(logx minus μ
σ
)2
x isin EX = expμ + σ22varX = exp2μ + 2σ2 minus exp2μ + σ2
MultivariateNormal
X sim Nk(μ )μ = (μ1 μk) isin k
positive definite
f (x) = expminus(x minus μ)T minus1(x minus μ)2(2π)k2||12
x = (x1 xk) isin k
EX = μ
varX =
Normal X simN(μ σ2)μ isin and σ gt 0
f (x) = 1radic2πσ2
exp
minus1
2
(x minus μ
σ
)2
x isin EX = μ
varX = σ2
Studentrsquos t X sim tν
ν gt 0f (x) = ((ν + 1)2)
(ν2)radic
πν
(1 + x2
ν
)minus(ν+1)2
x isin EX = 0 if ν gt 1
varX = ν
ν + 2if ν gt 2
Uniform X simUnif(a b)a b isin and a lt b
f (x) = 1
b minus ax isin [a b]
EX = (a + b)2varX = (b minus a)212
Weibull X simWeibull(a b)a gt 0 and b gt 0
f (x) = abxbminus1 expminusaxbx gt 0
EX = (1 + 1b)
a1b
varX = (1 + 2b) minus (1 + 1b)2
a2b
7
8 CHAPTER 1 REVIEW(n
k1 km
)= nprodm
i=1 kiwhere n =
msumi=1
ki (115)
(r) =
(r minus 1) for r = 1 2 int infin0 trminus1 expminust dt for general r gt 0
(116)
It is worth knowing that
(
12
)= radic
π and (n + 1
2
)= 1 times 3 times 5 times middot middot middot times (2n minus 1)
radicπ2n
for positive integer nMany of the distributions commonly used in statistics are members of an
exponential family A k-parameter exponential family density can be expressed as
f (x|γ) = c1(x)c2(γ) exp
ksum
i=1
yi(x)θi(γ)
(117)
for nonnegative functions c1 and c2 The vectorγ denotes the familiar parameters suchas λ for the Poisson density and p for the binomial density The real-valued θi(γ) arethe natural or canonical parameters which are usually transformations of γ Theyi(x) are the sufficient statistics for the canonical parameters It is straightforwardto show
Ey(X) =κprime(θ) (118)
and
vary(X) =κprimeprime(θ) (119)
where κ(θ) = minus log c3(θ) letting c3(θ) denote the reexpression of c2(γ) in terms ofthe canonical parameters θ = (θ1 θk) and y(X) = (y1(X) yk(X)) Theseresults can be rewritten in terms of the original parameters γ as
E
ksum
i=1
dθi(γ)
dγj
yi(X)
= minus d
dγj
log c2(γ) (120)
and
var
ksum
i=1
dθi(γ)
dγj
yi(X)
= minus d2
dγ2j
log c2(γ) minus E
ksum
i=1
d2θi(γ)
dγ2j
yi(X)
(121)
Example 11 (Poisson) The Poisson distribution belongs to the exponential familywith c1(x) = 1x c2(λ) = expminusλ y(x) = x and θ(λ) = log λ Deriving momentsin terms of θ we have κ(θ) = expθ so EX = κprime(θ) = expθ = λ and varX =κprimeprime(θ) = expθ = λ The same results may be obtained with (120) and (121) notingthat dθdλ = 1λ For example (120) gives E Xλ = 1
It is also important to know how the distribution of a random variable changeswhen it is transformed Let X = (X1 Xp) denote a p-dimensional random
14 LIKELIHOOD INFERENCE 9
variable with continuous density function f Suppose that
U = g(X) = (g1(X) gp(X)) = (U1 Up) (122)
where g is a one-to-one function mapping the support region of f onto the space ofall u = g(x) for which x satisfies f (x) gt 0 To derive the probability distribution ofU from that of X we need to use the Jacobian matrix The density of the transformedvariables is
f (u) = f (gminus1(u)) |J(u)| (123)
where |J(u)| is the absolute value of the determinant of the Jacobian matrix of gminus1
evaluated at u having (i j)th element dxiduj where these derivatives are assumedto be continuous over the support region of U
14 LIKELIHOOD INFERENCE
If X1 Xn are independent and identically distributed (iid) each having densityf (x | θ) that depends on a vector of p unknown parameters θ = (θ1 θp) then thejoint likelihood function is
L(θ) =nprod
i=1
f (xi|θ) (124)
When the data are not iid the joint likelihood is still expressed as the joint densityf (x1 xn|θ) viewed as a function of θ
The observed data x1 xn might have been realized under many differentvalues for θ The parameters for which observing x1 xn would be most likelyconstitute the maximum likelihood estimate of θ In other words if ϑ is the functionof x1 xn that maximizes L(θ) then θ = ϑ(X1 Xn) is the maximum likeli-hood estimator (MLE) for θ MLEs are invariant to transformation so the MLE of atransformation of θ equals the transformation of θ
It is typically easier to work with the log likelihood function
l (θ) = log L(θ) (125)
which has the same maximum as the original likelihood since log is a strictly mono-tonic function Furthermore any additive constants (involving possibly x1 xn
but not θ) may be omitted from the log likelihood without changing the location of itsmaximum or differences between log likelihoods at different θ Note that maximizingL(θ) with respect to θ is equivalent to solving the system of equations
lprime(θ) = 0 (126)
where
lprime(θ) =(
dl(θ)
dθ1
dl(θ)
dθn
)
10 CHAPTER 1 REVIEW
is called the score function The score function satisfies
Elprime(θ) = 0 (127)
where the expectation is taken with respect to the distribution of X1 Xn Some-times an analytical solution to (126) provides the MLE this book describes a varietyof methods that can be used when the MLE cannot be solved for in closed form Itis worth noting that there are pathological circumstances where the MLE is not asolution of the score equation or the MLE is not unique see [127] for examples
The MLE has a sampling distribution because it depends on the realizationof the random variables X1 Xn The MLE may be biased or unbiased for θ yetunder quite general conditions it is asymptotically unbiased as n rarr infin The samplingvariance of the MLE depends on the average curvature of the log likelihood When thelog likelihood is very pointy the location of the maximum is more precisely known
To make this precise let lprimeprime(θ) denote the p times p matrix having (i j)th elementgiven by d2l(θ)(dθidθj) The Fisher information matrix is defined as
I(θ) = Elprime(θ)lprime(θ)T = minusElprimeprime(θ) (128)
where the expectations are taken with respect to the distribution of X1 Xn Thefinal equality in (128) requires mild assumptions which are satisfied for example inexponential families I(θ) may sometimes be called the expected Fisher informationto distinguish it from minuslprimeprime(θ) which is the observed Fisher information There aretwo reasons why the observed Fisher information is quite useful First it can becalculated even if the expectations in (128) cannot easily be computed Second it isa good approximation to I(θ) that improves as n increases
Under regularity conditions the asymptotic variancendashcovariance matrix ofthe MLE θ is I(θlowast)minus1 where θlowast denotes the true value of θ Indeed as n rarr infin thelimiting distribution of θ is Np(θlowast I(θlowast)minus1) Since the true parameter values areunknown I(θlowast)minus1 must be estimated in order to estimate the variancendashcovariancematrix of the MLE An obvious approach is to use I(θ)minus1 Alternatively it is alsoreasonable to use minuslprimeprime(θ)minus1 Standard errors for individual parameter MLEs canbe estimated by taking the square root of the diagonal elements of the chosenestimate of I(θlowast)minus1 A thorough introduction to maximum likelihood theory andthe relative merits of these estimates of I(θlowast)minus1 can be found in [127 182 371 470]
Profile likelihoods provide an informative way to graph a higher-dimensionallikelihood surface to make inference about some parameters while treating othersas nuisance parameters and to facilitate various optimization strategies The profilelikelihood is obtained by constrained maximization of the full likelihood with respectto parameters to be ignored If θ = (μ φ) then the profile likelihood for φ is
L(φ|μ(φ)) = maxμ
L(μ φ) (129)
Thus for each possible φ a value of μ is chosen to maximize L(μ φ) This optimalμ is a function of φ The profile likelihood is the function that maps φ to the valueof the full likelihood evaluated at φ and its corresponding optimal μ Note that the φ
ACKNOWLEDGMENTS
The course upon which this book is based was developed and taught by us at ColoradoState University from 1994 onwards Thanks are due to our many students who havebeen semiwilling guinea pigs over the years We also thank our colleagues in theStatistics Department for their continued support The late Richard Tweedie meritsparticular acknowledgment for his mentoring during the early years of our careers
We owe a great deal of intellectual debt to Adrian Raftery who deserves specialthanks not only for his teaching and advising but also for his unwavering support andhis seemingly inexhaustible supply of good ideas In addition we thank our influentialadvisors and teachers at the University of Washington Statistics Department includingDavid Madigan Werner Stuetzle and Judy Zeh Of course each of our chapters couldbe expanded into a full-length book and great scholars have already done so We owemuch to their efforts upon which we relied when developing our course and ourmanuscript
Portions of the first edition were written at the Department of Mathematicsand Statistics University of Otago in Dunedin New Zealand whose faculty wethank for graciously hosting us during our sabbatical in 2003 Much of our workon the second edition was undertaken during our sabbatical visit to the AustraliaCommonwealth Scientific and Research Organization in 2009ndash2010 sponsored byCSIRO Mathematics Informatics and Statistics and hosted at the Longpocket Lab-oratory in Indooroopilly Australia We thank our hosts and colleagues there for theirsupport
Our manuscript has been greatly improved through the constructive reviewsof John Bickham Ben Bird Kate Cowles Jan Hannig Alan Herlihy David HunterDevin Johnson Michael Newton Doug Nychka Steve Sain David W Scott N ScottUrquhart Haonan Wang Darrell Whitley and eight anonymous referees We alsothank the sharp-eyed readers listed in the errata for their suggestions and correctionsOur editor Steve Quigley and the folks at Wiley were supportive and helpful duringthe publication process We thank Nelida Pohl for permission to adapt her photographin the cover design of the first edition We thank Melinda Stelzer for permission to useher painting ldquoChampagne Circuitrdquo 2001 for the cover of the second edition Moreabout her art can be found at wwwfacebookcomgeekchicart We also owe specialnote of thanks to Zube (aka John Dzubera) who kept our own computers runningdespite our best efforts to the contrary
Funding from National Science Foundation (NSF) CAREER grant SBR-9875508 was a significant source of support for the first author during the preparationof the first edition He also thanks his colleagues and friends in the North SlopeBorough Alaska Department of Wildlife Management for their longtime researchsupport The second author gratefully acknowledges the support of STAR Research
xvii
xviii ACKNOWLEDGMENTS
Assistance Agreement CR-829095 awarded to Colorado State University by the USEnvironmental Protection Agency (EPA) The views expressed here are solely thoseof the authors NSF and EPA do not endorse any products or commercial servicesmentioned herein
Finally we thank our parents for enabling and supporting our educations and forproviding us with the ldquostubbornness generdquo necessary for graduate school the tenuretrack or book publicationmdashtake your pick The second edition is dedicated to ourkids Natalie and Neil for continuing to show us what is important and what is not
Geof H GivensJennifer A Hoeting
CHAPTER 1REVIEW
This chapter reviews notation and background material in mathematics probabilityand statistics Readers may wish to skip this chapter and turn directly to Chapter 2returning here only as needed
11 MATHEMATICAL NOTATION
We use boldface to distinguish a vector x = (x1 xp) or a matrix M from a scalarvariable x or a constant M A vector-valued function f evaluated at x is also boldfacedas in f(x) = (f1(x) fp(x)) The transpose of M is denoted MT
Unless otherwise specified all vectors are considered to be column vectors sofor example an n times p matrix can be written as M = (x1 xn)T Let I denote anidentity matrix and 1 and 0 denote vectors of ones and zeros respectively
A symmetric square matrix M is positive definite if xTMx gt 0 for all nonzerovectors x Positive definiteness is equivalent to the condition that all eigenvalues ofM are positive M is nonnegative definite or positive semidefinite if xTMx ge 0 for allnonzero vectors x
The derivative of a function f evaluated at x is denoted f prime(x) When x =(x1 xp) the gradient of f at x is
f prime(x) =(
df (x)
dx1
df (x)
dxp
)
The Hessian matrix for f at x is f primeprime(x) having (i j)th element equal to d2f (x)(dxi dxj) The negative Hessian has important uses in statistical inference
Let J(x) denote the Jacobian matrix evaluated at x for the one-to-one mappingy = f(x) The (i j)th element of J(x) is equal to dfi(x)dxj
A functional is a real-valued function on a space of functions For example ifT (f ) = int 1
0 f (x) dx then the functional T maps suitably integrable functions onto thereal line
The indicator function 1A equals 1 if A is true and 0 otherwise The real lineis denoted and p-dimensional real space is p
Computational Statistics Second Edition Geof H Givens and Jennifer A Hoetingcopy 2013 John Wiley amp Sons Inc Published 2013 by John Wiley amp Sons Inc
1
2 CHAPTER 1 REVIEW
12 TAYLORrsquoS THEOREM AND MATHEMATICALLIMIT THEORY
First we define standard ldquobig ohrdquo and ldquolittle ohrdquo notation for describing the relativeorders of convergence of functions Let the functions f and g be defined on a commonpossibly infinite interval Let z0 be a point in this interval or a boundary point of it(ie minusinfin or infin) We require g(z) = 0 for all z = z0 in a neighborhood of z0 Thenwe say
f (z) = O(g(z)) (11)
if there exists a constant M such that |f (z)| le M|g(z)| as z rarr z0 For example(n + 1)(3n2) = O(nminus1) and it is understood that we are considering n rarr infin Iflimzrarrz0 f (z)g(z) = 0 then we say
f (z) = O(g(z)) (12)
For example f (x0 + h) minus f (x0) = hf prime(x0) + O(h) as h rarr 0 if f is differentiable atx0 The same notation can be used for describing the convergence of a sequence xnas n rarr infin by letting f (n) = xn
Taylorrsquos theorem provides a polynomial approximation to a function f Supposef has finite (n + 1)th derivative on (a b) and continuous nth derivative on [a b] Thenfor any x0 isin [a b] distinct from x the Taylor series expansion of f about x0 is
f (x) =nsum
i=0
1
if (i)(x0)(x minus x0)i + Rn (13)
where f (i)(x0) is the ith derivative of f evaluated at x0 and
Rn = 1
(n + 1)f (n+1)(ξ)(x minus x0)n+1 (14)
for some point ξ in the interval between x and x0 As |x minus x0| rarr 0 note that Rn =O(|x minus x0|n+1)
The multivariate version of Taylorrsquos theorem is analogous Suppose f is areal-valued function of a p-dimensional variable x possessing continuous partialderivatives of all orders up to and including n + 1 with respect to all coordinates inan open convex set containing x and x0 = x Then
f (x) = f (x0) +nsum
i=1
1
iD(i)(f x0 x minus x0) + Rn (15)
where
D(i)(f x y) =psum
j1=1
middot middot middotpsum
ji=1
(di
dtj1 middot middot middot dtji
f (t)
∣∣∣∣t=x
) iprodk=1
yjk
(16)
12 TAYLORrsquoS THEOREM AND MATHEMATICAL LIMIT THEORY 3
and
Rn = 1
(n + 1)D(n+1)(f ξ x minus x0) (17)
for some ξ on the line segment joining x and x0 As |x minus x0| rarr 0 note that Rn =O(|x minus x0|n+1)
The EulerndashMaclaurin formula is useful in many asymptotic analyses If f has2n continuous derivatives in [0 1] thenint 1
0f (x) dx =f (0) + f (1)
2
minusnminus1sumi=0
b2i(f (2iminus1)(1) minus f (2iminus1)(0))
(2i)minus b2nf
(2n)(ξ)
(2n) (18)
where 0 le ξ le 1 f (j) is the jth derivative of f and bj = Bj(0) can be determinedusing the recursion relation
msumj=0
(m + 1
j
)Bj(z) = (m + 1)zm (19)
initialized with B0(z) = 1 The proof of this result is based on repeated integrationsby parts [376]
Finally we note that it is sometimes desirable to approximate the derivative ofa function numerically using finite differences For example the ith component ofthe gradient of f at x can be approximated by
df (x)
dxi
asymp f (x + εiei) minus f (x minus εiei)
2εi
(110)
where εi is a small number and ei is the unit vector in the ith coordinate directionTypically one might start with say εi = 001 or 0001 and approximate the desiredderivative for a sequence of progressively smaller εi The approximation willgenerally improve until εi becomes small enough that the calculation is degraded andeventually dominated by computer roundoff error introduced by subtractive cancel-lation Introductory discussion of this approach and a more sophisticated Richardsonextrapolation strategy for obtaining greater precision are provided in [376] Finitedifferences can also be used to approximate the second derivative of f at x via
df (x)
dxi dxj
asymp 1
4εiεj
(f (x + εiei + εjej) minus f (x + εiei minus εjej)
minus f (x minus εiei + εjej) + f (x minus εiei minus εjej)
)(111)
with similar sequential precision improvements
4 CHAPTER 1 REVIEW
13 STATISTICAL NOTATION AND PROBABILITYDISTRIBUTIONS
We use capital letters to denote random variables such as Y or X and lowercaseletters to represent specific realized values of random variables such as y or x Theprobability density function of X is denoted f the cumulative distribution functionis F We use the notation X sim f (x) to mean that X is distributed with density f (x)Frequently the dependence of f (x) on one or more parameters also will be denotedwith a conditioning bar as in f (x|α β) Because of the diversity of topics covered inthis book we want to be careful to distinguish when f (x|α) refers to a density functionas opposed to the evaluation of that density at a point x When the meaning is unclearfrom the context we will be explicit for example by using f (middot|α) to denote thefunction When it is important to distinguish among several densities we may adoptsubscripts referring to specific random variables so that the density functions forX and Y are fX and fY respectively We use the same notation for distributions ofdiscrete random variables and in the Bayesian context
The conditional distribution of X given that Y equals y (ie X|Y = y) is de-scribed by the density denoted f (x|y) or fX|Y (x|y) In this case we write that X|Y hasdensity f (x|Y ) For notational simplicity we allow density functions to be implicitlyspecified by their arguments so we may use the same symbol say f to refer to manydistinct functions as in the equation f (x y|μ) = f (x|y μ)f (y|μ) Finally f (X) andF (X) are random variables the evaluations of the density and cumulative distributionfunctions respectively at the random argument X
The expectation of a random variable is denoted EX Unless specificallymentioned the distribution with respect to which an expectation is taken is the dis-tribution of X or should be implicit from the context To denote the probability ofan event A we use P[A] = E1A The conditional expectation of X|Y = y isEX|y When Y is unknown EX|Y is a random variable that depends on Y Otherattributes of the distribution of X and Y include varX covX Y corX Y andcvX = varX12EX These quantities are the variance of X the covariance andcorrelation of X and Y and the coefficient of variation of X respectively
A useful result regarding expectations is Jensenrsquos inequality Let g be a convexfunction on a possibly infinite open interval I so
g(λx + (1 minus λ)y) le λg(x) + (1 minus λ)g(y) (112)
for all x y isin I and all 0 lt λ lt 1 Then Jensenrsquos inequality states that Eg(X) geg(EX) for any random variable X having P[X isin I] = 1
Tables 11 12 and 13 provide information about many discrete and contin-uous distributions used throughout this book We refer to the following well-knowncombinatorial constants
n = n(n minus 1)(n minus 2) middot middot middot (3)(2)(1) with 0 = 1 (113)(n
k
)= n
k(n minus k) (114)
TABLE 11 Notation and description for common probability distributions of discrete random variables
Notation and Density and Mean and
Name Parameter Space Sample Space Variance
Bernoulli X simBernoulli(p)0 le p le 1
f (x) = px(1 minus p)1minusx
x = 0 or 1EX = p
varX = p(1 minus p)
Binomial X simBin(n p)0 le p le 1n isin 1 2
f (x) =(
n
x
)px(1 minus p)nminusx
x = 0 1 n
EX = np
varX = np(1 minus p)
Multinomial X simMultinomial(n p)p = (p1 pk)0 le pi le 1 and n isin 1 2 sumk
i=1 pi = 1
f (x) =(
n
x1 xk
)prodk
i=1 pxi
i
x = (x1 xk) and xi isin 0 1 nsumk
i=1 xi = n
EX = npvarXi = npi(1 minus pi)covXi Xj = minusnpipj
NegativeBinomial
X simNegBin(r p)0 le p le 1r isin 1 2
f (x) =(
x + r minus 1
r minus 1
)pr(1 minus p)x
x isin 0 1 EX = r(1 minus p)pvarX = r(1 minus p)p2
Poisson X simPoisson(λ) f (x) = λx
xexpminusλ EX = λ
λ gt 0 x isin 0 1 varX = λ
5
TABLE 12 Notation and description for some common probability distributions of continuous random variables
Notation and Density and Mean and
Name Parameter Space Sample Space Variance
Beta X simBeta(α β)α gt 0 and β gt 0
f (x) = (α + β)
(α)(β)xαminus1(1 minus x)βminus1
0 le x le 1
EX = α
α + β
varX = αβ
(α + β)2(α + β + 1)
Cauchy X simCauchy(α β)α isin and β gt 0
f (x) = 1
πβ
[1 +
(x minus α
β
)2]
x isin
EX is nonexistentvarX is nonexistent
Chi-square X sim χ2ν
ν gt 0f (x) = Gamma(ν2 12)x gt 0
EX = ν
varX = 2ν
Dirichlet X simDirichlet(α)α = (α1 αk)αi gt 0α0 = sumk
i=1 αi
f (x) = (α0)prodk
i=1 xαiminus1iprodk
i=1 (αi)x = (x1 xk) and 0 le xi le 1sumk
i=1 xi = 1
EX = αα0
varXi = αi(α0 minus αi)
α20(α0 + 1)
covXi Xj = minusαiαj
α20(α0 + 1)
Exponential X simExp(λ)λ gt 0
f (x) = λ expminusλxx gt 0
EX = 1λ
varX = 1λ2
Gamma X simGamma(r λ)λ gt 0 and r gt 0
f (x) = λrxrminus1
(r)expminusλx
x gt 0
EX = rλ
varX = rλ2
6
TABLE 13 Notation and description for more common probability distributions of continuous random variables
Notation and Density and Mean and
Name Parameter Space Sample Space Variance
Lognormal X simLognormal(μ σ2)μ isin and σ gt 0
f (x) = 1
xradic
2πσ2exp
minus1
2
(logx minus μ
σ
)2
x isin EX = expμ + σ22varX = exp2μ + 2σ2 minus exp2μ + σ2
MultivariateNormal
X sim Nk(μ )μ = (μ1 μk) isin k
positive definite
f (x) = expminus(x minus μ)T minus1(x minus μ)2(2π)k2||12
x = (x1 xk) isin k
EX = μ
varX =
Normal X simN(μ σ2)μ isin and σ gt 0
f (x) = 1radic2πσ2
exp
minus1
2
(x minus μ
σ
)2
x isin EX = μ
varX = σ2
Studentrsquos t X sim tν
ν gt 0f (x) = ((ν + 1)2)
(ν2)radic
πν
(1 + x2
ν
)minus(ν+1)2
x isin EX = 0 if ν gt 1
varX = ν
ν + 2if ν gt 2
Uniform X simUnif(a b)a b isin and a lt b
f (x) = 1
b minus ax isin [a b]
EX = (a + b)2varX = (b minus a)212
Weibull X simWeibull(a b)a gt 0 and b gt 0
f (x) = abxbminus1 expminusaxbx gt 0
EX = (1 + 1b)
a1b
varX = (1 + 2b) minus (1 + 1b)2
a2b
7
8 CHAPTER 1 REVIEW(n
k1 km
)= nprodm
i=1 kiwhere n =
msumi=1
ki (115)
(r) =
(r minus 1) for r = 1 2 int infin0 trminus1 expminust dt for general r gt 0
(116)
It is worth knowing that
(
12
)= radic
π and (n + 1
2
)= 1 times 3 times 5 times middot middot middot times (2n minus 1)
radicπ2n
for positive integer nMany of the distributions commonly used in statistics are members of an
exponential family A k-parameter exponential family density can be expressed as
f (x|γ) = c1(x)c2(γ) exp
ksum
i=1
yi(x)θi(γ)
(117)
for nonnegative functions c1 and c2 The vectorγ denotes the familiar parameters suchas λ for the Poisson density and p for the binomial density The real-valued θi(γ) arethe natural or canonical parameters which are usually transformations of γ Theyi(x) are the sufficient statistics for the canonical parameters It is straightforwardto show
Ey(X) =κprime(θ) (118)
and
vary(X) =κprimeprime(θ) (119)
where κ(θ) = minus log c3(θ) letting c3(θ) denote the reexpression of c2(γ) in terms ofthe canonical parameters θ = (θ1 θk) and y(X) = (y1(X) yk(X)) Theseresults can be rewritten in terms of the original parameters γ as
E
ksum
i=1
dθi(γ)
dγj
yi(X)
= minus d
dγj
log c2(γ) (120)
and
var
ksum
i=1
dθi(γ)
dγj
yi(X)
= minus d2
dγ2j
log c2(γ) minus E
ksum
i=1
d2θi(γ)
dγ2j
yi(X)
(121)
Example 11 (Poisson) The Poisson distribution belongs to the exponential familywith c1(x) = 1x c2(λ) = expminusλ y(x) = x and θ(λ) = log λ Deriving momentsin terms of θ we have κ(θ) = expθ so EX = κprime(θ) = expθ = λ and varX =κprimeprime(θ) = expθ = λ The same results may be obtained with (120) and (121) notingthat dθdλ = 1λ For example (120) gives E Xλ = 1
It is also important to know how the distribution of a random variable changeswhen it is transformed Let X = (X1 Xp) denote a p-dimensional random
14 LIKELIHOOD INFERENCE 9
variable with continuous density function f Suppose that
U = g(X) = (g1(X) gp(X)) = (U1 Up) (122)
where g is a one-to-one function mapping the support region of f onto the space ofall u = g(x) for which x satisfies f (x) gt 0 To derive the probability distribution ofU from that of X we need to use the Jacobian matrix The density of the transformedvariables is
f (u) = f (gminus1(u)) |J(u)| (123)
where |J(u)| is the absolute value of the determinant of the Jacobian matrix of gminus1
evaluated at u having (i j)th element dxiduj where these derivatives are assumedto be continuous over the support region of U
14 LIKELIHOOD INFERENCE
If X1 Xn are independent and identically distributed (iid) each having densityf (x | θ) that depends on a vector of p unknown parameters θ = (θ1 θp) then thejoint likelihood function is
L(θ) =nprod
i=1
f (xi|θ) (124)
When the data are not iid the joint likelihood is still expressed as the joint densityf (x1 xn|θ) viewed as a function of θ
The observed data x1 xn might have been realized under many differentvalues for θ The parameters for which observing x1 xn would be most likelyconstitute the maximum likelihood estimate of θ In other words if ϑ is the functionof x1 xn that maximizes L(θ) then θ = ϑ(X1 Xn) is the maximum likeli-hood estimator (MLE) for θ MLEs are invariant to transformation so the MLE of atransformation of θ equals the transformation of θ
It is typically easier to work with the log likelihood function
l (θ) = log L(θ) (125)
which has the same maximum as the original likelihood since log is a strictly mono-tonic function Furthermore any additive constants (involving possibly x1 xn
but not θ) may be omitted from the log likelihood without changing the location of itsmaximum or differences between log likelihoods at different θ Note that maximizingL(θ) with respect to θ is equivalent to solving the system of equations
lprime(θ) = 0 (126)
where
lprime(θ) =(
dl(θ)
dθ1
dl(θ)
dθn
)
10 CHAPTER 1 REVIEW
is called the score function The score function satisfies
Elprime(θ) = 0 (127)
where the expectation is taken with respect to the distribution of X1 Xn Some-times an analytical solution to (126) provides the MLE this book describes a varietyof methods that can be used when the MLE cannot be solved for in closed form Itis worth noting that there are pathological circumstances where the MLE is not asolution of the score equation or the MLE is not unique see [127] for examples
The MLE has a sampling distribution because it depends on the realizationof the random variables X1 Xn The MLE may be biased or unbiased for θ yetunder quite general conditions it is asymptotically unbiased as n rarr infin The samplingvariance of the MLE depends on the average curvature of the log likelihood When thelog likelihood is very pointy the location of the maximum is more precisely known
To make this precise let lprimeprime(θ) denote the p times p matrix having (i j)th elementgiven by d2l(θ)(dθidθj) The Fisher information matrix is defined as
I(θ) = Elprime(θ)lprime(θ)T = minusElprimeprime(θ) (128)
where the expectations are taken with respect to the distribution of X1 Xn Thefinal equality in (128) requires mild assumptions which are satisfied for example inexponential families I(θ) may sometimes be called the expected Fisher informationto distinguish it from minuslprimeprime(θ) which is the observed Fisher information There aretwo reasons why the observed Fisher information is quite useful First it can becalculated even if the expectations in (128) cannot easily be computed Second it isa good approximation to I(θ) that improves as n increases
Under regularity conditions the asymptotic variancendashcovariance matrix ofthe MLE θ is I(θlowast)minus1 where θlowast denotes the true value of θ Indeed as n rarr infin thelimiting distribution of θ is Np(θlowast I(θlowast)minus1) Since the true parameter values areunknown I(θlowast)minus1 must be estimated in order to estimate the variancendashcovariancematrix of the MLE An obvious approach is to use I(θ)minus1 Alternatively it is alsoreasonable to use minuslprimeprime(θ)minus1 Standard errors for individual parameter MLEs canbe estimated by taking the square root of the diagonal elements of the chosenestimate of I(θlowast)minus1 A thorough introduction to maximum likelihood theory andthe relative merits of these estimates of I(θlowast)minus1 can be found in [127 182 371 470]
Profile likelihoods provide an informative way to graph a higher-dimensionallikelihood surface to make inference about some parameters while treating othersas nuisance parameters and to facilitate various optimization strategies The profilelikelihood is obtained by constrained maximization of the full likelihood with respectto parameters to be ignored If θ = (μ φ) then the profile likelihood for φ is
L(φ|μ(φ)) = maxμ
L(μ φ) (129)
Thus for each possible φ a value of μ is chosen to maximize L(μ φ) This optimalμ is a function of φ The profile likelihood is the function that maps φ to the valueof the full likelihood evaluated at φ and its corresponding optimal μ Note that the φ
xviii ACKNOWLEDGMENTS
Assistance Agreement CR-829095 awarded to Colorado State University by the USEnvironmental Protection Agency (EPA) The views expressed here are solely thoseof the authors NSF and EPA do not endorse any products or commercial servicesmentioned herein
Finally we thank our parents for enabling and supporting our educations and forproviding us with the ldquostubbornness generdquo necessary for graduate school the tenuretrack or book publicationmdashtake your pick The second edition is dedicated to ourkids Natalie and Neil for continuing to show us what is important and what is not
Geof H GivensJennifer A Hoeting
CHAPTER 1REVIEW
This chapter reviews notation and background material in mathematics probabilityand statistics Readers may wish to skip this chapter and turn directly to Chapter 2returning here only as needed
11 MATHEMATICAL NOTATION
We use boldface to distinguish a vector x = (x1 xp) or a matrix M from a scalarvariable x or a constant M A vector-valued function f evaluated at x is also boldfacedas in f(x) = (f1(x) fp(x)) The transpose of M is denoted MT
Unless otherwise specified all vectors are considered to be column vectors sofor example an n times p matrix can be written as M = (x1 xn)T Let I denote anidentity matrix and 1 and 0 denote vectors of ones and zeros respectively
A symmetric square matrix M is positive definite if xTMx gt 0 for all nonzerovectors x Positive definiteness is equivalent to the condition that all eigenvalues ofM are positive M is nonnegative definite or positive semidefinite if xTMx ge 0 for allnonzero vectors x
The derivative of a function f evaluated at x is denoted f prime(x) When x =(x1 xp) the gradient of f at x is
f prime(x) =(
df (x)
dx1
df (x)
dxp
)
The Hessian matrix for f at x is f primeprime(x) having (i j)th element equal to d2f (x)(dxi dxj) The negative Hessian has important uses in statistical inference
Let J(x) denote the Jacobian matrix evaluated at x for the one-to-one mappingy = f(x) The (i j)th element of J(x) is equal to dfi(x)dxj
A functional is a real-valued function on a space of functions For example ifT (f ) = int 1
0 f (x) dx then the functional T maps suitably integrable functions onto thereal line
The indicator function 1A equals 1 if A is true and 0 otherwise The real lineis denoted and p-dimensional real space is p
Computational Statistics Second Edition Geof H Givens and Jennifer A Hoetingcopy 2013 John Wiley amp Sons Inc Published 2013 by John Wiley amp Sons Inc
1
2 CHAPTER 1 REVIEW
12 TAYLORrsquoS THEOREM AND MATHEMATICALLIMIT THEORY
First we define standard ldquobig ohrdquo and ldquolittle ohrdquo notation for describing the relativeorders of convergence of functions Let the functions f and g be defined on a commonpossibly infinite interval Let z0 be a point in this interval or a boundary point of it(ie minusinfin or infin) We require g(z) = 0 for all z = z0 in a neighborhood of z0 Thenwe say
f (z) = O(g(z)) (11)
if there exists a constant M such that |f (z)| le M|g(z)| as z rarr z0 For example(n + 1)(3n2) = O(nminus1) and it is understood that we are considering n rarr infin Iflimzrarrz0 f (z)g(z) = 0 then we say
f (z) = O(g(z)) (12)
For example f (x0 + h) minus f (x0) = hf prime(x0) + O(h) as h rarr 0 if f is differentiable atx0 The same notation can be used for describing the convergence of a sequence xnas n rarr infin by letting f (n) = xn
Taylorrsquos theorem provides a polynomial approximation to a function f Supposef has finite (n + 1)th derivative on (a b) and continuous nth derivative on [a b] Thenfor any x0 isin [a b] distinct from x the Taylor series expansion of f about x0 is
f (x) =nsum
i=0
1
if (i)(x0)(x minus x0)i + Rn (13)
where f (i)(x0) is the ith derivative of f evaluated at x0 and
Rn = 1
(n + 1)f (n+1)(ξ)(x minus x0)n+1 (14)
for some point ξ in the interval between x and x0 As |x minus x0| rarr 0 note that Rn =O(|x minus x0|n+1)
The multivariate version of Taylorrsquos theorem is analogous Suppose f is areal-valued function of a p-dimensional variable x possessing continuous partialderivatives of all orders up to and including n + 1 with respect to all coordinates inan open convex set containing x and x0 = x Then
f (x) = f (x0) +nsum
i=1
1
iD(i)(f x0 x minus x0) + Rn (15)
where
D(i)(f x y) =psum
j1=1
middot middot middotpsum
ji=1
(di
dtj1 middot middot middot dtji
f (t)
∣∣∣∣t=x
) iprodk=1
yjk
(16)
12 TAYLORrsquoS THEOREM AND MATHEMATICAL LIMIT THEORY 3
and
Rn = 1
(n + 1)D(n+1)(f ξ x minus x0) (17)
for some ξ on the line segment joining x and x0 As |x minus x0| rarr 0 note that Rn =O(|x minus x0|n+1)
The EulerndashMaclaurin formula is useful in many asymptotic analyses If f has2n continuous derivatives in [0 1] thenint 1
0f (x) dx =f (0) + f (1)
2
minusnminus1sumi=0
b2i(f (2iminus1)(1) minus f (2iminus1)(0))
(2i)minus b2nf
(2n)(ξ)
(2n) (18)
where 0 le ξ le 1 f (j) is the jth derivative of f and bj = Bj(0) can be determinedusing the recursion relation
msumj=0
(m + 1
j
)Bj(z) = (m + 1)zm (19)
initialized with B0(z) = 1 The proof of this result is based on repeated integrationsby parts [376]
Finally we note that it is sometimes desirable to approximate the derivative ofa function numerically using finite differences For example the ith component ofthe gradient of f at x can be approximated by
df (x)
dxi
asymp f (x + εiei) minus f (x minus εiei)
2εi
(110)
where εi is a small number and ei is the unit vector in the ith coordinate directionTypically one might start with say εi = 001 or 0001 and approximate the desiredderivative for a sequence of progressively smaller εi The approximation willgenerally improve until εi becomes small enough that the calculation is degraded andeventually dominated by computer roundoff error introduced by subtractive cancel-lation Introductory discussion of this approach and a more sophisticated Richardsonextrapolation strategy for obtaining greater precision are provided in [376] Finitedifferences can also be used to approximate the second derivative of f at x via
df (x)
dxi dxj
asymp 1
4εiεj
(f (x + εiei + εjej) minus f (x + εiei minus εjej)
minus f (x minus εiei + εjej) + f (x minus εiei minus εjej)
)(111)
with similar sequential precision improvements
4 CHAPTER 1 REVIEW
13 STATISTICAL NOTATION AND PROBABILITYDISTRIBUTIONS
We use capital letters to denote random variables such as Y or X and lowercaseletters to represent specific realized values of random variables such as y or x Theprobability density function of X is denoted f the cumulative distribution functionis F We use the notation X sim f (x) to mean that X is distributed with density f (x)Frequently the dependence of f (x) on one or more parameters also will be denotedwith a conditioning bar as in f (x|α β) Because of the diversity of topics covered inthis book we want to be careful to distinguish when f (x|α) refers to a density functionas opposed to the evaluation of that density at a point x When the meaning is unclearfrom the context we will be explicit for example by using f (middot|α) to denote thefunction When it is important to distinguish among several densities we may adoptsubscripts referring to specific random variables so that the density functions forX and Y are fX and fY respectively We use the same notation for distributions ofdiscrete random variables and in the Bayesian context
The conditional distribution of X given that Y equals y (ie X|Y = y) is de-scribed by the density denoted f (x|y) or fX|Y (x|y) In this case we write that X|Y hasdensity f (x|Y ) For notational simplicity we allow density functions to be implicitlyspecified by their arguments so we may use the same symbol say f to refer to manydistinct functions as in the equation f (x y|μ) = f (x|y μ)f (y|μ) Finally f (X) andF (X) are random variables the evaluations of the density and cumulative distributionfunctions respectively at the random argument X
The expectation of a random variable is denoted EX Unless specificallymentioned the distribution with respect to which an expectation is taken is the dis-tribution of X or should be implicit from the context To denote the probability ofan event A we use P[A] = E1A The conditional expectation of X|Y = y isEX|y When Y is unknown EX|Y is a random variable that depends on Y Otherattributes of the distribution of X and Y include varX covX Y corX Y andcvX = varX12EX These quantities are the variance of X the covariance andcorrelation of X and Y and the coefficient of variation of X respectively
A useful result regarding expectations is Jensenrsquos inequality Let g be a convexfunction on a possibly infinite open interval I so
g(λx + (1 minus λ)y) le λg(x) + (1 minus λ)g(y) (112)
for all x y isin I and all 0 lt λ lt 1 Then Jensenrsquos inequality states that Eg(X) geg(EX) for any random variable X having P[X isin I] = 1
Tables 11 12 and 13 provide information about many discrete and contin-uous distributions used throughout this book We refer to the following well-knowncombinatorial constants
n = n(n minus 1)(n minus 2) middot middot middot (3)(2)(1) with 0 = 1 (113)(n
k
)= n
k(n minus k) (114)
TABLE 11 Notation and description for common probability distributions of discrete random variables
Notation and Density and Mean and
Name Parameter Space Sample Space Variance
Bernoulli X simBernoulli(p)0 le p le 1
f (x) = px(1 minus p)1minusx
x = 0 or 1EX = p
varX = p(1 minus p)
Binomial X simBin(n p)0 le p le 1n isin 1 2
f (x) =(
n
x
)px(1 minus p)nminusx
x = 0 1 n
EX = np
varX = np(1 minus p)
Multinomial X simMultinomial(n p)p = (p1 pk)0 le pi le 1 and n isin 1 2 sumk
i=1 pi = 1
f (x) =(
n
x1 xk
)prodk
i=1 pxi
i
x = (x1 xk) and xi isin 0 1 nsumk
i=1 xi = n
EX = npvarXi = npi(1 minus pi)covXi Xj = minusnpipj
NegativeBinomial
X simNegBin(r p)0 le p le 1r isin 1 2
f (x) =(
x + r minus 1
r minus 1
)pr(1 minus p)x
x isin 0 1 EX = r(1 minus p)pvarX = r(1 minus p)p2
Poisson X simPoisson(λ) f (x) = λx
xexpminusλ EX = λ
λ gt 0 x isin 0 1 varX = λ
5
TABLE 12 Notation and description for some common probability distributions of continuous random variables
Notation and Density and Mean and
Name Parameter Space Sample Space Variance
Beta X simBeta(α β)α gt 0 and β gt 0
f (x) = (α + β)
(α)(β)xαminus1(1 minus x)βminus1
0 le x le 1
EX = α
α + β
varX = αβ
(α + β)2(α + β + 1)
Cauchy X simCauchy(α β)α isin and β gt 0
f (x) = 1
πβ
[1 +
(x minus α
β
)2]
x isin
EX is nonexistentvarX is nonexistent
Chi-square X sim χ2ν
ν gt 0f (x) = Gamma(ν2 12)x gt 0
EX = ν
varX = 2ν
Dirichlet X simDirichlet(α)α = (α1 αk)αi gt 0α0 = sumk
i=1 αi
f (x) = (α0)prodk
i=1 xαiminus1iprodk
i=1 (αi)x = (x1 xk) and 0 le xi le 1sumk
i=1 xi = 1
EX = αα0
varXi = αi(α0 minus αi)
α20(α0 + 1)
covXi Xj = minusαiαj
α20(α0 + 1)
Exponential X simExp(λ)λ gt 0
f (x) = λ expminusλxx gt 0
EX = 1λ
varX = 1λ2
Gamma X simGamma(r λ)λ gt 0 and r gt 0
f (x) = λrxrminus1
(r)expminusλx
x gt 0
EX = rλ
varX = rλ2
6
TABLE 13 Notation and description for more common probability distributions of continuous random variables
Notation and Density and Mean and
Name Parameter Space Sample Space Variance
Lognormal X simLognormal(μ σ2)μ isin and σ gt 0
f (x) = 1
xradic
2πσ2exp
minus1
2
(logx minus μ
σ
)2
x isin EX = expμ + σ22varX = exp2μ + 2σ2 minus exp2μ + σ2
MultivariateNormal
X sim Nk(μ )μ = (μ1 μk) isin k
positive definite
f (x) = expminus(x minus μ)T minus1(x minus μ)2(2π)k2||12
x = (x1 xk) isin k
EX = μ
varX =
Normal X simN(μ σ2)μ isin and σ gt 0
f (x) = 1radic2πσ2
exp
minus1
2
(x minus μ
σ
)2
x isin EX = μ
varX = σ2
Studentrsquos t X sim tν
ν gt 0f (x) = ((ν + 1)2)
(ν2)radic
πν
(1 + x2
ν
)minus(ν+1)2
x isin EX = 0 if ν gt 1
varX = ν
ν + 2if ν gt 2
Uniform X simUnif(a b)a b isin and a lt b
f (x) = 1
b minus ax isin [a b]
EX = (a + b)2varX = (b minus a)212
Weibull X simWeibull(a b)a gt 0 and b gt 0
f (x) = abxbminus1 expminusaxbx gt 0
EX = (1 + 1b)
a1b
varX = (1 + 2b) minus (1 + 1b)2
a2b
7
8 CHAPTER 1 REVIEW(n
k1 km
)= nprodm
i=1 kiwhere n =
msumi=1
ki (115)
(r) =
(r minus 1) for r = 1 2 int infin0 trminus1 expminust dt for general r gt 0
(116)
It is worth knowing that
(
12
)= radic
π and (n + 1
2
)= 1 times 3 times 5 times middot middot middot times (2n minus 1)
radicπ2n
for positive integer nMany of the distributions commonly used in statistics are members of an
exponential family A k-parameter exponential family density can be expressed as
f (x|γ) = c1(x)c2(γ) exp
ksum
i=1
yi(x)θi(γ)
(117)
for nonnegative functions c1 and c2 The vectorγ denotes the familiar parameters suchas λ for the Poisson density and p for the binomial density The real-valued θi(γ) arethe natural or canonical parameters which are usually transformations of γ Theyi(x) are the sufficient statistics for the canonical parameters It is straightforwardto show
Ey(X) =κprime(θ) (118)
and
vary(X) =κprimeprime(θ) (119)
where κ(θ) = minus log c3(θ) letting c3(θ) denote the reexpression of c2(γ) in terms ofthe canonical parameters θ = (θ1 θk) and y(X) = (y1(X) yk(X)) Theseresults can be rewritten in terms of the original parameters γ as
E
ksum
i=1
dθi(γ)
dγj
yi(X)
= minus d
dγj
log c2(γ) (120)
and
var
ksum
i=1
dθi(γ)
dγj
yi(X)
= minus d2
dγ2j
log c2(γ) minus E
ksum
i=1
d2θi(γ)
dγ2j
yi(X)
(121)
Example 11 (Poisson) The Poisson distribution belongs to the exponential familywith c1(x) = 1x c2(λ) = expminusλ y(x) = x and θ(λ) = log λ Deriving momentsin terms of θ we have κ(θ) = expθ so EX = κprime(θ) = expθ = λ and varX =κprimeprime(θ) = expθ = λ The same results may be obtained with (120) and (121) notingthat dθdλ = 1λ For example (120) gives E Xλ = 1
It is also important to know how the distribution of a random variable changeswhen it is transformed Let X = (X1 Xp) denote a p-dimensional random
14 LIKELIHOOD INFERENCE 9
variable with continuous density function f Suppose that
U = g(X) = (g1(X) gp(X)) = (U1 Up) (122)
where g is a one-to-one function mapping the support region of f onto the space ofall u = g(x) for which x satisfies f (x) gt 0 To derive the probability distribution ofU from that of X we need to use the Jacobian matrix The density of the transformedvariables is
f (u) = f (gminus1(u)) |J(u)| (123)
where |J(u)| is the absolute value of the determinant of the Jacobian matrix of gminus1
evaluated at u having (i j)th element dxiduj where these derivatives are assumedto be continuous over the support region of U
14 LIKELIHOOD INFERENCE
If X1 Xn are independent and identically distributed (iid) each having densityf (x | θ) that depends on a vector of p unknown parameters θ = (θ1 θp) then thejoint likelihood function is
L(θ) =nprod
i=1
f (xi|θ) (124)
When the data are not iid the joint likelihood is still expressed as the joint densityf (x1 xn|θ) viewed as a function of θ
The observed data x1 xn might have been realized under many differentvalues for θ The parameters for which observing x1 xn would be most likelyconstitute the maximum likelihood estimate of θ In other words if ϑ is the functionof x1 xn that maximizes L(θ) then θ = ϑ(X1 Xn) is the maximum likeli-hood estimator (MLE) for θ MLEs are invariant to transformation so the MLE of atransformation of θ equals the transformation of θ
It is typically easier to work with the log likelihood function
l (θ) = log L(θ) (125)
which has the same maximum as the original likelihood since log is a strictly mono-tonic function Furthermore any additive constants (involving possibly x1 xn
but not θ) may be omitted from the log likelihood without changing the location of itsmaximum or differences between log likelihoods at different θ Note that maximizingL(θ) with respect to θ is equivalent to solving the system of equations
lprime(θ) = 0 (126)
where
lprime(θ) =(
dl(θ)
dθ1
dl(θ)
dθn
)
10 CHAPTER 1 REVIEW
is called the score function The score function satisfies
Elprime(θ) = 0 (127)
where the expectation is taken with respect to the distribution of X1 Xn Some-times an analytical solution to (126) provides the MLE this book describes a varietyof methods that can be used when the MLE cannot be solved for in closed form Itis worth noting that there are pathological circumstances where the MLE is not asolution of the score equation or the MLE is not unique see [127] for examples
The MLE has a sampling distribution because it depends on the realizationof the random variables X1 Xn The MLE may be biased or unbiased for θ yetunder quite general conditions it is asymptotically unbiased as n rarr infin The samplingvariance of the MLE depends on the average curvature of the log likelihood When thelog likelihood is very pointy the location of the maximum is more precisely known
To make this precise let lprimeprime(θ) denote the p times p matrix having (i j)th elementgiven by d2l(θ)(dθidθj) The Fisher information matrix is defined as
I(θ) = Elprime(θ)lprime(θ)T = minusElprimeprime(θ) (128)
where the expectations are taken with respect to the distribution of X1 Xn Thefinal equality in (128) requires mild assumptions which are satisfied for example inexponential families I(θ) may sometimes be called the expected Fisher informationto distinguish it from minuslprimeprime(θ) which is the observed Fisher information There aretwo reasons why the observed Fisher information is quite useful First it can becalculated even if the expectations in (128) cannot easily be computed Second it isa good approximation to I(θ) that improves as n increases
Under regularity conditions the asymptotic variancendashcovariance matrix ofthe MLE θ is I(θlowast)minus1 where θlowast denotes the true value of θ Indeed as n rarr infin thelimiting distribution of θ is Np(θlowast I(θlowast)minus1) Since the true parameter values areunknown I(θlowast)minus1 must be estimated in order to estimate the variancendashcovariancematrix of the MLE An obvious approach is to use I(θ)minus1 Alternatively it is alsoreasonable to use minuslprimeprime(θ)minus1 Standard errors for individual parameter MLEs canbe estimated by taking the square root of the diagonal elements of the chosenestimate of I(θlowast)minus1 A thorough introduction to maximum likelihood theory andthe relative merits of these estimates of I(θlowast)minus1 can be found in [127 182 371 470]
Profile likelihoods provide an informative way to graph a higher-dimensionallikelihood surface to make inference about some parameters while treating othersas nuisance parameters and to facilitate various optimization strategies The profilelikelihood is obtained by constrained maximization of the full likelihood with respectto parameters to be ignored If θ = (μ φ) then the profile likelihood for φ is
L(φ|μ(φ)) = maxμ
L(μ φ) (129)
Thus for each possible φ a value of μ is chosen to maximize L(μ φ) This optimalμ is a function of φ The profile likelihood is the function that maps φ to the valueof the full likelihood evaluated at φ and its corresponding optimal μ Note that the φ
CHAPTER 1REVIEW
This chapter reviews notation and background material in mathematics probabilityand statistics Readers may wish to skip this chapter and turn directly to Chapter 2returning here only as needed
11 MATHEMATICAL NOTATION
We use boldface to distinguish a vector x = (x1 xp) or a matrix M from a scalarvariable x or a constant M A vector-valued function f evaluated at x is also boldfacedas in f(x) = (f1(x) fp(x)) The transpose of M is denoted MT
Unless otherwise specified all vectors are considered to be column vectors sofor example an n times p matrix can be written as M = (x1 xn)T Let I denote anidentity matrix and 1 and 0 denote vectors of ones and zeros respectively
A symmetric square matrix M is positive definite if xTMx gt 0 for all nonzerovectors x Positive definiteness is equivalent to the condition that all eigenvalues ofM are positive M is nonnegative definite or positive semidefinite if xTMx ge 0 for allnonzero vectors x
The derivative of a function f evaluated at x is denoted f prime(x) When x =(x1 xp) the gradient of f at x is
f prime(x) =(
df (x)
dx1
df (x)
dxp
)
The Hessian matrix for f at x is f primeprime(x) having (i j)th element equal to d2f (x)(dxi dxj) The negative Hessian has important uses in statistical inference
Let J(x) denote the Jacobian matrix evaluated at x for the one-to-one mappingy = f(x) The (i j)th element of J(x) is equal to dfi(x)dxj
A functional is a real-valued function on a space of functions For example ifT (f ) = int 1
0 f (x) dx then the functional T maps suitably integrable functions onto thereal line
The indicator function 1A equals 1 if A is true and 0 otherwise The real lineis denoted and p-dimensional real space is p
Computational Statistics Second Edition Geof H Givens and Jennifer A Hoetingcopy 2013 John Wiley amp Sons Inc Published 2013 by John Wiley amp Sons Inc
1
2 CHAPTER 1 REVIEW
12 TAYLORrsquoS THEOREM AND MATHEMATICALLIMIT THEORY
First we define standard ldquobig ohrdquo and ldquolittle ohrdquo notation for describing the relativeorders of convergence of functions Let the functions f and g be defined on a commonpossibly infinite interval Let z0 be a point in this interval or a boundary point of it(ie minusinfin or infin) We require g(z) = 0 for all z = z0 in a neighborhood of z0 Thenwe say
f (z) = O(g(z)) (11)
if there exists a constant M such that |f (z)| le M|g(z)| as z rarr z0 For example(n + 1)(3n2) = O(nminus1) and it is understood that we are considering n rarr infin Iflimzrarrz0 f (z)g(z) = 0 then we say
f (z) = O(g(z)) (12)
For example f (x0 + h) minus f (x0) = hf prime(x0) + O(h) as h rarr 0 if f is differentiable atx0 The same notation can be used for describing the convergence of a sequence xnas n rarr infin by letting f (n) = xn
Taylorrsquos theorem provides a polynomial approximation to a function f Supposef has finite (n + 1)th derivative on (a b) and continuous nth derivative on [a b] Thenfor any x0 isin [a b] distinct from x the Taylor series expansion of f about x0 is
f (x) =nsum
i=0
1
if (i)(x0)(x minus x0)i + Rn (13)
where f (i)(x0) is the ith derivative of f evaluated at x0 and
Rn = 1
(n + 1)f (n+1)(ξ)(x minus x0)n+1 (14)
for some point ξ in the interval between x and x0 As |x minus x0| rarr 0 note that Rn =O(|x minus x0|n+1)
The multivariate version of Taylorrsquos theorem is analogous Suppose f is areal-valued function of a p-dimensional variable x possessing continuous partialderivatives of all orders up to and including n + 1 with respect to all coordinates inan open convex set containing x and x0 = x Then
f (x) = f (x0) +nsum
i=1
1
iD(i)(f x0 x minus x0) + Rn (15)
where
D(i)(f x y) =psum
j1=1
middot middot middotpsum
ji=1
(di
dtj1 middot middot middot dtji
f (t)
∣∣∣∣t=x
) iprodk=1
yjk
(16)
12 TAYLORrsquoS THEOREM AND MATHEMATICAL LIMIT THEORY 3
and
Rn = 1
(n + 1)D(n+1)(f ξ x minus x0) (17)
for some ξ on the line segment joining x and x0 As |x minus x0| rarr 0 note that Rn =O(|x minus x0|n+1)
The EulerndashMaclaurin formula is useful in many asymptotic analyses If f has2n continuous derivatives in [0 1] thenint 1
0f (x) dx =f (0) + f (1)
2
minusnminus1sumi=0
b2i(f (2iminus1)(1) minus f (2iminus1)(0))
(2i)minus b2nf
(2n)(ξ)
(2n) (18)
where 0 le ξ le 1 f (j) is the jth derivative of f and bj = Bj(0) can be determinedusing the recursion relation
msumj=0
(m + 1
j
)Bj(z) = (m + 1)zm (19)
initialized with B0(z) = 1 The proof of this result is based on repeated integrationsby parts [376]
Finally we note that it is sometimes desirable to approximate the derivative ofa function numerically using finite differences For example the ith component ofthe gradient of f at x can be approximated by
df (x)
dxi
asymp f (x + εiei) minus f (x minus εiei)
2εi
(110)
where εi is a small number and ei is the unit vector in the ith coordinate directionTypically one might start with say εi = 001 or 0001 and approximate the desiredderivative for a sequence of progressively smaller εi The approximation willgenerally improve until εi becomes small enough that the calculation is degraded andeventually dominated by computer roundoff error introduced by subtractive cancel-lation Introductory discussion of this approach and a more sophisticated Richardsonextrapolation strategy for obtaining greater precision are provided in [376] Finitedifferences can also be used to approximate the second derivative of f at x via
df (x)
dxi dxj
asymp 1
4εiεj
(f (x + εiei + εjej) minus f (x + εiei minus εjej)
minus f (x minus εiei + εjej) + f (x minus εiei minus εjej)
)(111)
with similar sequential precision improvements
4 CHAPTER 1 REVIEW
13 STATISTICAL NOTATION AND PROBABILITYDISTRIBUTIONS
We use capital letters to denote random variables such as Y or X and lowercaseletters to represent specific realized values of random variables such as y or x Theprobability density function of X is denoted f the cumulative distribution functionis F We use the notation X sim f (x) to mean that X is distributed with density f (x)Frequently the dependence of f (x) on one or more parameters also will be denotedwith a conditioning bar as in f (x|α β) Because of the diversity of topics covered inthis book we want to be careful to distinguish when f (x|α) refers to a density functionas opposed to the evaluation of that density at a point x When the meaning is unclearfrom the context we will be explicit for example by using f (middot|α) to denote thefunction When it is important to distinguish among several densities we may adoptsubscripts referring to specific random variables so that the density functions forX and Y are fX and fY respectively We use the same notation for distributions ofdiscrete random variables and in the Bayesian context
The conditional distribution of X given that Y equals y (ie X|Y = y) is de-scribed by the density denoted f (x|y) or fX|Y (x|y) In this case we write that X|Y hasdensity f (x|Y ) For notational simplicity we allow density functions to be implicitlyspecified by their arguments so we may use the same symbol say f to refer to manydistinct functions as in the equation f (x y|μ) = f (x|y μ)f (y|μ) Finally f (X) andF (X) are random variables the evaluations of the density and cumulative distributionfunctions respectively at the random argument X
The expectation of a random variable is denoted EX Unless specificallymentioned the distribution with respect to which an expectation is taken is the dis-tribution of X or should be implicit from the context To denote the probability ofan event A we use P[A] = E1A The conditional expectation of X|Y = y isEX|y When Y is unknown EX|Y is a random variable that depends on Y Otherattributes of the distribution of X and Y include varX covX Y corX Y andcvX = varX12EX These quantities are the variance of X the covariance andcorrelation of X and Y and the coefficient of variation of X respectively
A useful result regarding expectations is Jensenrsquos inequality Let g be a convexfunction on a possibly infinite open interval I so
g(λx + (1 minus λ)y) le λg(x) + (1 minus λ)g(y) (112)
for all x y isin I and all 0 lt λ lt 1 Then Jensenrsquos inequality states that Eg(X) geg(EX) for any random variable X having P[X isin I] = 1
Tables 11 12 and 13 provide information about many discrete and contin-uous distributions used throughout this book We refer to the following well-knowncombinatorial constants
n = n(n minus 1)(n minus 2) middot middot middot (3)(2)(1) with 0 = 1 (113)(n
k
)= n
k(n minus k) (114)
TABLE 11 Notation and description for common probability distributions of discrete random variables
Notation and Density and Mean and
Name Parameter Space Sample Space Variance
Bernoulli X simBernoulli(p)0 le p le 1
f (x) = px(1 minus p)1minusx
x = 0 or 1EX = p
varX = p(1 minus p)
Binomial X simBin(n p)0 le p le 1n isin 1 2
f (x) =(
n
x
)px(1 minus p)nminusx
x = 0 1 n
EX = np
varX = np(1 minus p)
Multinomial X simMultinomial(n p)p = (p1 pk)0 le pi le 1 and n isin 1 2 sumk
i=1 pi = 1
f (x) =(
n
x1 xk
)prodk
i=1 pxi
i
x = (x1 xk) and xi isin 0 1 nsumk
i=1 xi = n
EX = npvarXi = npi(1 minus pi)covXi Xj = minusnpipj
NegativeBinomial
X simNegBin(r p)0 le p le 1r isin 1 2
f (x) =(
x + r minus 1
r minus 1
)pr(1 minus p)x
x isin 0 1 EX = r(1 minus p)pvarX = r(1 minus p)p2
Poisson X simPoisson(λ) f (x) = λx
xexpminusλ EX = λ
λ gt 0 x isin 0 1 varX = λ
5
TABLE 12 Notation and description for some common probability distributions of continuous random variables
Notation and Density and Mean and
Name Parameter Space Sample Space Variance
Beta X simBeta(α β)α gt 0 and β gt 0
f (x) = (α + β)
(α)(β)xαminus1(1 minus x)βminus1
0 le x le 1
EX = α
α + β
varX = αβ
(α + β)2(α + β + 1)
Cauchy X simCauchy(α β)α isin and β gt 0
f (x) = 1
πβ
[1 +
(x minus α
β
)2]
x isin
EX is nonexistentvarX is nonexistent
Chi-square X sim χ2ν
ν gt 0f (x) = Gamma(ν2 12)x gt 0
EX = ν
varX = 2ν
Dirichlet X simDirichlet(α)α = (α1 αk)αi gt 0α0 = sumk
i=1 αi
f (x) = (α0)prodk
i=1 xαiminus1iprodk
i=1 (αi)x = (x1 xk) and 0 le xi le 1sumk
i=1 xi = 1
EX = αα0
varXi = αi(α0 minus αi)
α20(α0 + 1)
covXi Xj = minusαiαj
α20(α0 + 1)
Exponential X simExp(λ)λ gt 0
f (x) = λ expminusλxx gt 0
EX = 1λ
varX = 1λ2
Gamma X simGamma(r λ)λ gt 0 and r gt 0
f (x) = λrxrminus1
(r)expminusλx
x gt 0
EX = rλ
varX = rλ2
6
TABLE 13 Notation and description for more common probability distributions of continuous random variables
Notation and Density and Mean and
Name Parameter Space Sample Space Variance
Lognormal X simLognormal(μ σ2)μ isin and σ gt 0
f (x) = 1
xradic
2πσ2exp
minus1
2
(logx minus μ
σ
)2
x isin EX = expμ + σ22varX = exp2μ + 2σ2 minus exp2μ + σ2
MultivariateNormal
X sim Nk(μ )μ = (μ1 μk) isin k
positive definite
f (x) = expminus(x minus μ)T minus1(x minus μ)2(2π)k2||12
x = (x1 xk) isin k
EX = μ
varX =
Normal X simN(μ σ2)μ isin and σ gt 0
f (x) = 1radic2πσ2
exp
minus1
2
(x minus μ
σ
)2
x isin EX = μ
varX = σ2
Studentrsquos t X sim tν
ν gt 0f (x) = ((ν + 1)2)
(ν2)radic
πν
(1 + x2
ν
)minus(ν+1)2
x isin EX = 0 if ν gt 1
varX = ν
ν + 2if ν gt 2
Uniform X simUnif(a b)a b isin and a lt b
f (x) = 1
b minus ax isin [a b]
EX = (a + b)2varX = (b minus a)212
Weibull X simWeibull(a b)a gt 0 and b gt 0
f (x) = abxbminus1 expminusaxbx gt 0
EX = (1 + 1b)
a1b
varX = (1 + 2b) minus (1 + 1b)2
a2b
7
8 CHAPTER 1 REVIEW(n
k1 km
)= nprodm
i=1 kiwhere n =
msumi=1
ki (115)
(r) =
(r minus 1) for r = 1 2 int infin0 trminus1 expminust dt for general r gt 0
(116)
It is worth knowing that
(
12
)= radic
π and (n + 1
2
)= 1 times 3 times 5 times middot middot middot times (2n minus 1)
radicπ2n
for positive integer nMany of the distributions commonly used in statistics are members of an
exponential family A k-parameter exponential family density can be expressed as
f (x|γ) = c1(x)c2(γ) exp
ksum
i=1
yi(x)θi(γ)
(117)
for nonnegative functions c1 and c2 The vectorγ denotes the familiar parameters suchas λ for the Poisson density and p for the binomial density The real-valued θi(γ) arethe natural or canonical parameters which are usually transformations of γ Theyi(x) are the sufficient statistics for the canonical parameters It is straightforwardto show
Ey(X) =κprime(θ) (118)
and
vary(X) =κprimeprime(θ) (119)
where κ(θ) = minus log c3(θ) letting c3(θ) denote the reexpression of c2(γ) in terms ofthe canonical parameters θ = (θ1 θk) and y(X) = (y1(X) yk(X)) Theseresults can be rewritten in terms of the original parameters γ as
E
ksum
i=1
dθi(γ)
dγj
yi(X)
= minus d
dγj
log c2(γ) (120)
and
var
ksum
i=1
dθi(γ)
dγj
yi(X)
= minus d2
dγ2j
log c2(γ) minus E
ksum
i=1
d2θi(γ)
dγ2j
yi(X)
(121)
Example 11 (Poisson) The Poisson distribution belongs to the exponential familywith c1(x) = 1x c2(λ) = expminusλ y(x) = x and θ(λ) = log λ Deriving momentsin terms of θ we have κ(θ) = expθ so EX = κprime(θ) = expθ = λ and varX =κprimeprime(θ) = expθ = λ The same results may be obtained with (120) and (121) notingthat dθdλ = 1λ For example (120) gives E Xλ = 1
It is also important to know how the distribution of a random variable changeswhen it is transformed Let X = (X1 Xp) denote a p-dimensional random
14 LIKELIHOOD INFERENCE 9
variable with continuous density function f Suppose that
U = g(X) = (g1(X) gp(X)) = (U1 Up) (122)
where g is a one-to-one function mapping the support region of f onto the space ofall u = g(x) for which x satisfies f (x) gt 0 To derive the probability distribution ofU from that of X we need to use the Jacobian matrix The density of the transformedvariables is
f (u) = f (gminus1(u)) |J(u)| (123)
where |J(u)| is the absolute value of the determinant of the Jacobian matrix of gminus1
evaluated at u having (i j)th element dxiduj where these derivatives are assumedto be continuous over the support region of U
14 LIKELIHOOD INFERENCE
If X1 Xn are independent and identically distributed (iid) each having densityf (x | θ) that depends on a vector of p unknown parameters θ = (θ1 θp) then thejoint likelihood function is
L(θ) =nprod
i=1
f (xi|θ) (124)
When the data are not iid the joint likelihood is still expressed as the joint densityf (x1 xn|θ) viewed as a function of θ
The observed data x1 xn might have been realized under many differentvalues for θ The parameters for which observing x1 xn would be most likelyconstitute the maximum likelihood estimate of θ In other words if ϑ is the functionof x1 xn that maximizes L(θ) then θ = ϑ(X1 Xn) is the maximum likeli-hood estimator (MLE) for θ MLEs are invariant to transformation so the MLE of atransformation of θ equals the transformation of θ
It is typically easier to work with the log likelihood function
l (θ) = log L(θ) (125)
which has the same maximum as the original likelihood since log is a strictly mono-tonic function Furthermore any additive constants (involving possibly x1 xn
but not θ) may be omitted from the log likelihood without changing the location of itsmaximum or differences between log likelihoods at different θ Note that maximizingL(θ) with respect to θ is equivalent to solving the system of equations
lprime(θ) = 0 (126)
where
lprime(θ) =(
dl(θ)
dθ1
dl(θ)
dθn
)
10 CHAPTER 1 REVIEW
is called the score function The score function satisfies
Elprime(θ) = 0 (127)
where the expectation is taken with respect to the distribution of X1 Xn Some-times an analytical solution to (126) provides the MLE this book describes a varietyof methods that can be used when the MLE cannot be solved for in closed form Itis worth noting that there are pathological circumstances where the MLE is not asolution of the score equation or the MLE is not unique see [127] for examples
The MLE has a sampling distribution because it depends on the realizationof the random variables X1 Xn The MLE may be biased or unbiased for θ yetunder quite general conditions it is asymptotically unbiased as n rarr infin The samplingvariance of the MLE depends on the average curvature of the log likelihood When thelog likelihood is very pointy the location of the maximum is more precisely known
To make this precise let lprimeprime(θ) denote the p times p matrix having (i j)th elementgiven by d2l(θ)(dθidθj) The Fisher information matrix is defined as
I(θ) = Elprime(θ)lprime(θ)T = minusElprimeprime(θ) (128)
where the expectations are taken with respect to the distribution of X1 Xn Thefinal equality in (128) requires mild assumptions which are satisfied for example inexponential families I(θ) may sometimes be called the expected Fisher informationto distinguish it from minuslprimeprime(θ) which is the observed Fisher information There aretwo reasons why the observed Fisher information is quite useful First it can becalculated even if the expectations in (128) cannot easily be computed Second it isa good approximation to I(θ) that improves as n increases
Under regularity conditions the asymptotic variancendashcovariance matrix ofthe MLE θ is I(θlowast)minus1 where θlowast denotes the true value of θ Indeed as n rarr infin thelimiting distribution of θ is Np(θlowast I(θlowast)minus1) Since the true parameter values areunknown I(θlowast)minus1 must be estimated in order to estimate the variancendashcovariancematrix of the MLE An obvious approach is to use I(θ)minus1 Alternatively it is alsoreasonable to use minuslprimeprime(θ)minus1 Standard errors for individual parameter MLEs canbe estimated by taking the square root of the diagonal elements of the chosenestimate of I(θlowast)minus1 A thorough introduction to maximum likelihood theory andthe relative merits of these estimates of I(θlowast)minus1 can be found in [127 182 371 470]
Profile likelihoods provide an informative way to graph a higher-dimensionallikelihood surface to make inference about some parameters while treating othersas nuisance parameters and to facilitate various optimization strategies The profilelikelihood is obtained by constrained maximization of the full likelihood with respectto parameters to be ignored If θ = (μ φ) then the profile likelihood for φ is
L(φ|μ(φ)) = maxμ
L(μ φ) (129)
Thus for each possible φ a value of μ is chosen to maximize L(μ φ) This optimalμ is a function of φ The profile likelihood is the function that maps φ to the valueof the full likelihood evaluated at φ and its corresponding optimal μ Note that the φ
2 CHAPTER 1 REVIEW
12 TAYLORrsquoS THEOREM AND MATHEMATICALLIMIT THEORY
First we define standard ldquobig ohrdquo and ldquolittle ohrdquo notation for describing the relativeorders of convergence of functions Let the functions f and g be defined on a commonpossibly infinite interval Let z0 be a point in this interval or a boundary point of it(ie minusinfin or infin) We require g(z) = 0 for all z = z0 in a neighborhood of z0 Thenwe say
f (z) = O(g(z)) (11)
if there exists a constant M such that |f (z)| le M|g(z)| as z rarr z0 For example(n + 1)(3n2) = O(nminus1) and it is understood that we are considering n rarr infin Iflimzrarrz0 f (z)g(z) = 0 then we say
f (z) = O(g(z)) (12)
For example f (x0 + h) minus f (x0) = hf prime(x0) + O(h) as h rarr 0 if f is differentiable atx0 The same notation can be used for describing the convergence of a sequence xnas n rarr infin by letting f (n) = xn
Taylorrsquos theorem provides a polynomial approximation to a function f Supposef has finite (n + 1)th derivative on (a b) and continuous nth derivative on [a b] Thenfor any x0 isin [a b] distinct from x the Taylor series expansion of f about x0 is
f (x) =nsum
i=0
1
if (i)(x0)(x minus x0)i + Rn (13)
where f (i)(x0) is the ith derivative of f evaluated at x0 and
Rn = 1
(n + 1)f (n+1)(ξ)(x minus x0)n+1 (14)
for some point ξ in the interval between x and x0 As |x minus x0| rarr 0 note that Rn =O(|x minus x0|n+1)
The multivariate version of Taylorrsquos theorem is analogous Suppose f is areal-valued function of a p-dimensional variable x possessing continuous partialderivatives of all orders up to and including n + 1 with respect to all coordinates inan open convex set containing x and x0 = x Then
f (x) = f (x0) +nsum
i=1
1
iD(i)(f x0 x minus x0) + Rn (15)
where
D(i)(f x y) =psum
j1=1
middot middot middotpsum
ji=1
(di
dtj1 middot middot middot dtji
f (t)
∣∣∣∣t=x
) iprodk=1
yjk
(16)
12 TAYLORrsquoS THEOREM AND MATHEMATICAL LIMIT THEORY 3
and
Rn = 1
(n + 1)D(n+1)(f ξ x minus x0) (17)
for some ξ on the line segment joining x and x0 As |x minus x0| rarr 0 note that Rn =O(|x minus x0|n+1)
The EulerndashMaclaurin formula is useful in many asymptotic analyses If f has2n continuous derivatives in [0 1] thenint 1
0f (x) dx =f (0) + f (1)
2
minusnminus1sumi=0
b2i(f (2iminus1)(1) minus f (2iminus1)(0))
(2i)minus b2nf
(2n)(ξ)
(2n) (18)
where 0 le ξ le 1 f (j) is the jth derivative of f and bj = Bj(0) can be determinedusing the recursion relation
msumj=0
(m + 1
j
)Bj(z) = (m + 1)zm (19)
initialized with B0(z) = 1 The proof of this result is based on repeated integrationsby parts [376]
Finally we note that it is sometimes desirable to approximate the derivative ofa function numerically using finite differences For example the ith component ofthe gradient of f at x can be approximated by
df (x)
dxi
asymp f (x + εiei) minus f (x minus εiei)
2εi
(110)
where εi is a small number and ei is the unit vector in the ith coordinate directionTypically one might start with say εi = 001 or 0001 and approximate the desiredderivative for a sequence of progressively smaller εi The approximation willgenerally improve until εi becomes small enough that the calculation is degraded andeventually dominated by computer roundoff error introduced by subtractive cancel-lation Introductory discussion of this approach and a more sophisticated Richardsonextrapolation strategy for obtaining greater precision are provided in [376] Finitedifferences can also be used to approximate the second derivative of f at x via
df (x)
dxi dxj
asymp 1
4εiεj
(f (x + εiei + εjej) minus f (x + εiei minus εjej)
minus f (x minus εiei + εjej) + f (x minus εiei minus εjej)
)(111)
with similar sequential precision improvements
4 CHAPTER 1 REVIEW
13 STATISTICAL NOTATION AND PROBABILITYDISTRIBUTIONS
We use capital letters to denote random variables such as Y or X and lowercaseletters to represent specific realized values of random variables such as y or x Theprobability density function of X is denoted f the cumulative distribution functionis F We use the notation X sim f (x) to mean that X is distributed with density f (x)Frequently the dependence of f (x) on one or more parameters also will be denotedwith a conditioning bar as in f (x|α β) Because of the diversity of topics covered inthis book we want to be careful to distinguish when f (x|α) refers to a density functionas opposed to the evaluation of that density at a point x When the meaning is unclearfrom the context we will be explicit for example by using f (middot|α) to denote thefunction When it is important to distinguish among several densities we may adoptsubscripts referring to specific random variables so that the density functions forX and Y are fX and fY respectively We use the same notation for distributions ofdiscrete random variables and in the Bayesian context
The conditional distribution of X given that Y equals y (ie X|Y = y) is de-scribed by the density denoted f (x|y) or fX|Y (x|y) In this case we write that X|Y hasdensity f (x|Y ) For notational simplicity we allow density functions to be implicitlyspecified by their arguments so we may use the same symbol say f to refer to manydistinct functions as in the equation f (x y|μ) = f (x|y μ)f (y|μ) Finally f (X) andF (X) are random variables the evaluations of the density and cumulative distributionfunctions respectively at the random argument X
The expectation of a random variable is denoted EX Unless specificallymentioned the distribution with respect to which an expectation is taken is the dis-tribution of X or should be implicit from the context To denote the probability ofan event A we use P[A] = E1A The conditional expectation of X|Y = y isEX|y When Y is unknown EX|Y is a random variable that depends on Y Otherattributes of the distribution of X and Y include varX covX Y corX Y andcvX = varX12EX These quantities are the variance of X the covariance andcorrelation of X and Y and the coefficient of variation of X respectively
A useful result regarding expectations is Jensenrsquos inequality Let g be a convexfunction on a possibly infinite open interval I so
g(λx + (1 minus λ)y) le λg(x) + (1 minus λ)g(y) (112)
for all x y isin I and all 0 lt λ lt 1 Then Jensenrsquos inequality states that Eg(X) geg(EX) for any random variable X having P[X isin I] = 1
Tables 11 12 and 13 provide information about many discrete and contin-uous distributions used throughout this book We refer to the following well-knowncombinatorial constants
n = n(n minus 1)(n minus 2) middot middot middot (3)(2)(1) with 0 = 1 (113)(n
k
)= n
k(n minus k) (114)
TABLE 11 Notation and description for common probability distributions of discrete random variables
Notation and Density and Mean and
Name Parameter Space Sample Space Variance
Bernoulli X simBernoulli(p)0 le p le 1
f (x) = px(1 minus p)1minusx
x = 0 or 1EX = p
varX = p(1 minus p)
Binomial X simBin(n p)0 le p le 1n isin 1 2
f (x) =(
n
x
)px(1 minus p)nminusx
x = 0 1 n
EX = np
varX = np(1 minus p)
Multinomial X simMultinomial(n p)p = (p1 pk)0 le pi le 1 and n isin 1 2 sumk
i=1 pi = 1
f (x) =(
n
x1 xk
)prodk
i=1 pxi
i
x = (x1 xk) and xi isin 0 1 nsumk
i=1 xi = n
EX = npvarXi = npi(1 minus pi)covXi Xj = minusnpipj
NegativeBinomial
X simNegBin(r p)0 le p le 1r isin 1 2
f (x) =(
x + r minus 1
r minus 1
)pr(1 minus p)x
x isin 0 1 EX = r(1 minus p)pvarX = r(1 minus p)p2
Poisson X simPoisson(λ) f (x) = λx
xexpminusλ EX = λ
λ gt 0 x isin 0 1 varX = λ
5
TABLE 12 Notation and description for some common probability distributions of continuous random variables
Notation and Density and Mean and
Name Parameter Space Sample Space Variance
Beta X simBeta(α β)α gt 0 and β gt 0
f (x) = (α + β)
(α)(β)xαminus1(1 minus x)βminus1
0 le x le 1
EX = α
α + β
varX = αβ
(α + β)2(α + β + 1)
Cauchy X simCauchy(α β)α isin and β gt 0
f (x) = 1
πβ
[1 +
(x minus α
β
)2]
x isin
EX is nonexistentvarX is nonexistent
Chi-square X sim χ2ν
ν gt 0f (x) = Gamma(ν2 12)x gt 0
EX = ν
varX = 2ν
Dirichlet X simDirichlet(α)α = (α1 αk)αi gt 0α0 = sumk
i=1 αi
f (x) = (α0)prodk
i=1 xαiminus1iprodk
i=1 (αi)x = (x1 xk) and 0 le xi le 1sumk
i=1 xi = 1
EX = αα0
varXi = αi(α0 minus αi)
α20(α0 + 1)
covXi Xj = minusαiαj
α20(α0 + 1)
Exponential X simExp(λ)λ gt 0
f (x) = λ expminusλxx gt 0
EX = 1λ
varX = 1λ2
Gamma X simGamma(r λ)λ gt 0 and r gt 0
f (x) = λrxrminus1
(r)expminusλx
x gt 0
EX = rλ
varX = rλ2
6
TABLE 13 Notation and description for more common probability distributions of continuous random variables
Notation and Density and Mean and
Name Parameter Space Sample Space Variance
Lognormal X simLognormal(μ σ2)μ isin and σ gt 0
f (x) = 1
xradic
2πσ2exp
minus1
2
(logx minus μ
σ
)2
x isin EX = expμ + σ22varX = exp2μ + 2σ2 minus exp2μ + σ2
MultivariateNormal
X sim Nk(μ )μ = (μ1 μk) isin k
positive definite
f (x) = expminus(x minus μ)T minus1(x minus μ)2(2π)k2||12
x = (x1 xk) isin k
EX = μ
varX =
Normal X simN(μ σ2)μ isin and σ gt 0
f (x) = 1radic2πσ2
exp
minus1
2
(x minus μ
σ
)2
x isin EX = μ
varX = σ2
Studentrsquos t X sim tν
ν gt 0f (x) = ((ν + 1)2)
(ν2)radic
πν
(1 + x2
ν
)minus(ν+1)2
x isin EX = 0 if ν gt 1
varX = ν
ν + 2if ν gt 2
Uniform X simUnif(a b)a b isin and a lt b
f (x) = 1
b minus ax isin [a b]
EX = (a + b)2varX = (b minus a)212
Weibull X simWeibull(a b)a gt 0 and b gt 0
f (x) = abxbminus1 expminusaxbx gt 0
EX = (1 + 1b)
a1b
varX = (1 + 2b) minus (1 + 1b)2
a2b
7
8 CHAPTER 1 REVIEW(n
k1 km
)= nprodm
i=1 kiwhere n =
msumi=1
ki (115)
(r) =
(r minus 1) for r = 1 2 int infin0 trminus1 expminust dt for general r gt 0
(116)
It is worth knowing that
(
12
)= radic
π and (n + 1
2
)= 1 times 3 times 5 times middot middot middot times (2n minus 1)
radicπ2n
for positive integer nMany of the distributions commonly used in statistics are members of an
exponential family A k-parameter exponential family density can be expressed as
f (x|γ) = c1(x)c2(γ) exp
ksum
i=1
yi(x)θi(γ)
(117)
for nonnegative functions c1 and c2 The vectorγ denotes the familiar parameters suchas λ for the Poisson density and p for the binomial density The real-valued θi(γ) arethe natural or canonical parameters which are usually transformations of γ Theyi(x) are the sufficient statistics for the canonical parameters It is straightforwardto show
Ey(X) =κprime(θ) (118)
and
vary(X) =κprimeprime(θ) (119)
where κ(θ) = minus log c3(θ) letting c3(θ) denote the reexpression of c2(γ) in terms ofthe canonical parameters θ = (θ1 θk) and y(X) = (y1(X) yk(X)) Theseresults can be rewritten in terms of the original parameters γ as
E
ksum
i=1
dθi(γ)
dγj
yi(X)
= minus d
dγj
log c2(γ) (120)
and
var
ksum
i=1
dθi(γ)
dγj
yi(X)
= minus d2
dγ2j
log c2(γ) minus E
ksum
i=1
d2θi(γ)
dγ2j
yi(X)
(121)
Example 11 (Poisson) The Poisson distribution belongs to the exponential familywith c1(x) = 1x c2(λ) = expminusλ y(x) = x and θ(λ) = log λ Deriving momentsin terms of θ we have κ(θ) = expθ so EX = κprime(θ) = expθ = λ and varX =κprimeprime(θ) = expθ = λ The same results may be obtained with (120) and (121) notingthat dθdλ = 1λ For example (120) gives E Xλ = 1
It is also important to know how the distribution of a random variable changeswhen it is transformed Let X = (X1 Xp) denote a p-dimensional random
14 LIKELIHOOD INFERENCE 9
variable with continuous density function f Suppose that
U = g(X) = (g1(X) gp(X)) = (U1 Up) (122)
where g is a one-to-one function mapping the support region of f onto the space ofall u = g(x) for which x satisfies f (x) gt 0 To derive the probability distribution ofU from that of X we need to use the Jacobian matrix The density of the transformedvariables is
f (u) = f (gminus1(u)) |J(u)| (123)
where |J(u)| is the absolute value of the determinant of the Jacobian matrix of gminus1
evaluated at u having (i j)th element dxiduj where these derivatives are assumedto be continuous over the support region of U
14 LIKELIHOOD INFERENCE
If X1 Xn are independent and identically distributed (iid) each having densityf (x | θ) that depends on a vector of p unknown parameters θ = (θ1 θp) then thejoint likelihood function is
L(θ) =nprod
i=1
f (xi|θ) (124)
When the data are not iid the joint likelihood is still expressed as the joint densityf (x1 xn|θ) viewed as a function of θ
The observed data x1 xn might have been realized under many differentvalues for θ The parameters for which observing x1 xn would be most likelyconstitute the maximum likelihood estimate of θ In other words if ϑ is the functionof x1 xn that maximizes L(θ) then θ = ϑ(X1 Xn) is the maximum likeli-hood estimator (MLE) for θ MLEs are invariant to transformation so the MLE of atransformation of θ equals the transformation of θ
It is typically easier to work with the log likelihood function
l (θ) = log L(θ) (125)
which has the same maximum as the original likelihood since log is a strictly mono-tonic function Furthermore any additive constants (involving possibly x1 xn
but not θ) may be omitted from the log likelihood without changing the location of itsmaximum or differences between log likelihoods at different θ Note that maximizingL(θ) with respect to θ is equivalent to solving the system of equations
lprime(θ) = 0 (126)
where
lprime(θ) =(
dl(θ)
dθ1
dl(θ)
dθn
)
10 CHAPTER 1 REVIEW
is called the score function The score function satisfies
Elprime(θ) = 0 (127)
where the expectation is taken with respect to the distribution of X1 Xn Some-times an analytical solution to (126) provides the MLE this book describes a varietyof methods that can be used when the MLE cannot be solved for in closed form Itis worth noting that there are pathological circumstances where the MLE is not asolution of the score equation or the MLE is not unique see [127] for examples
The MLE has a sampling distribution because it depends on the realizationof the random variables X1 Xn The MLE may be biased or unbiased for θ yetunder quite general conditions it is asymptotically unbiased as n rarr infin The samplingvariance of the MLE depends on the average curvature of the log likelihood When thelog likelihood is very pointy the location of the maximum is more precisely known
To make this precise let lprimeprime(θ) denote the p times p matrix having (i j)th elementgiven by d2l(θ)(dθidθj) The Fisher information matrix is defined as
I(θ) = Elprime(θ)lprime(θ)T = minusElprimeprime(θ) (128)
where the expectations are taken with respect to the distribution of X1 Xn Thefinal equality in (128) requires mild assumptions which are satisfied for example inexponential families I(θ) may sometimes be called the expected Fisher informationto distinguish it from minuslprimeprime(θ) which is the observed Fisher information There aretwo reasons why the observed Fisher information is quite useful First it can becalculated even if the expectations in (128) cannot easily be computed Second it isa good approximation to I(θ) that improves as n increases
Under regularity conditions the asymptotic variancendashcovariance matrix ofthe MLE θ is I(θlowast)minus1 where θlowast denotes the true value of θ Indeed as n rarr infin thelimiting distribution of θ is Np(θlowast I(θlowast)minus1) Since the true parameter values areunknown I(θlowast)minus1 must be estimated in order to estimate the variancendashcovariancematrix of the MLE An obvious approach is to use I(θ)minus1 Alternatively it is alsoreasonable to use minuslprimeprime(θ)minus1 Standard errors for individual parameter MLEs canbe estimated by taking the square root of the diagonal elements of the chosenestimate of I(θlowast)minus1 A thorough introduction to maximum likelihood theory andthe relative merits of these estimates of I(θlowast)minus1 can be found in [127 182 371 470]
Profile likelihoods provide an informative way to graph a higher-dimensionallikelihood surface to make inference about some parameters while treating othersas nuisance parameters and to facilitate various optimization strategies The profilelikelihood is obtained by constrained maximization of the full likelihood with respectto parameters to be ignored If θ = (μ φ) then the profile likelihood for φ is
L(φ|μ(φ)) = maxμ
L(μ φ) (129)
Thus for each possible φ a value of μ is chosen to maximize L(μ φ) This optimalμ is a function of φ The profile likelihood is the function that maps φ to the valueof the full likelihood evaluated at φ and its corresponding optimal μ Note that the φ
12 TAYLORrsquoS THEOREM AND MATHEMATICAL LIMIT THEORY 3
and
Rn = 1
(n + 1)D(n+1)(f ξ x minus x0) (17)
for some ξ on the line segment joining x and x0 As |x minus x0| rarr 0 note that Rn =O(|x minus x0|n+1)
The EulerndashMaclaurin formula is useful in many asymptotic analyses If f has2n continuous derivatives in [0 1] thenint 1
0f (x) dx =f (0) + f (1)
2
minusnminus1sumi=0
b2i(f (2iminus1)(1) minus f (2iminus1)(0))
(2i)minus b2nf
(2n)(ξ)
(2n) (18)
where 0 le ξ le 1 f (j) is the jth derivative of f and bj = Bj(0) can be determinedusing the recursion relation
msumj=0
(m + 1
j
)Bj(z) = (m + 1)zm (19)
initialized with B0(z) = 1 The proof of this result is based on repeated integrationsby parts [376]
Finally we note that it is sometimes desirable to approximate the derivative ofa function numerically using finite differences For example the ith component ofthe gradient of f at x can be approximated by
df (x)
dxi
asymp f (x + εiei) minus f (x minus εiei)
2εi
(110)
where εi is a small number and ei is the unit vector in the ith coordinate directionTypically one might start with say εi = 001 or 0001 and approximate the desiredderivative for a sequence of progressively smaller εi The approximation willgenerally improve until εi becomes small enough that the calculation is degraded andeventually dominated by computer roundoff error introduced by subtractive cancel-lation Introductory discussion of this approach and a more sophisticated Richardsonextrapolation strategy for obtaining greater precision are provided in [376] Finitedifferences can also be used to approximate the second derivative of f at x via
df (x)
dxi dxj
asymp 1
4εiεj
(f (x + εiei + εjej) minus f (x + εiei minus εjej)
minus f (x minus εiei + εjej) + f (x minus εiei minus εjej)
)(111)
with similar sequential precision improvements
4 CHAPTER 1 REVIEW
13 STATISTICAL NOTATION AND PROBABILITYDISTRIBUTIONS
We use capital letters to denote random variables such as Y or X and lowercaseletters to represent specific realized values of random variables such as y or x Theprobability density function of X is denoted f the cumulative distribution functionis F We use the notation X sim f (x) to mean that X is distributed with density f (x)Frequently the dependence of f (x) on one or more parameters also will be denotedwith a conditioning bar as in f (x|α β) Because of the diversity of topics covered inthis book we want to be careful to distinguish when f (x|α) refers to a density functionas opposed to the evaluation of that density at a point x When the meaning is unclearfrom the context we will be explicit for example by using f (middot|α) to denote thefunction When it is important to distinguish among several densities we may adoptsubscripts referring to specific random variables so that the density functions forX and Y are fX and fY respectively We use the same notation for distributions ofdiscrete random variables and in the Bayesian context
The conditional distribution of X given that Y equals y (ie X|Y = y) is de-scribed by the density denoted f (x|y) or fX|Y (x|y) In this case we write that X|Y hasdensity f (x|Y ) For notational simplicity we allow density functions to be implicitlyspecified by their arguments so we may use the same symbol say f to refer to manydistinct functions as in the equation f (x y|μ) = f (x|y μ)f (y|μ) Finally f (X) andF (X) are random variables the evaluations of the density and cumulative distributionfunctions respectively at the random argument X
The expectation of a random variable is denoted EX Unless specificallymentioned the distribution with respect to which an expectation is taken is the dis-tribution of X or should be implicit from the context To denote the probability ofan event A we use P[A] = E1A The conditional expectation of X|Y = y isEX|y When Y is unknown EX|Y is a random variable that depends on Y Otherattributes of the distribution of X and Y include varX covX Y corX Y andcvX = varX12EX These quantities are the variance of X the covariance andcorrelation of X and Y and the coefficient of variation of X respectively
A useful result regarding expectations is Jensenrsquos inequality Let g be a convexfunction on a possibly infinite open interval I so
g(λx + (1 minus λ)y) le λg(x) + (1 minus λ)g(y) (112)
for all x y isin I and all 0 lt λ lt 1 Then Jensenrsquos inequality states that Eg(X) geg(EX) for any random variable X having P[X isin I] = 1
Tables 11 12 and 13 provide information about many discrete and contin-uous distributions used throughout this book We refer to the following well-knowncombinatorial constants
n = n(n minus 1)(n minus 2) middot middot middot (3)(2)(1) with 0 = 1 (113)(n
k
)= n
k(n minus k) (114)
TABLE 11 Notation and description for common probability distributions of discrete random variables
Notation and Density and Mean and
Name Parameter Space Sample Space Variance
Bernoulli X simBernoulli(p)0 le p le 1
f (x) = px(1 minus p)1minusx
x = 0 or 1EX = p
varX = p(1 minus p)
Binomial X simBin(n p)0 le p le 1n isin 1 2
f (x) =(
n
x
)px(1 minus p)nminusx
x = 0 1 n
EX = np
varX = np(1 minus p)
Multinomial X simMultinomial(n p)p = (p1 pk)0 le pi le 1 and n isin 1 2 sumk
i=1 pi = 1
f (x) =(
n
x1 xk
)prodk
i=1 pxi
i
x = (x1 xk) and xi isin 0 1 nsumk
i=1 xi = n
EX = npvarXi = npi(1 minus pi)covXi Xj = minusnpipj
NegativeBinomial
X simNegBin(r p)0 le p le 1r isin 1 2
f (x) =(
x + r minus 1
r minus 1
)pr(1 minus p)x
x isin 0 1 EX = r(1 minus p)pvarX = r(1 minus p)p2
Poisson X simPoisson(λ) f (x) = λx
xexpminusλ EX = λ
λ gt 0 x isin 0 1 varX = λ
5
TABLE 12 Notation and description for some common probability distributions of continuous random variables
Notation and Density and Mean and
Name Parameter Space Sample Space Variance
Beta X simBeta(α β)α gt 0 and β gt 0
f (x) = (α + β)
(α)(β)xαminus1(1 minus x)βminus1
0 le x le 1
EX = α
α + β
varX = αβ
(α + β)2(α + β + 1)
Cauchy X simCauchy(α β)α isin and β gt 0
f (x) = 1
πβ
[1 +
(x minus α
β
)2]
x isin
EX is nonexistentvarX is nonexistent
Chi-square X sim χ2ν
ν gt 0f (x) = Gamma(ν2 12)x gt 0
EX = ν
varX = 2ν
Dirichlet X simDirichlet(α)α = (α1 αk)αi gt 0α0 = sumk
i=1 αi
f (x) = (α0)prodk
i=1 xαiminus1iprodk
i=1 (αi)x = (x1 xk) and 0 le xi le 1sumk
i=1 xi = 1
EX = αα0
varXi = αi(α0 minus αi)
α20(α0 + 1)
covXi Xj = minusαiαj
α20(α0 + 1)
Exponential X simExp(λ)λ gt 0
f (x) = λ expminusλxx gt 0
EX = 1λ
varX = 1λ2
Gamma X simGamma(r λ)λ gt 0 and r gt 0
f (x) = λrxrminus1
(r)expminusλx
x gt 0
EX = rλ
varX = rλ2
6
TABLE 13 Notation and description for more common probability distributions of continuous random variables
Notation and Density and Mean and
Name Parameter Space Sample Space Variance
Lognormal X simLognormal(μ σ2)μ isin and σ gt 0
f (x) = 1
xradic
2πσ2exp
minus1
2
(logx minus μ
σ
)2
x isin EX = expμ + σ22varX = exp2μ + 2σ2 minus exp2μ + σ2
MultivariateNormal
X sim Nk(μ )μ = (μ1 μk) isin k
positive definite
f (x) = expminus(x minus μ)T minus1(x minus μ)2(2π)k2||12
x = (x1 xk) isin k
EX = μ
varX =
Normal X simN(μ σ2)μ isin and σ gt 0
f (x) = 1radic2πσ2
exp
minus1
2
(x minus μ
σ
)2
x isin EX = μ
varX = σ2
Studentrsquos t X sim tν
ν gt 0f (x) = ((ν + 1)2)
(ν2)radic
πν
(1 + x2
ν
)minus(ν+1)2
x isin EX = 0 if ν gt 1
varX = ν
ν + 2if ν gt 2
Uniform X simUnif(a b)a b isin and a lt b
f (x) = 1
b minus ax isin [a b]
EX = (a + b)2varX = (b minus a)212
Weibull X simWeibull(a b)a gt 0 and b gt 0
f (x) = abxbminus1 expminusaxbx gt 0
EX = (1 + 1b)
a1b
varX = (1 + 2b) minus (1 + 1b)2
a2b
7
8 CHAPTER 1 REVIEW(n
k1 km
)= nprodm
i=1 kiwhere n =
msumi=1
ki (115)
(r) =
(r minus 1) for r = 1 2 int infin0 trminus1 expminust dt for general r gt 0
(116)
It is worth knowing that
(
12
)= radic
π and (n + 1
2
)= 1 times 3 times 5 times middot middot middot times (2n minus 1)
radicπ2n
for positive integer nMany of the distributions commonly used in statistics are members of an
exponential family A k-parameter exponential family density can be expressed as
f (x|γ) = c1(x)c2(γ) exp
ksum
i=1
yi(x)θi(γ)
(117)
for nonnegative functions c1 and c2 The vectorγ denotes the familiar parameters suchas λ for the Poisson density and p for the binomial density The real-valued θi(γ) arethe natural or canonical parameters which are usually transformations of γ Theyi(x) are the sufficient statistics for the canonical parameters It is straightforwardto show
Ey(X) =κprime(θ) (118)
and
vary(X) =κprimeprime(θ) (119)
where κ(θ) = minus log c3(θ) letting c3(θ) denote the reexpression of c2(γ) in terms ofthe canonical parameters θ = (θ1 θk) and y(X) = (y1(X) yk(X)) Theseresults can be rewritten in terms of the original parameters γ as
E
ksum
i=1
dθi(γ)
dγj
yi(X)
= minus d
dγj
log c2(γ) (120)
and
var
ksum
i=1
dθi(γ)
dγj
yi(X)
= minus d2
dγ2j
log c2(γ) minus E
ksum
i=1
d2θi(γ)
dγ2j
yi(X)
(121)
Example 11 (Poisson) The Poisson distribution belongs to the exponential familywith c1(x) = 1x c2(λ) = expminusλ y(x) = x and θ(λ) = log λ Deriving momentsin terms of θ we have κ(θ) = expθ so EX = κprime(θ) = expθ = λ and varX =κprimeprime(θ) = expθ = λ The same results may be obtained with (120) and (121) notingthat dθdλ = 1λ For example (120) gives E Xλ = 1
It is also important to know how the distribution of a random variable changeswhen it is transformed Let X = (X1 Xp) denote a p-dimensional random
14 LIKELIHOOD INFERENCE 9
variable with continuous density function f Suppose that
U = g(X) = (g1(X) gp(X)) = (U1 Up) (122)
where g is a one-to-one function mapping the support region of f onto the space ofall u = g(x) for which x satisfies f (x) gt 0 To derive the probability distribution ofU from that of X we need to use the Jacobian matrix The density of the transformedvariables is
f (u) = f (gminus1(u)) |J(u)| (123)
where |J(u)| is the absolute value of the determinant of the Jacobian matrix of gminus1
evaluated at u having (i j)th element dxiduj where these derivatives are assumedto be continuous over the support region of U
14 LIKELIHOOD INFERENCE
If X1 Xn are independent and identically distributed (iid) each having densityf (x | θ) that depends on a vector of p unknown parameters θ = (θ1 θp) then thejoint likelihood function is
L(θ) =nprod
i=1
f (xi|θ) (124)
When the data are not iid the joint likelihood is still expressed as the joint densityf (x1 xn|θ) viewed as a function of θ
The observed data x1 xn might have been realized under many differentvalues for θ The parameters for which observing x1 xn would be most likelyconstitute the maximum likelihood estimate of θ In other words if ϑ is the functionof x1 xn that maximizes L(θ) then θ = ϑ(X1 Xn) is the maximum likeli-hood estimator (MLE) for θ MLEs are invariant to transformation so the MLE of atransformation of θ equals the transformation of θ
It is typically easier to work with the log likelihood function
l (θ) = log L(θ) (125)
which has the same maximum as the original likelihood since log is a strictly mono-tonic function Furthermore any additive constants (involving possibly x1 xn
but not θ) may be omitted from the log likelihood without changing the location of itsmaximum or differences between log likelihoods at different θ Note that maximizingL(θ) with respect to θ is equivalent to solving the system of equations
lprime(θ) = 0 (126)
where
lprime(θ) =(
dl(θ)
dθ1
dl(θ)
dθn
)
10 CHAPTER 1 REVIEW
is called the score function The score function satisfies
Elprime(θ) = 0 (127)
where the expectation is taken with respect to the distribution of X1 Xn Some-times an analytical solution to (126) provides the MLE this book describes a varietyof methods that can be used when the MLE cannot be solved for in closed form Itis worth noting that there are pathological circumstances where the MLE is not asolution of the score equation or the MLE is not unique see [127] for examples
The MLE has a sampling distribution because it depends on the realizationof the random variables X1 Xn The MLE may be biased or unbiased for θ yetunder quite general conditions it is asymptotically unbiased as n rarr infin The samplingvariance of the MLE depends on the average curvature of the log likelihood When thelog likelihood is very pointy the location of the maximum is more precisely known
To make this precise let lprimeprime(θ) denote the p times p matrix having (i j)th elementgiven by d2l(θ)(dθidθj) The Fisher information matrix is defined as
I(θ) = Elprime(θ)lprime(θ)T = minusElprimeprime(θ) (128)
where the expectations are taken with respect to the distribution of X1 Xn Thefinal equality in (128) requires mild assumptions which are satisfied for example inexponential families I(θ) may sometimes be called the expected Fisher informationto distinguish it from minuslprimeprime(θ) which is the observed Fisher information There aretwo reasons why the observed Fisher information is quite useful First it can becalculated even if the expectations in (128) cannot easily be computed Second it isa good approximation to I(θ) that improves as n increases
Under regularity conditions the asymptotic variancendashcovariance matrix ofthe MLE θ is I(θlowast)minus1 where θlowast denotes the true value of θ Indeed as n rarr infin thelimiting distribution of θ is Np(θlowast I(θlowast)minus1) Since the true parameter values areunknown I(θlowast)minus1 must be estimated in order to estimate the variancendashcovariancematrix of the MLE An obvious approach is to use I(θ)minus1 Alternatively it is alsoreasonable to use minuslprimeprime(θ)minus1 Standard errors for individual parameter MLEs canbe estimated by taking the square root of the diagonal elements of the chosenestimate of I(θlowast)minus1 A thorough introduction to maximum likelihood theory andthe relative merits of these estimates of I(θlowast)minus1 can be found in [127 182 371 470]
Profile likelihoods provide an informative way to graph a higher-dimensionallikelihood surface to make inference about some parameters while treating othersas nuisance parameters and to facilitate various optimization strategies The profilelikelihood is obtained by constrained maximization of the full likelihood with respectto parameters to be ignored If θ = (μ φ) then the profile likelihood for φ is
L(φ|μ(φ)) = maxμ
L(μ φ) (129)
Thus for each possible φ a value of μ is chosen to maximize L(μ φ) This optimalμ is a function of φ The profile likelihood is the function that maps φ to the valueof the full likelihood evaluated at φ and its corresponding optimal μ Note that the φ
4 CHAPTER 1 REVIEW
13 STATISTICAL NOTATION AND PROBABILITYDISTRIBUTIONS
We use capital letters to denote random variables such as Y or X and lowercaseletters to represent specific realized values of random variables such as y or x Theprobability density function of X is denoted f the cumulative distribution functionis F We use the notation X sim f (x) to mean that X is distributed with density f (x)Frequently the dependence of f (x) on one or more parameters also will be denotedwith a conditioning bar as in f (x|α β) Because of the diversity of topics covered inthis book we want to be careful to distinguish when f (x|α) refers to a density functionas opposed to the evaluation of that density at a point x When the meaning is unclearfrom the context we will be explicit for example by using f (middot|α) to denote thefunction When it is important to distinguish among several densities we may adoptsubscripts referring to specific random variables so that the density functions forX and Y are fX and fY respectively We use the same notation for distributions ofdiscrete random variables and in the Bayesian context
The conditional distribution of X given that Y equals y (ie X|Y = y) is de-scribed by the density denoted f (x|y) or fX|Y (x|y) In this case we write that X|Y hasdensity f (x|Y ) For notational simplicity we allow density functions to be implicitlyspecified by their arguments so we may use the same symbol say f to refer to manydistinct functions as in the equation f (x y|μ) = f (x|y μ)f (y|μ) Finally f (X) andF (X) are random variables the evaluations of the density and cumulative distributionfunctions respectively at the random argument X
The expectation of a random variable is denoted EX Unless specificallymentioned the distribution with respect to which an expectation is taken is the dis-tribution of X or should be implicit from the context To denote the probability ofan event A we use P[A] = E1A The conditional expectation of X|Y = y isEX|y When Y is unknown EX|Y is a random variable that depends on Y Otherattributes of the distribution of X and Y include varX covX Y corX Y andcvX = varX12EX These quantities are the variance of X the covariance andcorrelation of X and Y and the coefficient of variation of X respectively
A useful result regarding expectations is Jensenrsquos inequality Let g be a convexfunction on a possibly infinite open interval I so
g(λx + (1 minus λ)y) le λg(x) + (1 minus λ)g(y) (112)
for all x y isin I and all 0 lt λ lt 1 Then Jensenrsquos inequality states that Eg(X) geg(EX) for any random variable X having P[X isin I] = 1
Tables 11 12 and 13 provide information about many discrete and contin-uous distributions used throughout this book We refer to the following well-knowncombinatorial constants
n = n(n minus 1)(n minus 2) middot middot middot (3)(2)(1) with 0 = 1 (113)(n
k
)= n
k(n minus k) (114)
TABLE 11 Notation and description for common probability distributions of discrete random variables
Notation and Density and Mean and
Name Parameter Space Sample Space Variance
Bernoulli X simBernoulli(p)0 le p le 1
f (x) = px(1 minus p)1minusx
x = 0 or 1EX = p
varX = p(1 minus p)
Binomial X simBin(n p)0 le p le 1n isin 1 2
f (x) =(
n
x
)px(1 minus p)nminusx
x = 0 1 n
EX = np
varX = np(1 minus p)
Multinomial X simMultinomial(n p)p = (p1 pk)0 le pi le 1 and n isin 1 2 sumk
i=1 pi = 1
f (x) =(
n
x1 xk
)prodk
i=1 pxi
i
x = (x1 xk) and xi isin 0 1 nsumk
i=1 xi = n
EX = npvarXi = npi(1 minus pi)covXi Xj = minusnpipj
NegativeBinomial
X simNegBin(r p)0 le p le 1r isin 1 2
f (x) =(
x + r minus 1
r minus 1
)pr(1 minus p)x
x isin 0 1 EX = r(1 minus p)pvarX = r(1 minus p)p2
Poisson X simPoisson(λ) f (x) = λx
xexpminusλ EX = λ
λ gt 0 x isin 0 1 varX = λ
5
TABLE 12 Notation and description for some common probability distributions of continuous random variables
Notation and Density and Mean and
Name Parameter Space Sample Space Variance
Beta X simBeta(α β)α gt 0 and β gt 0
f (x) = (α + β)
(α)(β)xαminus1(1 minus x)βminus1
0 le x le 1
EX = α
α + β
varX = αβ
(α + β)2(α + β + 1)
Cauchy X simCauchy(α β)α isin and β gt 0
f (x) = 1
πβ
[1 +
(x minus α
β
)2]
x isin
EX is nonexistentvarX is nonexistent
Chi-square X sim χ2ν
ν gt 0f (x) = Gamma(ν2 12)x gt 0
EX = ν
varX = 2ν
Dirichlet X simDirichlet(α)α = (α1 αk)αi gt 0α0 = sumk
i=1 αi
f (x) = (α0)prodk
i=1 xαiminus1iprodk
i=1 (αi)x = (x1 xk) and 0 le xi le 1sumk
i=1 xi = 1
EX = αα0
varXi = αi(α0 minus αi)
α20(α0 + 1)
covXi Xj = minusαiαj
α20(α0 + 1)
Exponential X simExp(λ)λ gt 0
f (x) = λ expminusλxx gt 0
EX = 1λ
varX = 1λ2
Gamma X simGamma(r λ)λ gt 0 and r gt 0
f (x) = λrxrminus1
(r)expminusλx
x gt 0
EX = rλ
varX = rλ2
6
TABLE 13 Notation and description for more common probability distributions of continuous random variables
Notation and Density and Mean and
Name Parameter Space Sample Space Variance
Lognormal X simLognormal(μ σ2)μ isin and σ gt 0
f (x) = 1
xradic
2πσ2exp
minus1
2
(logx minus μ
σ
)2
x isin EX = expμ + σ22varX = exp2μ + 2σ2 minus exp2μ + σ2
MultivariateNormal
X sim Nk(μ )μ = (μ1 μk) isin k
positive definite
f (x) = expminus(x minus μ)T minus1(x minus μ)2(2π)k2||12
x = (x1 xk) isin k
EX = μ
varX =
Normal X simN(μ σ2)μ isin and σ gt 0
f (x) = 1radic2πσ2
exp
minus1
2
(x minus μ
σ
)2
x isin EX = μ
varX = σ2
Studentrsquos t X sim tν
ν gt 0f (x) = ((ν + 1)2)
(ν2)radic
πν
(1 + x2
ν
)minus(ν+1)2
x isin EX = 0 if ν gt 1
varX = ν
ν + 2if ν gt 2
Uniform X simUnif(a b)a b isin and a lt b
f (x) = 1
b minus ax isin [a b]
EX = (a + b)2varX = (b minus a)212
Weibull X simWeibull(a b)a gt 0 and b gt 0
f (x) = abxbminus1 expminusaxbx gt 0
EX = (1 + 1b)
a1b
varX = (1 + 2b) minus (1 + 1b)2
a2b
7
8 CHAPTER 1 REVIEW(n
k1 km
)= nprodm
i=1 kiwhere n =
msumi=1
ki (115)
(r) =
(r minus 1) for r = 1 2 int infin0 trminus1 expminust dt for general r gt 0
(116)
It is worth knowing that
(
12
)= radic
π and (n + 1
2
)= 1 times 3 times 5 times middot middot middot times (2n minus 1)
radicπ2n
for positive integer nMany of the distributions commonly used in statistics are members of an
exponential family A k-parameter exponential family density can be expressed as
f (x|γ) = c1(x)c2(γ) exp
ksum
i=1
yi(x)θi(γ)
(117)
for nonnegative functions c1 and c2 The vectorγ denotes the familiar parameters suchas λ for the Poisson density and p for the binomial density The real-valued θi(γ) arethe natural or canonical parameters which are usually transformations of γ Theyi(x) are the sufficient statistics for the canonical parameters It is straightforwardto show
Ey(X) =κprime(θ) (118)
and
vary(X) =κprimeprime(θ) (119)
where κ(θ) = minus log c3(θ) letting c3(θ) denote the reexpression of c2(γ) in terms ofthe canonical parameters θ = (θ1 θk) and y(X) = (y1(X) yk(X)) Theseresults can be rewritten in terms of the original parameters γ as
E
ksum
i=1
dθi(γ)
dγj
yi(X)
= minus d
dγj
log c2(γ) (120)
and
var
ksum
i=1
dθi(γ)
dγj
yi(X)
= minus d2
dγ2j
log c2(γ) minus E
ksum
i=1
d2θi(γ)
dγ2j
yi(X)
(121)
Example 11 (Poisson) The Poisson distribution belongs to the exponential familywith c1(x) = 1x c2(λ) = expminusλ y(x) = x and θ(λ) = log λ Deriving momentsin terms of θ we have κ(θ) = expθ so EX = κprime(θ) = expθ = λ and varX =κprimeprime(θ) = expθ = λ The same results may be obtained with (120) and (121) notingthat dθdλ = 1λ For example (120) gives E Xλ = 1
It is also important to know how the distribution of a random variable changeswhen it is transformed Let X = (X1 Xp) denote a p-dimensional random
14 LIKELIHOOD INFERENCE 9
variable with continuous density function f Suppose that
U = g(X) = (g1(X) gp(X)) = (U1 Up) (122)
where g is a one-to-one function mapping the support region of f onto the space ofall u = g(x) for which x satisfies f (x) gt 0 To derive the probability distribution ofU from that of X we need to use the Jacobian matrix The density of the transformedvariables is
f (u) = f (gminus1(u)) |J(u)| (123)
where |J(u)| is the absolute value of the determinant of the Jacobian matrix of gminus1
evaluated at u having (i j)th element dxiduj where these derivatives are assumedto be continuous over the support region of U
14 LIKELIHOOD INFERENCE
If X1 Xn are independent and identically distributed (iid) each having densityf (x | θ) that depends on a vector of p unknown parameters θ = (θ1 θp) then thejoint likelihood function is
L(θ) =nprod
i=1
f (xi|θ) (124)
When the data are not iid the joint likelihood is still expressed as the joint densityf (x1 xn|θ) viewed as a function of θ
The observed data x1 xn might have been realized under many differentvalues for θ The parameters for which observing x1 xn would be most likelyconstitute the maximum likelihood estimate of θ In other words if ϑ is the functionof x1 xn that maximizes L(θ) then θ = ϑ(X1 Xn) is the maximum likeli-hood estimator (MLE) for θ MLEs are invariant to transformation so the MLE of atransformation of θ equals the transformation of θ
It is typically easier to work with the log likelihood function
l (θ) = log L(θ) (125)
which has the same maximum as the original likelihood since log is a strictly mono-tonic function Furthermore any additive constants (involving possibly x1 xn
but not θ) may be omitted from the log likelihood without changing the location of itsmaximum or differences between log likelihoods at different θ Note that maximizingL(θ) with respect to θ is equivalent to solving the system of equations
lprime(θ) = 0 (126)
where
lprime(θ) =(
dl(θ)
dθ1
dl(θ)
dθn
)
10 CHAPTER 1 REVIEW
is called the score function The score function satisfies
Elprime(θ) = 0 (127)
where the expectation is taken with respect to the distribution of X1 Xn Some-times an analytical solution to (126) provides the MLE this book describes a varietyof methods that can be used when the MLE cannot be solved for in closed form Itis worth noting that there are pathological circumstances where the MLE is not asolution of the score equation or the MLE is not unique see [127] for examples
The MLE has a sampling distribution because it depends on the realizationof the random variables X1 Xn The MLE may be biased or unbiased for θ yetunder quite general conditions it is asymptotically unbiased as n rarr infin The samplingvariance of the MLE depends on the average curvature of the log likelihood When thelog likelihood is very pointy the location of the maximum is more precisely known
To make this precise let lprimeprime(θ) denote the p times p matrix having (i j)th elementgiven by d2l(θ)(dθidθj) The Fisher information matrix is defined as
I(θ) = Elprime(θ)lprime(θ)T = minusElprimeprime(θ) (128)
where the expectations are taken with respect to the distribution of X1 Xn Thefinal equality in (128) requires mild assumptions which are satisfied for example inexponential families I(θ) may sometimes be called the expected Fisher informationto distinguish it from minuslprimeprime(θ) which is the observed Fisher information There aretwo reasons why the observed Fisher information is quite useful First it can becalculated even if the expectations in (128) cannot easily be computed Second it isa good approximation to I(θ) that improves as n increases
Under regularity conditions the asymptotic variancendashcovariance matrix ofthe MLE θ is I(θlowast)minus1 where θlowast denotes the true value of θ Indeed as n rarr infin thelimiting distribution of θ is Np(θlowast I(θlowast)minus1) Since the true parameter values areunknown I(θlowast)minus1 must be estimated in order to estimate the variancendashcovariancematrix of the MLE An obvious approach is to use I(θ)minus1 Alternatively it is alsoreasonable to use minuslprimeprime(θ)minus1 Standard errors for individual parameter MLEs canbe estimated by taking the square root of the diagonal elements of the chosenestimate of I(θlowast)minus1 A thorough introduction to maximum likelihood theory andthe relative merits of these estimates of I(θlowast)minus1 can be found in [127 182 371 470]
Profile likelihoods provide an informative way to graph a higher-dimensionallikelihood surface to make inference about some parameters while treating othersas nuisance parameters and to facilitate various optimization strategies The profilelikelihood is obtained by constrained maximization of the full likelihood with respectto parameters to be ignored If θ = (μ φ) then the profile likelihood for φ is
L(φ|μ(φ)) = maxμ
L(μ φ) (129)
Thus for each possible φ a value of μ is chosen to maximize L(μ φ) This optimalμ is a function of φ The profile likelihood is the function that maps φ to the valueof the full likelihood evaluated at φ and its corresponding optimal μ Note that the φ
TABLE 11 Notation and description for common probability distributions of discrete random variables
Notation and Density and Mean and
Name Parameter Space Sample Space Variance
Bernoulli X simBernoulli(p)0 le p le 1
f (x) = px(1 minus p)1minusx
x = 0 or 1EX = p
varX = p(1 minus p)
Binomial X simBin(n p)0 le p le 1n isin 1 2
f (x) =(
n
x
)px(1 minus p)nminusx
x = 0 1 n
EX = np
varX = np(1 minus p)
Multinomial X simMultinomial(n p)p = (p1 pk)0 le pi le 1 and n isin 1 2 sumk
i=1 pi = 1
f (x) =(
n
x1 xk
)prodk
i=1 pxi
i
x = (x1 xk) and xi isin 0 1 nsumk
i=1 xi = n
EX = npvarXi = npi(1 minus pi)covXi Xj = minusnpipj
NegativeBinomial
X simNegBin(r p)0 le p le 1r isin 1 2
f (x) =(
x + r minus 1
r minus 1
)pr(1 minus p)x
x isin 0 1 EX = r(1 minus p)pvarX = r(1 minus p)p2
Poisson X simPoisson(λ) f (x) = λx
xexpminusλ EX = λ
λ gt 0 x isin 0 1 varX = λ
5
TABLE 12 Notation and description for some common probability distributions of continuous random variables
Notation and Density and Mean and
Name Parameter Space Sample Space Variance
Beta X simBeta(α β)α gt 0 and β gt 0
f (x) = (α + β)
(α)(β)xαminus1(1 minus x)βminus1
0 le x le 1
EX = α
α + β
varX = αβ
(α + β)2(α + β + 1)
Cauchy X simCauchy(α β)α isin and β gt 0
f (x) = 1
πβ
[1 +
(x minus α
β
)2]
x isin
EX is nonexistentvarX is nonexistent
Chi-square X sim χ2ν
ν gt 0f (x) = Gamma(ν2 12)x gt 0
EX = ν
varX = 2ν
Dirichlet X simDirichlet(α)α = (α1 αk)αi gt 0α0 = sumk
i=1 αi
f (x) = (α0)prodk
i=1 xαiminus1iprodk
i=1 (αi)x = (x1 xk) and 0 le xi le 1sumk
i=1 xi = 1
EX = αα0
varXi = αi(α0 minus αi)
α20(α0 + 1)
covXi Xj = minusαiαj
α20(α0 + 1)
Exponential X simExp(λ)λ gt 0
f (x) = λ expminusλxx gt 0
EX = 1λ
varX = 1λ2
Gamma X simGamma(r λ)λ gt 0 and r gt 0
f (x) = λrxrminus1
(r)expminusλx
x gt 0
EX = rλ
varX = rλ2
6
TABLE 13 Notation and description for more common probability distributions of continuous random variables
Notation and Density and Mean and
Name Parameter Space Sample Space Variance
Lognormal X simLognormal(μ σ2)μ isin and σ gt 0
f (x) = 1
xradic
2πσ2exp
minus1
2
(logx minus μ
σ
)2
x isin EX = expμ + σ22varX = exp2μ + 2σ2 minus exp2μ + σ2
MultivariateNormal
X sim Nk(μ )μ = (μ1 μk) isin k
positive definite
f (x) = expminus(x minus μ)T minus1(x minus μ)2(2π)k2||12
x = (x1 xk) isin k
EX = μ
varX =
Normal X simN(μ σ2)μ isin and σ gt 0
f (x) = 1radic2πσ2
exp
minus1
2
(x minus μ
σ
)2
x isin EX = μ
varX = σ2
Studentrsquos t X sim tν
ν gt 0f (x) = ((ν + 1)2)
(ν2)radic
πν
(1 + x2
ν
)minus(ν+1)2
x isin EX = 0 if ν gt 1
varX = ν
ν + 2if ν gt 2
Uniform X simUnif(a b)a b isin and a lt b
f (x) = 1
b minus ax isin [a b]
EX = (a + b)2varX = (b minus a)212
Weibull X simWeibull(a b)a gt 0 and b gt 0
f (x) = abxbminus1 expminusaxbx gt 0
EX = (1 + 1b)
a1b
varX = (1 + 2b) minus (1 + 1b)2
a2b
7
8 CHAPTER 1 REVIEW(n
k1 km
)= nprodm
i=1 kiwhere n =
msumi=1
ki (115)
(r) =
(r minus 1) for r = 1 2 int infin0 trminus1 expminust dt for general r gt 0
(116)
It is worth knowing that
(
12
)= radic
π and (n + 1
2
)= 1 times 3 times 5 times middot middot middot times (2n minus 1)
radicπ2n
for positive integer nMany of the distributions commonly used in statistics are members of an
exponential family A k-parameter exponential family density can be expressed as
f (x|γ) = c1(x)c2(γ) exp
ksum
i=1
yi(x)θi(γ)
(117)
for nonnegative functions c1 and c2 The vectorγ denotes the familiar parameters suchas λ for the Poisson density and p for the binomial density The real-valued θi(γ) arethe natural or canonical parameters which are usually transformations of γ Theyi(x) are the sufficient statistics for the canonical parameters It is straightforwardto show
Ey(X) =κprime(θ) (118)
and
vary(X) =κprimeprime(θ) (119)
where κ(θ) = minus log c3(θ) letting c3(θ) denote the reexpression of c2(γ) in terms ofthe canonical parameters θ = (θ1 θk) and y(X) = (y1(X) yk(X)) Theseresults can be rewritten in terms of the original parameters γ as
E
ksum
i=1
dθi(γ)
dγj
yi(X)
= minus d
dγj
log c2(γ) (120)
and
var
ksum
i=1
dθi(γ)
dγj
yi(X)
= minus d2
dγ2j
log c2(γ) minus E
ksum
i=1
d2θi(γ)
dγ2j
yi(X)
(121)
Example 11 (Poisson) The Poisson distribution belongs to the exponential familywith c1(x) = 1x c2(λ) = expminusλ y(x) = x and θ(λ) = log λ Deriving momentsin terms of θ we have κ(θ) = expθ so EX = κprime(θ) = expθ = λ and varX =κprimeprime(θ) = expθ = λ The same results may be obtained with (120) and (121) notingthat dθdλ = 1λ For example (120) gives E Xλ = 1
It is also important to know how the distribution of a random variable changeswhen it is transformed Let X = (X1 Xp) denote a p-dimensional random
14 LIKELIHOOD INFERENCE 9
variable with continuous density function f Suppose that
U = g(X) = (g1(X) gp(X)) = (U1 Up) (122)
where g is a one-to-one function mapping the support region of f onto the space ofall u = g(x) for which x satisfies f (x) gt 0 To derive the probability distribution ofU from that of X we need to use the Jacobian matrix The density of the transformedvariables is
f (u) = f (gminus1(u)) |J(u)| (123)
where |J(u)| is the absolute value of the determinant of the Jacobian matrix of gminus1
evaluated at u having (i j)th element dxiduj where these derivatives are assumedto be continuous over the support region of U
14 LIKELIHOOD INFERENCE
If X1 Xn are independent and identically distributed (iid) each having densityf (x | θ) that depends on a vector of p unknown parameters θ = (θ1 θp) then thejoint likelihood function is
L(θ) =nprod
i=1
f (xi|θ) (124)
When the data are not iid the joint likelihood is still expressed as the joint densityf (x1 xn|θ) viewed as a function of θ
The observed data x1 xn might have been realized under many differentvalues for θ The parameters for which observing x1 xn would be most likelyconstitute the maximum likelihood estimate of θ In other words if ϑ is the functionof x1 xn that maximizes L(θ) then θ = ϑ(X1 Xn) is the maximum likeli-hood estimator (MLE) for θ MLEs are invariant to transformation so the MLE of atransformation of θ equals the transformation of θ
It is typically easier to work with the log likelihood function
l (θ) = log L(θ) (125)
which has the same maximum as the original likelihood since log is a strictly mono-tonic function Furthermore any additive constants (involving possibly x1 xn
but not θ) may be omitted from the log likelihood without changing the location of itsmaximum or differences between log likelihoods at different θ Note that maximizingL(θ) with respect to θ is equivalent to solving the system of equations
lprime(θ) = 0 (126)
where
lprime(θ) =(
dl(θ)
dθ1
dl(θ)
dθn
)
10 CHAPTER 1 REVIEW
is called the score function The score function satisfies
Elprime(θ) = 0 (127)
where the expectation is taken with respect to the distribution of X1 Xn Some-times an analytical solution to (126) provides the MLE this book describes a varietyof methods that can be used when the MLE cannot be solved for in closed form Itis worth noting that there are pathological circumstances where the MLE is not asolution of the score equation or the MLE is not unique see [127] for examples
The MLE has a sampling distribution because it depends on the realizationof the random variables X1 Xn The MLE may be biased or unbiased for θ yetunder quite general conditions it is asymptotically unbiased as n rarr infin The samplingvariance of the MLE depends on the average curvature of the log likelihood When thelog likelihood is very pointy the location of the maximum is more precisely known
To make this precise let lprimeprime(θ) denote the p times p matrix having (i j)th elementgiven by d2l(θ)(dθidθj) The Fisher information matrix is defined as
I(θ) = Elprime(θ)lprime(θ)T = minusElprimeprime(θ) (128)
where the expectations are taken with respect to the distribution of X1 Xn Thefinal equality in (128) requires mild assumptions which are satisfied for example inexponential families I(θ) may sometimes be called the expected Fisher informationto distinguish it from minuslprimeprime(θ) which is the observed Fisher information There aretwo reasons why the observed Fisher information is quite useful First it can becalculated even if the expectations in (128) cannot easily be computed Second it isa good approximation to I(θ) that improves as n increases
Under regularity conditions the asymptotic variancendashcovariance matrix ofthe MLE θ is I(θlowast)minus1 where θlowast denotes the true value of θ Indeed as n rarr infin thelimiting distribution of θ is Np(θlowast I(θlowast)minus1) Since the true parameter values areunknown I(θlowast)minus1 must be estimated in order to estimate the variancendashcovariancematrix of the MLE An obvious approach is to use I(θ)minus1 Alternatively it is alsoreasonable to use minuslprimeprime(θ)minus1 Standard errors for individual parameter MLEs canbe estimated by taking the square root of the diagonal elements of the chosenestimate of I(θlowast)minus1 A thorough introduction to maximum likelihood theory andthe relative merits of these estimates of I(θlowast)minus1 can be found in [127 182 371 470]
Profile likelihoods provide an informative way to graph a higher-dimensionallikelihood surface to make inference about some parameters while treating othersas nuisance parameters and to facilitate various optimization strategies The profilelikelihood is obtained by constrained maximization of the full likelihood with respectto parameters to be ignored If θ = (μ φ) then the profile likelihood for φ is
L(φ|μ(φ)) = maxμ
L(μ φ) (129)
Thus for each possible φ a value of μ is chosen to maximize L(μ φ) This optimalμ is a function of φ The profile likelihood is the function that maps φ to the valueof the full likelihood evaluated at φ and its corresponding optimal μ Note that the φ
TABLE 12 Notation and description for some common probability distributions of continuous random variables
Notation and Density and Mean and
Name Parameter Space Sample Space Variance
Beta X simBeta(α β)α gt 0 and β gt 0
f (x) = (α + β)
(α)(β)xαminus1(1 minus x)βminus1
0 le x le 1
EX = α
α + β
varX = αβ
(α + β)2(α + β + 1)
Cauchy X simCauchy(α β)α isin and β gt 0
f (x) = 1
πβ
[1 +
(x minus α
β
)2]
x isin
EX is nonexistentvarX is nonexistent
Chi-square X sim χ2ν
ν gt 0f (x) = Gamma(ν2 12)x gt 0
EX = ν
varX = 2ν
Dirichlet X simDirichlet(α)α = (α1 αk)αi gt 0α0 = sumk
i=1 αi
f (x) = (α0)prodk
i=1 xαiminus1iprodk
i=1 (αi)x = (x1 xk) and 0 le xi le 1sumk
i=1 xi = 1
EX = αα0
varXi = αi(α0 minus αi)
α20(α0 + 1)
covXi Xj = minusαiαj
α20(α0 + 1)
Exponential X simExp(λ)λ gt 0
f (x) = λ expminusλxx gt 0
EX = 1λ
varX = 1λ2
Gamma X simGamma(r λ)λ gt 0 and r gt 0
f (x) = λrxrminus1
(r)expminusλx
x gt 0
EX = rλ
varX = rλ2
6
TABLE 13 Notation and description for more common probability distributions of continuous random variables
Notation and Density and Mean and
Name Parameter Space Sample Space Variance
Lognormal X simLognormal(μ σ2)μ isin and σ gt 0
f (x) = 1
xradic
2πσ2exp
minus1
2
(logx minus μ
σ
)2
x isin EX = expμ + σ22varX = exp2μ + 2σ2 minus exp2μ + σ2
MultivariateNormal
X sim Nk(μ )μ = (μ1 μk) isin k
positive definite
f (x) = expminus(x minus μ)T minus1(x minus μ)2(2π)k2||12
x = (x1 xk) isin k
EX = μ
varX =
Normal X simN(μ σ2)μ isin and σ gt 0
f (x) = 1radic2πσ2
exp
minus1
2
(x minus μ
σ
)2
x isin EX = μ
varX = σ2
Studentrsquos t X sim tν
ν gt 0f (x) = ((ν + 1)2)
(ν2)radic
πν
(1 + x2
ν
)minus(ν+1)2
x isin EX = 0 if ν gt 1
varX = ν
ν + 2if ν gt 2
Uniform X simUnif(a b)a b isin and a lt b
f (x) = 1
b minus ax isin [a b]
EX = (a + b)2varX = (b minus a)212
Weibull X simWeibull(a b)a gt 0 and b gt 0
f (x) = abxbminus1 expminusaxbx gt 0
EX = (1 + 1b)
a1b
varX = (1 + 2b) minus (1 + 1b)2
a2b
7
8 CHAPTER 1 REVIEW(n
k1 km
)= nprodm
i=1 kiwhere n =
msumi=1
ki (115)
(r) =
(r minus 1) for r = 1 2 int infin0 trminus1 expminust dt for general r gt 0
(116)
It is worth knowing that
(
12
)= radic
π and (n + 1
2
)= 1 times 3 times 5 times middot middot middot times (2n minus 1)
radicπ2n
for positive integer nMany of the distributions commonly used in statistics are members of an
exponential family A k-parameter exponential family density can be expressed as
f (x|γ) = c1(x)c2(γ) exp
ksum
i=1
yi(x)θi(γ)
(117)
for nonnegative functions c1 and c2 The vectorγ denotes the familiar parameters suchas λ for the Poisson density and p for the binomial density The real-valued θi(γ) arethe natural or canonical parameters which are usually transformations of γ Theyi(x) are the sufficient statistics for the canonical parameters It is straightforwardto show
Ey(X) =κprime(θ) (118)
and
vary(X) =κprimeprime(θ) (119)
where κ(θ) = minus log c3(θ) letting c3(θ) denote the reexpression of c2(γ) in terms ofthe canonical parameters θ = (θ1 θk) and y(X) = (y1(X) yk(X)) Theseresults can be rewritten in terms of the original parameters γ as
E
ksum
i=1
dθi(γ)
dγj
yi(X)
= minus d
dγj
log c2(γ) (120)
and
var
ksum
i=1
dθi(γ)
dγj
yi(X)
= minus d2
dγ2j
log c2(γ) minus E
ksum
i=1
d2θi(γ)
dγ2j
yi(X)
(121)
Example 11 (Poisson) The Poisson distribution belongs to the exponential familywith c1(x) = 1x c2(λ) = expminusλ y(x) = x and θ(λ) = log λ Deriving momentsin terms of θ we have κ(θ) = expθ so EX = κprime(θ) = expθ = λ and varX =κprimeprime(θ) = expθ = λ The same results may be obtained with (120) and (121) notingthat dθdλ = 1λ For example (120) gives E Xλ = 1
It is also important to know how the distribution of a random variable changeswhen it is transformed Let X = (X1 Xp) denote a p-dimensional random
14 LIKELIHOOD INFERENCE 9
variable with continuous density function f Suppose that
U = g(X) = (g1(X) gp(X)) = (U1 Up) (122)
where g is a one-to-one function mapping the support region of f onto the space ofall u = g(x) for which x satisfies f (x) gt 0 To derive the probability distribution ofU from that of X we need to use the Jacobian matrix The density of the transformedvariables is
f (u) = f (gminus1(u)) |J(u)| (123)
where |J(u)| is the absolute value of the determinant of the Jacobian matrix of gminus1
evaluated at u having (i j)th element dxiduj where these derivatives are assumedto be continuous over the support region of U
14 LIKELIHOOD INFERENCE
If X1 Xn are independent and identically distributed (iid) each having densityf (x | θ) that depends on a vector of p unknown parameters θ = (θ1 θp) then thejoint likelihood function is
L(θ) =nprod
i=1
f (xi|θ) (124)
When the data are not iid the joint likelihood is still expressed as the joint densityf (x1 xn|θ) viewed as a function of θ
The observed data x1 xn might have been realized under many differentvalues for θ The parameters for which observing x1 xn would be most likelyconstitute the maximum likelihood estimate of θ In other words if ϑ is the functionof x1 xn that maximizes L(θ) then θ = ϑ(X1 Xn) is the maximum likeli-hood estimator (MLE) for θ MLEs are invariant to transformation so the MLE of atransformation of θ equals the transformation of θ
It is typically easier to work with the log likelihood function
l (θ) = log L(θ) (125)
which has the same maximum as the original likelihood since log is a strictly mono-tonic function Furthermore any additive constants (involving possibly x1 xn
but not θ) may be omitted from the log likelihood without changing the location of itsmaximum or differences between log likelihoods at different θ Note that maximizingL(θ) with respect to θ is equivalent to solving the system of equations
lprime(θ) = 0 (126)
where
lprime(θ) =(
dl(θ)
dθ1
dl(θ)
dθn
)
10 CHAPTER 1 REVIEW
is called the score function The score function satisfies
Elprime(θ) = 0 (127)
where the expectation is taken with respect to the distribution of X1 Xn Some-times an analytical solution to (126) provides the MLE this book describes a varietyof methods that can be used when the MLE cannot be solved for in closed form Itis worth noting that there are pathological circumstances where the MLE is not asolution of the score equation or the MLE is not unique see [127] for examples
The MLE has a sampling distribution because it depends on the realizationof the random variables X1 Xn The MLE may be biased or unbiased for θ yetunder quite general conditions it is asymptotically unbiased as n rarr infin The samplingvariance of the MLE depends on the average curvature of the log likelihood When thelog likelihood is very pointy the location of the maximum is more precisely known
To make this precise let lprimeprime(θ) denote the p times p matrix having (i j)th elementgiven by d2l(θ)(dθidθj) The Fisher information matrix is defined as
I(θ) = Elprime(θ)lprime(θ)T = minusElprimeprime(θ) (128)
where the expectations are taken with respect to the distribution of X1 Xn Thefinal equality in (128) requires mild assumptions which are satisfied for example inexponential families I(θ) may sometimes be called the expected Fisher informationto distinguish it from minuslprimeprime(θ) which is the observed Fisher information There aretwo reasons why the observed Fisher information is quite useful First it can becalculated even if the expectations in (128) cannot easily be computed Second it isa good approximation to I(θ) that improves as n increases
Under regularity conditions the asymptotic variancendashcovariance matrix ofthe MLE θ is I(θlowast)minus1 where θlowast denotes the true value of θ Indeed as n rarr infin thelimiting distribution of θ is Np(θlowast I(θlowast)minus1) Since the true parameter values areunknown I(θlowast)minus1 must be estimated in order to estimate the variancendashcovariancematrix of the MLE An obvious approach is to use I(θ)minus1 Alternatively it is alsoreasonable to use minuslprimeprime(θ)minus1 Standard errors for individual parameter MLEs canbe estimated by taking the square root of the diagonal elements of the chosenestimate of I(θlowast)minus1 A thorough introduction to maximum likelihood theory andthe relative merits of these estimates of I(θlowast)minus1 can be found in [127 182 371 470]
Profile likelihoods provide an informative way to graph a higher-dimensionallikelihood surface to make inference about some parameters while treating othersas nuisance parameters and to facilitate various optimization strategies The profilelikelihood is obtained by constrained maximization of the full likelihood with respectto parameters to be ignored If θ = (μ φ) then the profile likelihood for φ is
L(φ|μ(φ)) = maxμ
L(μ φ) (129)
Thus for each possible φ a value of μ is chosen to maximize L(μ φ) This optimalμ is a function of φ The profile likelihood is the function that maps φ to the valueof the full likelihood evaluated at φ and its corresponding optimal μ Note that the φ
TABLE 13 Notation and description for more common probability distributions of continuous random variables
Notation and Density and Mean and
Name Parameter Space Sample Space Variance
Lognormal X simLognormal(μ σ2)μ isin and σ gt 0
f (x) = 1
xradic
2πσ2exp
minus1
2
(logx minus μ
σ
)2
x isin EX = expμ + σ22varX = exp2μ + 2σ2 minus exp2μ + σ2
MultivariateNormal
X sim Nk(μ )μ = (μ1 μk) isin k
positive definite
f (x) = expminus(x minus μ)T minus1(x minus μ)2(2π)k2||12
x = (x1 xk) isin k
EX = μ
varX =
Normal X simN(μ σ2)μ isin and σ gt 0
f (x) = 1radic2πσ2
exp
minus1
2
(x minus μ
σ
)2
x isin EX = μ
varX = σ2
Studentrsquos t X sim tν
ν gt 0f (x) = ((ν + 1)2)
(ν2)radic
πν
(1 + x2
ν
)minus(ν+1)2
x isin EX = 0 if ν gt 1
varX = ν
ν + 2if ν gt 2
Uniform X simUnif(a b)a b isin and a lt b
f (x) = 1
b minus ax isin [a b]
EX = (a + b)2varX = (b minus a)212
Weibull X simWeibull(a b)a gt 0 and b gt 0
f (x) = abxbminus1 expminusaxbx gt 0
EX = (1 + 1b)
a1b
varX = (1 + 2b) minus (1 + 1b)2
a2b
7
8 CHAPTER 1 REVIEW(n
k1 km
)= nprodm
i=1 kiwhere n =
msumi=1
ki (115)
(r) =
(r minus 1) for r = 1 2 int infin0 trminus1 expminust dt for general r gt 0
(116)
It is worth knowing that
(
12
)= radic
π and (n + 1
2
)= 1 times 3 times 5 times middot middot middot times (2n minus 1)
radicπ2n
for positive integer nMany of the distributions commonly used in statistics are members of an
exponential family A k-parameter exponential family density can be expressed as
f (x|γ) = c1(x)c2(γ) exp
ksum
i=1
yi(x)θi(γ)
(117)
for nonnegative functions c1 and c2 The vectorγ denotes the familiar parameters suchas λ for the Poisson density and p for the binomial density The real-valued θi(γ) arethe natural or canonical parameters which are usually transformations of γ Theyi(x) are the sufficient statistics for the canonical parameters It is straightforwardto show
Ey(X) =κprime(θ) (118)
and
vary(X) =κprimeprime(θ) (119)
where κ(θ) = minus log c3(θ) letting c3(θ) denote the reexpression of c2(γ) in terms ofthe canonical parameters θ = (θ1 θk) and y(X) = (y1(X) yk(X)) Theseresults can be rewritten in terms of the original parameters γ as
E
ksum
i=1
dθi(γ)
dγj
yi(X)
= minus d
dγj
log c2(γ) (120)
and
var
ksum
i=1
dθi(γ)
dγj
yi(X)
= minus d2
dγ2j
log c2(γ) minus E
ksum
i=1
d2θi(γ)
dγ2j
yi(X)
(121)
Example 11 (Poisson) The Poisson distribution belongs to the exponential familywith c1(x) = 1x c2(λ) = expminusλ y(x) = x and θ(λ) = log λ Deriving momentsin terms of θ we have κ(θ) = expθ so EX = κprime(θ) = expθ = λ and varX =κprimeprime(θ) = expθ = λ The same results may be obtained with (120) and (121) notingthat dθdλ = 1λ For example (120) gives E Xλ = 1
It is also important to know how the distribution of a random variable changeswhen it is transformed Let X = (X1 Xp) denote a p-dimensional random
14 LIKELIHOOD INFERENCE 9
variable with continuous density function f Suppose that
U = g(X) = (g1(X) gp(X)) = (U1 Up) (122)
where g is a one-to-one function mapping the support region of f onto the space ofall u = g(x) for which x satisfies f (x) gt 0 To derive the probability distribution ofU from that of X we need to use the Jacobian matrix The density of the transformedvariables is
f (u) = f (gminus1(u)) |J(u)| (123)
where |J(u)| is the absolute value of the determinant of the Jacobian matrix of gminus1
evaluated at u having (i j)th element dxiduj where these derivatives are assumedto be continuous over the support region of U
14 LIKELIHOOD INFERENCE
If X1 Xn are independent and identically distributed (iid) each having densityf (x | θ) that depends on a vector of p unknown parameters θ = (θ1 θp) then thejoint likelihood function is
L(θ) =nprod
i=1
f (xi|θ) (124)
When the data are not iid the joint likelihood is still expressed as the joint densityf (x1 xn|θ) viewed as a function of θ
The observed data x1 xn might have been realized under many differentvalues for θ The parameters for which observing x1 xn would be most likelyconstitute the maximum likelihood estimate of θ In other words if ϑ is the functionof x1 xn that maximizes L(θ) then θ = ϑ(X1 Xn) is the maximum likeli-hood estimator (MLE) for θ MLEs are invariant to transformation so the MLE of atransformation of θ equals the transformation of θ
It is typically easier to work with the log likelihood function
l (θ) = log L(θ) (125)
which has the same maximum as the original likelihood since log is a strictly mono-tonic function Furthermore any additive constants (involving possibly x1 xn
but not θ) may be omitted from the log likelihood without changing the location of itsmaximum or differences between log likelihoods at different θ Note that maximizingL(θ) with respect to θ is equivalent to solving the system of equations
lprime(θ) = 0 (126)
where
lprime(θ) =(
dl(θ)
dθ1
dl(θ)
dθn
)
10 CHAPTER 1 REVIEW
is called the score function The score function satisfies
Elprime(θ) = 0 (127)
where the expectation is taken with respect to the distribution of X1 Xn Some-times an analytical solution to (126) provides the MLE this book describes a varietyof methods that can be used when the MLE cannot be solved for in closed form Itis worth noting that there are pathological circumstances where the MLE is not asolution of the score equation or the MLE is not unique see [127] for examples
The MLE has a sampling distribution because it depends on the realizationof the random variables X1 Xn The MLE may be biased or unbiased for θ yetunder quite general conditions it is asymptotically unbiased as n rarr infin The samplingvariance of the MLE depends on the average curvature of the log likelihood When thelog likelihood is very pointy the location of the maximum is more precisely known
To make this precise let lprimeprime(θ) denote the p times p matrix having (i j)th elementgiven by d2l(θ)(dθidθj) The Fisher information matrix is defined as
I(θ) = Elprime(θ)lprime(θ)T = minusElprimeprime(θ) (128)
where the expectations are taken with respect to the distribution of X1 Xn Thefinal equality in (128) requires mild assumptions which are satisfied for example inexponential families I(θ) may sometimes be called the expected Fisher informationto distinguish it from minuslprimeprime(θ) which is the observed Fisher information There aretwo reasons why the observed Fisher information is quite useful First it can becalculated even if the expectations in (128) cannot easily be computed Second it isa good approximation to I(θ) that improves as n increases
Under regularity conditions the asymptotic variancendashcovariance matrix ofthe MLE θ is I(θlowast)minus1 where θlowast denotes the true value of θ Indeed as n rarr infin thelimiting distribution of θ is Np(θlowast I(θlowast)minus1) Since the true parameter values areunknown I(θlowast)minus1 must be estimated in order to estimate the variancendashcovariancematrix of the MLE An obvious approach is to use I(θ)minus1 Alternatively it is alsoreasonable to use minuslprimeprime(θ)minus1 Standard errors for individual parameter MLEs canbe estimated by taking the square root of the diagonal elements of the chosenestimate of I(θlowast)minus1 A thorough introduction to maximum likelihood theory andthe relative merits of these estimates of I(θlowast)minus1 can be found in [127 182 371 470]
Profile likelihoods provide an informative way to graph a higher-dimensionallikelihood surface to make inference about some parameters while treating othersas nuisance parameters and to facilitate various optimization strategies The profilelikelihood is obtained by constrained maximization of the full likelihood with respectto parameters to be ignored If θ = (μ φ) then the profile likelihood for φ is
L(φ|μ(φ)) = maxμ
L(μ φ) (129)
Thus for each possible φ a value of μ is chosen to maximize L(μ φ) This optimalμ is a function of φ The profile likelihood is the function that maps φ to the valueof the full likelihood evaluated at φ and its corresponding optimal μ Note that the φ
8 CHAPTER 1 REVIEW(n
k1 km
)= nprodm
i=1 kiwhere n =
msumi=1
ki (115)
(r) =
(r minus 1) for r = 1 2 int infin0 trminus1 expminust dt for general r gt 0
(116)
It is worth knowing that
(
12
)= radic
π and (n + 1
2
)= 1 times 3 times 5 times middot middot middot times (2n minus 1)
radicπ2n
for positive integer nMany of the distributions commonly used in statistics are members of an
exponential family A k-parameter exponential family density can be expressed as
f (x|γ) = c1(x)c2(γ) exp
ksum
i=1
yi(x)θi(γ)
(117)
for nonnegative functions c1 and c2 The vectorγ denotes the familiar parameters suchas λ for the Poisson density and p for the binomial density The real-valued θi(γ) arethe natural or canonical parameters which are usually transformations of γ Theyi(x) are the sufficient statistics for the canonical parameters It is straightforwardto show
Ey(X) =κprime(θ) (118)
and
vary(X) =κprimeprime(θ) (119)
where κ(θ) = minus log c3(θ) letting c3(θ) denote the reexpression of c2(γ) in terms ofthe canonical parameters θ = (θ1 θk) and y(X) = (y1(X) yk(X)) Theseresults can be rewritten in terms of the original parameters γ as
E
ksum
i=1
dθi(γ)
dγj
yi(X)
= minus d
dγj
log c2(γ) (120)
and
var
ksum
i=1
dθi(γ)
dγj
yi(X)
= minus d2
dγ2j
log c2(γ) minus E
ksum
i=1
d2θi(γ)
dγ2j
yi(X)
(121)
Example 11 (Poisson) The Poisson distribution belongs to the exponential familywith c1(x) = 1x c2(λ) = expminusλ y(x) = x and θ(λ) = log λ Deriving momentsin terms of θ we have κ(θ) = expθ so EX = κprime(θ) = expθ = λ and varX =κprimeprime(θ) = expθ = λ The same results may be obtained with (120) and (121) notingthat dθdλ = 1λ For example (120) gives E Xλ = 1
It is also important to know how the distribution of a random variable changeswhen it is transformed Let X = (X1 Xp) denote a p-dimensional random
14 LIKELIHOOD INFERENCE 9
variable with continuous density function f Suppose that
U = g(X) = (g1(X) gp(X)) = (U1 Up) (122)
where g is a one-to-one function mapping the support region of f onto the space ofall u = g(x) for which x satisfies f (x) gt 0 To derive the probability distribution ofU from that of X we need to use the Jacobian matrix The density of the transformedvariables is
f (u) = f (gminus1(u)) |J(u)| (123)
where |J(u)| is the absolute value of the determinant of the Jacobian matrix of gminus1
evaluated at u having (i j)th element dxiduj where these derivatives are assumedto be continuous over the support region of U
14 LIKELIHOOD INFERENCE
If X1 Xn are independent and identically distributed (iid) each having densityf (x | θ) that depends on a vector of p unknown parameters θ = (θ1 θp) then thejoint likelihood function is
L(θ) =nprod
i=1
f (xi|θ) (124)
When the data are not iid the joint likelihood is still expressed as the joint densityf (x1 xn|θ) viewed as a function of θ
The observed data x1 xn might have been realized under many differentvalues for θ The parameters for which observing x1 xn would be most likelyconstitute the maximum likelihood estimate of θ In other words if ϑ is the functionof x1 xn that maximizes L(θ) then θ = ϑ(X1 Xn) is the maximum likeli-hood estimator (MLE) for θ MLEs are invariant to transformation so the MLE of atransformation of θ equals the transformation of θ
It is typically easier to work with the log likelihood function
l (θ) = log L(θ) (125)
which has the same maximum as the original likelihood since log is a strictly mono-tonic function Furthermore any additive constants (involving possibly x1 xn
but not θ) may be omitted from the log likelihood without changing the location of itsmaximum or differences between log likelihoods at different θ Note that maximizingL(θ) with respect to θ is equivalent to solving the system of equations
lprime(θ) = 0 (126)
where
lprime(θ) =(
dl(θ)
dθ1
dl(θ)
dθn
)
10 CHAPTER 1 REVIEW
is called the score function The score function satisfies
Elprime(θ) = 0 (127)
where the expectation is taken with respect to the distribution of X1 Xn Some-times an analytical solution to (126) provides the MLE this book describes a varietyof methods that can be used when the MLE cannot be solved for in closed form Itis worth noting that there are pathological circumstances where the MLE is not asolution of the score equation or the MLE is not unique see [127] for examples
The MLE has a sampling distribution because it depends on the realizationof the random variables X1 Xn The MLE may be biased or unbiased for θ yetunder quite general conditions it is asymptotically unbiased as n rarr infin The samplingvariance of the MLE depends on the average curvature of the log likelihood When thelog likelihood is very pointy the location of the maximum is more precisely known
To make this precise let lprimeprime(θ) denote the p times p matrix having (i j)th elementgiven by d2l(θ)(dθidθj) The Fisher information matrix is defined as
I(θ) = Elprime(θ)lprime(θ)T = minusElprimeprime(θ) (128)
where the expectations are taken with respect to the distribution of X1 Xn Thefinal equality in (128) requires mild assumptions which are satisfied for example inexponential families I(θ) may sometimes be called the expected Fisher informationto distinguish it from minuslprimeprime(θ) which is the observed Fisher information There aretwo reasons why the observed Fisher information is quite useful First it can becalculated even if the expectations in (128) cannot easily be computed Second it isa good approximation to I(θ) that improves as n increases
Under regularity conditions the asymptotic variancendashcovariance matrix ofthe MLE θ is I(θlowast)minus1 where θlowast denotes the true value of θ Indeed as n rarr infin thelimiting distribution of θ is Np(θlowast I(θlowast)minus1) Since the true parameter values areunknown I(θlowast)minus1 must be estimated in order to estimate the variancendashcovariancematrix of the MLE An obvious approach is to use I(θ)minus1 Alternatively it is alsoreasonable to use minuslprimeprime(θ)minus1 Standard errors for individual parameter MLEs canbe estimated by taking the square root of the diagonal elements of the chosenestimate of I(θlowast)minus1 A thorough introduction to maximum likelihood theory andthe relative merits of these estimates of I(θlowast)minus1 can be found in [127 182 371 470]
Profile likelihoods provide an informative way to graph a higher-dimensionallikelihood surface to make inference about some parameters while treating othersas nuisance parameters and to facilitate various optimization strategies The profilelikelihood is obtained by constrained maximization of the full likelihood with respectto parameters to be ignored If θ = (μ φ) then the profile likelihood for φ is
L(φ|μ(φ)) = maxμ
L(μ φ) (129)
Thus for each possible φ a value of μ is chosen to maximize L(μ φ) This optimalμ is a function of φ The profile likelihood is the function that maps φ to the valueof the full likelihood evaluated at φ and its corresponding optimal μ Note that the φ
14 LIKELIHOOD INFERENCE 9
variable with continuous density function f Suppose that
U = g(X) = (g1(X) gp(X)) = (U1 Up) (122)
where g is a one-to-one function mapping the support region of f onto the space ofall u = g(x) for which x satisfies f (x) gt 0 To derive the probability distribution ofU from that of X we need to use the Jacobian matrix The density of the transformedvariables is
f (u) = f (gminus1(u)) |J(u)| (123)
where |J(u)| is the absolute value of the determinant of the Jacobian matrix of gminus1
evaluated at u having (i j)th element dxiduj where these derivatives are assumedto be continuous over the support region of U
14 LIKELIHOOD INFERENCE
If X1 Xn are independent and identically distributed (iid) each having densityf (x | θ) that depends on a vector of p unknown parameters θ = (θ1 θp) then thejoint likelihood function is
L(θ) =nprod
i=1
f (xi|θ) (124)
When the data are not iid the joint likelihood is still expressed as the joint densityf (x1 xn|θ) viewed as a function of θ
The observed data x1 xn might have been realized under many differentvalues for θ The parameters for which observing x1 xn would be most likelyconstitute the maximum likelihood estimate of θ In other words if ϑ is the functionof x1 xn that maximizes L(θ) then θ = ϑ(X1 Xn) is the maximum likeli-hood estimator (MLE) for θ MLEs are invariant to transformation so the MLE of atransformation of θ equals the transformation of θ
It is typically easier to work with the log likelihood function
l (θ) = log L(θ) (125)
which has the same maximum as the original likelihood since log is a strictly mono-tonic function Furthermore any additive constants (involving possibly x1 xn
but not θ) may be omitted from the log likelihood without changing the location of itsmaximum or differences between log likelihoods at different θ Note that maximizingL(θ) with respect to θ is equivalent to solving the system of equations
lprime(θ) = 0 (126)
where
lprime(θ) =(
dl(θ)
dθ1
dl(θ)
dθn
)
10 CHAPTER 1 REVIEW
is called the score function The score function satisfies
Elprime(θ) = 0 (127)
where the expectation is taken with respect to the distribution of X1 Xn Some-times an analytical solution to (126) provides the MLE this book describes a varietyof methods that can be used when the MLE cannot be solved for in closed form Itis worth noting that there are pathological circumstances where the MLE is not asolution of the score equation or the MLE is not unique see [127] for examples
The MLE has a sampling distribution because it depends on the realizationof the random variables X1 Xn The MLE may be biased or unbiased for θ yetunder quite general conditions it is asymptotically unbiased as n rarr infin The samplingvariance of the MLE depends on the average curvature of the log likelihood When thelog likelihood is very pointy the location of the maximum is more precisely known
To make this precise let lprimeprime(θ) denote the p times p matrix having (i j)th elementgiven by d2l(θ)(dθidθj) The Fisher information matrix is defined as
I(θ) = Elprime(θ)lprime(θ)T = minusElprimeprime(θ) (128)
where the expectations are taken with respect to the distribution of X1 Xn Thefinal equality in (128) requires mild assumptions which are satisfied for example inexponential families I(θ) may sometimes be called the expected Fisher informationto distinguish it from minuslprimeprime(θ) which is the observed Fisher information There aretwo reasons why the observed Fisher information is quite useful First it can becalculated even if the expectations in (128) cannot easily be computed Second it isa good approximation to I(θ) that improves as n increases
Under regularity conditions the asymptotic variancendashcovariance matrix ofthe MLE θ is I(θlowast)minus1 where θlowast denotes the true value of θ Indeed as n rarr infin thelimiting distribution of θ is Np(θlowast I(θlowast)minus1) Since the true parameter values areunknown I(θlowast)minus1 must be estimated in order to estimate the variancendashcovariancematrix of the MLE An obvious approach is to use I(θ)minus1 Alternatively it is alsoreasonable to use minuslprimeprime(θ)minus1 Standard errors for individual parameter MLEs canbe estimated by taking the square root of the diagonal elements of the chosenestimate of I(θlowast)minus1 A thorough introduction to maximum likelihood theory andthe relative merits of these estimates of I(θlowast)minus1 can be found in [127 182 371 470]
Profile likelihoods provide an informative way to graph a higher-dimensionallikelihood surface to make inference about some parameters while treating othersas nuisance parameters and to facilitate various optimization strategies The profilelikelihood is obtained by constrained maximization of the full likelihood with respectto parameters to be ignored If θ = (μ φ) then the profile likelihood for φ is
L(φ|μ(φ)) = maxμ
L(μ φ) (129)
Thus for each possible φ a value of μ is chosen to maximize L(μ φ) This optimalμ is a function of φ The profile likelihood is the function that maps φ to the valueof the full likelihood evaluated at φ and its corresponding optimal μ Note that the φ
10 CHAPTER 1 REVIEW
is called the score function The score function satisfies
Elprime(θ) = 0 (127)
where the expectation is taken with respect to the distribution of X1 Xn Some-times an analytical solution to (126) provides the MLE this book describes a varietyof methods that can be used when the MLE cannot be solved for in closed form Itis worth noting that there are pathological circumstances where the MLE is not asolution of the score equation or the MLE is not unique see [127] for examples
The MLE has a sampling distribution because it depends on the realizationof the random variables X1 Xn The MLE may be biased or unbiased for θ yetunder quite general conditions it is asymptotically unbiased as n rarr infin The samplingvariance of the MLE depends on the average curvature of the log likelihood When thelog likelihood is very pointy the location of the maximum is more precisely known
To make this precise let lprimeprime(θ) denote the p times p matrix having (i j)th elementgiven by d2l(θ)(dθidθj) The Fisher information matrix is defined as
I(θ) = Elprime(θ)lprime(θ)T = minusElprimeprime(θ) (128)
where the expectations are taken with respect to the distribution of X1 Xn Thefinal equality in (128) requires mild assumptions which are satisfied for example inexponential families I(θ) may sometimes be called the expected Fisher informationto distinguish it from minuslprimeprime(θ) which is the observed Fisher information There aretwo reasons why the observed Fisher information is quite useful First it can becalculated even if the expectations in (128) cannot easily be computed Second it isa good approximation to I(θ) that improves as n increases
Under regularity conditions the asymptotic variancendashcovariance matrix ofthe MLE θ is I(θlowast)minus1 where θlowast denotes the true value of θ Indeed as n rarr infin thelimiting distribution of θ is Np(θlowast I(θlowast)minus1) Since the true parameter values areunknown I(θlowast)minus1 must be estimated in order to estimate the variancendashcovariancematrix of the MLE An obvious approach is to use I(θ)minus1 Alternatively it is alsoreasonable to use minuslprimeprime(θ)minus1 Standard errors for individual parameter MLEs canbe estimated by taking the square root of the diagonal elements of the chosenestimate of I(θlowast)minus1 A thorough introduction to maximum likelihood theory andthe relative merits of these estimates of I(θlowast)minus1 can be found in [127 182 371 470]
Profile likelihoods provide an informative way to graph a higher-dimensionallikelihood surface to make inference about some parameters while treating othersas nuisance parameters and to facilitate various optimization strategies The profilelikelihood is obtained by constrained maximization of the full likelihood with respectto parameters to be ignored If θ = (μ φ) then the profile likelihood for φ is
L(φ|μ(φ)) = maxμ
L(μ φ) (129)
Thus for each possible φ a value of μ is chosen to maximize L(μ φ) This optimalμ is a function of φ The profile likelihood is the function that maps φ to the valueof the full likelihood evaluated at φ and its corresponding optimal μ Note that the φ