1 preliminaries - information technology...

1

Preliminaries

This chapter describes several of the issues and methods of statistical infer-ence. The discussion is at a fairly general level, and many of the details arenot addressed here.

The purpose of an exploration of data may be rather limited and ad hoc,perhaps just to reach a yes-no decision about an hypothesis, or the purposemay be more general, perhaps to gain understanding of some natural phe-nomenon. The process of understanding often begins with general questionsabout the structure of the data. At any stage of the analysis, our understand-ing is facilitated by means of a model.

A model is a description that embodies our current understanding of aphenomenon. In an operational sense, we can formulate a model either as adescription of a data-generating process, or as a prescription for processingdata. The model is often expressed as a set of equations that relate dataelements to each other. It may include probability distributions for the dataelements. If any of the data elements are considered to be realizations ofrandom variables, the model is a stochastic model.

A model should not limit our analysis; rather, the model should be ableto evolve. The process of understanding involves successive refinements of themodel. The refinements proceed from vague models to more specific ones. Anexploratory data analysis may begin by mining the data to identify interestingproperties. These properties generally raise questions that are to be exploredfurther.

A family of models may have a common form within which the members ofthe family are distinguished by values of parameters. For example, the familyof normal probability distributions has a single form of a probability densityfunction that has two parameters. If this form of model is chosen to representthe properties of a dataset, we may seek confidence intervals for values ofthe two parameters or perform statistical tests of hypothesized values of theparameters.

In models that are not as mathematically tractable as the normal proba-bility model—and many realistic models are not—computationally intensive

Elements of Computational Statistics, Second Edition c©2013 James E. Gentle

6 1 Preliminaries

methods involving simulations, resamplings, and multiple views may be usedto make inferences about the parameters of a model.

1.1 Discovering Structure: Data Structures

and Structure in Data

The components of statistical datasets are “observations” and “variables”. Ingeneral, “data structures” are ways of organizing data to take advantage ofthe relationships among the variables constituting the dataset. Data structuresmay express hierarchical relationships, crossed relationships (as in “relational”databases), or more complicated aspects of the data (as in “object-oriented”databases).

In data analysis, “structure in the data” is of interest. Structure in thedata includes such nonparametric features as modes, gaps, or clusters in thedata, the symmetry of the data, and other general aspects of the shape ofthe data. Because many classical techniques of statistical analysis rely on anassumption of normality of the data, the most interesting structure in thedata may be those aspects of the data that deviate most from normality.

Sometimes, it is possible to express the structure in the data in termsof mathematical models. Prior to doing this, graphical displays may be usedto discover qualitative structure in the data. Patterns observed in the datamay suggest explicit statements of the structure or of relationships amongthe variables in the dataset. The process of building models of relationshipsis an iterative one, and graphical displays are useful throughout the process.Graphs comparing data and the fitted models are used to refine the models.

Multiple Analyses and Multiple Views

Effective use of graphics often requires multiple views. For multivariate data,plots of individual variables or combinations of variables can be producedquickly and used to get a general idea of the properties of the data. The datashould be inspected from various perspectives. Instead of a single histogram todepict the general shape of univariate data, for example, multiple histogramswith different bin widths and different bin locations may provide more insight.

Sometimes, a few data points in a display can completely obscure inter-esting structure in the other data points. A zooming window to restrict thescope of the display and simultaneously restore the scale to an appropriateviewing size can reveal structure. A zooming window can be used with anygraphics software whether the software supports it or not; zooming can beaccomplished by deletion of the points in the dataset outside of the window.

Scaling the axes can also be used effectively to reveal structure. The rel-ative scale is called the “aspect ratio”. In Figure 1.1, which is a plot of abivariate dataset, we form a zooming window that deletes a single observa-tion. The greater magnification and the changed aspect ratio clearly show a


1.1 Discovering Structure 7

relationship between X and Y in a region close to the origin that may not holdfor the full range of data. A simple statement of this relationship, however,would not extrapolate outside the window to the outlying point.

The use of a zooming window is not “deletion of outliers”; it is focusingin on a subset of the data and is done independently of whatever is believedabout the data outside of the window.

1 2 3 4 5 6 7 8

05

1015

20

X

Y

1.0 2.0 3.0 4.0

0.2

0.4

0.6

0.8

X

Y

Figure 1.1. Scales Matter

One type of structure that may go undetected is that arising from theorder in which the data were collected. For data that are recognized as a timeseries by the analyst, this is obviously not a problem, but often there is atime dependency in the data that is not recognized immediately. “Time” or“location” may not be an explicit variable on the dataset, even though it maybe an important variable. The index of the observation within the datasetmay be a surrogate variable for time, and characteristics of the data mayvary as the index varies. Often it is useful to make plots in which one axisis the index number of the observations. More subtle time dependencies arethose in which the values of the variables are not directly related to time,but relationships among variables are changing over time. The identificationof such time dependencies is much more difficult, and often requires fittinga model and plotting residuals. Another strictly graphical way of observingchanges in relationships over time is by using a sequence of graphical displays.


8 1 Preliminaries

Simple Plots May Reveal the Unexpected

A simple plot of the data will often reveal structure or other characteristicsof the data that numerical summaries do not.

An important property of data that is often easily seen in a graph isthe unit of measurement. Data on continuous variables are often rounded ormeasured on a coarse grid. This may indicate other problems in the collectionof the data. The horizontal lines in Figure 1.2 indicate that the data do notcome from a continuous distribution. Whether we can use methods of dataanalysis that assume continuity depends on the coarseness of the grid or ofthe measurement; that is, on the extent to which the data are discrete or theextent to which they have been discretized.

0 2 4 6 8 10

510

1520

X

Y

Figure 1.2. Discrete Data, Rounded Data, or Data Measured Imprecisely

I discuss graphics further in Chapter 7. The emphasis in that chapter,however, is on the use of graphics for discovery. The field of statistical graphicsis much broader, of course, and includes many issues of design of graphicaldisplays for conveying information (rather than for discovering information).

1.2 Modeling and Statistical Inference

The process of building models involves successive refinements. The evolutionof the models proceeds from vague, tentative models to more complete ones,and our understanding of the process being modeled grows in this process.


1.2 Modeling and Statistical Inference 9

The usual statements about statistical methods regarding bias, variance,and so on are made in the context of a model. It is not possible to measure thebias or the variance of a procedure to select a model, except in the relativelysimple case of selection from some well-defined and simple set of possiblemodels. Only within the context of rigid assumptions (a “metamodel”) canwe do a precise statistical analysis of model selection. Even the simple cases ofselection of variables in linear regression analysis under the usual assumptionsabout the distribution of residuals (and this is a highly idealized situation)present more problems to the analyst than are generally recognized.

1.2.1 Random Variables and Probability Models

A random variable is a type of function whose values are real numbers. Igenerally use upper-case italic letters to denote random variables, and I denotethe set of all real numbers by IR. (See pages 421 and 422 in Appendix Cfor notation that I use.) Some of the simplest models used in statistics areprobability models for random variables.

For a random variable X, a probability model specifies the probabilitythat X is in a given set of real numbers (specifically, a Borel set, but wewill not dwell on technical details here). This probability can be expressed interms of the probability that the random variable is less than or equal to somegiven numbers; hence, the fundamental function in a probability model is thecumulative distribution function, or CDF, which for the random variable Xyields

Pr(X ≤ x)

for any real number x. We often denote the CDF of X as PX(x) or FX(x). (Seepage 423 and the following pages in Appendix C for a summary of notationrelating to random variables and functions of random variables.)

It is clear that the CDF PX(x) is nondecreasing in x, that PX(x) → 0 asx → −∞, that PX(x) → 1 as x → ∞, and that PX(x) is continuous from theright.

Notice that if x is a fixed real number, then PX(x) is also a fixed realnumber. If X or Y is a random variable, then PX(X) or PX(Y ) is also arandom variable. The notation “PX(x)” implies that X is a random variable.

There are two classes of random variables that we will distinguish: “discreterandom variables” and “continuous random variables”. We will avoid technicaldetails concerning random variables in this text, but we can think of discreterandom variables as those that can take on only a countable set of values withpositive probability, while continuous random variables can assume any of anuncountable number of values (such as any number in an interval). (Out ofhabit, I will sometimes refer to a continuous random variable as an “absolutelycontinuous random variable”. For purposes of this text, that distinction canbe ignored.) A probability model for a continuous random variable yieldsprobabilities for the random variable taking values within an interval, and


10 1 Preliminaries

assigns 0 to the probability that it assumes any specific value. In general,the CDF of a continuous random variable is continuous, while the CDF of adiscrete random variable is a step function, which is discontinuous.

A very important property of the CDF for continuous random variables isthat if the continuous random variable X has CDF PX(x), then

PX(X) ∼ U(0, 1), (1.1)

where the symbol “∼” means “is distributed as”, and U(0, 1) represents theuniform distribution over the interval (0, 1). This property is used in manyareas of statistics, notably in computational statistics for the generation ofrandom numbers (see page subsubsec:invcdf in Chapter 2).

We call the derivative of a CDF PX the probability density function, orPDF, and we often denote it by the corresponding lower-case letter, pX orfX . (If the CDF is not differentiable in the usual sense, the PDF is a specialtype of derivative, and may denote the probability of a single point.)

There are several probability models that are useful over various rangesof applications. We call a probability model a “distribution”. Two commonones are the normal model, which we denote as N(µ, σ2), and the uniformmodel mentioned above. Notation of the form N(µ, σ2) may denote a specificdistribution (if µ and σ2 are assumed fixed), or it may denote the family of

distributions. The concept of families of distributions is pervasive in statistics;the simple tasks of statistical inference often involve inference about specificvalues of the parameters in a given (assumed) family of distributions. Ap-pendix D, beginning on page 435, lists several of the more useful families ofdistributions.

An important class of families of probability distributions are those whoseprobability densities are of the form

p(y | θ) = h(y) exp(θTg(y) − a(θ)

), if y ∈ Y,

= 0, otherwise, (1.2)

where θ is an m-vector, g(·) is an m-vector-valued function, and Y is someset that does not depend on θ. This class is called the exponential class of

distributions, and a family of probability distributions in this class is calledan exponential family.

The exponential class of probability distributions includes many of the fa-miliar families of distributions, such as the normal, the binomial, the Poisson,the gamma, the Pareto, and the negative binomial.

Given any distribution, transformations of the random variable can be usedto form various families of related distributions. For example, if X is a randomvariable, and Y = aX + b, with a 6= 0, the family of distributions of all suchY is called a location-scale family. Many common families of distributions, forexample the normal family, are location-scale families.

Often in statistics, we suppose that we have a random sample from aparticular distribution. There are various types of random samples, but we


1.2 Modeling and Statistical Inference 11

will use the term in the sense of “simple random sample”, unless we qualifythe data-generating process in some way. A random sample is the same as aset of random variables that are independent and have the same distribution;that is, the random variables are independent and identically distributed, or

iid. We often use notation of the form X1, . . . , Xniid∼ N(µ, σ2), for example, to

indicate that the n random variables are independent and identically normallydistributed.

1.2.2 Descriptive Statistics and Inferential Statistics

We can distinguish statistical activities that involve:

• data collection;• descriptions of a given dataset;• inference within the context of a model or family of models; and• model selection.

In any given application, it is likely that all of these activities will come intoplay. Sometimes (and often, ideally!), a statistician can specify how data areto be collected, either in surveys or in experiments. We will not be concernedwith this aspect of the process in this text.

Once data are available, either from a survey or designed experiment, orjust observational data, a statistical analysis begins by considering generaldescriptions of the dataset. These descriptions include ensemble characteris-tics, such as averages and spreads, and identification of extreme points. Thedescriptions are in the form of various summary statistics and graphical dis-plays. The descriptive analyses may be computationally intensive for largedatasets, especially if there are a large number of variables. The computa-tionally intensive approach also involves multiple views of the data, includingconsideration of various transformations of the data. We discuss these meth-ods in Chapters 5 and 7 and in Part II.

A stochastic model is often expressed as a probability density function oras a cumulative distribution function of a random variable. In a simple linearregression model with normal errors,

Y = β0 + β1x + E, (1.3)

for example, the model may be expressed by use of the probability densityfunction for the random variable E. (Notice that Y and E are written inuppercase because they represent random variables.) The probability densityfunction for Y is

p(y) =1√2πσ

e−(y−β0−β1x)2/(2σ2). (1.4)

In this model, x is an observable covariate; σ, β0, and β1 are unobservable(and, generally, unknown) parameters; and 2 and π are constants. Statisticalinference about parameters includes estimation or tests of their values or


12 1 Preliminaries

statements about their probability distributions based on observations of theelements of the model.

The elements of a stochastic model include observable random variables,observable covariates, unobservable parameters, and constants. Some randomvariables in the model may be considered to be “responses”. The covariatesmay be considered to affect the response; they may or may not be randomvariables. The parameters are variable within a class of models, but for aspecific data model the parameters are constants. The parameters may beconsidered to be unobservable random variables, and in that sense, a specificdata model is defined by a realization of the parameter random variable. Inthe model, written as

Y = f(x; β) + E, (1.5)

we identify a “systematic component”, f(x; β), and a “random component”,E.

1.2.3 Model Building and Computational Inference

The selection of an appropriate model may be very difficult, and almost alwaysinvolves not only questions of how well the model corresponds to the observeddata, but also the tractability of the model; that is, whether or not the model isusable. In some cases, usability means the existence of closed-form expressionsor formulas that can be evaluated by ordinary arithmetic. The methods ofcomputational statistics allow a much wider range of tractability than can becontemplated in mathematical statistics.

Statistical analyses generally are undertaken with the purpose of makinga decision about a dataset, about a population from which a sample dataset isavailable, or in making a prediction about a future event. Much of the theoryof statistics developed during the middle third of the twentieth century wasconcerned with formal inference; that is, use of a sample to make decisionsabout stochastic models based on probabilities that would result if a givenmodel was indeed the data-generating process. The heuristic paradigm callsfor rejection of a model if the probability is small that data arising from themodel would be similar to the observed sample. This process can be quitetedious because of the wide range of models that should be explored andbecause some of the models may not yield mathematically tractable estimatorsor test statistics. Computationally intensive methods include exploration of arange of models, many of which may be mathematically intractable.

In a different approach employing the same paradigm, the statistical meth-ods may involve direct simulation of the hypothesized data-generating processrather than formal computations of probabilities that would result under agiven model of the data-generating process. We refer to this approach as com-

putational inference. We discuss methods of computational inference in Chap-ters 2, 3, and 4. In a variation of computational inference, we may not evenattempt to develop a model of the data-generating process; rather, we build


1.3 The Role of the ECDF in Inference 13

decision rules directly from the data. This is often the approach in clusteringand classification, which we discuss in Chapters 10 and 12. Computationalinference is rooted in classical statistical inference. In subsequent sections ofthe current chapter, we discuss general techniques used in statistical inference.

1.3 The Role of the Empirical Cumulative

Distribution Function

Methods of statistical inference are based on an assumption (often implicit)that a discrete uniform distribution with mass points at the observed values ofa random sample is asymptotically the same as the distribution governing thedata-generating process. Thus, the distribution function of this discrete uni-form distribution is a model of the distribution function of the data-generatingprocess.

For a given set of univariate data, y1, . . . , yn, the empirical cumulative

distribution function, or ECDF, is

Pn(y) =#{yi, s.t. yi ≤ y}

n. (1.6)

The ECDF is the basic function used in many methods of computationalinference.

Although the ECDF has similar definitions for univariate and multivari-ate random variables, it is most useful in the univariate case. An equivalentexpression for univariate random variables, in terms of intervals on the realline, is

Pn(y) =1

n

n∑

i=1

I(−∞,y](yi), (1.7)

where I is the indicator function. (See page 427 for the definition and some ofthe properties of the indicator function. The measure dI(−∞,a](x), which weuse in equation (1.15) below, is particularly interesting.)

It is easy to see that the ECDF is pointwise unbiased for the CDF; that is,if the yi are independent realizations of random variables Yi, each with CDFP (·), for a given y,

E(Pn(y)

)= E

(1

n

n∑

i=1

I(−∞,y](Yi)

)

=1

n

n∑

i=1

E(I(−∞,y](Yi)

)

= Pr (Y ≤ y)

= P (y). (1.8)

Similarly, we find


14 1 Preliminaries

V(Pn(y)

)= P (y)

(1 − P (y)

)/n; (1.9)

indeed, at a fixed point y, nPn(y) is a binomial random variable with param-eters n and π = P (y). Because Pn is a function of the order statistics, whichform a complete sufficient statistic for P , there is no unbiased estimator ofP (y) with smaller variance.

We also define the empirical probability density function (EPDF) as thederivative of the ECDF:

pn(y) =1

n

n∑

i=1

δ(y − yi), (1.10)

where δ is the Dirac delta function. The EPDF is just a series of spikes atpoints corresponding to the observed values. It is not as useful as the ECDF.It is, however, unbiased at any point for the probability density function atthat point.

The ECDF and the EPDF can be used as estimators of the correspondingpopulation functions, but there are better estimators (see Chapter 9).

1.3.1 Statistical Functions of the CDF and the ECDF

In many models of interest, a parameter can be expressed as a functional ofthe probability density function or of the cumulative distribution function ofa random variable in the model. The mean of a distribution, for example, canbe expressed as a functional Θ of the CDF P :

Θ(P ) =

∫

IRd

y dP (y). (1.11)

A functional that defines a parameter is called a statistical function.

Estimation of Statistical Functions

A common task in statistics is to use a random sample to estimate the param-eters of a probability distribution. If the statistic T from a random sample isused to estimate the parameter θ, we measure the performance of T by themagnitude of the bias,

|E(T ) − θ|, (1.12)

by the variance,

V(T ) = E((T − E(T )) (T − E(T ))

T)

, (1.13)

by the mean squared error,

E((T − θ)T(T − θ)

), (1.14)



and by other expected values of measures of the distance from T to θ. (Theseexpressions above are for the scalar case, but similar expressions apply tovectors T and θ, in which case the bias is a vector, the variance is the variance-covariance matrix, and the mean squared error is a dot product and hence ascalar.)

If E(T ) = θ, T is unbiased for θ. For sample size n, if E(T ) = θ+O(n−1/2),T is said to be first-order accurate for θ; if E(T ) = θ + O(n−1), it is second-

order accurate. (See page 428 for the definition of O(·). Convergence of E(T )can also be expressed as a stochastic convergence of T , in which case we usethe notation OP(·).)

The order of the mean squared error is an important characteristic of anestimator. For good estimators of location, the order of the mean squarederror is typically O(n−1). Good estimators of probability densities, however,typically have mean squared errors of at least order O(n−4/5) (see Chapter 9).

Estimation Using the ECDF

There are many ways to construct an estimator and to make inferences aboutthe population. In the univariate case especially, we often use data to makeinferences about a parameter by applying the statistical function to the ECDF.An estimator of a parameter that is defined in this way is called a plug-in

estimator. A plug-in estimator for a given parameter is the same functionalof the ECDF as the parameter is of the CDF.

For the mean of the model, for example, we use the estimate that is thesame functional of the ECDF as the population mean in equation (1.11),

Θ(Pn) =

∫ ∞

−∞

y dPn(y)

=

∫ ∞

−∞

y d1

n

n∑

i=1

I(−∞,y](yi)

=1

n

n∑

i=1

∫ ∞

−∞

y dI(−∞,y](yi)

=1

n

n∑

i=1

yi

= y. (1.15)

The sample mean is thus a plug-in estimator of the population mean. Suchan estimator is called a method of moments estimator. This is an importanttype of plug-in estimator. The method of moments results in estimates of thepopulation moments E(Y r) that are the corresponding sample moments. Pop-ulation parameters that are functions of population moments are estimatedby the corresponding functions of sample moments.


16 1 Preliminaries

Statistical properties of plug-in estimators are generally relatively easy todetermine. In some cases, the statistical properties, such as expectation andvariance, are optimal in some sense.

In addition to estimation based on the ECDF, other methods of compu-tational statistics make use of the ECDF. In some cases, such as in bootstrapmethods, the ECDF is a surrogate for the CDF. In other cases, such as MonteCarlo methods, an ECDF for an estimator is constructed by repeated sam-pling, and that ECDF is used to make inferences using the observed value ofthe estimator from the given sample.

Viewed as a statistical function, Θ denotes a specific functional form. Anyfunctional of the ECDF is a function of the data, so we may also use the nota-tion Θ(Y1, . . . , Yn). Often, however, the notation is cleaner if we use anotherletter to denote the function of the data; for example, T (Y1, . . . , Yn), even ifit might be the case that

T (Y1, . . . , Yn) = Θ(Pn).

We will also often use the same letter that denotes the functional of the sampleto represent the random variable computed from a random sample; that is,we may write

T = T (Y1, . . . , Yn).

As usual, we will use t to denote a realization of the random variable T .Use of the ECDF in statistical inference does not require many assump-

tions about the distribution. Other methods discussed below are based oninformation or assumptions about the data-generating process.

1.3.2 Order Statistics

In a set of iid random variables X1, . . . , Xn, it is often of interest to considerthe ranked values Xi1 ≤ · · · ≤ Xin

. These are called the order statistics andare denoted as X(1:n), . . . , X(n:n). For 1 ≤ k ≤ n, we refer to X(k:n) as the

kth order statistic. We often use the simpler notation X(k), assuming that nis some fixed and known value. Also, we sometimes drop the parentheses inthe other representation, Xk:n.

If the CDF of the n iid random variables is P (x) and the PDF is p(x), wecan get the PDF of the kth order statistic by forming the joint density, andintegrating out all variables except the kth order statistic. This yields

pX(k:n)(x) =

(n

k

)(P (x)

)k−1(1 − P (x)

)n−k

p(x). (1.16)

Clearly, the order statistics X(1:n), . . . , X(n:n) from an iid sample of randomvariables X1, . . . , Xn are neither independent nor identically distributed.

From equation (1.16), and the fact that if the random variable X hasCDF P , then P (X) ∼ U(0, 1) (expression (1.1) on page 10), we see that the



distribution of the kth order statistic from a U(0, 1) is the beta distributionwith parameters k and n − k + 1; that is,

X(k:n) ∼ beta(k, n − k + 1), (1.17)

if X(k:n) is the kth order statistic in a sample of size n from a uniform(0, 1)distribution.

Another important fact about order statistics from a uniform distributionis that they have simple conditional distributions. Given the (k + 1)th orderstatistic, Uk+1:n = v, the conditional joint distribution of vU1:k, . . . vUk:k isthe same as the joint distribution of U1:n, . . .Uk:n; that is,

(vU1:k , . . . vUk:k)d= (U1:n, . . .Uk:n). (1.18)

(See Exercise 1.8.)

Quantiles

For α ∈ (0, 1), the α quantile of the distribution with CDF P is the valuex(α) such that P (x(α)) = α if such a value exists. (For a univariate randomvariable, this is a single point. For a d-variate random variable, it is a (d− 1)-dimensional object that is generally nonunique.)

In a discrete distribution, for a given value of α, there may be no value x(α)

such that P (x(α)) = α. We can define a useful concept of “quantile”, however,that always exists. For a univariate probability distribution with CDF P , wedefine the function P−1 on the open interval (0, 1) as

P−1(α) = inf{x, s.t. P (x) ≥ α}. (1.19)

We call P−1 the quantile function. Notice that if P is strictly increasing, thequantile function is the ordinary inverse of the cumulative distribution func-tion. If P is not strictly increasing, the quantile function can be interpreted asa generalized inverse of the cumulative distribution function. This definitionis reasonable (at the expense of overloading the notation used for the ordinaryinverse of a function) because, while a CDF may not be an invertible function,it is monotonic nondecreasing. Notice that for the univariate random variableX with CDF P , if x(α) = P−1(α), then x(α) is the α quantile of X as above.

In a discrete distribution, the quantile function is a step function, and thequantile is the same for values of α in an interval.

The quantile function, just as the CDF, fully determines a probabilitydistribution.

It is clear that x(α) is a functional of the CDF, say Ξα(P ). The functionalis very simple. It is

Ξα(P ) = P−1(α),

where P−1 is the quantile function.If P (x) = α, we say the quantile-level of x is α. (We also sometimes use

the term “quantile” by itself in a slightly different way: if P (x) = α, we saythe quantile of x is α.)


18 1 Preliminaries

Empirical Quantiles

The quantile function P−1n associated with an ECDF P leads to the order

statistics on which the ECDF is based. These quantiles do not correspondsymmetrically to the quantiles of the underlying population. For a quantile-level α < 2/n, P−1

n (α) = x(1:n), then for each increase in α of 1/n, P−1n

increases to the next order statistic, until finally, it is x(n:n) for the singlelimiting value α = 1. This likely disconnect between the quantiles of theECDF and the quantiles of the underlying distribution leads us to considerother definitions for the empirical quantile, or sample quantile.

For a given sample from a continuous distribution, the intervals betweensuccessive order statistics are unbiased estimates of intervals of equal proba-bility. For quantile-levels between 1/n and 1− 1/n, this leads to estimates ofquantiles that are between two specific order statistics; that is, for α between1/n and 1 − 1/n, an unbiased estimate of the α quantile would be betweenxi:n and xi+1:n, where i/n ≤ α ≤ (i + 1)/n. For α between 1/n and 1 − 1/n,we therefore define the empirical quantile as a convex linear combination oftwo successive order statistics:

qi = λxi:n + (1 − λ)xi+1:n, (1.20)

for 0 ≤ λ ≤ 1. We define the quantile-level of such a quantile as

pi =i − ι

n + ν(1.21)

for some ι ∈ [0, 12 ] and some ν ∈ [0, 1]. This expression is based on a linear

interpolation of the ECDF between the order statistics. The values of ι and νdetermine the value of λ in equation (1.20).

We would like to choose values of ι and ν in expression (1.21) that makethe empirical quantiles of a random sample correspond closely to those of thepopulation depend on the distribution of the population, but of course, thoseare generally unknown. A certain symmetry may be imposed by requiringν = 1 − 2ι. For a discrete distribution, it is generally best to take ι = ν = 0.A common pari of choices in continuous distributions are ι = 1

2 and ν = 0.Another reasonable pair of choices are ι = 3

8and ν = 1/4. This corresponds

to values that match quantiles of a normal distribution.Hyndman and Fan (1996) identify nine different versions of the expres-

sion (1.21) that are used in statistical computer packages. In some cases, thesoftware only uses one version; in other cases, the user is given a choice.

The R function quantiles allows the user to choose among nine types,and for the chosen type returns the empirical quantiles and the quantile-levels.The R function ppoints generates values of pi as in equation (1.21). For nless that 11, ppoints uses the values ι = 3

8 and ν = 1/4, and for larger valuesof n, it uses ι = 1

2 and ν = 0.



Empirical quantiles are used in Monte Carlo inference, in nonparametricinference, and in graphical displays for comparing a sample with a standarddistribution or with another sample.

Empirical quantiles can also be used as estimators of the population quan-tiles, but there are other estimators for quantiles of a continuous distribution.Some, such as the Kaigh-Lachenbruch estimator and the Harrell-Davis esti-mator, use a weighted combination of multiple data points instead of just asingle one, as in the simple estimators above. (See Kaigh and Lachenbruch(1982) and Harrell and Davis (1982) and also see Exercise 1.10.)

If a covariate is available, it may be possible to use it to improve the quan-tile estimate. This is often the case in simulation studies, where the covariateis called a control variable.

1.3.3 q-q Plots

The quantile-quantile plot or q-q plot is a useful graphical display for comparingtwo distributions, two samples, or a sample and a given theoretical distribu-tion. The most common use is for comparing an observed sample with a giventheoretical or reference distribution. Either order statistics of the sample orsample quantiles are plotted along one axis and either expected values of orderstatistics or quantiles of the reference distribution corresponding to appropri-ate quantile-levels are plotted along the other axis. The choices have to do withthe appropriate definitions of empirical quantiles, as in equation (1.20), or ofpopulation quantiles corresponding to quantile-levels, as in equation (1.21).The sample points may be on the vertical axis and the population points onthe horizontal axis, or they can be plotted in the other way.

The extent to which the q-q scatterplot fails to lie along a straight lineprovides a visual assessment of whether the points plotted along the two axesare from distributions or samples whose quantiles are linearly related. Thepoints will fall along a straight line if the two distributions are in the samelocation-scale family; that is, if the two underlying random variables have arelationship of the form Y = aX + b, where a 6= 0.

One of the most common reference distributions, of course, is the normal.The normal family of distributions is a location-scale family, so a q-q plot fora normal reference distribution does not depend on the mean or variance. TheR function qqnorm plots order statistics from a given sample against normalquantiles.

The R function qqplot creates more general q-q plots.Figure 1.3 shows a q-q plot that compares a sample with two gamma

reference distributions. The sample was generated from a gamma(10, 10)distribution, and the order statistics from that sample are plotted alongthe vertical axes. The reference distribution on the left-hand side in Fig-ure 1.3 is a gamma(10, 1) distribution, which is in the same scale family asa gamma(10, 10); hence, the points fall close to a straight line. This plot wasproduced by the R statements


20 1 Preliminaries

plot(qgamma(ppoints(length(x)),10),sort(x))

abline(lsfit(qgamma(ppoints(x),10),sort(x)))

Note that the “worst fit” is in the tails; that is, near the end points of theplot. This is characteristic of q-q plots and is a result of the greater skewnessof the extreme order statistics.

4 6 8 10 14 18

6080

100

120

140

Gamma(10,1) scores

x

0 10 20 30 40

6080

100

120

140

Gamma(1,1) scores

x

Figure 1.3. Quantile-Quantile Plot for Comparing the Sample to Gamma Distri-

butions

The reference distribution in the plot on the right-hand side in Figure 1.3is a gamma(1, 1) distribution. The points do not seem to lie on a straightline. The extremes of the sample do not match the quantiles well at all. Thepattern that we observe for the smaller observations (that is, that they arebelow a straight line that fits most of the data) is characteristic of data with aheavier left tail than the reference distribution to which it is being compared.Conversely, the larger observations, being below the straight line, indicatethat the data have a lighter right tail than the reference distribution.

An important property of the q-q plot is that its shape is independent ofthe location and the scale of the data. In plot on the left side in Figure 1.3,the sample is from a gamma distribution with a scale parameter of 10, butthe distribution quantiles are from a population with a scale parameter of 1.

For a random sample from the distribution against whose quantiles it isplotted, the points generally deviate most from a straight line in the tails.This is because of the larger variability of the extreme order statistics. Also,because the distributions of the extreme statistics are skewed, the deviationfrom a straight line is in a specific direction (toward lighter tails) more thanhalf of the time (see Exercise 1.11, page 60).


1.4 The Role of Optimization in Inference 21

A q-q plot is an informal visual goodness-of-fit test. (There is no “signifi-cance level”.) The sup absolute difference between the ECDF and the referenceCDF is the Kolmogorov distance, which is the basis for the Kolmogorov test(and the Kolmogorov-Smirnov test) for distributions. The Kolmogorov dis-tance does poorly in measuring differences in the tails of the distribution. Aq-q plot, on the other hand, generally is very good in revealing differences inthe tails.

A q-q plot is a useful application of the ECDF. As I have mentioned, theECDF is not so meaningful for multivariate distributions. Plots based on theECDF of a multivariate dataset are generally difficult to interpret.

1.4 The Role of Optimization in Inference

Important classes of estimators are defined as the points at which some func-tion that involves the parameter and the random variable achieves an opti-mum. There are, of course, many functions that involve the parameter and therandom variable. One example is the probability density, and taking the valueof the parameter that maximizes the joint probability density is the methodof “maximum likelihood”, which we will discuss in Section 1.4.4, beginning onpage 41.

In the general setup of the problem, we have a function s(x), called theobjective function, and we wish to find the value of x, called the decision

variable, that yields the minimum value of s. We write the problem as

minx

s(x). (1.22)

An optimization problem may be ill-posed in the sense that there is no solu-tion; for example, minx x is ill-posed.

“Optimization” means either maximization or minimization, but we oftengenerically discuss the problem in terms of minimization. If we are interestedin the maximum rather than the minimum, we merely consider −s(x) ratherthan s(x). In statistical applications, the nature of the function determines themeaning of “optimized”; if the function is the probability density, for exam-ple, “optimized” would logically mean “maximized”. (This leads to maximumlikelihood estimation, as mentioned above.)

Constrained Optimization Problems

A common problem arising in applications is to find a point at which a functions(x) achieves a maximum or minimum value. This is the general statement ofan optimization problem. Sometimes the problem is constrained:

minx

s(x) (1.23)

s.t. g(x) ≤ b.


22 1 Preliminaries

Here we assume s is a scalar function and x and b are vectors. The functions is the objective function, the variable x is the decision variable, and theinequalities g(x) ≤ b are called the constraints. Any value of the decisionvariable that satisfies the constraints is called a feasible point. We also writethe constrained optimization problem in the form

minx∈X

s(x),

where X is the set of feasible points.A constrained optimization problem may be ill-posed in the sense that

there is no solution; for example, if IR+ represents the set of positive realnumbers,

minx∈IR+

x,

is ill-posed. The problemmin

x∈IR+

x,

where IR+ represents the set of nonnegative real numbers, is well-posed how-ever.

There are many variations of optimization problems, most of which canbe expressed in the form of problem (1.23). If some constraint is an equal-ity, gi(x) = bi, we form two inequalities in problem (1.23): gi(x) ≤ bi andgi(x) ≤ −bi. If there are no constraints, that is, if we have an unconstrained

optimization problem as in expression (1.22), we remove g(x) ≤ b from theproblem statement.

Optimization for Statistical Inference

The objective function in statistical inference may be some measure of thedeparture of observations from their predicted values, based on a model, or itmay be the likelihood of observing those particular values of the observablerandom variable. In optimization for statistical inference, once an objectivefunction is chosen, observations on the random variable are taken and are thenconsidered to be fixed; the parameter in the function is the decision variable.The function is then optimized with respect to the parameter variable.

In discussing the use of optimization in statistical estimation, we mustbe careful to distinguish between a symbol that represents a fixed parameterand a symbol that represents a “variable” parameter. When we denote aprobability density function as p(y | θ), we generally expect “θ” to representa fixed, but possibly unknown, parameter. A family of probability modelsspecifies a parameter space, say Θ, that determines the possible values of θ inthat model.

In an estimation method that involves optimizing some function, “θ” isoften used as a variable placeholder. I prefer to use some other symbol for avariable quantity that serves in the place of the parameter; therefore, in the



following discussion, I will sometimes use “t” in place of “θ” when I want totreat is as a variable. For a family of probability models with parameter spaceΘ, we define a set T ⊇ Θ and require t ∈ T . (T is often the closure of theparameter space, Θ.) For other common symbols of parameters, such as β,when I want to treat them as variables in an optimization algorithm, I usecorresponding Latin letters, such as b.

In an iterative algorithm, I use t(k) to represent a fixed value in the kth

iteration.

Some Comments on Optimization for Statistical Inference

The solution to an optimization problem is in some sense “best” for thatparticular problem and its objective functions; this may mean, however, thatit is considerably less good for some other optimization problem. It is often thecase, therefore, that an optimal solution is not robust to assumptions aboutthe phenomenon being studied.

Optimization, because it is performed in the context of a set of assump-tions, is likely to magnify the effects of the assumptions.

1.4.1 Computational Methods for Optimization

In some cases, the optimization problem can be solved in closed form, but inmost cases the problem must be solved iteratively. An iterative solution followsa sequence of values of the decision variable x(0), x(1), . . . x(k), x(k+1), . . .. Aniterative solution method requires

0. a starting point x(0)

1. a method for deciding whether to stop and accept the current iterate2. given an iterate x(k), a method for generating a possible new point x(k+1)

3. a method for deciding whether x(k+1) should be accepted as the next iteratex(k+1)

In step 1 the decision is usually based on the difference between x(k) andx(k−1) or between s(x(k)) and s(x(k−1)).

The main differences in computational methods for optimization occur insteps 2 and 3, and often depend on the properties of s(x) and g(x).

In step 2, the choice of the next point to investigate depends on the natureof the space of feasible points. An important distinction is whether or not thespace is dense, in which case the problem is called continuous optimization.In continuous optimization, the decision variable is effectively a real numberor a real vector. For countable feasible spaces, the problem is called discreteoptimization or combinatorial optimization. In combinatorial optimization,the decision variable, instead of a real number, may be a more complicatedmathematical object, such as a permutation. In that case, we often use theterm “state”, rather than “variable” or “point”.


24 1 Preliminaries

Also in step 2 for constrained optimization problems, some methods mayallow consideration of a point x(k+1) that does not satisfy the constraints, solong as the objective function is defined at the point. Ultimately, of course,the solution sequence must return to the feasible points. Some methods maymake different computational iterations depending on how close the pointis to the boundary of the feasible space, that is, depending on the quantity‖g(x(k)) − b‖.

In step 3, the obvious decision criterion is whether or not

s(x(k+1)

)< s

(x(k)

). (1.24)

If the inequality holds, then x(k+1) is better than x(k), and so it should betaken as the value of x(k+1). Sometimes, however, even if inequality (1.24) isnot satisfied, x(k+1) may be accepted as the next point. This is done by com-puting an appropriate probability for accepting x(k+1) and then generating a0-1 value based on that probability. Methods that do this are called stochas-tic optimization algorithms. The “appropriate probability” depends on thealgorithm, of course, but it is usually some function of

∆s =∥∥∥s(x(k+1)

)− s

(x(k)

)∥∥∥ . (1.25)

Other variations on the basic computational steps involve conditional it-erations of steps 2 and 3 for subsets of the decision variables. If x1 and x2

represent subvectors of the vector of decision variables, it may be more effi-

cient to update x(k)1 to get x

(k+1)1 for a fixed value x

(k)2 , and then update x

(k)2

to get x(k+1)2 for the fixed value x

(k+1)1 . This results in a zigzag path toward

the optimum, but each step generally requires fewer computations. This idea,of course, could be applied to more than two subvectors of the decision vector.The subvectors can even overlap. Once these sub-iterations over steps 2 and 3are completed, the method continues with step 1. This idea is exploited ina class of optimization algorithms that use the so-called conjugate gradientmethod.

Local and Global Optima

A function may have a local minimum, that is, a point that is either no largerthan any of the other points in a neighborhood of the given point. If thegiven point is smaller than all other points in the neighborhood, it is a strictlocal minimum. If the neighborhood contains all feasible points, than a pointthat is no larger than any of the other points in the neighborhood is a globalminimum. If a global minimum is smaller than all other feasible points, thenit is the global minimum. It is also called a strict global minimum or a uniqueminimum.

Unfortunately, there is no reliable way of determining whether or not alocal minimum is the global minimum. Of course, if we know something of the



properties of the objective function, we may know that some local minimumis a global minimum. Sometimes, all we seek in an optimization problem is alocal optimum, but if we want a global optimum, we may try to find as manydifferent local optima as is reasonably possible, and then choose the optimumone among those.

Handling Constraints

Constraints can be handled in different ways. One obvious approach is justto ignore the constraints and see if the unconstrained solution satisfies theconstraints. This may be much quicker than addressing the constraints ateach iteration, and in the happy case that the constraints are satisfied, theproblem is solved. If the constraints are not satisfied, there may be some adhoc ways of adjusting the solution so as to satisfy the constraints but remainoptimal within the constraint space.

Another obvious way of addressing the constraints is to make sure thatthe candidate point x(k+1) in step 2 satisfies the constraints. In methods thatdo this, it is sometimes useful to identify the constraints that are “active”;that is, for which gi(x

(k)) = bi, because the potential values for x(k+1) aremore limited by gi. Sometimes remaining within the constraint space on alliterations overly restricts the speed of the method. A compromise may beto add some positive function of the amount of the constraint violations,‖b− g(x)‖, to the objective function.

Newton’s Method and Variations

If the objective function s is differentiable, the problem is one of continuousoptimization and basic principles of calculus may be used to find an optimum.

If s is twice differentiable, one algorithm is Newton’s method, in which theminimizing value of x is obtained as a limit of the iterates

x(k) = x(k−1) −(Hs

(x(k−1)

))−1

∇s(x(k−1)

), (1.26)

where Hs(x) denotes the Hessian of s and ∇s(x) denotes the gradient of s, bothevaluated at x. (Newton’s method is sometimes called the Newton-Raphsonmethod.)

For various computational considerations, instead of the exact Hessian, amatrix Hs approximating the Hessian is often used. In this case, the techniqueis called a quasi-Newton method.

The direction of steepest descent from the current point is the gradient−∇s(x) at that point. Newton’s method chooses a direction of descent

d = x(k) − x(k−1)

by the solution of equations of the form Hs(x) d = −∇s(x), that is,


26 1 Preliminaries

d = −(Hs(x)

)−1

∇s(x), (1.27)

similar to the expression in equation (1.26). These equations adjust a straightgradient descent in the direction of ∇s(t) by the inverse of the Hessian.

The distance to move in that direction, which by the solution of the equa-tions is ∥∥∥x(k) − x(k−1)

∥∥∥ ,

may be too great, so we may damp it by multiplying it by a quantity between0 and 1. This damping factor, call it α, is a tuning parameter of the method.The new point for the next iteration is taken as the old x plus αd, that is,x(k−1) = x(k) + αd.

Figure 1.4 shows one step from the point x(k) to x(k+1)N using Newton’s

method and to x(k+1)sd using a straight steepest descent. Newton’s direction is

closer to the direction of the true minimum of the function.

x(k)

xsd(k+1)

xN(k+1)

Figure 1.4. A Steepest Descent Step and a Newton Step

Nelder-Mead Method and Variations

The Nelder-Mead simplex method is a direct search method for continuousoptimization that does not use derivatives. The steps are chosen so as toensure a local descent, but neither the gradient nor an approximation to it isused. In this method, to find the minimum of a function, s, of m variables, a



set of m +1 extreme points (a simplex) is chosen to start with, and iterationsproceed by replacing the point that has the largest value of the function witha point that has a smaller value. The new point is chosen by reflecting thepoint with the largest value through the opposite face of the simplex. Thisyields a new simplex and the procedure continues.

Figure 1.5 shows one step in the Nelder-Mead method, for m = 2. The

three points x(k)1 , x

(k)2 , and x

(k)3 form the simplex (in this case a triangle) at

the kth step. The value at x(k)3 is largest, so that point is reflected through the

line segment formed by points x(k)1 and x

(k)2 (the opposite face of the simplex),

and the next simplex consists of those two points and the point xr.

x3(k)

x2(k)

x1(k)

xm

xr

Figure 1.5. One Nelder-Mead Iteration

The most obvious tuning parameter is how far to reflect the point with thelargest value; for example, in the context of Figure 1.5, how far to extend the

line segment formed by points x(k)3 and xm. There are other tuning parameters

that affect the choice of xm (in the example, xm is the midpoint of the linesegment) and other variations on the basic method.

Although the Nelder-Mead algorithm may be slow to converge, it is a veryuseful method for several reasons. The computations in any iteration of thealgorithm are not extensive. No derivatives are needed; in fact, not even thefunction values themselves are needed, only their relative values. The methodis therefore well-suited to noisy functions; that is functions that cannot beevaluated exactly.


28 1 Preliminaries

There have been many suggestions for improving the Nelder-Mead method.Most have concentrated on the stopping criteria or the tuning parameters. Thevarious tuning parameters allow considerable flexibility, but there are no goodgeneral guidelines for their selection.

Simulated Annealing

Simulated annealing is a stochastic method that simulates the thermodynamicprocess in which a metal is heated to its melting temperature and then isallowed to cool slowly so that its structure is frozen at the crystal configurationof lowest energy. In this process, the molecules reach a state of minimumenergy (ideally), but the molecules go through continuous rearrangements,moving toward a lower energy level as they gradually lose mobility due to thecooling. The rearrangements do not result in a monotonic decrease in energy.This is analogous to a sequence of optimization iterations that occasionally gouphill. If the function has local minima, going uphill occasionally is desirable,and this is one of the main advantages of simulated annealing method ofoptimization.

The decision variable in this kind of optimization problem is the state ofthe system. The state, of course, can be characterized by a vector x, and so theannealing process is the same as finding the vector that minimizes a functions(x), just as in problem (1.23).

In simulated annealing, the decision in step 3 of the general method out-lined on page 23 depends on ∆s in equation (1.25), and on a tuning parameter,T (k), called a “temperature” parameter. When the temperature is high, thatis, T (k) is relatively large, the probability of acceptance of any given point ishigh, and the process corresponds to a pure random walk. When the temper-ature is low, however, the probability of accepting any given point is low andat some point, only downhill points are accepted. An “annealing schedule”determines how the tuning parameter is adjusted.

Algorithm 1.1 is a general formulation of simulated annealing that followsthe general optimization procedure described on page 23. There are two func-tions that are required, one to compute and update the tuning parameterT (k), and one to compute the probability of accepting an “uphill” point. Idiscuss those briefly in later sections.

Algorithm 1.1 Simulated Annealing

0. Set k = 0 and initialize state x(k).1. Compute T (k).2. Set i = 0 and j = 0.3. Generate state x(k+1) and compute ∆s = s(x(k+1)) − s(x(k)).4. Based on ∆s, decide whether to move from state x(k) to state x(k+1).

If ∆s ≤ 0,accept;



otherwise,accept with a probability P (∆s, T (k)).

If state x(k+1) is accepted, set i = i + 1.5. If i is equal to the limit for the number of successes at a given value T (k),

go to step 1.6. Set j = j + 1. If j is less than the limit for the number of iterations at

given temperature, go to step 3.7. If i = 0,

deliver s as the optimum; otherwise,if k < kmax,

set k = k + 1 and go to step 1;otherwise,

issue message that‘algorithm did not converge in kmax iterations’.

For optimization of a continuous function over a region, the state is a pointin that region. A new state or point may be selected by choosing a radius r andpoint on the d dimensional sphere of radius r centered at the previous point.For a continuous objective function, the movement in step 3 of Algorithm 1.1may be a random direction to step in the domain of the objective function.In combinatorial optimization, the selection of a new state in step 3 may bea random rearrangement of a given configuration.

Parameters of the Simulated Annealing Algorithm

There are a number of tuning parameters to choose in the simulated anneal-ing algorithm. These include such relatively simple things as the number ofrepetitions or when to adjust the temperature. The probability of acceptanceand the type of temperature adjustments present more complicated choices.

One approach is to assume that at a given temperature, T (k), the stateshave a known probability density (or set of probabilities, if the set of statesis countable), pS(x, T ), and then to define an acceptance probability to movefrom state x(k) to x(k+1) in terms of the relative change in the probability den-sity from pS(x(k), T ) to pS(x(k+1), T ). In the original application of Metropoliset al., the objective function was the energy of a given configuration, and theprobability of an energy change of ∆s at temperature T is proportional toexp(−∆s/T ).

Even when there is no underlying probability model, the probability instep 4 of Algorithm 1.1 is often taken as

P (∆s, T (k)) = e−∆s/T (k)

, (1.28)

although a completely different form could be used. The exponential distri-bution models energy changes in ensembles of molecules, but otherwise it hasno intrinsic relationship to a given optimization problem.


30 1 Preliminaries

The probability can be tuned in the early stages of the computations sothat some reasonable proportion of uphill steps are taken.

There are various ways the temperature can be updated in step 1.The probability of the method converging to the global optimum depends

on a slow decrease in the temperature. In practice, the temperature is generallydecreased by some proportion of its current value:

T (k+1) = bkT (k). (1.29)

We would like to decrease T as rapidly as possible, yet have a high proba-bility of determining the global optimum. Geman and Geman (1984) showedthat under the assumptions that the energy distribution is Gaussian and theacceptance probability is of the form (1.28), the probability of convergencegoes to 1 if the temperature decreases as the inverse of the logarithm of thetime, that is, if bk = (log(k))−1 in equation (1.29). Under the assumption thatthe energy distribution is Cauchy, a similar argument allows bk = k−1, and auniform distribution over bounded regions allows bk = exp(−ckk1/d), whereck is some constant, and d is the number of dimensions.

A constant temperature is often used in simulated annealing for optimiza-tion of continuous functions. The additive and multiplicative adjustments, ck

and bk are usually taken as constants, rather than varying with k.In some cases it may desirable to exercise more control over the random

walk that forms the basis of simulated annealing. For example, we may keepa list of “good” points, perhaps the m best points found so far. After someiterations, we may return to one or more of the good states and begin thewalk anew.

When gradient information is available, even in a limited form, simulatedannealing is generally not as efficient as other methods that use that informa-tion. The main advantages of simulated annealing include its simplicity, itsability to move away from local optima, and the wide range of problems towhich it can be applied.

Simulated annealing proceeds as a random walk through the domain ofthe objective function.

Simulated annealing has been successfully used in a range of optimizationproblems, including probability density smoothing, classification, constructionof minimum volume ellipsoids, and optimal design.

Optimization of Combinations and Permutations

The traveling salesperson problem can serve as a prototype of the discreteoptimization problems. The domain of this problem is a set of permutations;nevertheless, the problems are called sometimes called “combinatorial” opti-mization problems. In the traveling salesperson problem, a state is an orderedlist of points (“cities”) and the objective function is the total distance betweenall the points in the order given (plus the return distance from the last point



to the first point). (A closed path that visits each of the vertices of a graphexactly once is a Hamiltonian closed path or circuit. A connected but incom-plete graph may not contain a Hamiltonian circuit, and in most statementsof the traveling salesperson problem it does not depend on the existence of aHamiltonian circuit; that is, it may be allowed to visit a given city more thanonce.)

Given a state x(k), the basic task, as in step 2 of the general outline onpage 23, is to generate a possible new state x(k+1). One simple rearrangementof the list is the reversal of a sublist, that is, for example,

(1, 2, 3, 4, 5, 6, 7, 8, 9) → (1, 6, 5, 4, 3, 2, 7, 8, 9).

Another simple rearrangement is the movement of a sublist to some otherpoint in the list, for example,

(1, 2, 3, 4, 5, 6, 7, 8,↑ 9) → (1, 7, 8, 2, 3, 4, 5, 6, 9)

Both of these rearrangements are called “2-changes”, because in the graphdefining the salesperson’s circuit, exactly two edges are replaced by two others.

The traveling salesperson problem is a good example to illustrate the simu-lated annealing method, but there are more efficient methods based on cuttingplanes in linear programming algorithms. We will not discuss linear program-ming of algorithms here.

Evolutionary Algorithms

There are many variations of methods that use evolutionary strategies. Thesemethods are inspired by biological evolution, and often use terminology frombiology. Genetic algorithms mimic the behavior of organisms in a competitiveenvironment in which only the fittest and their offspring survive. Decisionvariables correspond to “genotypes” or “chromosomes”; a point or a state isrepresented by a string (usually bit strings); and new values of the decisionvariables are produced from existing points by “crossover” or “mutation”. Theset of points at any stage constitute a “population”. The points that survivefrom one stage to another are those yielding lower values of the objectivefunction.

The ideas of numerical optimization using processes similar to biologicalevolution are old.

1.4.2 Fitting Statistical Models by Optimization

A statistical model such as Y = β0 + β1x + E in equation (1.3) specifies afamily of probability distributions of the data. The specific member of thatfamily of probability distributions depends on the values of the parametersin model, β0, β1 , and σ2 (see equation (1.4)). Simpler models, such as the


32 1 Preliminaries

statement “X is distributed as N(µ, σ2)”, also specify families of probabilitydistributions. An important step in statistical inference is to use observed datato estimate the parameters in the model. This is called “fitting the model”.The question is, how do we estimate the parameters? What properties shouldthe estimators have?

In the following pages we discuss two general ways in which optimizationis used in to estimate parameters or to fit models. One is to minimize de-viations of observed values from what a model would predict (think “leastsquares”, as an example). This is an intuitive procedure which may be chosenwithout regard to the nature of the data-generating process. The justifica-tion for a particular form of the objective function, however, may arise fromassumptions about a probability distribution underlying the data-generatingprocess. Another way in which optimization is used in statistical inference isin maximizing the “likelihood”, which we will discuss more precisely in Sec-tion 1.4.4, beginning on page 41. The correct likelihood function depends onthe probability distribution underlying the data-generating process, which, ofcourse, is not known and can only be assumed. How poor the maximum like-lihood estimator is depends on both the true distribution and the assumeddistribution.

In the discussion below, we briefly describe particular optimization tech-niques that assume that the objective function is a continuous function of thedecision variables, or the parameters. We also assume that there are no a prioriconstraints on the values of the parameters. Techniques appropriate for othersituations, such as for discrete optimization and constrained optimization, areavailable in the general literature on optimization.

We must also realize that mathematical expressions below do not necessar-

ily imply computational methods. There are many additional considerationsfor the numerical computations. A standard example of this point is in thesolution of the linear full-rank system of n equations in n unknowns: Ax = b.While we may write the solution as x = A−1b, we would almost never computethe solution by forming the inverse and then multiplying b by it (see Gentle(2007), Chapter 6).

1.4.3 Estimation by Minimizing Residuals

In many applications, we can express the expected value of a random variableas a function of a parameter. For example, in the model Y = β0 +β1x + E inequation (1.3), we have

E(Y ) = β0 + β1x, (1.30)

which involves an observable covariate. In general, we may write the expectedvalue as

E(Y ) = f(x, θ), (1.31)

where f is some function, x may be some observable covariates (possibly avector), and θ is some unobservable parameter (possibly a vector). (The more



difficult and interesting problems, of course, involve the determination of theform of the function f , but for the time being, we concentrate on the simplerproblem of determining an appropriate value of θ, assuming that the form ofthe function f is known.)

Assuming that we have observations y1, . . . , yn on Y (and observations onthe covariates if there are any), we ask what value of θ would make the modelfit the data best. There are various ways of approaching this question, but areasonable first step would be to look at the differences in the observationsand their expected (or “predicted”) values:

yi − f(xi, θ).

For any particular value of θ, say t, we have the residuals,

ri(t) = yi − f(xi, t). (1.32)

A reasonable estimator of θ would be the value of t that minimizes some normof r(t), the n-vector of residuals ri(t).

Notice that I am using t in place of θ. The logical reason for doing thisis that θ itself is some unknown constant. I want to work with a variablewhose value I can choose according to my own criterion. (While you shouldunderstand this reasoning, I will often use the same symbol for the variable

that I use for the constant unknown parameter. This sloppiness is common inthe statistical literature.)

We have now formulated our estimation problem as an optimization prob-lem:

mint

‖r(t)‖, (1.33)

where ‖·‖ represents a vector norm. We often choose the norm as the Lp norm,which we sometimes represent as ‖ ·‖p. For the n-vector v, the Lp vector normis (

n∑

i=1

|vi|p)1/p

.

I will discuss vector norms more fully in Chapter 5, beginning on page 149.For the Lp norm, we minimize a function of an Lp norm of the residuals,

sp(t) =

n∑

i=1

|yi − f(xi, t)|p, (1.34)

for some p ≥ 1, to obtain an Lp estimator. Simple choices are the sum of theabsolute values and the sum of the squares. The latter choice yields the least

squares estimator.More generally, we could minimize

sρ(t) =

n∑

i=1

ρ(yi − f(xi, t)) (1.35)


34 1 Preliminaries

for some nonnegative function ρ(·) to obtain an “M estimator”. (The namecomes from the similarity of this objective function to the objective functionfor some maximum likelihood estimators.)

Constrained Minimization of Residuals

Some models have restrictions on the parameters. For example, we may knowfrom first principles that a certain quantity y increases if another quantity xincreases. Hence, if we use a simple linear model for the relationship of y tox,

y = β0 + β1x + ε,

we may require that in any fitted model, the value for the slope, β1, is non-negative. To fit the model by minimizing the residuals formed from observeddata, we have the constrained optimization problem,

minb0,b1

‖r(b0, b1)‖

s.t. b1 ≥ 0.

As indicated above, there are various ways of handling constraints in opti-mization problems. Often, in a problem of the type described here where theconstraints arise from properties of the phenomenon itself, fitting an appropri-ate model by minimizing residuals yields estimates that satisfy the constraints.If the fitted values do not satisfy the constraints, there may be cause to recon-sider the model, the way in which the data were collected, or other aspects ofthe problem.

Other Types of Residuals

Recall our original estimation problem. We have a model, say

Y = f(xi, θ) + E,

where E is a random variable with unknown variance, say σ2. We have formu-lated an optimization problem involving residuals formed from the decisionvariable t for the estimation of θ. What about estimation of σ2? To use thesame heuristic, that is, to form a residual involving σ2, we would use theresiduals that involve t:

g

(n∑

i=1

(r(t))2

)− v2 , (1.36)

where g is some function and where the variable v2 is used in place of σ2.This residual may look somewhat unfamiliar, and there is no obvious way ofchoosing g. In certain simple models, however, we can formulate expression



g(∑n

i=1(r(t))2)

in terms of the expected values of Y − f(x, θ). In the caseof the model (1.3) Y = β0 + β1x + E, this leads to the estimator v2 =∑n

i=1(r(t))2/(n − 2).

Formation of residuals in other estimation problems may not always beso straightforward. For example, consider the problem of estimation of theparameters α and β in a beta(α, β) distribution. If the random variable Y hasthis distribution, then E(Y ) = α/(α + β). Given observations y1, . . . , yn, wecould form residuals similar to what we did above:

r1i(a, b) = yi − a/(a + b).

Minimizing a norm of this, however, does not yield unique individual valuesfor a and b.

We need other, mathematically independent, residuals. (“Mathematicallyindependent” refers to unique solutions of equations; not statistical indepen-dence of random variables.) If the random variable Y has the beta distributionas above, then we know

E(Y 2) = V(Y 2) + (E(Y ))2

=α2(α + β) + αβ

(α + β)2(α + β + 1).

We could now form a second set of residuals:

r2i(a, b) = y2i − (a2(a + b) + ab)/((a + b)2(a + b + 1)).

We now form a minimization problem involving these two sets of residuals.There are, of course, various minimization problems that could be formu-

lated. The most direct problem, analogous to (1.33), is

mina,b

(‖r1(a, b)‖ + ‖r2(a, b)‖).

Another optimization problem is the sequential one:

mina,b

‖r1(a, b)‖, (1.37)

and then, subject to a conditional minimum,

mina,b

‖r2(a, b)‖. (1.38)

In Exercise 1.14a, you are asked to estimate the parameters in a gammadistribution using these ideas.

Computational Methods for Estimation by Minimizing Residuals

Standard techniques for optimization can be used to determine estimates thatminimize various functions of the residuals, that is, for some appropriate func-tion of the residuals s(·), to solve


36 1 Preliminaries

mint

s(t). (1.39)

Here, I am thinking of the model parameter θ, and so I am using t as a variablein place of θ. I will denote the optimal value of t, according to the criteriaembodied in the the function s(·), as θ.

Except for special forms of the objective function, the algorithms to solveexpression (1.39) are iterative. If s is twice differentiable (and in practice, s isusually chosen to be twice differentiable, at least piecewise), one algorithm isNewton’s method, (as in equation (1.26) on page 25) in which the minimizing

value of t, θ, is obtained as a limit of the iterates

t(k) = t(k−1) −(Hs

(t(k−1)

))−1

∇s(t(k−1)

), (1.40)

where Hs(t) denotes the Hessian of s and ∇s(t) denotes the gradient of s,both evaluated at t. Instead of Newton’s method, a quasi-Newton method isoften used for computational efficiency.

There are various ways of deciding when an iterative optimization algo-rithm has converged. In general, convergence criteria are based on the sizeof the change in t(k) from t(k−1), or the size of the change in s(t(k)) froms(t(k−1)).

Statistical Properties of Minimum-Residual Estimators

It is generally difficult to determine the variance or other high-order statisticalproperties of an estimator defined as above (that is, defined as the minimizerof some function of the residuals). In many cases, all that is possible is toapproximate the variance of the estimator in terms of some relationship thatholds for a normal distribution. (In robust statistical methods, for example,it is common to see a “scale estimate” expressed in terms of some mysteriousconstant times a function of some transformation of the residuals.)

There are two issues that affect both the computational method and thestatistical properties of the estimator defined as the solution to the optimiza-tion problem. One consideration has to do with the acceptable values of theparameter θ. In order for the model to make sense, it may be necessary that theparameter be in some restricted range. In some models, a parameter must bepositive, for example. In these cases, the optimization problem has constraints.Such a problem is more difficult to solve than an unconstrained problem. Sta-tistical properties of the solution are also more difficult to determine. Moreextreme cases of restrictions on the parameter may require the parameter totake values in a countable set. Obviously, in such cases, Newton’s methodcannot be used because the derivatives cannot be defined. In those cases, acombinatorial optimization algorithm must be used instead. Other situationsin which the function is not differentiable also present problems for the opti-mization algorithm. In such cases, if the domain is continuous, a descendingsequence of simplexes can be used.



Secondly, it may turn out that the optimization problem (1.39) has localminima. This depends on the nature of the function f(·) in equation (1.31).Local minima present problems for the computation of the solution becausethe algorithm may get stuck in a local optimum. Local minima also presentconceptual problems concerning the appropriateness of the estimation crite-rion itself. As long as there is a unique global optimum, it seems reasonableto seek it and to ignore local optima. It is not so clear what to do if there aremultiple points at which the global optimum is attained.

Least Squares Estimation

Least squares estimators are generally more tractable than estimators basedon other functions of the residuals. They are more tractable both in terms ofsolving the optimization problem to obtain the estimate and in approximatingstatistical properties of the estimators, such as their variances.

Consider the model in equation (1.31), E(Y ) = f(x, θ), and assume thatθ is an m-vector and that f(·) is a smooth function in θ. Letting y be then-vector of observations, we can write the least squares objective functioncorresponding to equation (1.34) as

s(t) =(r(t)

)Tr(t), (1.41)

where the superscript T indicates the transpose of a vector or matrix.The gradient and the Hessian for a least squares problem have special

structures that involve the Jacobian of the residuals, Jr(t). The gradient of sis

∇s(t) = 2(Jr(t)

)Tr(t). (1.42)

Taking derivatives of ∇s(t), we see that the Hessian of s can be written interms of the Jacobian of r and the individual residuals:

Hs(t) = 2(Jr(t)

)TJr(t) + 2

n∑

i=1

ri(t)Hri(t). (1.43)

In the vicinity of the solution θ, the residuals ri(t) should be small, andHs(t) may be approximated by neglecting the second term:

Hs(t) ≈ 2(Jr(t)

)TJr(t).

Using equation (1.42) and this approximation for equation (1.43) in the ad-justed gradient descent equation (1.27), we have the system of equations

(Jr(t

(k−1)))T

Jr(t(k−1)) d(k) = −

(Jr(t

(k−1)))T

r(t(k−1)) (1.44)

to be solved for d(k), where


38 1 Preliminaries

d(k) ∝ t(k) − t(k−1).

It is clear that the solution d(k) is a descent direction; that is, if∇s(t(k−1)) 6= 0,

(d(k))T∇s(t(k−1)) = −((

Jr(t(k−1))

)Td(k)

)T (Jr(t

(k−1)))T

d(k)

< 0.

The update step is determined by a line search in the appropriate direction:

t(k) − t(k−1) = α(k)d(k).

This method is called the Gauss-Newton algorithm. (The method is also some-times called the “modified Gauss-Newton algorithm” because many years agono damping was used in the Gauss-Newton algorithm, and α(k) was taken asthe constant 1. Without an adjustment to the step, the Gauss-Newton methodtends to overshoot the minimum in the direction d(k).) In practice, rather thana full search to determine the best value of α(k), we just consider the sequenceof values 1, 1

2 , 14 , . . . and take the largest value so that s(t(k)) < s(t(k−1)). The

algorithm terminates when the change is small.If the residuals are not small or if Jr(t

(k)) is poorly conditioned, the Gauss-Newton method can perform very poorly. One possibility is to add a condi-tioning matrix to the coefficient matrix in equation (1.44). A simple choice isτ (k)Im, and the equation for the update becomes

((Jr(t

(k−1)))T

Jr(t(k−1)) + τ (k)Im

)d(k) = −

(Jr(t

(k−1)))T

r(t(k−1)),

where Im is the m × m identity matrix. A better choice may be an m × mscaling matrix, S(k), that takes into account the variability in the columns ofJr(t

(k−1)); hence, we have for the update

((Jr(t

(k−1)))T

Jr(t(k−1)) + λ(k)

(S(k)

)TS(k)

)d(k) = −

(Jr(t

(k−1)))T

r(t(k−1)).

(1.45)

The basic requirement for the matrix(S(k)

)TS(k) is that it improve the con-

dition of the coefficient matrix. There are various ways of choosing this ma-

trix. One is to transform the matrix(Jr(t

(k−1)))T

Jr(t(k−1)) so that it has 1’s

along the diagonal (this is equivalent to forming a correlation matrix from avariance-covariance matrix), and to use the scaling vector to form S(k). Thenonnegative factor λ(k) can be chosen to control the extent of the adjustment.The sequence λ(k) must go to 0 for the algorithm to converge.

Equation (1.45) can be thought of as a Lagrange multiplier formulation ofthe constrained problem,

minx

12

∥∥Jr(t(k−1))x + r(t(k−1))

∥∥

s.t.∥∥S(k)x

∥∥ ≤ δk.

(1.46)



The Lagrange multiplier λ(k) is zero if d(k) from equation (1.44) satisfies‖d(k)‖ ≤ δk; otherwise, it is chosen so that

∥∥S(k)d(k)∥∥ = δk.

Use of an adjustment such as in equation (1.45) is called the Levenberg-

Marquardt algorithm. This is probably the most widely used method for non-linear least squares.

Variance of Least Squares Estimators

In a data-generating process following the simple model E(Y ) = f(θ), if Y hasfinite second moment, then sample mean Y is a consistent estimator (in meansquare) of f(θ). Furthermore, if Y has finite fourth moment, then the mini-

mum residual norm(r(θ)

)Tr(θ) divided by (n − m) is a consistent estimator

of the variance of Y , say σ2, that is

σ2 = E(Y − f(θ))2 .

We denote this estimator as σ2:

σ2 =(r(θ)

)Tr(θ)/(n − m).

This estimator, strictly speaking, is not a least squares estimator of σ2, butit is based on least squares estimators of the other parameters, and so we callit the least squares estimator.

The variance of the least squares estimator θ, however, is not easy to workout, except in special cases. In the simplest case, f is linear and Y has anormal distribution, and we have the familiar linear regression estimates of θand of the variance of the estimator of θ.

Without the linearity property, however, even with the assumption of nor-mality, it may not be possible to write a simple expression for the variance-covariance matrix of an estimator that is defined as the solution to the leastsquares optimization problem. Using a linear approximation, however, we mayestimate an approximate variance-covariance matrix for θ as

((Jr(θ)

)TJr(θ)

)−1

σ2. (1.47)

Compare this linear approximation to the expression for the estimated variance-covariance matrix of the least squares estimator β in the linear regressionmodel E(Y ) = Xβ, in which Jr(β) is just X. An estimate of σ2 based on theleast squares estimates of the parameters is the sum of the squared residualsdivided by n − m, where m is the number of estimated elements in θ.

If the residuals are small, the Hessian is approximately equal to the cross-product of the Jacobian, as we see from equation (1.43), so an alternate ex-pression for the estimated variance-covariance matrix is

(Hs(θ)

)−1σ2. (1.48)


40 1 Preliminaries

This latter expression is more useful if Newton’s method or a quasi-Newtonmethod is used instead of the Gauss-Newton method for the solution of theleast squares problem because in these methods the Hessian or an approximateHessian is used in the computations.

Iteratively Reweighted Least Squares

Often in applications, the residuals in equation (1.32) are not given equalweight for estimating θ. This may be because the reliability or precision ofthe observations may be different. For weighted least squares, instead of equa-tion (1.41) we have the objective function

sw(t) =

n∑

i=1

wi

(ri(t)

)2. (1.49)

The weights add no complexity to the problem, and the Gauss-Newtonmethods of the previous section apply immediately, with

r(t) = Wr(t), (1.50)

where W is a diagonal matrix containing the weights.The simplicity of the computations for weighted least squares suggests

a more general usage of the method. Suppose that we are to minimize someother Lp norm of the residuals ri, as in equation (1.34). The objective functioncan be written as

sp(t) =

n∑

i=1

1

|yi − f(t)|2−p|yi − f(t)|2. (1.51)

This leads to an iteration on the least squares solutions. Beginning withyi − f(t(0)) = 1, we form the recursion that results from the approximation

sp

(t(k))≈

n∑

i=1

1∣∣∣yi − f(t(k−1))

∣∣∣2−p

∣∣∣yi − f(t(k))∣∣∣2

. (1.52)

Hence, we solve a weighted least squares problem, and then form a newweighted least squares problem using the residuals from the previous prob-lem. This method is called iteratively reweighted least squares, or IRLS. Theiterations over the residuals are outside the loops of iterations to solve theleast squares problems, so in nonlinear least squares, IRLS results in nestediterations.

There are some problems with the use of reciprocals of powers of residualsas weights. The most obvious problem arises from very small residuals. Thisis usually handled by use of a fixed large number as the weight.

Iteratively reweighted least squares can also be applied to other norms,that is, other than Lp norms,



sρ(t) =

n∑

i=1

ρ(yi − f(t)

),

but the approximations for the updates may not be as good.

1.4.4 Estimation by Maximum Likelihood

One of the most commonly used approaches to statistical estimation is maxi-

mum likelihood. The concept has an intuitive appeal, and the estimators basedon this approach have a number of desirable mathematical properties, at leastfor broad classes of distributions.

Given a sample y1, . . . , yn from a distribution with probability density orprobability mass function p(y | θ), a reasonable estimate of θ is the valuethat maximizes the joint density or joint probability with variable t at theobserved sample value:

∏i p(yi | t), so long as that maximizing value is within

the parameter space of the model, or at least within the closure of that space.We define the likelihood function as a function of a variable in place of theparameter:

Ln(t ; y) =

n∏

i=1

p(yi | t), for t ∈ Θ, (1.53)

where Θ is the parameter space of the probability model.Note the reversal in roles of variables and parameters. The likelihood func-

tion appears to represent a “posterior probability”, but that is not an appro-priate interpretation. Ln is not even a PDF (in t), even though

∏ni=1 p(yi | t)

is a PDF (in the variables yi).Just as in the case of estimation by minimizing residuals, the more dif-

ficult and interesting problems involve the determination of the form of thefunction p(yi | θ). In these sections, as above, however, we concentrate on thesimpler problem of determining an appropriate value of t as an estimator ofθ, assuming that the form of p is known.

The value of t for which L attains its maximum value is the maximum

likelihood estimate (MLE) of θ for the given data, y. The data—that is, therealizations of the variables in the density function—are considered as fixed,and the parameters are considered as variables of the optimization problem,

maxt∈Θ

Ln(t ; y). (1.54)

This optimization problem often is more difficult than the optimizationproblem (1.33) that results from an estimation approach based on minimiza-tion of some norm of a residual vector. As we discussed in that case, therecan be both computational and statistical problems associated either withrestrictions on the set of possible parameter values or with the existence oflocal optima of the objective function. These problems also occur in maximumlikelihood estimation.


42 1 Preliminaries

In practice the functions of residuals that are minimized are almost alwaysdifferentiable, and the optimum occurs at a stationary point. This, however,is often not the case in maximum likelihood estimation. A standard examplein which the MLE does not occur at a stationary point is a distribution inwhich the range depends on the parameter, and the simplest such distributionis the uniform U(0, θ). In this case, the MLE is the max order statistic.

In addition to problems due to constraints or due to local optima, otherproblems may arise if the likelihood function is bounded. The conceptual dif-ficulties resulting from an unbounded likelihood are much deeper. In practice,for computing estimates in the unbounded case, the general likelihood prin-ciple may be retained, and the optimization problem redefined to include apenalty that keeps the function bounded. Adding a penalty to form a boundedobjective function in the optimization problem, or to dampen the solution iscalled penalized maximum likelihood estimation.

For a broad class of distributions, the maximum likelihood criterion yieldsestimators with good statistical properties. A set of conditions that guaranteecertain optimality properties is called the “regular case”.

Maximum likelihood estimation is particularly straightforward for distri-butions in an exponential family, that is, with PDF of the form

p(y | θ) = h(y) exp(θTg(y) − a(θ)

).

(Here, if the PDF is zero over any region, then h(y), which does not depend onθ, must be zero over that region. Compare this expression with equation (1.2)on page 10, where the support of the distribution was given differently.) When-ever g(·) and a(·) are sufficiently smooth, the MLE in an exponential familyhas certain optimal statistical properties.

The log-likelihood function,

lLn(t ; y) = logLn(t ; y), (1.55)

is a sum rather than a product. The form of the log-likelihood in an exponen-tial family is particularly simple:

lLn(t ; y) =

n∑

i=1

tTg(yi) − n a(t) + c,

where c depends on the yi but is constant with respect to the variable ofinterest.

The logarithm is monotone, so the optimization problem (1.54) can besolved by solving the maximization problem with the log-likelihood function:

maxt∈Θ

lLn(t ; y). (1.56)

In the following discussion, we will find it convenient to drop the subscriptn in the notation for the likelihood and the log-likelihood. We will also often



work with the likelihood and log-likelihood as if there is only one observation.(A general definition of a likelihood function is any nonnegative function thatis proportional to the density or the probability mass function; that is, itis the same as the density or the probability mass function except that thearguments are switched, and its integral or sum over the domain of the randomvariable need not be 1.)

If the likelihood is twice differentiable and if the range does not depend onthe parameter, Newton’s method (see equation (1.26)) could be used to solvethe optimization problem (1.56). Newton’s equation

HlL(t(k−1) ; y) d(k) = ∇lL(t(k−1) ; y) (1.57)

is used to determine the step direction in the kth iteration. A quasi-Newtonmethod, as we mentioned on page 26, uses a matrix HlL(t(k−1)) in place ofthe Hessian HlL(t(k−1)).

Remember that mathematical expressions do not necessarily imply com-

putational methods. There are many additional considerations for the nu-merical computations, and the expressions above and below, such as equa-tions (1.60), (1.62), and (1.63), rarely should be used directly in a computerprogram.

Statistical Models

In the foregoing I have been careful, at the cost of some notational complexity,to distinguish between a parameter and a variable in a computational methodused to estimate that parameter. Most other authors do not do this, and I donot always do it myself. It is important to make the distinction in data analysiswhen the objective is to fit a statistical model. In analysis of a statistical model(as opposed to using the model to analyze data), it is often important to thinkof the parameter as a variable. In fact, we may take the derivative of a PDFwith respect to the parameter, which obviously requires that the parameterbe a variable.

In this section, we will consider some important useful relationships be-tween the log-likelihood function and properties of probability models as theyrange over the parameter space, that is, when the parameter is considered tobe a variable.

If it exists, the derivative of the log-likelihood is the relative rate of change,with respect to the parameter placeholder θ, of the probability density func-tion at a fixed observation. If θ is a scalar, some positive function of thederivative such as its square or its absolute value is obviously a measure ofthe effect of change in the parameter or in the estimate of the parameter. Moregenerally, an outer product of the derivative with itself is a useful measureof the changes in the components of the parameter at any given point in theparameter space:

∇lL(θ ; y

) (∇lL

(θ ; y

))T.


44 1 Preliminaries

The average of this quantity with respect to the probability density of therandom variable Y ,

I(θ | Y ) = Eθ

(∇lL

(θ∣∣ Y) (

∇lL(θ∣∣ Y))T)

, (1.58)

is called the information matrix, or the Fisher information matrix. It is a mea-sure of the information that an observation on Y contains about the parameterθ.

The optimization problem (1.54) or (1.56) can be solved by Newton’smethod, equation (1.26) on page 25, or by a quasi-Newton method. (We shouldfirst note that this is a maximization problem, so the signs are reversed fromour previous discussion of a minimization problem.)

If θ is a scalar, the expected value of the square of the first derivative isthe expected value of the negative of the second derivative,

E

((∂

∂θlL(θ ; y)

)2)

= −E

(∂2

∂θ2lL(θ ; y)

),

or, in general,

E(∇lL

(θ ; y

) (∇lL

(θ ; y

))T)= −E

(HlL

(θ ; y

)). (1.59)

The expected value of the second derivative, or an approximation of it, canbe used in a Newton-like method to solve the maximization problem.

Computations

A common quasi-Newton method for optimizing lL(θ ; y) is Fisher scoring, inwhich the Hessian in Newton’s method is replaced by its expected value. Theexpected value can be replaced by an estimate, such as the sample mean. Theiterates then are

θ(k) = θ(k−1) −(E(θ(k−1)

))−1

∇lL(θ(k−1) ; y

), (1.60)

where E(θ(k−1)) is an estimate or an approximation of

E(HlL

(θ(k−1)

∣∣ Y))

, (1.61)

which is itself an approximation of Eθ∗(HlL

(θ∣∣ Y )). By equation (1.59), this

is the negative of the Fisher information matrix if the differentiation and ex-pectation operators can be interchanged. (This is one of the “regularity condi-

tions” we alluded to earlier.) The most common practice is to take E(θ(k−1))as the Hessian evaluated at the current value of the iterations on θ; that is,as HlL (θ(k−1) ; y). This is called the observed information matrix.



In some cases a covariate xi may be associated with the observed yi, andthe distribution of Y with given covariate xi has a parameter µ that is afunction of xi and θ. (The linear regression model is an example, with µi =xT

i θ.) We may in general write µ = xi(θ). In these cases, another quasi-Newtonmethod may be useful. The Hessian in equation (1.57) is replaced by

(X(θ(k−1))

)T

K(θ(k−1)) X(θ(k−1)), (1.62)

where K(θ(k−1)) is a positive definite matrix that may depend on the currentvalue θ(k−1). (Again, think of this in the context of a regression model, butnot necessarily linear regression.) This method is called the Delta algorithm

because of its similarity to the delta method for approximating a variance-covariance matrix (described on page 51).

In some cases, when θ is a vector, the optimization problem (1.54) or (1.56)can be solved by alternating iterations on the elements of θ. In this approach,iterations based on equations such as (1.57) are

HlL

(θ(k−1)i ; θ

(k−1)j , y

)d(k)i = ∇llL

(θ(k−1)i ; θ

(k−1)j , y

), (1.63)

where θ = (θi, θj) (or (θj , θi)), di is the update direction for θi, and θj isconsidered to be constant in this step. In the next step, the indices i and jare exchanged. This is called componentwise optimization. For some objectivefunctions, the optimal value of θi for fixed θj can be determined in closed form.In such cases, componentwise optimization may be the best method.

Sometimes, we may be interested in the MLE of θi given a fixed value ofθj, so the iterations do not involve an interchange of i and j as in component-wise optimization. Separating the arguments of the likelihood or log-likelihoodfunction in this manner leads to what is called profile likelihood, or concen-

trated likelihood.As a purely computational device, the separation of θ into smaller vectors

makes for a smaller optimization problem for which the number of computa-tions is reduced by more than a linear amount. The iterations tend to zigzagtoward the solution, so convergence may be quite slow. If, however, the Hessianis block diagonal, or almost block diagonal (with sparse off-diagonal submatri-ces), two successive steps of the alternating method are essentially equivalentto one step with the full θ. The rate of convergence would be the same as thatwith the full θ. Because the total number of computations in the two stepsis less than the number of computations in a single step with a full θ, themethod may be more efficient in this case.

Statistical Properties of MLE

Under suitable regularity conditions we referred to earlier, maximum likeli-hood estimators have a number of desirable properties. For most distributions


46 1 Preliminaries

used as models in practical applications, the MLEs are consistent. Further-more, in those cases, the MLE θ is asymptotically normal (with mean θ∗) withvariance-covariance matrix

(Eθ∗

(−HlL

(θ∗∣∣ Y)))−1

, (1.64)

which is the inverse of the Fisher information matrix. A consistent estimatorof the variance-covariance matrix is the inverse of the Hessian at θ. (Notethat there are two kinds of asymptotic properties and convergence issues.Some involve the iterative algorithm, and the others are the usual statisticalasymptotics in terms of the sample size.)

EM Methods

As we mentioned above, the computational burden in a single iteration forsolving an optimization problem can be reduced by more than a linear amountby separating the decision variable into two subvectors. The optimum is thenapproached by alternating between computations involving the two subvec-tors, and the iterations proceed in a zigzag path to the solution. Each of theindividual sequences of iterations is simpler than the sequence of iterations onthe full decision variable.

In the case of maximum likelihood problems, there is an alternatingmethod that arises from an entirely different approach. This method alter-nates between updating θ(k) using maximum likelihood and conditional ex-pected values. The method is called the EM method because the alternatingsteps involve an expectation and a maximization. The method was describedand analyzed by Dempster et al. (1977). Many additional details and alter-natives as well as many examples of applications of the EM algorithm arediscussed by Ng et al. (2012).

The EM methods can be explained most easily in terms of a random sam-ple that consists of two components, one observed and one unobserved ormissing. A simple example of missing data occurs in life-testing, when, forexample, a number of electrical units are switched on and the time when eachfails is recorded. In such an experiment, it is usually necessary to curtail therecordings prior to the failure of all units. The failure times of the units stillworking are unobserved. The data are said to be right censored. The num-ber of censored observations and the time of the censoring obviously provideinformation about the distribution of the failure times.

The missing data can be missing observations on the same random variablethat yields the observed sample, as in the case of the censoring example; or themissing data can be from a different random variable that is related somehowto the random variable observed.

Many common applications of EM methods do involve missing-data prob-lems, but this is not necessary. Often, an EM method can be constructed basedon an artificial “missing” random variable to supplement the observable data.



Let Y = (X, U), and assume that we have observations on X but not onU . We wish to estimate the parameter θ, which figures in the distributionof both components of Y . An EM method uses the observations on X toobtain a value of θ(k) that increases the likelihood and then uses a conditionalexpectation of U that increases the likelihood further.

Let Lc(θ ; x, u) denote the likelihood the complete sample, including thevalues of U that are not available. The likelihood for the observed X is

L(θ ; x) =

∫Lc(θ ; x, u) du.

The corresponding Let log-likelihoods are lLc(θ ; x, u) and lL(θ ; x).

The EM approach to maximizing L(θ ; x) has two alternating steps. Thefirst one begins with a value θ(0) and computes the expected value of theunobserved variable, given that value of θ and the observed value of X. Thesteps are iterated until convergence.

• E step : compute q(k)(θ) = EU |x,θ(k−1)

(lLc

(θ | x, U)).

• M step : determine θ(k) to maximize q(k)(θ), subject to any constraints onacceptable values of θ.

The sequence θ(1), θ(2), . . . converges to a local maximum of the observed-datalikelihood L(θ ; u) under fairly general conditions (including, of course, thenonexistence of a local maximum near enough to θ(0)). The EM method canbe very slow to converge, however.

For a simple example of the EM method, see Exercise 1.17, in which theproblem discussed in the article by Dempster et al. (1977) is described. Inthis case, the unobserved data is a result of the model being overparametrized.Another simple example is the case of a sample from a mixture of distributions.The missing data are the indicators of which distribution gave rise to theindividual observations (see Exercise 1.18). As mentioned above, life-testingexperiments or other experiments in which there is an “outcome” often yieldmissing data because some of the experimental units are removed from theexperiment before the outcome actually occurs. The times-to-outcome of thecensored experimental units are not observed; only bounds on the times areknown. For instance, consider the simple exponential model for failure times,

p(t) =1

θe−t/θ.

A total of n items are put on test, and their ending times are to be recorded.However, as is often the case, only the failure times t1, . . . , tn1 of the first n1

items to fail are recorded, and for the remaining n−n1 items all we know arethe times at which they were taken from the test (or “censored”). This oftenoccurs at once, so there is just one censoring time, but we can consider a moregeneral setup in which there are different censoring times, cn1+1 , . . . , cn. Thelog-likelihood is


48 1 Preliminaries

l(θ) = −n1 log(θ) −(

n1∑

i=1

xi +

n∑

i=n1+1

ci

)/θ. (1.65)

(In Exercise 1.19a, you are asked to supply the reasons for the term∑n

i=n1+1 ci/θ.)The MLE of θ can easily be obtained, because the maximum clearly occursat

θ =

(n1∑

i=1

xi +

n∑

i=n1+1

ci

)/n1.

Although the MLE can be obtained in closed form, this example providesuseful insight into the EM method, and in Exercise 1.19b, you are asked toformulate this estimation problem to solve as an EM method.

As a further example of the EM method, consider an experiment describedby Flury and Zoppe (2000). It is assumed that the lifetime of light bulbs fol-lows an exponential distribution with mean θ. To estimate θ, n light bulbs weretested until they all failed. Their failure times were recorded as x1, . . . , xn. Ina separate experiment, m bulbs were tested, but the individual failure timeswere not recorded. Only the number of bulbs, r, that had failed at time t wasrecorded. The missing data are the failure times of the bulbs in the secondexperiment, u1, . . . , um. We have

lLc(θ ; x, u) = −n(log θ + x/θ) −

m∑

i=1

(log θ + ui/θ).

The expected value, EU |x,θ(k−1) , of this is

q(k)(θ) = −(n+m) log θ−1

θ

(nx + (m− r)(t + θ(k−1)) + r(θ(k−1) − th(k−1))

),

where h(k−1) is given by

h(k−1) =e−t/θ(k−1)

1 − e−t/θ(k−1).

The kth M step determines the maximum, which, given θ(k−1), occurs at

θ(k) =1

n + m

(nx + (m − r)(t + θ(k−1)) + r(θ(k−1) − th(k−1))

).

Starting with a positive number θ(0), this equation is iterated until conver-gence.

This example is interesting because if we assume that the distribution ofthe light bulbs is uniform, U(0, θ) (such bulbs are called “heavybulbs”!), theEM algorithm cannot be applied. As we have pointed out above, maximumlikelihood methods must be used with some care whenever the range of thedistribution depends on the parameter. In this case, however, there is anotherproblem. It is in computing q(k)(θ), which does not exist for θ < θ(k−1).



Although in the paper that first provided a solid description of the EMmethod (Dempster et al. (1977)), specific techniques were used for the com-putations in the two steps, it is not necessary for the EM method to usethose same inner-loop algorithms. There are various other ways to performeach of these computations. A number of papers since 1977 have suggestedother specific methods for the computations and have given new names suchas MCEM, ECM, SEM, and so on, to methods based on those alternativeinner-loop computations.

For the expectation step, there are not as many choices. In the happy caseof an exponential family or some other nice distributions, the expectationcan be computed in closed form. Otherwise, computing the expectation is anumerical quadrature problem. There are various procedures for quadrature,including Monte Carlo (see page 83). The additional Monte Carlo computa-tions add a lot to the overall time required for convergence of the EM method.Even the variance-reducing methods discussed in Section 2.6 can do little tospeed up the method. Monte Carlo samples from previous expectation stepsmay be used in order to reduce the costs of Monte Carlo sampling.

An additional problem in using Monte Carlo in the expectation step maybe that the distribution of Y is difficult to simulate. The versatile Gibbsmethod (page 80) is often useful in this context.

The convergence criterion for optimization methods that involve MonteCarlo generally should be tighter than for deterministic methods.

For the maximization step, there are more choices, as we have seen in thediscussion of maximum likelihood estimation above.

For the maximization step, Dempster et al. (1977) suggested requiringonly an increase in the expected value; that is, take θ(k) so that qk(θ(k)) ≥qk−1

(θ(k−1)). This is called a generalized EM algorithm, or GEM. There are

several simple possibilities for achieving an increase. One way is just to takeθ(k) as the point resulting from a single Newton step. Another way of simpli-fying the the M-step is just to use a componentwise maximization, as in the

update step of equation (1.63) on page 45; that is, if θ = (θ1, θ2), first θ(k)1 is

determined to maximize q subject to the constraint θ2 = θ(k−1)2 ; then θ

(k)2 is

determined to maximize q subject to the constraint θ1 = θ(k)1 . This sometimes

simplifies the maximization problem so that it can be done in closed form.As is usual for estimators defined as solutions to optimization problems,

we may have some difficulty in determining the statistical properties of theestimators. One method of estimating the variance-covariance matrix of theestimator is by use of the gradient and Hessian of the complete-data log-likelihood, lLc

(θ ; x, u), as in equation (1.48). Bootstrap methods (see Chap-ter 4) can also be used for estimation of the variance-covariance matrix.


50 1 Preliminaries

1.5 Inference about Functions of Parameters

Often, instead of the parameter θ, we may be interested in g(θ), where g(·)is some function. Depending on the complexity of the function, the methodsof inference on g(θ) and the properties of those methods may be very similarto the methods and properties of inference on θ itself, or they may be verydifferent.

Functions of Parameters and Functions of Estimators

If the function g(·) is monotonic or has certain other regularity properties, itmay be the case that the estimator that results from the minimum residualsprinciple or from the maximum likelihood principle is invariant; that is, theestimator of g(θ) is merely the function g(·) evaluated at the solution to theoptimization problem for estimating θ.

Let T be a statistic used as an estimator of θ, and assume that E(T ) andV(T ) both exist and are finite. The statistical properties of a T for estimatingθ do not necessarily carry over to g(T ) for estimating g(θ).

If T is unbiased for θ, it is probably not the case that g(T ) is unbiased forg(θ). As an example of why a function of an unbiased estimator may not beunbiased, consider a simple case in which T and g(T ) are scalars and g is astrictly convex function. (A function g is a strictly convex function if for anytwo distinct points x and y in the domain of g, g(1

2(x+y)) < 12 (g(x)+g(y)).)

In this case, obviouslyE(g(T )) < g(E(T )), (1.66)

so g(T ) is biased for g(θ). (This relation is Jensen’s inequality.) An oppositeinequality obviously also applies to a concave function, in which case the biasis positive.

It is often possible to adjust g(T ) to be unbiased for g(θ); and propertiesof T , such as sufficiency for θ, may carry over to the adjusted g(T ). Some ofthe applications of the jackknife and the bootstrap that we discuss later arefor making adjustments to estimators of g(θ) that are based on estimators ofθ.

The variance of R = g(T ) can often be approximated in terms of the vari-ance of T . Let T and θ be m-vectors and R be a k-vector. In a simple butcommon case, we may know that T in a sample of size n has an approxi-mate normal distribution with mean θ and some variance-covariance matrix,say V(T ), and g is a smooth function (that is, it can be approximated by atruncated Taylor series about θ):

R = g(T )

≈ g(θ) + Jg(θ)(T − θ) +1

2(T − θ)THg(θ)(T − θ).

Assuming the variance of T is O(n−1), the remaining terms in the expansiongo to zero in probability at the rate of at least n−1.


1.6 Probability Statements in Statistical Inference 51

This yields the approximations

E(R) ≈ g(θ) (1.67)

andV(R) ≈ Jg(θ)V(T )

(Jg(θ)

)T. (1.68)

This method of approximation of the variance is called the delta method.A common form of a simple estimator that may be difficult to analyze and

may have unexpected properties is a ratio of two statistics,

R =T

S,

where S is a scalar. An example is a studentized statistic, in which T is asample mean and S is a function of squared deviations. If the underlyingdistribution is normal, a statistic of this form may have a well-known andtractable distribution. In particular, if T is a mean and S is a function ofan independent chi-squared random variable, the distribution is that of aStudent’s t. If the underlying distribution has heavy tails, however, the dis-tribution of R may have unexpectedly light tails. An asymmetric underlyingdistribution may also cause the distribution of R to be very different from aStudent’s t distribution. If the underlying distribution is positively skewed,the distribution of R may be negatively skewed (see Exercise 1.20).

Linear Estimators

A functional Θ is linear if, for any two functions f and g in the domain of Θand any real number a,

Θ(af + g) = aΘ(f) + Θ(g).

A statistic is linear if it is a linear functional of the ECDF. A linear statisticcan be computed from a sample using an online algorithm, and linear statis-tics from two samples can be combined by addition. Strictly speaking, thisdefinition excludes statistics such as means, but such statistics are essentially

linear in the sense that they can be combined by a linear combination if thesample sizes are known.

1.6 Probability Statements in Statistical Inference

There are two important instances in statistical inference in which statementsabout probability are associated with the decisions of the inferential methods.In hypothesis testing, under assumptions about the distributions, we base ourinferential methods on probabilities of two types of errors. In confidence inter-vals the decisions are associated with probability statements about coverage


52 1 Preliminaries

of the parameters. For both cases the probability statements are based on thedistribution of a random sample, Y1, . . . , Yn.

In computational inference, probabilities associated with hypothesis testsor confidence intervals are estimated by simulation of a hypothesized data-generating process or by resampling of an observed sample.

1.6.1 Tests of Hypotheses

Often statistical inference involves testing a “null” hypothesis, H0, about theparameter. In a simple case, for example, we may test the hypothesis

H0 : θ = θ0

versus an alternative hypothesis that θ takes on some other value or is insome set that does not include θ0. The straightforward way of performing thetest involves use of a test statistic, T , computed from a random sample ofdata, Y1, . . . , Yn. Associated with T is a rejection region C such that if thenull hypothesis is true, Pr (T ∈ C) is some preassigned (small) value, α, andPr (T ∈ C) is greater than α if the null hypothesis is not true. Thus, C is aregion of more “extreme” values of the test statistic if the null hypothesis istrue. If T ∈ C, the null hypothesis is rejected. It is desirable that the test havea high probability of rejecting the null hypothesis if indeed the null hypothesisis not true. The probability of rejection of the null hypothesis is called thepower of the test.

A procedure for testing that is mechanically equivalent to this is to com-pute the test statistic t and then to determine the probability that T is moreextreme than t. In this approach, the realized value of the test statistic de-termines a region Ct of more extreme values. The probability that the teststatistic is in Ct if the null hypothesis is true, Pr (T ∈ Ct), is called the“p-value” or “significance level” of the realized test statistic.

If the distribution of T under the null hypothesis is known, the critical re-gion or the p-value can be determined. If the distribution of T is not known,some other approach must be used. A common method is to use some approx-imation to the distribution. The objective is to approximate a quantile of Tunder the null hypothesis. The approximation is often based on an asymptoticdistribution of the test statistic. In Monte Carlo tests, discussed in Section 2.3,the quantile of T is estimated by simulation of the distribution of the under-lying data.

1.6.2 Confidence Intervals

Our usual notion of a confidence interval relies on a frequency approach toprobability, and it leads to the definition of a 1−α confidence interval for the(scalar) parameter θ as the random interval (TL, TU ) that has the property

Pr (TL ≤ θ ≤ TU ) = 1 − α. (1.69)


1.6 Probability Statements in Statistical Inference 53

This is also called a (1−α)100% confidence interval. The endpoints of the in-terval, TL and TU , are functions of a sample, Y1, . . . , Yn. The interval (TL, TU )is not uniquely determined.

The concept extends easily to vector-valued parameters. Rather than tak-ing vectors TL and TU , however, we generally define an ellipsoidal region,whose shape is determined by the covariances of the estimators.

A realization of the random interval, say (tL, tU), is also called a confidenceinterval. Although it may seem natural to state that the “probability that θis in (tL, tU) is 1 − α”, this statement can be misleading unless a certainunderlying probability structure is assumed.

In practice, the interval is usually specified with respect to an estimatorof θ, T . If we know the sampling distribution of T − θ, we may determine c1

and c2 such thatPr (c1 ≤ T − θ ≤ c2) = 1 − α; (1.70)

and hencePr (T − c2 ≤ θ ≤ T − c1) = 1 − α.

If either TL or TU in equation (1.69) is infinite or corresponds to a boundon acceptable values of θ, the confidence interval is one-sided. For two-sidedconfidence intervals, we may seek to make the probability on either side of Tto be equal, to make c1 = −c2, and/or to minimize |c1| or |c2|. This is similarin spirit to seeking an estimator with small variance.

For forming confidence intervals, we generally use a function of the samplethat also involves the parameter of interest, f(T, θ). The confidence intervalis then formed by separating the parameter from the sample values.

Whenever the distribution depends on parameters other than the one ofinterest, we may be able to form only conditional confidence intervals thatdepend on the value of the other parameters. A class of functions that areparticularly useful for forming confidence intervals in the presence of suchnuisance parameters are called pivotal values, or pivotal functions. A functionf(T, θ) is said to be a pivotal function if its distribution does not depend onany unknown parameters. This allows exact confidence intervals to be formedfor the parameter θ. We first form

Pr(f(α/2) ≤ f(T, θ) ≤ f(1−α/2)

)= 1 − α, (1.71)

where f(α/2) and f(1−α/2) are quantiles of the distribution of f(T, θ); that is,

Pr(f(T, θ) ≤ f(π)) = π.

If, as in the case considered above, f(T, θ) = T − θ, the resulting confidenceinterval has the form

Pr(T − f(1−α/2) ≤ θ ≤ T − f(α/2)

)= 1− α.

For example, suppose that Y1, . . . , Yn is a random sample from a N(µ, σ2)distribution, and Y is the sample mean. The quantity


54 1 Preliminaries

f(Y , µ) =

√n(n − 1) (Y − µ)√∑ (

Yi − Y)2 (1.72)

has a Student’s t distribution with n − 1 degrees of freedom, no matter whatis the value of σ2. This is one of the most commonly used pivotal values.

The pivotal value in equation (1.72) can be used to form a confidence valuefor θ by first writing

Pr(t(α/2) ≤ f(Y , µ) ≤ t(1−α/2)

)= 1 − α,

where t(π) is a percentile from the Student’s t distribution. Then, after making

substitutions for f(Y , µ), we form the familiar confidence interval for µ:

(Y − t(1−α/2) s/

√n, Y − t(α/2) s/

√n), (1.73)

where s2 is the usual sample variance,∑

(Yi − Y )2/(n − 1).Other similar pivotal values have F distributions. For example, consider

the usual linear regression model in which the n-vector random variable Yhas a Nn(Xβ, σ2I) distribution, where X is an n×m known matrix, and them-vector β and the scalar σ2 are unknown. A pivotal value useful in makinginferences about β is

g(β, β) =

(X(β − β)

)TX(β − β)/m

(Y − Xβ)T(Y − Xβ)/(n − m), (1.74)

whereβ = (XTX)+XTY.

The random variable g(β, β) for any finite value of σ2 has an F distributionwith m and n − m degrees of freedom.

For a given parameter and family of distributions, there may be multiplepivotal values. For purposes of statistical inference, such considerations asunbiasedness and minimum variance may guide the choice of a pivotal valueto use. Alternatively, it may not be possible to identify a pivotal quantity fora particular parameter. In that case, we may seek an approximate pivot. Afunction is asymptotically pivotal if a sequence of linear transformations ofthe function is pivotal in the limit as n → ∞.

If the distribution of T is known, c1 and c2 in equation (1.70) can bedetermined. If the distribution of T is not known, some other approach mustbe used. A method for computational inference, discussed in Section 4.3, is touse bootstrap samples from the ECDF.


1.7 Computational Methods 55

1.7 Computational Methods

Computations are an important part of almost any statistical activity, whetherit is analysis, presentation of the results of a statistical analysis, or even de-velopment of statistical theory. The extent and nature of the computationsdistinguish the subfield of computational statistics. There are many detailsand general issues of computations that are important, but in this text, I donot go into these details. I have discussed them rather extensively in Chap-ter 10 of Gentle (2007) and in Chapters 2 and 3 of Gentle (2009). In thissection, I briefly discuss two areas of numerical analysis, and then make somebrief comments on programming and software.

Mathematics provides us models for data and for computations. Thereare two significant ways in which mathematical models differ from the actualquantities we work with and how we work with them.

• The actual numbers we work with are not the same as the mathematical

abstraction IR, the real numbers.

• Useful mathematical expressions often do not indicate good computational

methods.

1.7.1 Computer Arithmetic

Numerical quantities can be represented in various ways in a computer. Thestandard representation that corresponds most closely to the mathematicalreals IR is the floating-point system, IF, which itself is an interesting mathe-matical structure.

In a standard computer implementation of IF, there are of order 1019

different real numbers that can be represented directly. There are also a fewspecial values in IF, such as a value that corresponds in some ways to ∞ andone that corresponds to −∞. There is also a special value that is “Not aNumber” (NaN). This is what would result from dividing 0 by 0, for example.Obviously, not all values in IR can be represented exactly in IF, so we mustaccept approximations of real numbers.

Numerical quantities are represented in IF using a fixed number of bitsseparated into three distinct components of a number: its sign, its order ofmagnitude (expressed as a power of 2), and its value within a fixed rangedetermined by its magnitude. This representation immediately implies thatthe values represented cannot be dense, in fact the number of different valuesis finite, and that, in turn, implies that

• there is a largest number;• there is a smallest positive number;• there is a number that is next largest after the number 1;• there is a number that is next smallest before the number 1;


56 1 Preliminaries

and so on.The computer arithmetic built onto IF corresponds closely to the arith-

metic on IR, but the fact that many numbers in IR cannot be representedexactly in IF means that the arithmetic in the two systems cannot be exactlythe same. Some basic properties of arithmetic on IR that cannot be preservedin a finite system are existence of multiplicative inverses, associativity (of ad-dition or of multiplication), and distribution of multiplication over addition.

The IEEE Standard 754 (which is also the ISO/IEC/IEEE 60559 standard)prescribes how the real numbers are to be represented in either 32 or 64 bits,and some minimal requirements for arithmetic on those numbers. In the IEEEStandard 754 for 64 bit floating point numbers, 1 bit is given to the sign ofthe number, 10 bits are given to the exponent, and 53 bits are given to themantissa.

The fact that the three components of the floating-point representationeach utilize a fixed number of bits means that the number of distinct valueswithin one range of values is different from the number of distinct valueswithin another range of the same length. For example, there are twice asmany distinct values between 1 and 2 that can be represented in IF than thereare between 2 and 3 that can be represented.

1.7.2 Algorithms

In most interesting computational problems, the computations can be per-formed in more than one way. Obviously, we would want to use a methodthat accomplishes the task in the least amount of time and in a way that thesolution, which is in IF, is closest to the true value in IR. (I will use the term“algorithm” loosely to mean a computational method.)

Mathematical Expressions and Computational Methods

There are many useful mathematical expressions that should not be followedin order to compute the quantity represented by the expression. For example,the solution to the linear system of equations Ax = b if A is square and of fullrank is A−1b. Without thinking too deeply about this problem, it should beobvious that there are better ways of finding x than by computing A−1 andthen multiplying it times b.

Another important example in statistics is to compute the least-squaresestimate of β in the full-rank model y = Xβ + ε. In a full-rank model, thisestimator is given by (XTX)−1XTy. To evaluate this quantity directly fromthe expression, we would first form XTX, then take its inverse, then formXTy and finally multiply XTy by the inverse of XTX. Such computations aresubject to bad accumulated rounding error. Better computational methods tocompute the least squares estimate of β do not form XTX, whose conditionnumber is the square of the condition number of X.


1.7 Computational Methods 57

Iterative Methods

For some quantities that we want to evaluate there is a closed form, such asthe expression above for the least-squares estimate of β in the linear model.For many quantities of interest, however, there is no closed form expression.Rather we have only statements of its properties, such as in the case of least-squares estimates in nonlinear models. In that case, we must use a sequenceof iterations, perhaps the Newton iterates in equation (1.26), on page 25.

Convergence Criteria

Iterative methods need a rule for deciding when to stop, and take the latestiteration as the final result. Convergence criteria are usually based on changesin the magnitude of some quantity from one iteration to the next. In general,we should never base the convergence decision on no change; that is, we shouldnot test that the value of one floating-point variable is equal to the value ofanother floating-point variable. The test for convergence should be based ontwo values’ being within a small range of each other.

Online Algorithms

An important feature of some algorithms is that an individual datum needenter the computations only once; that is, the computations can be madeand updated by passing through the data only once. Such algorithms arecalled online. For example, algorithms to compute a sum or to compute theleast-squares solutions in a linear model can be online, while an algorithm tocompute the median of a set of data cannot be online. Iterative algorithms,in general, cannot be online.

Computational Efficiency and Order of Computations

For any given problem there are usually more than one algorithm. Some algo-rithms may be expected to be more accurate than others; that is, they are lessaffected by roundoff, or else they are less affected by the characteristics of thedataset to which they are applied. Also, some algorithms generally take lesstime to perform the computations than other algorithms. We often measurethe efficiency of algorithms in terms of the number of floating-point opera-tions. Such counts are often expressed as an order in terms of some measureof the size of the problem. For example, to sort n elements, where n is a largenumber, one algorithm may require approximately n2 comparisons, whereasanother algorithm may require only approximately n log(n) comparisons. Wedenote these levels of efficiency as O(n2) and O(n log(n)). The values of thedata, not just the number of data elements, may affect the efficiency of algo-rithms, so measures of efficiency must be based on some specific characteristicsof datasets, We sometimes speak of a worst-case order or an average order.


58 1 Preliminaries

1.7.3 Software and Programming

The ability to program well in at least one programming language is a necessityfor a substantial proportion of statisticians. For programming examples in thisbook, I will generally use R, and for many people who work in computationalstatistics, it is sufficient to be able to program well in R.

Well-constructed programs in any programming language generally mustmake full use of the higher-level constructs of the language. In R, for example,higher-level constructs for numerical computations include vectors and ma-trices. Well-constructed programs in R are “vectorized”; that is, they avoidexplicit program loops by use of functions designed to operate on arrays.

Exercises

Some of these exercises require you to simulate output from a given data-generating process by using random number generators (see Appendix B).For these exercises you do not need to know much about random numbergeneration, which we will briefly discuss in Chapter 2. Some exercises requireyou to conduct small Monte Carlo studies to assess the performance of statis-tical techniques (see Appendix A).

1.1. a) How would you describe, in nontechnical terms, the structure of thedataset displayed in Figure 1.1, page 7?

b) How would you describe the structure of the dataset in more precisemathematical terms? (Obviously, without having the actual data, yourequations must contain unknown quantities. The question is meant tomake you think about how you would do this—that is, what would bethe components of your model.)

1.2. a) We stated that the normal, the binomial, the Poisson, the gamma, thePareto, and the negative binomial families of probability distributionswere of the exponential class. Show that this is true by expressingthe PDF of each distribution in the form of equation (1.2). (See Ap-pendix D, beginning on page 435, for descriptions of these families ofdistributions.)

b) Which of the distributions mentioned above are also location-scalefamilies?

c) Name some other families of probability distributions that are location-scale families. In particular, is the usual two-parameter gamma fam-ily a location-scale family? Is the three-parameter gamma family alocation-scale family?

1.3. Prove that if X is a random variable with an absolutely continuous distri-bution function PX , the random variable PX(X) has a U(0, 1) distribution.

1.4. Show that the variance of the ECDF at a point y is the expression inequation (1.9) on page 14. Use the definition of the variance in terms of


Exercises 59

expected values, and represent E((

Pn(y))2)

in a manner similar to how

E(Pn(y)

)was represented in equations (1.8). Do not just say that it is a

binomial random variable.1.5. The variance functional.

a) Express the variance of a random variable as a functional of its CDFas was done in equation (1.11) for the mean.

b) What is the same functional of the ECDF?c) What is the plug-in estimate of the variance?d) What are the statistical properties of the plug-in estimate of the vari-

ance? (Is it unbiased? Is it consistent? Is it an MLE?, etc.)1.6. For a distribution with CDF F and given α ∈ (0, 1), the functional that

defines the α quantile is the quantile function, equation (1.19). What isthe plug-in estimator of the α quantile? Is it unbiased for some α and forsome nondegenerate distribution?

1.7. Give examples ofa) a parameter that is defined by a linear functional of the distribution

function, andb) a parameter that is not a linear functional of the distribution function.c) Is the variance a linear functional of the distribution function?

1.8. a) Let U1, . . . , Un ,with n ≥ 2 be a random sample from a U(0, 1) distri-bution with order statistics U1:n, . . . , Un:n. Show that if Un:n = v theconditional distribution of vUn−1:n, given Un:n = v is the same as theunconditional distribution of Un−1:n. (This is a simple special case ofexpression (1.18).

b) Generate a random sample of 100 U(0, 1) variates. Call the or-der statistics from the sample u1:100, . . . , u100:100. Let v = u51:100.Now generate an independent random sample of 50 U(0, v) vari-ates. Call the order statistics from the sample w1:50, . . . , w50:50. Plotw1:50, . . . , w50:50 versus u1:50, . . . , u50:50. What do you observe?

1.9. Assume a random sample of size 10 from a normal distribution. Withν = 1−2ι in equation (1.21), determine the value of ι that makes the em-pirical quantile of the 9th order statistic unbiased for the normal quantilecorresponding to 0.90.

1.10. The Harrell-Davis quantile estimator (see Harrell and Davis (1982)) isbased on the expected value of the kth order statistic in a sample of sizen, which we can work out using equation 1.16:

E(X(k) =

(n

k

)∫ ∞

−∞

x(F (x)

)k−1(1 − F (x)

)n−k

dF (x).

If F in equation equation 1.16 is invertible, then we can make a change ofvariables y = F (x) in the integral:

E(X(k)) =

(n

k

)∫ 1

0

F−1(y)yk−1(1 − y)n−k dy.


60 1 Preliminaries

Now, for the α quantile, we consider the quantity α(n+1). As n increaseswithout bound, E(X(α(n+1))) → F−1(α); hence, we consider E(X(α(n+1)))in the form above. We substitute the continuous analogue of the combi-natorial,

(nk

)= 1/B(k, n − k + 1), where B(k, n − k + 1) is the complete

beta function (see Appendix C). Also, in place of the CDF, we use theECDF. For the expected value of the “α(n + 1)th order statistic” (notethat α(n + 1) is not necessarily an integer), we have, in the form of theintegral above,

xα =1

B(α(n + 1), (n + 1)(1 − α))

∫ 1

0

F−1n (y)yα(n+1)−1(1−y)(n+1)(1−α)−1 dy.

Because F−1n is a step function of the sample values that is constant on

the intervals ((i − 1)/n, i/n) and the steps are the order statistics, thisexpression can be written as a linear combination of the order statistics:

xα =

n∑

i=1

Wn,ix(i),

where

Wn,i =1

B(α(n + 1), (n + 1)(1 − α))

∫ i/n

(i−1)/n

yα(n+1)−1(1 − y)(n+1)(1−α)−1 dy

= P (i/n) − P ((i − 1)/n),

where P (·) is the CDF of the beta distribution with parameters α(n + 1)and (n + 1)(1 − α).a) Write a function that accepts a univariate dataset and for given α,

computes the Harrell-Davis α quantile estimate. (The R functionhdquantile in the Hmisc package computes this, but you should writeone yourself. All you need are functions to sort and to compute thebeta CDF.)

b) Conduct a small Monte Carlo study to compare the Harrell-Davisestimate of median with the sample median as an estimate of themedian of a Weibull distribution (see Table D.6 in Appendix D for adefinition of the Weibull distribution). The sample median for an odd-sized sample is the central order statistic, and the sample median foran even-sized sample is the average of the two central order statistic.Use sample sizes of 10, 50, and 100.Compare the estimators based on their bias and their variance.The median of the Weibull(α, β) distribution is (β log 2)1/α.

1.11. Generate a sample of pseudorandom numbers from a normal (0,1) distri-bution and produce a quantile plot of the sample against a normal (0,1)distribution, similar to Figure 1.3 on page 20. Do the tails of the sampleseem light? (How can you tell?) If they do not, generate another sample


Exercises 61

and plot it. Does the erratic tail behavior indicate problems with the ran-dom number generator? Why might you expect often (more than 50% ofthe time) to see samples with light tails? What is the relationship betweenthe median of X1:n and zα or the median of Xn:n and z1−α, where α isthe quantile-level used in producing the q-q plot?

1.12. Consider the least squares estimator of β in the usual linear regressionmodel, E(Y ) = Xβ.a) Use expression (1.47) on page 39 to derive the variance-covariance

matrix for the estimator.b) Use expression (1.48) to derive the variance-covariance matrix for the

estimator.1.13. a) For t = 1 : 1000 generate 1,000 observations from the model

yi = β1e−(β2ti+β3t2

i) + εi,

where β1 = 10, β2 = 0.001, β3 = 0.000001, and εiiid∼ N(0, σ2) with

σ2 = 1.b) Use least squares to estimate β0, β1, β2, and σ2.c) Use both equations (1.47) and (1.48) to estimate the variance-covariance

matrix of (β0, β1, β2).1.14. Assume a random sample y1, . . . , yn from a gamma distribution with pa-

rameters α and β.a) What are the least squares estimates of α and β? (Recall E(Y ) = αβ

and V(Y ) = αβ2.)b) Write a function in a language such as R, Matlab, or Fortran that

accepts a sample of size n and computes the least squares estimatorof α and β and an approximation of the variance-covariance matrixusing both expression (1.47) and expression (1.48).

c) Try out your program in Exercise 1.14b by generating a sample ofsize 500 from a gamma(2,3) distribution and then computing the es-timates. (See Appendix B for information on software for generatingrandom deviates.)

d) Formulate the optimization problem for determining the MLE of αand β. Does this problem have a closed-form solution?

e) Write a function in a language such as R, Matlab, or Fortran thataccepts a sample of size n and computes the maximum likelihoodestimator of α and β and computes an approximation of the variance-covariance matrix using expression (1.64), page 46.

f) Try out your program in Exercise 1.14e by computing the estimatesfrom an artificial sample of size 500 from a gamma(2,3) distribution.

1.15. Iteratively reweighted least squares and Lp norms.a) Write a program to use IRLS to determine the Lp estimates of β in the

linear regression model y = Xβ+E. The program should accept p, then-vector y, and the n×m matrix X, and should return the Lp estimatesof β as well as the sample mean squared error and the Lp norm of the


62 1 Preliminaries

residuals. Make sure that your program handles small residuals well.Residuals that are numerically zero can either be assigned to a verylarge value or, counter-intuitively, omitted.

b) Let n = 100, m = 3, and β = (1, 1, 1). Assume that the model containsan intercept; that is, the first column of X is constant (all 1’s). Nowgenerate random data for the matrix X. Generate the values indepen-dently and identically from a U(1, 10) distribution. Now generate yfrom the model and X plus an error that is N(0, 1).Finally use your program to compute the L1, L1.5, L2, L3 estimatesof β.

c) Now, take the same data as in Exercise 1.15b, and add 10 to y1 andto y2. Use your program to compute the L1, L1.5, L2, L3 estimates ofβ, and compare these estimates with the estimates from data withoutthe outliers.

d) Combine the two previous parts into a small Monte Carlo study tocompare the performance of the L1 and L2 estimators in the presenceof location outliers.

1.16. For the random variable Y with a distribution in an exponential familyand whose density is expressed in the form of equation (1.2) on page 10,and assuming that the first two moments of g(Y ) exist and a(·) is twicedifferentiable, show that

E(g(Y )) = ∇a(θ)

andV(g(Y )) = Ha(θ).

Hint: First show that

E(∇ log(p(Y | θ))) = 0,

where the differentiation is with respect to θ.1.17. Dempster et al. (1977) consider the multinomial distribution with four

outcomes, that is, the multinomial with probability function,

p(x1, x2, x3, x4) =n!

x1!x2!x3!x4!πx1

1 πx2

2 πx3

3 πx4

4 ,

with n = x1 +x2 +x3 +x4 and 1 = π1 +π2 +π3 +π4. They assumed thatthe probabilities are related by a single parameter, θ:

π1 =1

2+

1

4θ

π2 =1

4− 1

4θ

π3 =1

4− 1

4θ

π4 =1

4θ,


Exercises 63

where 0 ≤ θ ≤ 1. (This model goes back to an example discussed byFisher, 1925, in Statistical Methods for Research Workers.) Given an ob-servation (x1, x2, x3, x4), the log-likelihood function is

l(θ) = x1 log(2 + θ) + (x2 + x3) log(1 − θ) + x4 log(θ) + c

and

dl(θ)/dθ =x1

2 + θ− x2 + x3

1 − θ+

x4

θ.

The objective is to estimate θ.a) Determine the MLE of θ. (Just solve a simple polynomial equation.)

Evaluate the estimate using the data that Dempster, et al. used: n =197 and x = (125, 18, 20, 34).

b) Although the optimum is easily found as in the previous part of thisexercise, it is instructive to use Newton’s method (as in equation (1.26)on page 25). Write a program to determine the solution by Newton’s

method, starting with θ(0) = 0.5.c) Write a program to determine the solution by scoring (which is the

quasi-Newton method given in equation (1.60) on page 44), again

starting with θ(0) = 0.5.d) Write a program to determine the solution by the EM algorithm, again

starting with θ(0) = 0.5.e) How do these methods compare? (Remember, of course, that this is a

particularly simple problem.)1.18. A common application of EM methods is for estimation of the parameters

in a mixture distribution. In a simple instance of such a problem, we haveobservations y1, . . . , yn some of which come from population P1 and therest from population P2. We can think of the data as pairs, (yi, ui), wherethe unobserved ui is either 1 or 2, depending on whether the observationis from the first or the second population.Consider the model in which the proportion ω of the observations are froma N(µ1, σ

21) distribution and 1−ω of the observations are from a N(µ2, σ

22)

distribution.

a) Let ω = 0.7, µ1 = 0, µ2 = 2, and σ21 = σ2

2 = 1. Generate 1,000observations from this mixture distribution. (That means, of course,that there may be more or less than 70% from the first distribution,and the observations themselves are not grouped into 2 classes.)

b) Assume you know that σ21 = σ2

2 = 1. Use the EM method to estimateω, µ1, and µ2.

c) Now assume you do not know σ21 and σ2

2. Use the EM method toestimate ω, µ1, µ2, σ2

1, and σ22.

1.19. Consider the simple exponential model for failure times

p(t) =1

θe−t/θ,


64 1 Preliminaries

and data from n items that are put on test, and for which only the failuretimes of the first n1 items to fail are recorded, t1, . . . , tn1. For the remain-ing n−n1 items all we know are the times at which they were taken fromthe test (or “censored”), cn1+1, . . . , cn.a) Show that the log-likelihood is as given in equation (1.65) on page 48,

and that its maximum occurs at

θ =

(n1∑

i=1

xi +

n∑

i=n1+1

ci

)/n1.

Hint: For the term involving the censoring times, think in terms ofthe probability of the sample, which includes both the observed timesand the known censoring times.

b) Fornulate the problem in the EM framework, where the sum

n1∑

i=1

xi +

n∑

i=n1+1

ci

is estimated in the E step.1.20. Assume that X1, X2, . . .Xn is a random sample from an exponential dis-

tribution with parameter θ. Consider the random variable formed as aStudent’s t,

T =X − θ√

S2/n, (1.75)

where X is the sample mean and S2 is the sample variance,

1

n − 1

∑(Xi − X)2.

a) For the simple case n = 2, show that the distribution of T is negativelyskewed (although the distribution of X is positively skewed). Thedistribution is quite complicated.

b) Use Monte Carlo samples to explore the skewness of T for largersample sizes, n = 10, 20, 100. (Note that the denominator in equa-tion (1.75) in general is

√S2/n.)

c) Give a heuristic explanation of the negative skewness of T .The properties illustrated in the exercise relate to the robustness of sta-tistical procedures that use Student’s t. While those procedures may berobust to some departures from normality, they are often not robust toskewness. These properties also have relevance to the use of statistics likea Student’s t in the bootstrap.

1.21. Assume X1, . . . , Xn are iid Bernoulli(π). An unbiased estimator of π is X .Suppose we wish to estimate the variance π(1 − π).a) Is T = X(1 − X) for π(1 − π)?b) Use the delta method to approximate the variance of T .


Exercises 65

c) Work out the exact variance of T .1.22. A function T of a random variable X with distribution parametrized by

θ is said to be sufficient for θ if the conditional distribution of X givenT (X) does not depend on θ. Discuss (compare and contrast) pivotal andsufficient functions. (Start with the basics: Are they statistics? In whatway do they both depend on some universe of discourse, that is, on somefamily of distributions?)

1.23. Consider again the problem in Exercise 1.20. Use Monte Carlo to studythe properties of one-sided confidence intervals for θ, where the confidenceintervals are formed as if the pivotal statistic in equation (1.75) had aStudent’s t distribution with n − 1 degrees of freedom. Specifically,a) For a sample size of n = 100, form lower 95% confidence intervals for

θ, and for the value of that you choose for the study, estimate thecoverage probability.

b) For a sample size of n = 100, form upper 95% confidence intervalsfor θ, and for the value of that you choose for the study, estimate thecoverage probability.

1.24. Use the pivotal value g(β, β) in equation (1.74) on page 54 to form a(1 − α)100% confidence region for β in the usual linear regression model.

1.25. Assume a random sample y1, . . . , yn from a normal distribution with meanµ and variance σ2. Determine an unbiased estimator of σ based on thesample variance, s2. (Note that s2 is sufficient and unbiased for σ2.)

1.26. The properties of IF listed on page 55 imply some interesting propertiesof arithmetic in IF.a) Explain why some numbers in IF cannot have multiplicative inverses.

(This is the case in any finite positional representation of numbers.)b) Let x be the largest number in IF. Use the sum x+1 − 1 to show that

associativity of addition does not hold. (You can make an assumption,any reasonable assumption, about what happens when you add 1 tox.)

c) Let x be the largest number in IF. Use the product 2x/2 to show thatassociativity of multiplication does not hold.

d) Let x be the largest number in IF. Use the expression 12(x+x) to show

that multiplication does not distribute over addition.e) Let y be the number that is next largest after the number 1 in IF, and

let z be the number that is next smallest before the number 1. Whichis larger 1−y or 1−z, and why?

1.27. Why are there twice as many distinct values between 1 and 2 that can berepresented in IF as there are between 2 and 3 that can be represented inthat system? (You do not need to know the details of the representationto answer this. All you need are the given facts that the representationuses a binary base, and that the mantissa uses a fixed number of bits.)

1.28. a) An online algorithm to compute the sample variance is the following.0. set xbar = s2 = 0

1. for i = 1:n, set xbar = xbar + x[i];


66 1 Preliminaries

set s2 = s2 + x[i]^2

2. set xbar = xbar/n; set s2 = (s2 - n*xbar^2)/(n-1)

Suppose that for any value, we can carry at most 10 significant digits.Now suppose that the data (the x[i]’s) are of order 10,000, and we have100,000 observations. What is the approximate number of significantdigits of s2?

b) Devise a better online algorithm.1.29. Consider the problem of computing the product of an n × m matrix A

and an m × p matrix B. The (i, j)th element in the product C = AB isgiven by

Cij =

m∑

k=1

AikBkj .

a) What is the order of the number of operations required to computeC?

b) If A and B are n × n matrices, what is the order of the number ofoperations required to compute C?

c) If A is an n×n matrix, what is the order of the number of operationsrequired to compute ATA?


1 preliminaries - information technology...

Documents