identification of physical systems (applications to condition monitoring, fault diagnosis, soft...

3Estimation Theory

3.1 OverviewEstimation theory is a branch of statistics with wide application in science and engineering, and partic-ularly forms the backbone of system identification. It deals with estimating an unknown parameter frommeasurements. The map relating the unknown parameter to the measurement may be described either bya mathematical model or by a probability density function. Mostly the model is assumed to be linear, orthe PDF is assumed to be Gaussian, as the solution to the estimation problem is simplified. The estimatecan be expressed in closed form, linear, and computationally efficient.

However, there are many practical cases where the above assumptions may not hold. In these cases itis assumed that the model is nonlinear or the PDF is non-Gaussian. A set of measurements, in general,contains a mixture of both “good data” and “bad data” points. Loosely speaking, good data are thosewhich are close to the mean while bad data are located away from the mean. The bad data points aredue to faulty instruments, unexpected failure of communication links, intermittent faults, and modelingerrors, to name a few. As the bad data points are due to unpredictable causes it is difficult to characterizethem statistically. Bad data can only be defined loosely as data containing errors that are worse thanone normally would expect. Gaussian PDFs model random variables which are clustered close to theirmean, while a non-Gaussian PDF, such as a Laplacian or Cauchy, models data which contains a mixof good and bad data. A zero-mean unity variance Gaussian PDF fy(y) has thin tails and approachesexp(−y2∕2𝜎2

y ) asymptotically. About 68% of values drawn from a Gaussian distribution are within onestandard deviation 𝜎y away from the mean; about 95% of the values lie within two standard deviations;and about 99.7% are within three standard deviations. This justifies modeling good data by a GaussianPDF. A non-Gaussian PDF with a thicker tail, such as a Laplacian or Cauchy, models bad data. A

Laplacian PDF asymptotically approaches exp(−√

2|y|∕𝜎y) while a Cauchy PDF approaches 𝛼∕(𝜋y2).The thicker the tails the higher percentage of the data migrate towards the tails.

Figure 3.1 shows Gaussian, Laplace, and Cauchy PDFs and the random variables generated by thesePDFS. On the left, subfigures (a), (b), and (c) show respectively the Gaussian, Laplace, and the CauchyPDFs. The Gauss and Laplace PDFs have both unit mean and unit variance. The Cauchy has unit medianand infinite variance. The subfigures D, E, and F on the right show respectively the Gaussian and Laplaceand Cauchy random variables. The random variables which lie within an interval of [−1 1 ] around theunit mean (or median) are assumed to be good data and those lying outside the interval are assumed tobe dad data. The lines separating good data from the bad data are indicated in black at y = 0 and y = 2while the mean is indicated by a line at y = 1.

From Figure 3.1 we can deduce that the Gaussian generates the largest percentage of good data whilethe Cauchy generates the smallest percentage.

Identification of Physical Systems: Applications to Condition Monitoring, Fault Diagnosis, Soft Sensor andController Design, First Edition. Rajamani Doraiswami, Chris Diduch and Maryhelen Stevenson.© 2014 John Wiley & Sons, Ltd. Published 2014 by John Wiley & Sons, Ltd.

118 Identification of Physical Systems

0 50 100

-10123

D: Gaussian random variable

y

0 50 100

-2

0

2

E: Laplace random variable

y

0 50 100-15

-10

-5

0

5

F: Cauchy random variable

Time

y

-10 -5 0 5 100

0.2

0.4

A: Gaussian PDF

PD

F

-10 -5 0 5 100

0.2

0.4

0.6

B: Laplacian PDF

PD

F

-10 -5 0 5 100

0.1

0.2

0.3

C: Cauchy PDF

y

PD

F

Figure 3.1 Gaussian, Laplace, and Cauchy random variables

In many practical cases, the PDF of the data may be unknown or partially known. An ad hoc orincorrect choice of PDF may result in a very poor performance of the estimator. For example, choosinga Gaussian PDF (assuming the data to be good) when the PDF is Cauchy (data is very bad), will meanthe estimator performance will be catastrophic as the variance of the estimation error will be infinitelylarge. To deal with this statistical uncertainty, an estimator which gives an adequate performance overa class of PDFs is desirable. Such an estimator is said to be robust. A widely used approach to robustestimation is the so called “Min-Max” or game theoretic approach [1, 2]. Nature tries to maximize thecovariance of the estimation error by choosing the worst-case PDF, while the engineer minimizes theerror by choosing the best estimator. The engineer is thus forced to obtain an estimator by minimizingthe estimation error for the worst-case PDF. An example of the worst-case PDF obtained using Min-Maxis the “part Gauss-part Laplace” PDF. It is a thin-tailed Gaussian over the region containing good dataand a thicker tailed Laplacian over the bad data region.

The choice of performance measure to obtain an estimate of the non-random parameter plays animportant role. The intuitive and direct approach of choosing the measure to be the variance of theestimation error will yield an estimator to be a function of the estimator itself. This is meaningless,as the estimate of an unknown is a function of the unknown. Hence, alternative indirect schemes havebeen proposed so that the estimators have desirable properties such as unbiasedness and efficiency.An estimator is unbiased if the expectation of the estimate is equal to the true value of the estimatedparameter. In other words if a number of experiments are performed, then the mean value of the estimatesapproach the true parameter value. The property of unbiasedness alone is not sufficient to ensure thegood performance of an estimator. The variance of the estimator is an important measure. Ideally, theestimator must be unbiased and have the smallest possible variance.

The earliest stimulus for the development of estimation theory was apparently provided by astronom-ical studies in which the motion of planets and comets was studied using telescopic measurements. Tosolve this problem, Karl Gauss, around 1795, proposed the linear least-squares method, which is still

Estimation Theory 119

the most popular approach. Least-squares estimation is a method of fitting the measurements (data orobservation values) to a specified linear model. A best fit is obtained by minimizing the sum of the squaresof the residual, where a residual is defined as an error between the measured value (data or observedvalue) and the value obtained using the model. The linear least-squares estimator is unbiased. However,it is not efficient unless the measurements are independent and identically (i.i.d.) random variables. Theleast-squares method may perform poorly when the PDF is non-Gaussian.

The question arises as to what is the lowest possible covariance of estimation error? The answerto this question may be obtained from the Cramer–Rao lower bound, which gives the lowest possiblecovariance that is achievable by an unbiased estimator. The Cramer–Rao lower bound (CRLB), namedin honor of Herald Cramer and Calyampudi Radhakrishna Rao who were among the first to derive it,expresses a lower bound on the covariance of estimators of a deterministic parameter. The inverse of thelower bound is called the Fisher information. An unbiased estimator which achieves this lower boundis said to be efficient. An estimator which achieves the lowest possible mean squared error among allunbiased methods, is therefore the minimum variance unbiased (MVU) estimator. The next questionarises as to how to determine an efficient estimator.

In 1922, R. A. Fisher introduced the method termed Maximum Likelihood Estimation (MLE) inthe case when the PDF of the measurement is known. The unknown parameter is determined from themaximization of the probability that the observed occurs. It is based on maximizing a likelihood function,which is the PDF of the data expressed as a function of the parameter to be estimated. The maximumlikelihood estimator is widely used, as the estimate gives the minimum estimation error covarianceand serves as a gold standard for evaluating the performance of other estimators. It has the followingproperties:

Small sample property: If an efficient estimator exists, it is given by a maximum likelihood estimator.Large sample properties: The maximum-likelihood estimator possesses a number of attractive and

desirable properties. As the sample size increases to infinity, sequences of maximum-likelihood estimatorshave these properties:

∙ Consistency: subsequence of the sequence of MLEs converges in probability to the value beingestimated.

∙ Asymptotic normality: as the sample size increases, the distribution of the MLE tends to the Gaussiandistribution with a mean equal to the true mean, and the covariance matrix equal to the inverse of theFisher information matrix.

∙ Efficiency: The covariance of the estimation attains the Cramer–Rao lower bound when the samplesize tends to infinity.

In general, the estimates are implicit nonlinear functions of the parameter to be estimated and theestimate is obtained recursively. If the PDF of the measurement data is Gaussian, the maximum likelihoodestimation method simplifies to a weighted least-squares method where the weight is the inverse of thenoise covariance matrix.

3.2 Map Relating Measurement and the Parameter3.2.1 Mathematical ModelConsider the problem of estimating a Mx1 parameter 𝜽 = [ 𝜃1 𝜃2 𝜃3 . 𝜃M ]T where the maprelating the parameter to Nx1 measurement y = [ y(0) y(1) y(2) . y(N − 1) ]T is given by somelinear algebraic model

y = H𝜽 + v (3.1)

where v is zero-mean measurement noise with covariance 𝚺v; H is NxM matrix. Figure 3.2 shows themap relating the unknown parameter 𝜽 and the measurement y where v is additive noise.


θ Hθv

yΗ

Figure 3.2 Map relating the unknown parameter and the measurement

3.2.2 Probabilistic ModelThe measurement y is a random variable with distribution fy (y) which belongs to a family of PDFsparameterized by 𝜽. In estimation theory the parameter 𝜽 is unknown and the objective is to estimate 𝜽from y. To emphasize the probabilistic map relating y and 𝜽 the PDF fy(y) governing the measurement yis denoted explicitly in terms of both y and 𝜽 as:

fy(y;𝜽) = fy(y(0), y(1), y(2),… , y(N − 1);𝜽) (3.2)

where fy(y;𝜽) is the PDF governing the measurement y parameterized by 𝜽. Examples of typical PDFsare given below:

3.2.2.1 Gaussian PDF: Denoted fg(y;𝜽)

The most common example is the Gaussian PDF given by

fg(y;𝜽) = 1√(2𝜋)N |Σv| exp

{−1

2(y − 𝝁y)

T𝚺−1v (y − 𝝁y)

}(3.3)

The mean 𝜇y = E [y] and the covariance of y denoted cov (y) are:

E [y] = H𝜽 (3.4)

cov (y) = 𝚺v (3.5)

3.2.2.2 Uniform PDF: Denoted fu(y;𝜽)

A continuous uniform PDF for scalar y and scalar H = 1:

fu(y; 𝜃) =⎧⎪⎨⎪⎩

1b − a

for a ≤ y ≤ b

0 for y < a or y > b(3.6)

The mean and the variance are:

E [y] = 𝜇y = 𝜃 = b + a2

(3.7)

var(y) = 𝜎2y = (b − a)2

12(3.8)


Treating unknown parameter 𝜃 as its mean, that is 𝜃 = 𝜇y, the uniform PDF may be expressed in termsof its mean and variance as

fu(y; 𝜃) =⎧⎪⎨⎪⎩

1

2𝜎y

√3

for − 𝜎y

√3 ≤ y − 𝜇y ≤ 𝜎y

√3

0 otherwise(3.9)

The uniform PDF is used to model a random variable y when its variations around its mean are equallyprobable.

3.2.2.3 Laplacian PDF: Denoted fe(y;𝜽)

Laplace (also called double exponential) distribution is a continuous probability distribution namedafter Pierre-Simon Laplace and is called the double exponential because it may be interpreted as twoexponential distributions spliced together back-to-back. It takes the form for a scalar case as:

fe(y; 𝜃) = 1√2𝜎y

exp⎛⎜⎜⎝−

√2|y − 𝜇y|

𝜎y

⎞⎟⎟⎠ (3.10)

The Laplace has thicker tails compared to that of the normal PDF and is used to model measurement ywith larger random variations around its mean 𝜇y = H𝜃 and variance 𝜎2

y .

3.2.2.4 Worst-Case PDF: Part Gauss-Part Laplace Denoted fge(y;𝜽)

Part Gauss-part Laplace is a worst-case PDF for obtaining a robust estimator, and plays an importantrole in parameter estimation when the PDF of the measurement is unknown, as will be seen later. Themeasurement space is divided into two regions Υgd and Υbd, where Υgd is a finite region around the mean𝜇y of y and Υbd = ℜ − Υgd is the rest of the measurement space. Υgd and Υbd are defined as follows:

Υgd = {y : |y − 𝜇y| ≤ agd} (3.11)

Υbd = {y : |y − 𝜇y| > agd} (3.12)

where agd > 0 is some scalar separating the good from bad data points. The part Gauss-part Laplace PDFfy (y) is defined as

fge(y; 𝜃) =

⎧⎪⎪⎨⎪⎪⎩𝜅 exp

{− 1

2𝜎2ge

(y − 𝜇y)2

}y ∈ Υgd

𝜅 exp

{a2

gd

2𝜎2ge

}exp

{−

agd

2𝜎2ge

|y − 𝜇y|} y ∈ Υbd

(3.13)

where 𝜇y = E [y] = H𝜃; 𝜅 and 𝜎2gd are determined such that (i) the integral of fy(y) is unity, ensuring

thereby that fge(y; 𝜃) is a PDF; and (ii) the constraint on the variance is met. Similar to the LaplacianPDF, the worst-case PDF has exponentially decaying tails.


3.2.2.5 Cauchy PDF: Denoted fc(y;𝜽)

Cauchy distribution is a continuous distribution. Its mean does not exist and its variance is infinity. Itsmedian and mode are equal to the location parameter 𝜇y. The Cauchy has very thick tails compared toboth the normal and Laplace PDFs and is used to model measurement y with very large random variationsaround its median 𝜇y.

fc(y; 𝜃) = 𝛼

𝜋[(y − 𝜇y)2 + 𝛼2](3.14)

where 𝛼 is the scale parameter and 𝛼∕𝜋 determines the peak amplitude.

3.2.2.6 Comparisons of the PDFs

Gaussian, Laplacian, part Gauss-part Laplace, Cauchy, and the uniform PDFs are shown in Figure 3.3.All the PDFs except the Cauchy have zero-mean, and Cauchy has zero-median. The variance of theGaussian, the Laplacian, and part Gauss-part Laplace was 30, while Cauchy and the uniform haveinfinite variances. Note that the tails of the Gaussian PDF are the thinnest while that of the Cauchy arethe thickest, with Laplacian and part Gauss-part Laplace thicker than Gaussian but thinner than Cauchy.The uniform is flat over the entire range of y.

3.2.3 Likelihood FunctionThe likelihood function plays a central role in an estimation method termed maximum likelihoodestimation. A function closely related to the PDF, called “likelihood function,” serves as a performancemeasure for the maximum likelihood estimation. A likelihood function of the random variable y is the

-100 -80 -60 -40 -20 0 20 40 60 80 1000

0.005

0.01

0.015

0.02

0.025Gaussian, Laplace, part-Gauss-part-Laplace, Cauchy and uniform

y

PD

F

Gauss

Laplace

part-gauss

Cauchy

Uniform

Figure 3.3 Gauss Laplace, part Gauss-part Laplace, Cauchy, and the uniform PDFs


PDF fy(y;𝜽) of y expressed as a function of the parameter 𝜽. It may be loosely defined as the conditionalPDF of 𝜽 given y with a subtle distinction between the two. In the case of the conditional PDF, denotedfy(y;𝜽) for convenience, the parameter 𝜽 is fixed and y is a random variable, while in the case of likelihoodfunction, denoted L (𝜽|y), the parameter 𝜽 is a variable and the random variable y is fixed

L (𝜽|y) = fy(y;𝜽) (3.15)

As commonly used PDFs are exponential functions of𝜽, it is convenient to work with the natural logarithmof the likelihood function, termed the log-likelihood function, rather than the likelihood function itself.The log-likelihood function, denoted l (𝜽|y), is defined as

l (𝜽|y) = ln (L (𝜽|y)) (3.16)

where ln (.) is the natural logarithm of (.). The log-likelihood function l (𝜽|y) is the cost function used inobtaining the maximum likelihood estimator.

3.3 Properties of Estimators3.3.1 Indirect Approach to EstimationLet us formulate the problem of estimation of the unknown parameter 𝜽 from the measurement y using thedirect approach of minimizing the obvious performance metric, namely the covariance of the estimationerror:

min��

{E[(𝜽 − ��)(𝜽 − ��)T ]} (3.17)

where �� is an estimate of 𝜽 and 𝜽 − �� is parameter estimation error. Since the covariance matrix ispositive semi-definite, the minimum value of the covariance matrix is zero, and the optimal estimate�� is:

�� = 𝜽 (3.18)

This solution makes no sense as the estimate �� is expressed as a function of the unknown 𝜽. We mustseek an estimator which expresses the estimate as a function of the measurement. The estimate must besome function of the measurement:

�� = 𝜙 (y) (3.19)

We will consider two intuitive approaches to obtaining an estimate as a function of the measurement,based on assuming (i) the noise is zero-mean and (ii) the median of the noise is zero.

Noise is zero-mean: Let us now employ an intuitively simple approach to estimate 𝜽 using a scalarlinear model:

y(k) = H𝜃 + v(k) (3.20)

where y(k), H, and v(k) is a zero-mean random variable. Invoking the zero-mean property of the mea-surement noise E [v(k)] = 0, the estimate �� yields:

E[y(k) − H��] = 0 (3.21)

Simplifying we get

�� = 1H

E [y(k)] (3.22)


Remarks The expression for the estimate �� may be obtained by minimizing the 2-norm of the errory(k) − H�� given by

min��

E[(y(k) − H��)2] (3.23)

Differentiating with respect to �� and setting it to zero, the two-norm E[(y(k) − H��)2] yields Eq. (3.21).Let us verify the performance of the estimator by computing its mean E

[��]. Substituting for y(k) in

Eq. (3.22) using Eq. (3.20) and E [v(k)] = 0, we get:

E[��] = E [H𝜃 + v]H

= 𝜃 + E [v] = 𝜃 (3.24)

This shows that the performance of the estimate is ideal. However, the optimal estimate involves com-putation of an ensemble average of the measurement E [y(k)]. In practice, as only a finite and singlerealization of the measurement is available, an estimate of the ensemble average, namely, the time averageof the measurement, is employed and the estimate 𝜃 becomes:

�� = 1H

(1N

N−1∑k=0

y(k)

)(3.25)

where the time average1N

N−1∑k=0

y(k) is an estimate of the ensemble average E [y(k)]; and N is the number

of measurements.The estimator �� is a linear function of the measurement y(k).

Let us analyze the asymptotic behavior of the estimator (3.25) in terms of its bias and the variance.

3.3.2 Unbiasedness of the EstimatorAn estimator �� is an unbiased estimator of 𝜃 if

E[��] ={

𝜽 if 𝜽 is not randomE[𝜽] if 𝜽 is random

(3.26)

Let us now verify the unbiasedness of the estimator ��. Substituting for y(k) in Eq. (3.25) using Eq. (3.20)we get

�� = 1H

(1N

N−1∑k=0

H𝜃

)+ 1

H

(1N

N−1∑k=0

v(k)

)(3.27)

Simplifying we get

�� = 𝜃 + 1H

(1N

N−1∑k=0

v(k)

)(3.28)

Taking expectation on both sides and invoking E [v(k)] = 0 yields:

E[

��]= 𝜃 (3.29)


This shows that the estimate is unbiased. The definition of an unbiased estimator is given by Eq. (3.26).We will derive a condition for the unbiasedness for a class of linear estimators given by

�� = Fy (3.30)

The following lemma gives the condition

Lemma 3.1 Consider a linear model y = H𝜽 + v. A linear estimator, �� = Fy is unbiased if and onlyif

FH = I (3.31)

Proof: Substituting for y in Eq. (3.30) using y = H𝜽 + v we get

�� = FH𝜽 + Fv (3.32)

Taking expectation on both sides and noting that v is a zero-mean random variable we get,

E[��]= FHE[𝜽] + FE[v] = FHE[𝜽]

From the definition of unbiasedness, we conclude FH = I. In the scalar example (3.20) the condition(3.31) holds.

3.3.3 Variance of the Estimator: Scalar CaseLet us now verify the variance of the estimator, as the unbiasedness alone is sufficient to deduce itsperformance. For simplicity we will assume that v(k) is a zero-mean white noise with variance 𝜎2

v .The variance of the unbiased estimator �� is obtained using Eq. (3.28). Squaring and taking expectationyields:

E[(�� − 𝜃)2] = 1H2

E⎡⎢⎢⎣(

1N

N−1∑k=0

v(k)

)2⎤⎥⎥⎦ = 1N2

(𝜎2

v

H2

)(3.33)

This shows that the variance of the estimator is a function of the “signal to noise ratio”𝜎2

v

H2and inversely

proportional to N2. The asymptotic behavior of the variance is given by:

lim itN→∞

E[(�� − 𝜃)2] = 0 (3.34)

The variance is asymptotically zero.

3.3.4 Median of the Data SamplesA median of a sample value is the numerical value that separates the higher half from the lowerhalf of the sample value. In the case of a finite number of samples, the median is determined by


sorting the sample values from the lowest to highest values and picking the middle one. Let y =[ y(0) y(1) y(2) . y(N − 1) ]T be N data samples. Sort the data from the lowest to the highestvalues, x0 < x1 < x2 < x3 < ⋯ < xN−1 where x0 = min

ky(k) and xN−1 = max

ky(k):

{xi

}is termed as order

statistics of {y(k)}. Then, the median of y denoted median(y) is given by

median(y) =⎧⎪⎨⎪⎩

x N+12

N odd

12

(x N

2+ x N

2+1

)N even

(3.35)

A median is also a central point that minimizes the average of the absolute deviations. A formal definitionof a median is that it is the value that minimizes the 1-norm of the estimation error:

min��

{E|y − H��|} (3.36)

Thus, when the noise is zero-mean the estimator is determined from the minimization of 2-norm (3.23),while for the case of zero-median the estimator is obtained from the minimization of the 1-norm (3.36).The question arises as to how to evaluate an estimator. This is addressed in the next section.

3.3.5 Small and Large Sample PropertiesOften, the performance of an estimator is evaluated assuming that there are an infinite number of datasamples. A theoretical evaluation of estimation error for small samples is very difficult. The questionarises of whether there are measures that can evaluate the performance of an estimator when the samplesize is small. The commonly used measures are

∙ unbiasedness∙ efficiency

The property of unbiasedness is not good enough to ensure the good performance of an estimator. Thevariance of the estimator is an important measure. Ideally, the estimator must be unbiased with thesmallest possible variance. This topic is addressed next in the section on Cramer–Rao lower bound. Itwill be shown later that an unbiased estimator is efficient if its covariance attains the Cramer–Rao lowerbound.

3.3.6 Large Sample PropertiesIn the case of a non-Gaussian PDF, the estimates may not satisfy the desirable properties of unbiasednessand efficiency when the number of data samples is small. However, they may satisfy these propertieswhen the number of samples is infinitely large. The behavior of an estimator for a large sample size istermed asymptotic behavior and is used as a measure of the performance of an estimator. Asymptoticbehaviors include consistency, asymptotic unbiasedness, and asymptotic efficiency.

3.3.6.1 Consistent Estimator

In practice, a single estimator is constructed as a function of the number of data samples N, and asequence of estimators is obtained as sample size N grows to infinity. If this sequence of estimatorsconverges in probability to the true value, then the estimator is said to be consistent.


Let ��N be an estimator of 𝜽 from N data samples y = [ y(0) y(1) y(2) . y(N − 1) ]T . Then ��N is

a consistent estimator of 𝜃 if ��N converges to 𝜽 in probability:

lim itN→∞

P{|𝜽 − ��| > 𝜀} = 0, (3.37)

for every 𝜀 > 0.

3.3.6.2 Asymptotically Unbiased Estimator

An estimator ��N is said to be an asymptotically unbiased estimator of 𝜽 if the estimator is unbiased andif the number of data samples is infinite

lim itN→∞

E[��N

]=

{𝜽 if 𝜽 is not random

E [𝜽] if 𝜽 is random(3.38)

3.3.6.3 Asymptotically Efficient Estimator

Ideally, the estimator must be unbiased with the smallest possible variance for a finite number of datasamples. In general, this property may not hold for small sample sizes. If the property of efficiency holdsfor an infinitely large number of samples, N, then the estimator is said to be asymptotically efficient. Itwill be shown later that an estimator is asymptotically efficient if its covariance attains the Cramer–Raolower bound.

3.4 Cramer–Rao Inequality

Definition An unbiased estimator denoted ��0

is said to be more efficient compared to any other

unbiased estimator �� = 𝜙(y) if the covariance of ��0

is smaller than that of �� = 𝜙 (y):

cov(��0) ≤ cov

(��)

(3.39)

where cov(��) = E[(𝜽 − ��)(𝜽 − ��)T ]; and if P and Q are two positive semi-definite matrices then P ≥ Qimplies P − Q is positive semi-definite.

The Cramer–Rao lower bound gives the lowest possible covariance that is achievable by an unbiasedestimator. The lower bound is equal to the inverse of the Fisher information. An unbiased estimatorwhich achieves this lower bound is said to be efficient. An estimator which achieves the lowest possiblemean squared error among all unbiased methods is therefore the minimum variance unbiased (MVU)estimator. We will first consider the scalar and extend the result to the vector case. We will assume thefollowing:

∙ The derivation of Cramer–Rao lower bound assumes weak conditions on the PDF fy(y;𝜽) and the

estimator �� (y).∙ fy(y;𝜽) > 0.

∙ 𝛿fy(y;𝜽)

𝛿𝜽exists and is finite.


∙ the PDF fy(y;𝜽) is twice differentiable with respect 𝜽 and satisfies the following regularity condition:

E

[𝛿l (𝜽|y)

𝛿𝜽

]= 0 for all 𝜽 (3.40)

where𝛿l (𝜽|y)

𝛿𝜽is the gradient of l (𝜽|y) with respect to 𝜽.

∙ The operations of integration with respect to y and differentiation with respect to 𝜽 can be interchangedin E[��(y)].

𝛿

𝛿𝜃

∞

∫−∞

��(y)fy(y)dy =

∞

∫−∞

��(y)𝛿fy(y)

𝛿𝜃dy (3.41)

This condition, that the integration and differentiation can be swapped, holds for all well-behaved PDFswhich satisfy the following:

1. If fy(y;𝜽) has bounded support in y, and the bounds do not depend on 𝜃.2. If fy(y;𝜽) has infinite support, is continuously differentiable, and the integral converges uniformly for

all 𝜽.3. The estimate �� is unbiased, that is E

[��]= 𝜽.

Remark The uniform PDF does not satisfy the regularity condition and hence Cramer–Rao inequalitycannot be applied.

3.4.1 Scalar Case: 𝜃 and �� Scalars while y is a Nx1 Vector

Lemma 3.2 The Cramer–Rao inequality of an unbiased estimator �� is

var(��) ≥ 1IF

(3.42)

where var(��) = E[(�� − 𝜃)2]; IF is the Fisher information given by

IF = E

[(𝜕l (𝜃|y)

𝜕𝜃

)2]= −E

[𝜕2l (𝜃|y)

𝜕𝜃2

](3.43)

𝛿l (𝜽|y)𝛿𝜃

is the partial derivative of l (𝜽|y) with respect to 𝜃 and is given by


= H𝜎2

v

f ′y (y; 𝜃)

fy(y; 𝜃)(3.44)

where f ′y (y; 𝜃) =𝛿fy(y; 𝜃)

𝛿𝜃and evaluated at the true value of 𝜃. The Fisher information, IF, is constant or

a function of 𝜃 but it is not a function of the measurement y.


Corollary 3.2 An estimator �� attains the Cramer–Rao lower bound:

var(��) = 1IF

(3.45)

If and only if


= IF(�� − 𝜃) (3.46)

See Appendix for the proof.

3.4.2 Vector Case: 𝜃 is a Mx1 Vector

Lemma 3.3 The Cramer–Rao inequality of an unbiased estimator �� is

cov(��) ≥ I−1F (3.47)

where P ≥ Q is interpreted as P − Q and is positive semi-definite; cov(��) = E[(�� − 𝜽)(�� − 𝜽)T ]; IF isan MxM Fisher information matrix given by

IF = E

[(𝜕l (𝜽|y)

𝜕𝜽

)(𝜕l (𝜽|y)

𝜕𝜽

)T]= −E

[𝜕2l (𝜽|y)

𝜕𝜽2

](3.48)

where𝜕2l (𝜽|y)

𝜕𝜽2is the Hessian matrix of the likelihood function l (𝜽|y);

𝛿l (𝜽|y)𝛿𝜽

=HTΣ−1f ′y (y;𝜽)

fy(y;𝜽)(3.49)

where f ′y (y; 𝜃) =𝛿fy(y; 𝜃)

𝛿𝜃. The Fisher information matrix IF is constant or a function of 𝜽 but it is not a

function of the measurement y. The element ij of IF denoted [IF]ij is:

[IF

]ij= E

[(𝜕l (𝜽|y)

𝜕𝜃i

)(𝜕l (𝜽|y)

𝜕𝜃j

)]= −E

[𝜕2l (𝜽|y)

𝜕𝜃i𝜕𝜃j

](3.50)

Corollary 3.3 An estimator �� attains the Cramer–Rao lower bound:

cov(��) = I−1F (3.51)

If and only if


= IF

(�� − 𝜽

)(3.52)

See Appendix for the proof.

In view of the Cramer–Rao inequality and an estimator �� we may define an efficient estimator given byEq. (3.39) precisely as follows:


Definition An estimator �� is said to be efficient if its variance attains the Cramer–Rao lower bound(3.45) for the scalar case and Eq. (3.51) for the vector case or equivalently satisfies the condition (3.46)for the scalar and Eq. (3.52) for the vector case.

The estimator is asymptotically efficient if it attains the Cramer–Rao lower bound when the numberof data samples approaches infinity.

3.4.3 Illustrative Examples: Cramer–Rao Inequality

Example 3.1 i.i.d. Gaussian PDFSingle data sample: Let y be a measurement generated by the mathematical model

y = 𝜃 + v (3.53)

The probabilistic model is given by the PDF

fg(y; 𝜃) = 1√2𝜋 𝜎2

exp{− 1

2𝜎2v

(y − 𝜃)2

}(3.54)

The log-likelihood function l (𝜃|y) is

l (𝜃|y) = −12ln(2𝜋 𝜎2) − 1

2𝜎2v

(y − 𝜃)2 (3.55)

Taking partial derivative with respect to 𝜃 yields

𝜕l (𝜃|y)𝜕𝜃

= 1𝜎2

v

(y − 𝜃) (3.56)

Comparing the right sides of Eqs. (3.56) and (3.46) we deduce the following:

�� = y (3.57)

The Fisher information is constant:

IF (𝜃) = 1𝜎2

v

(3.58)

It is reciprocal of the variance of the measurement noise 𝜎2v .

N data samples: Let y = [ y(0) y(1) y(2) . y(N − 1) ]T be Nx1 measurement where y(k) ; k =0, 1, 2,…N − 1 are independent and identically distributed Gaussian random variables. The mathe-matical model that generates the measurements is:

y(k) = 𝜃 + v(k), k = 0, 1, 2,… , N − 1 (3.59)

Equivalently, the probabilistic model given by the PDF

fg(y; 𝜃) = 1(2𝜋 𝜎2

) N2

exp

{− 1

2𝜎2v

N−1∑k=0

(y(k) − 𝜃)2

}(3.60)


The log-likelihood function l (𝜽|y) becomes

l (𝜽|y) = −N2ln(2𝜋 𝜎2) − 1

2𝜎2v

N−1∑i=0

(y(k) − 𝜃)2 (3.61)

Taking the partial derivative yields

𝜕l (𝜽|y)𝜕𝜃

= 1𝜎2

v

N−1∑i=0

(y(k) − 𝜃) = N𝜎2

v

(1N

N−1∑k=0

y(k) − 𝜃

)(3.62)

Let us verify whether an efficient estimator exists using Eq. (3.46):

𝜕l (𝜽|y)𝜕𝜃

= N𝜎2

v

(1N

N−1∑k=0

y(k) − 𝜃

)(3.63)

Comparing the right-hand side of Eq. (3.46) with that of Eq. (3.63) we deduce the following:

�� = 1N

N−1∑k=0

y(k) (3.64)

Since {y(k)} is i.i.d., the Fisher information from Eq. (3.62) is:

IF = E

[(𝜕l (𝜽|y)

𝜕𝜃

)2]= N

𝜎2v

(3.65)

The Fisher information is constant N times the reciprocal of the noise variance 𝜎2v :

IF = N𝜎2

v

(3.66)

Properties of the estimator: The estimator for the Gaussian PDF is the mean. Let us verify whether �� isunbiased. Taking the expectation of Eq. (3.64) we get:

E[

��]= 1

N

N−1∑k=0

E [y(k)] (3.67)

Using Eq. (3.59) we get

E[

��]= 1

N

N−1∑k=0

E [y(k)] = 1N

N−1∑k=0

𝜃 + 1N

N−1∑k=0

E [v(k)] = 𝜃 (3.68)

Since E[

��]= 𝜃, the estimator �� is unbiased. The estimator �� is efficient as it satisfies the Cramer–Rao

equality condition (3.46). The estimator is unbiased and efficient for any sample size N.


Example 3.2 i.i.d. distributed Laplacian PDFSingle data sample: The mathematical and the probabilistic model are given by Eqs. (3.53) and (3.10).We will consider a simplified PDF given by

fe(y; 𝜃) = 1√2𝜎2

v

exp

(−

√2

𝜎v

|y − 𝜃|) (3.69)

The likelihood function l (𝜃|y) is

l (𝜃|y) = −12ln(2𝜎2

v ) −√

2𝜎v

|y − 𝜃| (3.70)

Expanding the definition of |y(k) − 𝜃|, we get

l (𝜃|y) =

⎧⎪⎪⎪⎨⎪⎪⎪⎩

−12ln(2𝜎2

v ) −√

2𝜎v

(y − 𝜃) for y > 𝜃

−12ln(2𝜎2

v ) for y = 𝜃

−12ln(2𝜎2

v ) +√

2𝜎v

(y − 𝜃) for y < 𝜃

(3.71)

It is differentiable everywhere except at y = 𝜃. Taking the partial derivative yields


=√

2𝜎v

sign (y − 𝜃) =

⎧⎪⎪⎪⎨⎪⎪⎪⎩

√2

𝜎v

y > 𝜃

0 y = 𝜃

−√

2𝜎v

y < 𝜃

(3.72)

The Fisher information is:

IF = E

[(𝜕l (𝜃|y)

𝜕𝜃

)2]= 2

𝜎2v

(3.73)

The Cramer–Rao inequality is

var(

��) ≥ 𝜎2

v

2(3.74)

N data samples: Let y =[

y(0) y(1) y(2) . y(N − 1)]T

be Nx1 measurements where y(k) ; k =0, 1, 2,…N − 1 are independent and identically distributed Laplacian random variables. Themathematical model that generates the measurements is given by Eq. (3.59). Equivalently, the prob-abilistic model is:

fe(y; 𝜃) =⎛⎜⎜⎜⎝

1√2𝜎2

v

⎞⎟⎟⎟⎠N

exp

(−

√2

𝜎v

N−1∑k=0

|y(k) − 𝜃|) (3.75)


Taking the natural logarithm, the likelihood function l (𝜃|y) becomes

l (𝜃|y) = −N2ln(2𝜎2

v ) −√

2𝜎v

N−1∑k=0

|y(k) − 𝜃| (3.76)



=√

2𝜎v

N−1∑k=0

sign (y(k) − 𝜃) (3.77)

The partial derivative of the log-likelihood function is nonlinear, which is positive constant for y > 0and negative constant for y < 0. This type of nonlinear function is termed a hard-limiter. The Fisherinformation is:

IF = E

[(𝜕l (𝜃|y)

𝜕𝜃

)2]= 2N

𝜎2v

(3.78)

The Cramer–Rao inequality is

var (𝜃) ≥ 𝜎2v

2N(3.79)

The properties of the estimator are deferred to the next section on maximum likelihood estimation.

Example 3.3 Part Gauss-part Laplace PDFConsider the PDF given by Eq. (3.13). Assuming 𝜇y = 𝜃 the PDF becomes

fge(y; 𝜃) =

⎧⎪⎪⎨⎪⎪⎩𝜅 exp

{− 1

2𝜎2ge

(y − 𝜃)2

} |y − 𝜃| ≤ agd

𝜅 exp

{a2

gd

2𝜎2ge

}exp

{− 1

2𝜎2ge

|y − 𝜃|agd

} |y − 𝜃| > agd

(3.80)

From the Appendix, the expression for the derivative of the log-likelihood function is

𝛿l (𝜃|y)𝛿𝜃

=

⎧⎪⎪⎨⎪⎪⎩

1𝜎2

ge

(y − 𝜃) |y − 𝜃| ≤ agd

agd

𝜎2ge

sign (y − 𝜃) |y − 𝜃| > agd

(3.81)

The derivative of the log-likelihood function is nonlinear: it is linear over Υgd and is positive constantover y − 𝜃 > agd and negative constant over y − 𝜃 < agd. This type of nonlinear function is termed a softlimiter.


IF = E

[(𝜕l (𝜃|y)

𝜕𝜃

)2]= 1

𝜎4ge

(𝜎2

gd + a2gd

(1 − 𝜆gd

))(3.82)


where 𝜎2gd =

y=𝜃+agd∫y=𝜃−agd

(y − 𝜃)2 fy(y)dy is called “partial variance” and 𝜆gd =agd∫

−agd

fy(y)dy is called “partial

probability” of y over the region Υgd. The partial variance and partial probability are respectively thevariance and the probability of the random variables restricted to a finite region around the mean, whichin this case is the good data region.

Remarks The Part Gauss-part Laplace PDF is the worst-case PDF which maximizes the Fisherinformation for a class of all continuous PDFs which are constrained by the partial covariance.

fge(y; 𝜃) = arg minfy(y)

{IF

}such that

y=𝜃+agd

∫y=𝜃−agd

(y − 𝜃)2 fge(y)dy ≤ 𝜎2gd (3.83)

One would have expected the worst-case PDF fge(y; 𝜃) to have tails thicker than those of Cauchy. Thatis, one would expect the worst-case PDF to generate more bad data than the Cauchy PDF.

Example 3.4 i.i.d. Cauchy PDFSingle data sample: The mathematical and the probabilistic models are given by Eqs. (3.53) and (3.14)respectively. We will consider a simplified PDF given by

fc(y; 𝜃) = 𝛼

𝜋[(y − 𝜃)2 + 𝛼2](3.84)

where 𝛼 is the scale parameter and 𝜃 is median (or the mode or location parameter) of the PDF. Thelikelihood function l (𝜃|y) becomes

l(𝜃|y) = ln(𝛼) − ln(𝜋) − ln(𝜋[(y − 𝜃)2 + 𝛼2]) (3.85)

The partial derivative of l(𝜃|y) is:

𝛿l(𝜃|y)𝛿𝜃

=2(y − 𝜃)

[(y − 𝜃)2 + 𝛼2](3.86)

The derivative of the log-likelihood function is nonlinear. It asymptotically decays as 1∕ |y|.The Fisher information using the Appendix yields:

IF = E

[(𝜕l(𝜃|y)

𝜕𝜃

)2]= 1

2𝛼2(3.87)

N data samples: Let y = [ y(0) y(1) y(2) . y(N − 1) ]T be Nx1 measurements where y(k) ; k =0, 1, 2,…N − 1 are independent and identically distributed Cauchy random variables. The mathematicalmodel that generates the measurements is given by Eq. (3.59). Equivalently, the probabilistic model is:

fc(y ; 𝜃) =N−1∏k=0

(𝛼

𝜋[(y(k) − 𝜃)2 + 𝛼2]

)(3.88)


Taking the natural logarithm, the likelihood l (𝜃|y) s:

l(𝜃|y) = N ln(𝛼) − N ln(𝜋) −N−1∑k=0

ln(𝜋[(y(k) − 𝜃)2 + 𝛼2]) (3.89)



=N−1∑k=0

(2 (y(k) − 𝜃)

[(y(k) − 𝜃)2 + 𝛼2]

)(3.90)

The Fisher information using the Appendix becomes

IF = E

[(𝛿l(𝜃|y)

𝛿𝜃

)2]= E

⎡⎢⎢⎣(

N−1∑k=0

(2 (y(k) − 𝜃)

[(y(k) − 𝜃)2 + 𝛼2]

))2⎤⎥⎥⎦ (3.91)

Using Eqs. (3.86) and (3.87) we get

IF = E⎡⎢⎢⎣(

𝛿 ln(fy(y; 𝜃)

)𝛿𝜃

)2⎤⎥⎥⎦ = N2𝛼2

(3.92)

The properties of the estimator are deferred to the next section on maximum likelihood estimation.

Remarks Comparing Eqs. (3.58) and (3.66); (3.73) and (3.78); and (3.87) and (3.92) wededuce that the Fisher Information for N independent and identically distributed data samplesy = [ y(0) y(1) y(2) . y(N − 1) ]T is N times the Fisher information for a single data y = 𝜃 + v.

The performance of an estimator depends upon the likelihood function. The log-likelihood function

l(𝜽|y) and its partial derivative𝛿l (𝜽|y)

𝛿𝜽are shown in Figure 3.4. The top figures show the PDFs, the

middle figures show the log-likelihood functions, and the bottom figures show the negatives of the partialderivative of the log-likelihood function for the Gaussian, the Laplacian, the part Gauss-part Laplace,and the Cauchy PDF.

Example 3.5 Multivariate Gaussian PDFConsider the problem of estimation of the Mx1 parameter 𝜽 = [ 𝜃1 𝜃2 𝜃3 . 𝜃M ]T where the linearmathematical and the equivalent probabilistic models are:

y = H𝜽 + v (3.93)

where v is zero-mean measurement Gaussian noise with covarianceΣv; H is NxM matrix. The probabilisticmodel is

fg(y;𝜽) = 1√(2𝜋)N |𝚺v| exp

{−1

2(y − H𝜽)T 𝚺−1

v (y − H𝜽)}

(3.94)


-100 0 1000

0.005

0.01

Gaussian

-100 0 1000

0.01

0.02

Laplace

-100 0 1000

0.005

0.01

part-gauss

-100 0 1000

0.005

0.01

Cauchy

-100 0 100

0

2

4

Likelihood

-100 0 100

-0.1

0

0.1

-Derivative

y

-100 0 100

0

1

2

3

Likelihood

-100 0 100-0.05

0

0.05-Derivative

y

-100 0 100

2

3

4

Likelihood

-100 0 100

-0.02

0

0.02

-Derivative

y

-100 0 100

-11

-10

-9Likelihood

-100 0 100

-0.02

0

0.02

-Derivative

y

Figure 3.4 Log-likelihood function and its partial derivatives

Taking the natural logarithm likelihood function l (𝜽|y) becomes

l (𝜽|y) = −N2ln (2𝜋) − 1

2ln |𝚺v| − 1

2(y − H𝜽)T 𝚺−1

v (y − H𝜽) (3.95)

Expanding the right-hand side, considering only the terms in 𝜽, the gradient vector becomes:


= 12

𝛿

𝛿𝜽

(yT𝚺y − 𝜽T HT𝚺−1

v y − yT𝚺−1v H𝜽 + 𝜽T HT𝚺−1

v H𝜽)

(3.96)

Vector calculus for any Mx1 vector p, 1xM vector s, and MxM matrix R is given by:

𝜕𝜽T p𝜕𝜽

= p ;𝜕sT𝜽

𝜕𝜽= s ;

𝜕

𝜕𝜽{𝜽T R𝜽} = 2R𝜽 (3.97)

Using the vector calculus, the expression for the gradient vector becomes


= −HT𝚺−1v y + HT𝚺−1

v H𝜽 = HT𝚺−1v (y − H𝜽) (3.98)

The Fisher information matrix IF given by Eq. (3.48) becomes

IF = E

[(𝜕l (𝜽|y)

𝜕𝜽

)(𝜕l (𝜽|y)

𝜕𝜽

)T]= HTΣ−1

v E[(y − H𝜽)(y − H𝜽)T ]𝚺−1v H (3.99)


From the model (3.93) we get

E[(y − H𝜽)(y − H𝜽)T ] = E[vvT ] = 𝚺v (3.100)

Using (3.100) the expression (3.99) for IF becomes

IF = HT𝚺−1v H (3.101)

Let us verify whether an efficient estimator exists. Using Eqs. (3.98) and (3.101) the condition for theCramer–Rao equality (3.52) becomes

HT𝚺−1v (y − H𝜽) = HT𝚺−1

v H(�� − 𝜽) (3.102)

Equating both sides yields

HT𝚺−1v H�� = HT𝚺−1

v y (3.103)

The efficient estimator exists and is given by

�� = (HT𝚺−1v H)−1HT𝚺−1

v y (3.104)

Remarks The log-likelihood function l (𝜽|y) for various PDFs is given below:Gaussian: l (𝜽|y) is a quadratic in y − 𝝁y and its partial derivative 𝛿

𝛿𝜽l (𝜽|y) is linear in y

Laplacian: l (𝜽|y) is absolute value of y − 𝝁y and 𝛿

𝛿𝜽l (𝜽|y) is a hard limiter.

Part Gauss-part Laplace: l (𝜽|y) is part quadratic in y − 𝝁y ∈ Υgd and part absolute value in y − 𝝁y ∈ Υbd;𝛿

𝛿𝜽l (𝜽|y) is a soft limiter, that is, it is part linear in y − 𝝁y ∈ Υgd and part constant for y − 𝝁y ∈ Υbd

Properties of the estimator: The estimator for Gaussian PDF is the mean. Let us verify whether �� isunbiased. Taking the expectation of Eq. (3.104) we get:

E[��]=

(HT𝚺−1

v H)−1

HT𝚺−1v E [y] (3.105)

The expression for E [y] using the linear model (3.93) is:

E [y] = HE [𝜽] + E [v] (3.106)

Since E [v] = 0 and 𝜽 is deterministic we get:

E [y] = H𝜽 (3.107)

Substituting for E [y] in (3.105) we get:

E[��]=

(HT𝚺−1

v H)−1

HT𝚺−1v H𝜽 = 𝜽 (3.108)

Since E[��] = 𝜽, �� is unbiased. It is efficient as �� satisfies the Cramer–Rao equality condition Eq. (3.52).


-100 -80 -60 -40 -20 0 20 40 60 80 1000

0.05

PDF with variance 10

y

PD

F

-100 -80 -60 -40 -20 0 20 40 60 80 1000

0.01

0.02

PDF with variance 20

y

PD

F

-100 -80 -60 -40 -20 0 20 40 60 80 1000

2

4

x 10-4 PDF with variance 1000

y

PD

F

Figure 3.5 PDFs of different variances: small, medium, and large

3.4.4 Fisher InformationThe denominator of the Cramer–Rao lower bound of the inequality (3.43) is called the Fisher information.The name “Fisher information” was given after the famous nineteenth-century statistician, RonaldFisher. The Fisher information depends upon the rate of change of the PDF of y with respect to theparameter 𝜃. The greater the expectation of the rate of change at a given value of 𝜃, the larger will be theFisher information: the greater the rates of change, the easier it will be distinguish 𝜃 from its neighboringvalues. The largest rate of change (infinite rate of change) will be when the PDF is a delta functionlocated at y = 𝜃, while the smallest rate of change (zero rate of change) will occur if the PDF has verylarge variance. In general, if the PDF has a peak at 𝜃, the Fisher information will be large, and it willbe small if the PDF is flat around 𝜃. Figure 3.5 shows zero-mean, 𝜃 = 0, Gaussian PDFs with variances5, 20, and 1000. It can be deduced that the smaller the variance, the greater is the rate of change of thePDF in the neighborhood of 𝜃 = 0, implying the larger Fisher information, and the larger the variance,the less the rate of change of the PDF, implying the smaller the Fisher information.

3.4.4.1 Information Content

The Fisher information is an amount of information about the parameter 𝜃 contained in the measurementy. The Fisher information is non-negative and satisfies the additive properties of information measure. Ifthere are N independent measurement i.i.d random variables containing information about the unknownparameter, then the Fisher information is N times the Fisher information when there is single measure-ment. See Example 3.1 and Example 3.2.

Lemma 3.4 If y(k), k = 0, 1, 2,… , N − 1 be N i.i.d. random variables containing information about𝜃 and the PDF fy(y(k); 𝜃), k = 0, 1, 2,… , N − 1 satisfy regularity condition (3.40), then

INF = NIF (3.109)


where INF = E

[(𝜕l(𝜃|y)

𝜕𝜃

)2]

and IF = E

[(𝜕l(𝜃|y)

𝜕𝜃

)2]

are respectively the Fisher information when

N measurements y = [ y(0) y(1) y(2) . y(N − 1) ]T and one measurement y(k) are used.

Proof: The PDF of y is fy(y; 𝜃). Since {y(k)} are i.i.d. we get

l(𝜃|y) =N−1∑k=0

ln(fy(y(k); 𝜃)) (3.110)

Invoking the statistical independence of the measurements and using the Appendix INF is:

INF =

N−1∑k=0

E

[(𝜕l (𝜃|y(k))

𝜕𝜃

)2]

(3.111)

Since {y(k)} is i.i.d. random variable, IF is the same for all y(k):

IF = E

[(𝜕l (𝜃|y(k))

𝜕𝜃

)2]

for all k (3.112)

Hence Eq. (3.109) holds.

Remarks Thanks to the additive property (3.109), the larger the number of i.i.d data samples, thehigher is the Fisher information, and as a result the lower is the Cramer–Rao lower bound.

Example 3.1 and Example 3.2 illustrate the additive property of the Fisher information.

3.5 Maximum Likelihood EstimationThere are two generally accepted methods of parameter estimation, namely the least-squares estimationand the maximum likelihood estimation. The maximum likelihood or ML estimator is widely used asthe estimate gives the minimum estimation error covariance, and serves as a gold standard for evaluatingthe performance of other estimators. It is efficient as it achieves the Cramer–Rao lower bound if it exists.It is based on maximizing a likelihood function of the PDF of the data expressed as a function of theparameter to be estimated. In general the estimates are implicit nonlinear functions of the parameter to beestimated and the estimated is obtained recursively. If the PDF of the measurement data is Gaussian, themaximum likelihood estimation method simplifies to a weighted least-squares method where the weightis the inverse of the noise covariance matrix.

3.5.1 Formulation of Maximum Likelihood EstimationConsider the problem of estimating a non-random parameter 𝜽 from y = H𝜽 + v. The optimal estimate ��is obtained by maximizing the likelihood function L (𝜽|y) of the random variable y, which is essentiallythe PDF of y expressed as a function of the parameter to be estimated 𝜽. It is the likelihood that themeasurement y is characterized the parameter 𝜽. The Maximum Likelihood (ML) estimate is obtained


by maximizing the likelihood function. As commonly used PDFs are exponential functions of 𝜽, a log-likelihood function l (𝜽|y) = ln fy(y;𝜽), instead of the likelihood function L (𝜽|y), is commonly employed.The ML estimate is obtained by maximizing the log-likelihood function

�� = argmax𝜽

{l (𝜽|y)} (3.113)

It is obtained by setting the partial derivatives of l (𝜽|y) to zero with respect to 𝜽:


= 0 (3.114)

Lemma 3.5 If a measurement model is (i) linear, y = H𝜽 + v, (ii) v is a zero-mean multivariateGaussian random variable with cov(v) = Σv, and (iii) the data matrix H has full rank and is constant,then the maximum likelihood estimator is unbiased and efficient and is given by

�� = (HT𝚺−1v H)−1HT𝚺−1

v y (3.115)

Proof: The PDF of y is

fg(y;𝜽) = 1√(2𝜋)N det(𝚺v)

exp

{−

(y − H𝜽)T 𝚺−1v (y − H𝜽)

2

}(3.116)

The log-likelihood function l (𝜽|y) is

l (𝜽|y) = −(y − H𝜽)T 𝚺−1

v (y − H𝜽)

2− N

2ln 2𝜋 − 1

2ln det Σv (3.117)

The ML estimate is obtained from maximizing the log-likelihood function (3.113):

�� = argmax𝜽

{−

(y − H𝜽)T Σ−1v (y − H𝜽)

2− N

2ln 2𝜋 − 1

2ln det 𝚺v

}(3.118)

Equivalently differentiating l (𝜽|y) with respect to 𝜽 and setting to zero the ML estimate �� (y) satisfies


= HTΣ−1v y − HTΣ−1

v H𝜽 = 0 (3.119)

The ML estimator �� is that value of 𝜽 that satisfies the above equation:

�� =(HT𝜮−1

v H)−1

HT𝚺−1v y (3.120)

Note: The maximum likelihood estimator is identical to that obtained using the Cramer–Rao equalitycondition (3.52). The maximum likelihood estimator is both unbiased and efficient. From Eq. (3.46) wededuce that the estimate �� is efficient with Fisher information

IF = HT𝜮−1v H (3.121)


Remarks If the measurement model is linear and the measurement noise is Gaussian, then themaximum likelihood estimator is linear in the measurement, unbiased, and efficient.

When the model is linear and the noise is Gaussian, the cost function for the ML estimator is the log-likelihood function given by Eq. (3.117). For estimating the mean value 𝜽, only the first term on theright-hand side is considered in the minimization as the rest of the two terms are not a function of 𝜽.The cost function reduces to the following minimization problem after suppressing the negative sign andsuppressing factor 2 in the denominator:

�� = argmin𝜽

{(y − H𝜽)T𝜮−1v (y − H𝜽)} (3.122)

In the statistics, the term y − H𝜽 is called residual: the residual is an error between the measurementand its estimate using the model that relates the measurement y and 𝜽. The estimation scheme given byEq. (3.122) is termed the weighted least-squares method.

The ML estimator for the case when the model is linear and the noise is zero-mean Gaussian isidentical to the weighted least-squares estimator.

3.5.2 Illustrative Examples: Maximum Likelihood Estimationof Mean or Median

Example 3.6 i.i.d. Gaussian PDFConsider the scalar version of Example 3.4 with H = [ 1 1 1 . 1 ]T and 𝚺v = I𝜎2

v . The unknown

parameter 𝜃 is scalar with y =[

y(0) y(1) y(2) . y(N − 1)]T

:

y(k) = 𝜃 + v(k), k = 0, 1, 2, 3,… , N − 1 (3.123)

We will consider two cases: (i) v(k) is i.i.d. zero-mean Gaussian noise; (ii) v(k) is zero-mean independentGaussian noise but not identically distributed.

(i) v is independent and identically distributed zero-mean Gaussian noise with covariance 𝜮v = I𝜎2v

Substituting 𝜮v = I𝝈2v , the log-likelihood function (3.117) reduces to

l(𝜃|y) = − 12𝜎2

v

N−1∑k=0

(y(k) − 𝜃)2 − N2ln 2𝜋 − 1

2ln 𝜎v (3.124)

The expression for the estimate �� given by (3.118) becomes:

�� = argmax𝜃

{− 1

2𝜎2v

N−1∑k=0

(y(k) − 𝜃)2 − N2ln 2𝜋 − 1

2ln 𝜎v

}(3.125)

Differentiating and setting to zero yields:


= − 1𝜎2

v

N−1∑k=0

(y(k) − 𝜃) = 0 (3.126)


Or, substituting 𝜮v = I𝝈2v in Eq. (3.120), the estimate �� becomes

�� = 1N

N∑i=1

y(i) (3.127)

The Fisher information (3.121) becomes

IF(𝜽) =(HT𝜮−1

v H)= N

𝜎2v

(3.128)

The variance of the estimation error

E[(𝜃 − ��)2

]= I−1

F =(HT𝚺−1

v H)−1 =

(HT H

)−1𝜎2

v =𝜎2

v

N(3.129)

(ii) v is zero-mean independent but not identically distributed Gaussian noise with covariance Σv =diag( 𝜎v1 𝜎v1 ⋯ 𝜎vn ). The estimate �� may easily be deduced.

The estimate and the variance of the estimation error are

�� =(HT𝚺−1

v H)−1

HT𝚺−1v y =

(N−1∑i=0

1𝜎2

vi

)−1 N−1∑i=0

y(i)𝜎2

vi

(3.130)


IF =(HT𝜮−1

v H)=

N−1∑i=0

1𝜎2

vi

(3.131)

The estimator is unbiased and efficient.

Example 3.7 i.i.d. Laplacian PDFConsider the linear measurement model (3.123), v(k) is i.i.d. zero-mean random noise with a LaplacianPDF given by Eq. (3.75):

fe(y; 𝜃) =⎛⎜⎜⎜⎝

1√2𝜎2

v

⎞⎟⎟⎟⎠N

exp

(−

√2

𝜎v

N−1∑k=0

|y(k) − 𝜃|) (3.132)


l (𝜽|y) = −N2ln

(2𝜎2

)−

√2

𝜎v

N−1∑k=0

|y(k) − 𝜃| (3.133)

The ML estimate is obtained from maximizing the log-likelihood function (3.113):

�� = argmax𝜽

{−N

2ln

(2𝜎2

)−

√2

𝜎v

N−1∑k=0

|y(k) − 𝜃|} (3.134)


Equivalently, differentiating ln fy(y) with respect to 𝜽 and setting to zero the ML estimate is obtained.

The estimator �� is obtained by setting Eq. (3.77) to zero:


=√

2𝜎v

N−1∑k=0

sign (y(k) − 𝜃) = 0 (3.135)

The estimator �� is that value of 𝜃 that solves the above equation. The ML estimate �� is that value 𝜃 thatmust satisfy Eq. (3.135), making the summation of all the signs of the terms (y(k) − 𝜃) equal zero. If wechoose the estimate to be the median of all the sample values, the number of positive and negative signsof the terms (y(k) − 𝜃) will be equal, ensuring the sum of the signs is zero. Thus it may be deduced thatthe maximum likelihood estimator is the median of y =

[y(0) y(1) y(2) . y(N − 1)

]T:

�� = median([ y(0) y(1) y(2) . y(N − 1) ]T ) (3.136)

The median is determined by sorting the sample values from the lowest to highest values and picking themiddle one so that the measurement data is split into two equal parts: in one part the data samples areall larger and in the other the data samples are all smaller than the median. Hence sign

(y(k) − ��

), k =

0, i, 2,… , N − 1 will have an equal number of 1’s and −1’s with the result

N−1∑k=0

sign(y(k) − ��

)= 0 (3.137)

Hence √2

𝜎v

N−1∑k=0

sign(y(k) − ��

)= 0 (3.138)

Consider the expression of the estimator (3.136). Expressing the median in terms of 𝜃 and v using themodel (3.123) we get

�� = 𝜃 + median([ v(0) v(1) v(2) . v(N − 1) ]T ) (3.139)

Using the definition of the median (3.35)

�� =⎧⎪⎨⎪⎩

𝜃 + v(𝓁) N odd

𝜃 + 12

(v(N∕2) + v(N∕2 + 1)) N even(3.140)

where the index 𝓁 separates the lower half from the upper half while the index m is the center that dividesthe upper and the lower halves. Taking expectation on both sides, using the definition of the median(3.35), and since v(k) is zero-mean, yields

E[

��]= 𝜃 (3.141)

Thus the median is an unbiased estimator.It is not possible to show that the median satisfies the Cramer–Rao equality condition (3.46). It can be

shown, however, that it asymptotically efficient.


Example 3.8 i.i.d. part Gauss-part Laplacian PDFThe expression for the part Gauss-part Laplacian PDF given by Eq. (3.80) becomes:

fge (y; 𝜃) =

⎧⎪⎪⎨⎪⎪⎩𝜅N exp

{− 1

2𝜎2v

N−1∑k=0

(y(k) − 𝜃)2

} |y(k) − 𝜃| ≤ agd

𝜅N exp

{Na2

gd

2𝜎2v

}exp

{−

agd

2𝜎2v

N−1∑k=0

|y(k) − 𝜃|} |y(k) − 𝜃| > agd

(3.142)


l (𝜽|y) =

⎧⎪⎪⎨⎪⎪⎩ln

(𝜅N

)− 1

2𝜎2y

N−1∑k=0

(y(k) − 𝜃)2 |y(k) − 𝜃| ≤ agd

ln(

𝜅N)+

Na2gd

2𝜎2y

−agd

2𝜎2y

N−1∑k=0

|y(k) − 𝜃| |y(k) − 𝜃| > agd

(3.143)

The ML estimate is obtained from maximizing the log-likelihood

�� = argmax𝜽

{l (𝜽|y)} (3.144)

Equivalently the ML estimate is determined by finding the roots of the derivative of the log-likelihoodfunction


= 0 (3.145)

In this case there is no closed form solution. The estimate is computed recursively using Eq. (3.145).The maximum likelihood estimator although inefficient for finite samples, is asymptotically efficient.

Remark A robust estimator is obtained using the maximum likelihood method when the PDF isunknown using the Min-Max approach. A ML estimator is sought which minimizes the log-likelihoodfunction for the worst-case PDF, namely fge(y; 𝜃). It is interesting that the worst-case PDF is asymptot-ically Laplacian with exponentially decaying tails. Intuitively, one would have expected the worst-casePDF to have tails thicker than that of Cauchy. That is, the worst-case PDF is expected to generate morebad data than Cauchy.

Example 3.9 i.i.d. Cauchy PDFConsider the linear measurement model (3.123) with v(k) is i.i.d. zero-mean random noise with CauchyPDF. The log-likelihood function is given by Eq. (3.89). The ML estimate is obtained from maximizingthe log-likelihood function (3.113):

�� = argmax𝜽

{N ln(𝛼) − N ln(𝜋) −

N−1∑k=0

ln(

𝜋[(y(k) − 𝜃)2 + 𝛼2])}

(3.146)


Equivalently the ML estimate is determined by finding the roots of Eq. (3.114) whose expression is givenin Eq. (3.90):


=N−1∑k=0

(2 (y(k) − 𝜃)[

(y(k) − 𝜃)2 + 𝛼2]) = 0 (3.147)

There is no closed form solution and the estimate is computed recursively.

The Cauchy distribution is a “pathological distribution,” which is often used to model a measurementdata with large outliers. Its mean does not exist and its variance is infinite. But its median and mode existand are well defined. Finding the roots of Eq. (3.147) is computationally burdensome. One simple androbust, although suboptimal, solution is to estimate the median value of the samples.

Remarks The cost function, denoted c(𝜃|y), for the ML estimate of Gaussian, Laplace, and Cauchy isderived from the l(𝜃|y) given respectively in Eqs. (3.60), (3.133), (3.143), and (3.89). The cost functionis a convex function of y − 𝜇y. Since the constant terms do not contribute to the constant, for comparingthe cost functions associated with different PDFs, a constant term is subtracted to the likelihood functionl(𝜃|y) so that the cost function is zero when y = 𝜇y:

c(𝜃|y) = 0 when y = 𝜇y (3.148)

The cost function for different PDFs when 𝜇y = 𝜃 becomes:

c (𝜽|y) =

⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

12𝜎2

v

N−1∑i=0

(y(k) − 𝜃)2 Gaussian√2

𝜎v

N−1∑k=0

|y(k) − 𝜃| Laplacian

N−1∑k=0

ln(𝜋[(y(k) − 𝜃)2 + 𝛼2]) − N ln(𝜋 𝛼2) Cauchy

(3.149)

Similarly, the cost function for part Gauss-part Laplace using the expression (3.143) is:

c (𝜽|y) =

⎧⎪⎪⎨⎪⎪⎩

12𝜎2

y

N−1∑k=0

(y(k) − 𝜃)2 |y(k) − 𝜃| ≤ agd

agd

2𝜎2y

N−1∑k=0

|y(k) − 𝜃| |y(k) − 𝜃| > agd

(3.150)

The ML estimator �� of the mean 𝜃 is determined by minimizing the cost function with respect to 𝜃:

�� = arg min𝜃

{c(𝜃|y)} (3.151)

In the case when the PDF is not known, then the robust estimator of 𝜃 is found by minimizing the costfunction for the worst-case PDF:

�� = arg min𝜃

{maxfy∈ℂ

{c(𝜃|y)}}

(3.152)


Remarks The functional forms of the cost function c(𝜃|y) for different PDFs are as follows:

1. Gaussian: It is a L2 metric (or distance function) of the y − 𝜃, that is it is a quadratic functionof y − 𝜃.

2. Laplace: It is a L1 metric of y − 𝜃, that is it is an absolute value function of y − 𝜃.3. A part Gauss-part Laplace: It is a compromise between a L2 and a L1 of y − 𝜃, that is it is a quadratic

function for small y − 𝜃 and an absolute function for large y − 𝜃.4. Cauchy: It is a log-quadratic function of y − 𝜃.

The weightings of cost function depend upon the PDF.

1. All cost functions, except that of the worst-case (part Gauss-part Laplace), do not discriminate betweenthe good and bad data. They weigh equally all the measurement deviation y − 𝜃.

2. The Gaussian cost function is a quadratic function of all y − 𝜃 including both good and bad data.The ML estimator using Gaussian PDF performs poorly in the presence of bad data and gives anacceptable performance when the measurement data is “good.”

3. The part Gauss-part Laplace (worst-case) cost function is employed when the PDF of the measurementdata is unknown or partially known. It discriminates between the good and the bad data, therebyensuring adequate performance of the ML estimator in the face measurement data (including theworst case of bad data) generated by a class of all PDFs. It is a quadratic function of y − 𝜃 for allgood data, |y − 𝜃| ≤ agd, and an absolute function of y − 𝜃 for all bad data, |y − 𝜃| > agd.

4. Similar to the worst-case cost function, the Laplace cost function weights bad measurement data asan absolute function of y − 𝜃 for all |y − 𝜃| > agd. Hence the ML estimator using Laplacian PDF isrobust to the presence of bad data.

5. The Cauchy cost function is a log-quadratic function of all y − 𝜃. It assigns the lowest weights,especially to the bad data. The ML estimator penalizes the bad data the most.

6. The ML estimates for Gaussian and Laplace are easy to compute. For the Gaussian PDF it is the meanof the data samples, while for the Laplace it is the median of the data samples. In the case when thePDF is not known, it is simpler to employ the mean when the data is believed to be “good” and themedian when data is believed to be “bad.”

The performance of the ML estimators for Gauss, Laplace, part Gauss-part Laplace, and Cauchy PDFsare illustrated in Figure 3.6. The ML estimate of 𝜃 = 1 is obtained for a linear model y(k) = 𝜃 + v(k) forthe cases when v(k) zero-mean Gaussian, Laplacian, part Gauss-part Laplace, and Cauchy i.i.d. randomvariables. The variance for Gaussian, Laplacian, and part Gauss-part Laplace random variables is unity.The ML estimates are obtained for both the correct as well as the incorrect choice of PDFs. The leftmostsubfigures (a), (f), (k), and (p) show the Gaussian, the Laplace, the part Gauss-part Laplace, and theCauchy measurement random variables respectively. The ML estimates assuming Gaussian, Laplace,part Gauss-part Laplace. and Cauchy are shown in (i) the top subfigures (b), (c), (d), and (e) respectivelywhen the true PDF is Gaussian, (ii) the middle subfigures (g), (h), (i), and (j) respectively when thetrue PDF is Laplace, (iii) the subfigures (l), (m), (n), and (o) respectively when the true PDF is partGauss-part Laplace, and (iv) the bottom subfigures (q), (r), (s), and (t) when the true PDF is Cauchy.The subfigures (b), (h), and (o) show the ML estimate when the true PDF is assumed. The rest show theestimates with the incorrect choice of PDF. The captions “est G,” “est L,” “est P,” and “est C” indicatethat the estimates are computed assuming Gaussian, Laplacian, part Gauss-part Laplace, and CauchyPDFs respectively. Similarly “data G,” “data L,” “data P,” and “data C” indicate that measurement data isgenerated respectively assuming Gaussian, Laplace, part Gauss-part Laplace, and Cauchy PDFs. Thesecost functions associated with the PDFs are shown in Figure 3.7.

Table 3.1 shows the performance of the ML estimator. The ideal case, when the assumed PDF isequal to the true PDF, and the practical case, when the assumed PDF is different from the true PDF, are


0 50 100

-1

0

1

2

3A: data G

0 50 1000.8

1

1.2

B: est G

0 50 100

0.8

1

1.2

C: est L

0 50 100

0.8

1

1.2

D: est P

0 50 100

0.8

1

1.2

1.4

E: est C

0 50 100-10

1

23

F: data L

0 50 100

1

1.2

G: est G

0 50 100

1

1.2H: est L

0 50 100

1

1.2

I: est P

0 50 100

1

1.2

J: est C

0 50 100

0

2

4

K: data P

0 50 100

1

1.2L: est G

0 50 1000.8

1

M: est L

0 50 100

0.9

1

1.1

N: est P

0 50 100

0.9

1

1.1

O: est C

0 50 100

-5

0

5

10

P: data C

0 50 100

-50

0

50

Q: est G

0 50 100

0.8

1

1.2R: est L

0 50 100

0.8

1

1.2

S: est P

0 50 100

0.8

1

1.2

T: est C

Figure 3.6 Performance of the ML estimators for various PDFs

considered. The ML estimates of the mean and the variance of the estimation errors are given when themeasurement data comes from the Gaussian, the Laplacian, the part Gauss-part Laplace, and the CauchyPDFs. The leftmost column indicates the true PDFs (Gaussian, Laplacian, part Gauss-part Laplace, orCauchy). The rows are associated with the true PDF, and the columns with the assumed PDFs. The MLestimate of the mean and the variance of the estimate are given at the intersection of the row (denoting thetrue PDF) and the columns (denoting the assumed PDFs). For example, The ML estimation of the meanand variance of the estimation error when the true PDF is Gaussian and the assumed PDF is Cauchy aregiven respectively by the entries in the first row and fifth column, and the first row and the ninth column.

We can deduce the following from the Figure 3.6 and the Table 3.1:

1. The performance of the ML estimator is the best when the assumed PDF is equal to the true PDF.2. In the case when the PDF is unknown, the ML estimator based on the worst-case PDF is robust to

uncertainties in the PDF generating the measurement data. The estimator performs adequately overall measurement data PDFs.

3. The performance of the estimator assuming a Gaussian PDF when the data is bad, such as Cauchydistributed measurements, is catastrophic.

4. The performance of the estimator assuming Laplace PDF is acceptable.


-80 -60 -40 -20 0 20 40 60 80 1000

1

2

3

4

5

y

Cost

Gauss

Laplace

Cauchypart Gauss-part Laplace

Cost functions for Gauss, Laplace part Gauss-part Laplace, and Cauchy

Figure 3.7 Cost functions: Gaussian, Laplace, part Gauss-part Laplace, and Cauchy PDFs

3.5.3 Illustrative Examples: Maximum Likelihood Estimation of Meanand Variance

So far we have concentrated on estimating the location parameter 𝜃 which is a mean or median or modelof a PDF. However, there is a need to estimate the variance from the measurement data for determiningthe PDF and to evaluate the performance of the estimator.

Example 3.10 i.i.d. Gaussian PDFConsider a scalar version of Example 3.4 with H = [ 1 1 1 . 1 ]T and 𝚺v = I𝜎2

v . From Eq. (3.124),the log-likelihood function for the unknown parameters 𝜃 and the variance 𝜎2

v denoted l(𝜃, 𝜎2v |y) is:

l(𝜃, 𝜎2v |y) = − 1

2𝜎2v

N−1∑k=0

(y(k) − 𝜃)2 − N2ln 2𝜋 − N

2ln 𝜎2

v , (3.153)

Table 3.1 Performance of the ML estimator

Estimate of the mean Estimate of the variance

Gauss Laplace Part. Cauchy Gauss Laplace Part. Cauchy

Gauss 1.0212 1.0169 1.0203 1.0125 0.0125 0.0164 0.0133 0.0203Laplace 1.0045 1.0092 1.0063 1.0067 0.0104 0.0061 0.0080 0.0062Part . . . 1.0007 1.0018 1.0043 1.0037 0.0072 0.0057 0.0036 0.0050Cauchy 1.7102 1.0098 1.0044 1.0094 27.2594 0.0072 0.0103 0.0060


The ML estimates �� and ��v are obtained from maximizing the log-likelihood function l(

𝜃, 𝜎v|y):

[��

��2v

]= argmax

𝜎v ,𝜃

{− 1

2𝜎2v

N−1∑k=0

(y(k) − 𝜃)2 − N2ln 2𝜋 − N

2ln 𝜎2

v

}(3.154)

Partial differentiation of l(

𝜃, 𝜎2v |y) with respect to 𝜃 and 𝜎v we get

𝛿

𝛿𝜃l(

𝜃, 𝜎2v |y) = 1

𝜎2v

N−1∑k=0

(y(k) − 𝜃) (3.155)

𝛿

𝛿𝜎2v

l(

𝜃, 𝜎2v |y) = 1

2𝜎4v

N−1∑k=0

(y(k) − 𝜃)2 − N2𝜎2

v

(3.156)

Setting the partial derivatives (3.155) and (3.156) to zero, the ML estimates �� and ��2v are those values of

𝜃 and 𝜎2v which satisfy the following equation:

⎡⎢⎢⎢⎣𝛿

𝛿𝜃l(

𝜃, 𝜎2v |y)

𝛿

𝛿𝜎2v

l(

𝜃, 𝜎2v |y)

⎤⎥⎥⎥⎦ =

⎡⎢⎢⎢⎢⎢⎣

1𝜎2

v

N−1∑k=0

(y(k) − 𝜃)

12𝜎4

v

N−1∑k=0

(y(k) − 𝜃)2 − N2𝜎2

v

⎤⎥⎥⎥⎥⎥⎦=

[00

](3.157)

The ML estimate �� satisfies:

1𝜎2

v

N−1∑k=0

(y(k) − ��

)= 0 (3.158)

From Eq. (3.158) the ML estimate �� of 𝜃 is:

�� = 1N

N∑i=1

y(i) (3.159)

The variance of the estimator �� from Eq. (3.129) is given by

var(

��)=

𝜎2v

N(3.160)

The ML estimate ��2v of 𝜎2

v is:

��2v = 1

N

N−1∑k=0

(y(k) − ��

)2(3.161)

It has been already established that �� is unbiased and efficient in Example 3.4.


Let us verify the properties of ��2v . We will show that ��2

v is biased. Substituting y(k) = 𝜃 + v(k) in the

expression �� = 1

N

N∑i=1

y(i) in Eq. (3.161) yields

E[

��2v

]= 1

N

N−1∑k=0

E⎡⎢⎢⎣(

𝜃 + v(k) − 1N

N−1∑i=0

(𝜃 + v(i))

)2⎤⎥⎥⎦ = 1N

N−1∑k=0

E⎡⎢⎢⎣(

v(k) − 1N

N−1∑i=0

v(i)

)2⎤⎥⎥⎦ (3.162)

Expanding and using the fact that {v(i)} is i.i.d. we get

E[

��2v

]= 1

N

N−1∑k=0

E⎡⎢⎢⎣(

1N

((N − 1) v(k) −

N−1∑i=0 ,i≠k

v(i)

))2⎤⎥⎥⎦ = N − 1N

𝜎2v (3.163)

It has been shown that the variance of the estimator ��2v is given by

var(

��2v

)= 2(N − 1)

N2𝜎4

v (3.164)

Equation (3.163) shows that ��2v is “slightly” biased since E

[��2

v

]= N−1

N𝜎2

v ≠ 𝜎2v . It is, however, asymp-

totically unbiased. We may fix the problem of bias by simply redefining the estimator (3.161) as:

��2v unb =

NN − 1

��2v = 1

N − 1

N−1∑k=0

(y(k) − ��

)2(3.165)

where ��2v unb is the redefined estimator of the variance. Let us now verify whether it is unbiased. Taking

the expectation and using Eq. (3.163) yields:

E[

��2v unb

]= 𝜎2

v (3.166)

Consider the Fisher information matrix (3.50):

From Eq. (3.155) E

[𝜕2l

(𝜃, 𝜎v|y)𝜕𝜃2

]= − N

𝜎2v

; E

[𝜕2l

(𝜃, 𝜎v|y)

𝜕𝜃𝜕𝜎2v

]= − 1

𝜎4v

N−1∑k=0

E [(y(k) − 𝜃)] = 0; and

using Eq. (3.156) E

[𝜕2l

(𝜃, 𝜎v|y)

𝜕(

𝜎2v

)2

]= − 1

𝜎6v

N−1∑k=0

E[(y(k) − 𝜃)2

]+ N

2𝜎4v= − N

𝜎4v+ N

2𝜎4v= − N

2𝜎4v

.

Hence the Fisher information matrix becomes

IF =

⎡⎢⎢⎢⎢⎢⎣−E

[𝜕2l

(𝜃, 𝜎v|y)𝜕𝜃2

]−E

[𝜕2l

(𝜃, 𝜎v|y)

𝜕𝜃𝜕𝜎2v

]

−E

[𝜕2l

(𝜃, 𝜎v|y)

𝜕𝜃𝜕𝜎2v

]−E

[𝜕2l

(𝜃, 𝜎v|y)

𝜕(

𝜎2v

)2

]⎤⎥⎥⎥⎥⎥⎦=

⎡⎢⎢⎢⎣N

𝜎2v

0

0N

2𝜎4v

⎤⎥⎥⎥⎦ (3.167)

Let us verify the efficiency of the estimators �� and ��2v by comparing their variances with the diagonal

elements of the inverse of the Fisher information matrix I−1F . From expressions of the variances (3.160)

and (3.164) with diagonal elements of I−1F = diag

(𝜎2

v

N

2𝜎4v

N

), it can be deduced that (a) �� is efficient

whereas ��2v is asymptotically efficient.


Example 3.11 i.i.d. Gaussian PDFWe will use the results of the linear model y = H𝜽 + v in Lemma 3.5 with 𝚺v = I𝜎2

v . The log-likelihoodfunction for the unknown parameters 𝜽 and the variance 𝜎2

v denoted l(𝜽, 𝜎2

v |y) is:

l (𝜽|y) = −(y − H𝜽)T (y − H𝜽)

2𝜎2v

− N2ln 2𝜋 − 1

2ln 𝜎2

v (3.168)

The ML estimates �� and ��v are obtained from maximizing the log-likelihood function l(𝜽, 𝜎v|y):

[��

��2v

]= argmax

𝜎v ,𝜃

{−

(y − H𝜽)T (y − H𝜽)2𝜎2

v

− N2ln 2𝜋 − N

2ln 𝜎2

v

}(3.169)

Partial differentiation of l(

𝜃, 𝜎2v |y) with respect to 𝜃 and 𝜎v gives

𝛿

𝛿𝜽l (𝜽|y) = 1

𝜎2v

HT (y − H𝜽) (3.170)

𝛿

𝛿𝜎2v

l(𝜽, 𝜎2

v |y) = 12𝜎4

v

(y − H𝜽)T (y − H𝜽) − N2𝜎2

v

(3.171)

Setting the partial derivatives (3.155) and (3.156) to zero, the ML estimates �� and ��2v are those values of

𝜽 and 𝜎2v which satisfy the following equation:

⎡⎢⎢⎢⎣𝛿

𝛿𝜃l(𝜽, 𝜎2

v |y)𝛿

𝛿𝜎2v

l(𝜽, 𝜎2

v |y)⎤⎥⎥⎥⎦ =

⎡⎢⎢⎢⎣1

𝜎2v

HT (y − H𝜽)

12𝜎4

v

(y − H𝜽)T (y − H𝜽) − N2𝜎2

v

⎤⎥⎥⎥⎦ =

[0

0

](3.172)

The ML estimator �� is that value of 𝜽 that satisfies the above equation:

�� =(HT H

)−1HT y (3.173)

The ML estimate ��2v of 𝜎2

v is:

��2v = 1

N

(y − H��

)T (y − H��

)(3.174)

Invoking Lemma 3.6 in the Appendix, the expectation of the estimate of the variance is:

E[

��2v

]= N − 1

N𝜎2

v (3.175)

This shows that ��2v is “slightly” biased. It is, however, asymptotically unbiased. We may fix the problem

of bias by simply redefining the estimator (3.161) as:

��2v unb =

NN − 1

��2v = 1

N − 1

(y − H��

)T (y − H��

)(3.176)


Example 3.12 Multivariate Gaussian PDFLet y = H𝜽 + v be a linear model, v a zero-mean Gaussian random variable with cov(v) = 𝚺v. The PDFof y is:

fy(y) = 1√(2𝜋)N|𝚺v| exp

{−1

2(y − H𝜽)T 𝚺−1

v (y − H𝜽)}

(3.177)

Let yp = [ y1 y2 y3 . yp ]T be p realizations of y obtained by performing p independent experiments.We will assume the sequence {yi : i = 1, 2, 3,… , p} to be i.i.d. Gaussian distributed random vectors. Theproblem is to estimate the mean 𝜽 and the covariance matrix 𝚺v from p realizations yi: i = 1, 2,… , p.The joint PDF of yp is

fy (yp) = 1√(2𝜋)Np|𝚺v|p

exp

{−1

2

p∑i=1

(yi − H𝜽)T𝚺−1v (yi − H𝜽)

}(3.178)

The log-likelihood function l(𝜽,𝚺−1v |yp) given in Eq. (3.117) becomes:

l(𝜽,𝚺−1v |yp) = −1

2

p∑i=1

(yi − H𝜽)T𝚺−1v (yi − H𝜽) − c −

p

2ln |𝚺v| (3.179)

where c =Np

2ln 2𝜋 is a constant term. The ML estimates �� and ��v are those values of 𝜽 and 𝚺v that

maximize the log-likelihood function l(𝜽,𝚺−1

v |yp):

[��

��v

]= arg max

𝚺−1v ,𝜃

{−1

2

p∑i=1

(yi − H𝜽)TΣ−1v (yi − H𝜽) − c −

p

2ln |𝚺v|} (3.180)

Partial differentiation with respect to 𝜽 and 𝚺−1v yields

𝛿

𝛿𝜽l(𝜽,𝚺−1

v |yp) =p∑

i=1

HT𝚺−1v (yi − H𝜽) (3.181)

𝛿

𝛿𝚺−1v

l(𝜽,𝚺−1

v |yp)= 𝛿

𝛿𝚺−1v

{−1

2

p∑i=1

(yi − H𝜽)T𝚺−1v (yi − H𝜽) − c −

p

2ln |Σv|} (3.182)

Expressing the quadratic term in Eq. (3.182) as a trace of the matrix product yields

− 12

𝛿

𝛿𝚺−1v

(trace

(p∑

i=1

(yi − H𝜽)T𝚺−1v (yi − H𝜽)

))− 𝛿

𝛿𝚺−1v

(p

2ln |𝚺v|) (3.183)

Using the properties of trace and determinant:trace {ABC} = trace {BCA}; |A| = 1|A−1| ; ln |A| = − ln |A−1| yields

− 12

𝛿

𝛿𝚺−1v

(trace

(𝚺−1

v

p∑i=1

(yi − H𝜽)(yi − H𝜽)T

))+

p

2𝛿

𝛿𝚺−1v

(ln |𝚺−1

v |) (3.184)


Computing the partial differentiation of the trace and the logarithm terms with respect to Σ−1v using

matrix calculus: 𝛿

𝛿Atrace {AB} = 𝛿

𝛿Atrace {BA} = BT ; 𝛿

𝛿Aln(|A|) = (A−1)T we get:

𝛿

𝛿𝚺−1v

trace

{𝚺−1

v

12

p∑i=1

(yi − H𝜽)(yi − H𝜽)T

}= 1

2

p∑i=1

(yi − H𝜽)(yi − H𝜽)T (3.185)

𝛿 ln(|𝚺−1

v |)𝛿𝚺−1

v

=(𝚺v

)T = 𝚺v (3.186)

Using the above expressions, the result of partial differentiation (3.182) becomes

𝛿

𝛿𝚺−1v

l(𝜽,𝚺−1

v |yp)= 1

2

(−

p∑i=1

(yi − H𝜽)(yi − H𝜽)T + 𝚺vp

)(3.187)

Setting the partial derivatives (3.181) and (3.187) to zero, the ML estimates �� and ��v are those values of𝜽 and 𝚺v which satisfy the following equation:

⎡⎢⎢⎢⎣𝛿

𝛿𝜽l(𝜽,Σ−1

v |yp)

𝛿

𝛿𝚺−1v

l(𝜽,𝚺−1v |yp)

⎤⎥⎥⎥⎦ =

⎡⎢⎢⎢⎢⎢⎣

p∑i=1

HT𝚺−1v (yi − H𝜽)

12

(−

p∑i=1

(yi − H𝜽)(yi − H𝜽)T + 𝚺vp

)⎤⎥⎥⎥⎥⎥⎦=

[0

0

](3.188)

Solving for 𝜽 and 𝚺v yields

�� = 1p

p∑i=1

(HT𝚺−1

v HT)−1H𝚺−1

v yi

��v =1p

p∑i=1

(yi − H��)(yi − H��)T

(3.189)

Example 3.13 i.i.d. Laplacian PDFConsider the linear measurement model (3.123), v(k) is i.i.d. zero-mean random noise with LaplacianPDF given by Eq. (3.75)

fy(y; 𝜃) =⎛⎜⎜⎜⎝

1√2𝜎2

v

⎞⎟⎟⎟⎠N

exp

(−

√2

𝜎v

N−1∑k=0

|y(k) − 𝜃|) (3.190)

The log-likelihood function for estimating the mean 𝜃 and standard deviation 𝜎v, l(

𝜃, 𝜎v|y) is

l(

𝜃, 𝜎v|y) = −N2ln

(2𝜎v

)−

√2

𝜎v

N−1∑k=0

|y(k) − 𝜃| (3.191)

The ML estimates �� and ��v are obtained from maximizing the log-likelihood function l(𝜃, 𝜎v|y):[��

��v

]= argmax

𝜎v ,𝜃

{−N

2ln(2𝜎v) −

√2

𝜎v

N−1∑k=0

|y(k) − 𝜃|} (3.192)


Partial differentiation of l(𝜃, 𝜎2v |y) with respect to 𝜃 and 𝜎v gives

𝛿

𝛿𝜃l(

𝜃, 𝜎2v |y) =

√2

𝜎v

N−1∑k=0

sign (y(k) − 𝜃) (3.193)

𝛿

𝛿𝜎v

l(

𝜃, 𝜎2v |y) =

√2

𝜎2v

N−1∑k=0

|y(k) − 𝜃| − N2𝜎v

(3.194)

Setting the partial derivatives (3.193) and (3.194) to zero, the ML estimates �� and ��v are those values of𝜃 and 𝜎v which satisfy the following equation:

⎡⎢⎢⎢⎢⎣𝛿

𝛿𝜃l(

𝜃, 𝜎2v |y)

𝛿

𝛿𝜎v

l(

𝜃, 𝜎2v |y)

⎤⎥⎥⎥⎥⎦=

⎡⎢⎢⎢⎢⎢⎣

√2

𝜎v

N−1∑k=0

sign (y(k) − 𝜃)√2

𝜎2v

N−1∑k=0

|y(k) − 𝜃| − N2𝜎v

⎤⎥⎥⎥⎥⎥⎦=

[0

0

](3.195)

The ML estimate �� is given by

�� = median([ y(0) y(1) y(2) . y(N − 1) ]T ) (3.196)

The ML estimate ��v is:

��v =√

2N

N−1∑k=0

|y(k) − ��| (3.197)

3.5.4 Properties of Maximum Likelihood Estimator1. The Cramer–Rao lower bound depends upon the PDF of the data, and one has to know the PDF either

a priori or estimate it from the data before we can compute the lower bound. Hence in many practicalcases it is not possible to compute the lower bound.

2. An efficient estimator may not always exist. However if the PDF is Gaussian, an efficient estimatoralways exists.

3. If the PDF is Gaussian, both the maximum likelihood and the best linear least-squares estimators areefficient.

4. If an efficient estimator exists, it is given by the maximum likelihood estimator.

3.6 Summary

Mathematical model

y = H𝜽 + v

Probabilistic model:

fy(y;𝜽) = fy(y(0), y(1), y(2),… , y(N − 1);𝜽)


Gaussian PDF: denoted fg(y;𝜽)

fg(y;𝜽) = 1√(2𝜋)N|Σv| exp

{−1

2(y − 𝝁y)

T𝚺−1v (y − 𝝁y)

}

𝝁y = E [y] ; E [y] = H𝜃; cov(y) = Σv

Uniform PDF: denoted fu(y; 𝜃)

fu(y; 𝜃) =⎧⎪⎨⎪⎩

1b − a

for a ≤ y ≤ b

0 for y < a or y > b

E[y] = 𝜇y = 𝜃 = b + a2

; var(y) = 𝜎2y = (b − a)2

12

Laplacian PDF: denoted fe(y; 𝜃)

fe(y; 𝜃) = 1√2𝜎y

exp⎛⎜⎜⎝−

√2|y − 𝜇y|

𝜎y

⎞⎟⎟⎠Worst-case PDF: Part Gauss-part Laplace denoted fge(y; 𝜃)

fge(y; 𝜃) =

⎧⎪⎪⎨⎪⎪⎩𝜅 exp

{− 1

2𝜎2ge

(y − 𝜇y)2

}y ∈ Υgd

𝜅 exp

{a2

gd

2𝜎2ge

}exp

{−

agd

2𝜎2ge

|y − 𝝁y|} y ∈ Υbd

where Υgd = {y : |y − 𝜇y| ≤ agd} and Υbd = {y : |y − 𝜇y| > agd}

Cauchy PDF: denoted fc(y; 𝜃)

fc(y; 𝜃) = 𝛼

𝜋[(y − 𝜇y)2 + 𝛼2]

Likelihood function

L (𝜽|y) = fy(y;𝜽)

The log-likelihood function

l (𝜽|y) = ln (L (𝜽|y))


Properties of estimatorsScalar case:

y(k) = H𝜃 + v(k)

min��

E[(y(k) − H��)2

]�� = 1

H

(1N

N−1∑k=0

y(k)

)E

[��]= 𝜃

Unbiasedness of the estimator

E[��] =

{𝜽 if 𝜽 is non-random

E[𝜽] if 𝜽 is random

A linear estimator, �� = Fy is unbiased if and only if FH = I

Variance of the Estimator: Scalar Case

E[(�� − 𝜃)2] = 1H2

E⎡⎢⎢⎣(

1N

N−1∑k=0

v(k)

)2⎤⎥⎥⎦ = 1N2

(𝜎2

v

H2

)

lim itN→∞

E[(�� − 𝜃)2] = 0

Median of the data samples

median(y) =⎧⎪⎨⎪⎩

x N+12

N odd

12

(x N

2+ x N

2+1

)N even

A formal definition:

min��

{E|y − H��|}Small and large sample properties:Consistent estimator��N is a consistent estimator of 𝜃 if ��N converges to 𝜽 in probability:

lim itN→∞

P{|𝜽 − ��| > 𝜀} = 0, for every 𝜀 > 0.

Asymptotically unbiased estimator:

lim itN→∞

E[��N

]=

{𝜽 if 𝜽 is random

E [𝜽] if 𝜽 is non-random

Asymptotically efficient estimatorIf the property of efficiency holds for infinitely large number samples N, then the estimator is said to beasymptotically efficient.


Cramer–Rao InequalityThe Cramer–Rao inequality of an unbiased estimator �� is

var(��) ≥ 1IF

IF is the Fisher information, IF = E

[(𝜕l(𝜃|y)

𝜕𝜃

)2]= −E

[𝜕2l(𝜃|y)

𝜕𝜃2

].

var(��) = 1

IFif and only if


= IF

(�� − 𝜃

)Vector Case: 𝜽 is a Mx1 Vector

cov(��) ≥ I−1F

where IF = E

[(𝜕l (𝜽|y)

𝜕𝜃

)(𝜕l (𝜽|y)

𝜕𝜃

)T]= −E

[𝜕2l (𝜽|y)

𝜕𝜃2

]is the Fisher information matrix cov(��) =

I−1F if and only if


= IF

(�� − 𝜽

).

If y(k), k = 0, 1, 2,… , N − 1 be N i.i.d. random variables then INF = NIF where IN

F = E

[(𝜕l (𝜽|y)

𝜕𝜃

)2]

and IF = E

[(𝜕l (𝜽|y)

𝜕𝜃

)2]

.

Maximum Likelihood Estimation

�� =(HT𝚺−1

v H)−1

HT𝚺−1v y

3.7 Appendix: Cauchy–Schwarz Inequality

⎛⎜⎜⎝∞

∫−∞

(f (y))2 dy⎞⎟⎟⎠⎛⎜⎜⎝

∞

∫−∞

(g(y))2 dy⎞⎟⎟⎠ ≥

⎛⎜⎜⎝∞

∫−∞

f (y)g(y)dy⎞⎟⎟⎠

2

(3.198)

where equality is achieved if and only if

f (y) = cg(y) (3.199)

where c is a constant, which is it not a function of y.

3.8 Appendix: Cramer–Rao Lower BoundRegularity Condition: The derivation of the Cramer–Rao lower bound assumes two weak regularityconditions: one on the PDF fy(y) and the other on the estimator �� (y).

1. fy(y) > 0

2.𝛿fy(y)

𝛿𝜃exists and is finite


3. The operations of integration with respect to y and differentiation with respect to 𝜽 can be interchangedin E

[�� (y)

].

𝛿

𝛿𝜃

∞

∫−∞

��(y)fy(y)dy =

∞

∫−∞

��(y)𝛿fy(y)

𝛿𝜃dy

This conditions that the integration and differentiation can be swapped holds for all well-behaved PDFs:

1. If fy(y) has bounded support in x, and the bounds do not depend on 𝜃.2. If fy(y) has infinite support, is continuously differentiable, and the integral converges uniformly for

all 𝜽.

Cramer–Rao lower bound: We will first consider the scalar case and then extend the result to a vectorcase.

3.8.1 Scalar Case

Theorem 3.1 Let y = [ y(0) y(1) y(2) . y(N − 1) ]T is a Nx1 vector of measurements character-ized by the probability density function (PDF) fy(y;𝜽), the measurement y is a function of an unknown

scalar parameter 𝜃. Let �� be an unbiased estimator of 𝜃 and only a function of y and not a function ofthe unknown parameter 𝜃. If the regularity conditions hold then

E[(��(y) − 𝜃)2] ≥(

E

[(𝛿 ln fy(y)

𝛿𝜃

)2])−1

(3.200)

where IF (𝜃) = E

[(𝛿 ln fy(y)

𝛿𝜃

)2]

is the Fisher information.

Proof: Since �� is an unbiased estimator of 𝜃, the following identity holds for all 𝜃

E[

��]− 𝜃 = 0 (3.201)

Expanding the expression of the expectation operator in terms of the PDF fy(y;𝜽) we get

∞

∫−∞

(��(y) − 𝜃

)fy(y;𝜽)dy = 0 (3.202)

The estimate �� is a function only of y and not of 𝜃 and hence partial differentiation with respect to 𝜃 andinvoking the regularity condition yields

∞

∫−∞

(�� (y) − 𝜃

) 𝛿fy(y;𝜽)

𝛿𝜃dy −

∞

∫−∞

fy(y;𝜽)dy = 0 (3.203)


Since∞∫

−∞fy(y;𝜽)dy = 1 we get

∞

∫−∞

(��(y) − 𝜽

) 𝛿fy(y;𝜽)

𝛿𝜃dy = 1 (3.204)

Dividing and multiplying the term inside the integral by fy(y;𝜽), we get

∞

∫−∞

(��(y) − 𝜃

) 𝛿fy(y;𝜽)

𝛿𝜃

fy(y;𝜽)fy(y;𝜽)dy = 1 (3.205)

Since𝛿 ln fy(y;𝜽)

𝛿𝜃=

𝛿fy(y;𝜽)

𝛿𝜃

fy(y;𝜽)we get

∞

∫−∞

(��(y) − 𝜃

) 𝛿 ln fy(y;𝜽)

𝛿𝜃fy(y;𝜽)dy = 1 (3.206)

Expressing the integrand using fy(y;𝜽) =√

fy(y;𝜽)√

fy(y;𝜽)(as fy(y;𝜽) ≥ 0) we get

∞

∫−∞

[(��(y) − 𝜃

)√fy(y;𝜽)

] [𝛿 ln fy(y;𝜽)

𝛿𝜃

√fy(y;𝜽)

]dy = 1 (3.207)

Using the Cauchy–Schwarz inequality given in 3.7 Appendix, we get

⎛⎜⎜⎝∞

∫−∞

(��(y) − 𝜃)2fy(y;𝜽)dy⎞⎟⎟⎠⎛⎜⎜⎝

∞

∫−∞

(𝛿 ln fy(y;𝜽)

𝛿𝜃

)2

fy(y;𝜽)dy⎞⎟⎟⎠ ≥ 1 (3.208)

Expressing this in terms of the expectation we get

E[(��(y) − 𝜃)2]E

[(𝛿 ln fy(y;𝜽)

𝛿𝜃

)2]≥ 1 (3.209)

Equivalently

E[(��(y) − 𝜃)2] ≥ 1IF(𝜃)

(3.210)

Corollary 3.1 The minimum variance unbiased estimator �� (an estimator which achieves the lowestbound on the variance) is the one that satisfies Eq. (3.209) with equality. From Cauchy–Schwarz inequalitythis implies

��(y) − 𝜃 = IF

𝛿 ln fy(y;𝜽)

𝛿𝜽(3.211)


3.8.2 Vector Case

Theorem 3.2 Let y = [ y(0) y(1) y(2) . y(N − 1) ]T is a Nx1 vector of measurements character-ized by the probability density function (PDF) fy(y;𝜽). The measurement y is function of an unknown

Mx1 parameter 𝜽. Let �� be an unbiased estimator of 𝜽 and only a function of y and not a function of theunknown parameter 𝜽. If the regularity conditions hold then

cov(��) = E[(��(y) − 𝜽)(��(y) − 𝜽)T ] ≥ I−1F (𝜽) (3.212)

where the MxM Fisher information matrix IF(𝜽) = E

[(𝛿 ln fy(y)

𝛿𝜽

)(𝛿 ln fy(y)

𝛿𝜽

)T]

. The inequality

implies that the difference cov(��)− I−1

F (𝜽) is positive semi-definite.

Proof: [3] Since ��(y) is unbiased, E[(��(y) − 𝜽)] = 0. Partial differentiation with respect to 𝜽 andinvoking the regularity condition yields

∞

∫−∞

(��(y) − 𝜽)

(𝛿 ln fy(y;𝜽 )

𝛿𝜽

)T

fy(y;𝜽)dy = I (3.213)

Pre-multiplying by aT and post-multiplying by b where a and b are Mx1 vectors we get

∞

∫−∞

aT (��(y) − 𝜽)

(𝛿 ln fy(y;𝜽)

𝛿𝜽

)T

b fy(y;𝜽)dy = aT b (3.214)

Defining two scalars h(y) = aT (��(y) − 𝜽) and g(y) =(

𝛿 ln fy(y;𝜽)

𝛿𝜽

)T

b Eq. (3.214) we get

∞

∫−∞

h(y)g(y)fy(y;𝜽)dy = aT b (3.215)

Using the Cauchy–Schwarz inequality given in 3.7 Appendix

⎛⎜⎜⎝∞

∫−∞

h2(y)fy(y)dy⎞⎟⎟⎠⎛⎜⎜⎝

∞

∫−∞

g2(y)fy(y)dy⎞⎟⎟⎠ ≥ (

aT b)2

(3.216)

where

h2(y) = aT (��(y) − 𝜽)(��(y) − 𝜽)T a

g2(y) = bT

(𝛿 ln fy(y;𝜽)

𝛿𝜃

)(𝛿 ln fy(y;𝜽)

𝛿𝜃

)T

b(3.217)

Expressing (3.216) in terms of cov(��), and IF(𝜽) we get

aT cov(��)

a bT IFb ≥ (aT b

)2(3.218)


Since b is arbitrary choosing b = I−1F (𝜽)a we get

aT cov(��)a ≥ aT I−1F (𝜽)a (3.219)

Thus

cov(��) ≥ I−1F (𝜽) (3.220)

Corollary 3.2 The minimum variance unbiased estimator �� (an estimator which achieves the lowestbound on the variance) is the one that satisfies Eq. (3.220) with equality. From the Cauchy–Schwarzinequality this implies

��(y) − 𝜽 = IF (𝜽)𝛿 ln fy(y)

𝛿𝜽(3.221)

3.9 Appendix: Fisher Information: Cauchy PDF

IF = E

[(𝜕 ln(fy(y; 𝜃))

𝜕𝜃

)2]=

∞

∫−∞

(𝜕 ln(fy(y; 𝜃))

𝜕𝜃

)2

fy(y; 𝜃)dy (3.222)

Substituting𝜕 ln(fy(y; 𝜃))

𝜕𝜃=

2y

y2 + 𝛼2; fy(y; 𝜃) = 𝛼

𝜋(

𝛼2 + y2) we get

IF = 4𝛼

𝜋

∞

∫−∞

y2(y2 + 𝛼2

)3dy (3.223)

The integral term is:

∫y2

(y2 + 𝛼2)3dy = 1

8𝛼3tan

( y

𝛼

)+

y

8𝛼2(y2 + 𝛼2)−

y

4(y2 + 𝛼2)2(3.224)

Substituting the upper limit of ∞ and the lower limit of integration −∞ for y in the expression on theright-hand side of Eq. (3.224), the Fisher information given by Eq. (3.223) is:

IF = 12𝛼2

(3.225)

3.10 Appendix: Fisher Information for i.i.d. PDFThe PDF of y is fy(y;𝜽). Since {y(k)} are i.i.d. we get

ln(fy (y; 𝜃)

)=

N−1∑k=0

ln(fy (y(k); 𝜃)

)(3.226)

The Fisher information IF is:

IF = E⎡⎢⎢⎣(

N−1∑k=0

ln(fy(y(k); 𝜃))

)2⎤⎥⎥⎦ (3.227)


Expanding the summation term we get

IF =N−1∑k=0

E

[(𝜕 ln(fy(y(k); 𝜃))

𝜕𝜃

)2]+

∑i

∑j≠i

E

[(𝜕 ln(fy(y(i); 𝜃))

𝜕𝜃

)(𝜕 ln(fy(y(j); 𝜃))

𝜕𝜃

)](3.228)

Expanding, and since y(i) and y(j) are independent yields, the product term becomes:

E

[(𝜕 ln(fy(y(i); 𝜃))

𝜕𝜃

)(𝜕 ln(fy(y(j); 𝜃))

𝜕𝜃

)]= E

[(𝜕 ln(fy(y(i); 𝜃))

𝜕𝜃

)]E

[(𝜕 ln(fy(y(j); 𝜃))

𝜕𝜃

)](3.229)

Hence IF becomes

IF =N−1∑k=0

E

[(𝜕 ln(fy(y(k); 𝜃))

𝜕𝜃

)2]+

∑i

∑j≠i

E

[(𝜕 ln(fy(y(i); 𝜃))

𝜕𝜃

)]E

[(𝜕 ln(fy(y(j); 𝜃))

𝜕𝜃

)](3.230)

Imposing the regularity condition (3.40), the product terms on the right-hand sides vanish. We get:

IF =N−1∑k=0

E

[(𝜕 ln(fy(y(k); 𝜃))

𝜕𝜃

)2]

(3.231)

3.11 Appendix: Projection OperatorLet H be some NxM matrix with N ≥ M and rank (H) = Mr. The projection operator denoted Pr associ-ated with H is defined as

Pr = HH† = H(HT H

)−1HT (3.232)

Properties of the projection operator are:Pr = HH† has following properties:

1. PTr = Pr is symmetrical

2. P2r = Pr hence Pm

r = Pr for m = 1, 2, 3,…,3. Eigenvalues of I − Pr are only ones and zeros

Let Mr be the rank of a NxM matrix H◦ Mr eigenvalues will be ones◦ The rest of the N − Mr eigenvalues will be zeros◦ trace

(I − Pr

)= N − Mr.

4. I − Pr projects a vector on to a space perpendicular to the range space of the matrix H5. If H is non-singular square matrix then Pr = I ; I − Pr = 0.

Lemma 3.6 Let y = H𝜽 + v and H have full rank:rank (H) = M. The ML estimator �� of 𝜽 is

�� =(HT H

)−1HT y, (3.233)


and the ML estimator ��2v of 𝜎v is

��2v = 1

N

(y − H��

)T (y − H��

)(3.234)

Then

E[

��2v

]=

(1 − M

N

)𝜎2

v (3.235)

Proof: Substituting �� =(HT H

)−1HT y and

��2v = 1

N

(y − H

(HT H

)−1HT y

)T (y − H

(HT H

)−1HT y

)(3.236)

Substituting y = H𝜽 + v in the expression H(HT H

)−1HT y we get

y − H(HT H

)−1HT y = H𝜽 + v − H

(HT H

)−1HT (H𝜽 + v) (3.237)

Simplifying we get

y − H(HT H

)−1HT y = v − H

(HT H

)−1HT v =

(I − H

(HT H

)−1HT

)v (3.238)

Using the definition of projection operator Pr (3.232) we get

y − H(HT H

)−1HT y ==

(I − Pr

)v (3.239)

Using Eq. (3.239), the expression (3.236) for ��2v becomes:

��2v = 1

NvT

(I − Pr

)T (I − Pr

)v (3.240)

Using the properties of the projection operator yields

��2v = 1

NvT

(I − Pr

)v (3.241)

Taking expectation yields

E[

��2v

]= 1

NE

[vT

(I − Pr

)v]

(3.242)

Since {v(k)} is zero mean i.i.d. with variance 𝜎2v , only the diagonal elements of I − P will contribute to

the non-zero values of the quadratic term on the right, and we get

E[

��2v

]= 1

Ntrace

(I − Pr

)𝜎2

v (3.243)

Using the property trace(I − Pr

)= N − M we get Eq. (3.235).


3.12 Appendix: Fisher Information: Part Gauss-Part LaplaceConsider the PDF given by Eq. (3.13). Assuming 𝜇y = 𝜃 the PDF becomes

fy(y) =

⎧⎪⎪⎨⎪⎪⎩𝜅 exp

{− 1

2𝜎2y

(y − 𝜃)2

}−agd ≤ y − 𝜃 ≤ agd

𝜅 exp

{a2

gd

2𝜎2y

}exp

{− 1

2𝜎2y

|y − 𝜃|agd

} |y − 𝜃| > agd

(3.244)

Taking the logarithm of the PDF yields

ln(fy(y)

)=

⎧⎪⎪⎨⎪⎪⎩ln (𝜅) − 1

2𝜎2y

(y − 𝜃)2 −agd ≤ y − 𝜃 ≤ agd

ln (𝜅) +a2

gd

2𝜎2y

−agd

2𝜎2y

|y − 𝜃| |y − 𝜃| > agd

(3.245)

Differentiating with respect to 𝜃 yields

𝛿

𝛿𝜃ln(fy(y)) =

⎧⎪⎪⎨⎪⎪⎩

1𝜎2

y

(y − 𝜃) −agd ≤ y − 𝜃 ≤ agd

agd

𝜎2y

sign(y − 𝜃) |y − 𝜃| > agd

(3.246)


IF = E

[(𝜕 ln(fy(y; 𝜃))

𝜕𝜃

)2]=

∞

∫−∞


𝜕𝜃

)2

fy(y)dy (3.247)

Splitting the integration interval over Υgd and Υbd yields

IF =

agd

∫−agd


𝜕𝜃

)2

fy(y)dy + ∫|y−𝜃|>agd


𝜕𝜃

)2

fy(y)dy (3.248)

Using Eq. (3.246) we get

IF = 1𝜎4

y

agd

∫−agd

(y − 𝜃)2fy(y)dy +a2

gd

𝜎4y

∫|y−𝜃|>agd

fy(y)dy (3.249)


Since

∞

∫−∞

fy(y)dy = 1 we get

∫|y−𝜃|>agd

fy(y)dy = 1 −

agd

∫−agd

fy(y)dy (3.250)

Hence the expression for Fisher information becomes

IF = 1𝜎4

y

agd

∫−agd

(y − 𝜃)2fy(y)dy +a2

gd

𝜎4y

⎛⎜⎜⎜⎝1 −

agd

∫−agd

fy(y)dy

⎞⎟⎟⎟⎠ (3.251)

Simplifying we get

IF = 1𝜎4

y

(𝜎2

gd + a2gd

(1 − 𝜆gd

))(3.252)

where 𝜎2gd =

y=𝜃+agd∫y=𝜃−agd

(y − 𝜃)2fy(y)dy is called “partial variance” and 𝜆gd =agd∫

−agd

fy(y)dy is called “partial

probability” of y over the region Υgd.

Problem3.1 Consider the same example of Gaussian measurements. Verify whether the following estimator �� is

(i) biased, (ii) efficient, (iii) asymptotically biased and asymptotically efficient.

References

[1] Doraiswami, R. (1976) A decision theoretic approach to parameter estimation. IEEE Transactions on AutomaticControl, 21(6), 860–866.

[2] Huber, P.J. (1964). Robust estimation of location parameter. Annals of Mathematical Statistics, (35), 73–102.[3] Kay, S.M. (1993) Fundamentals of Signal Processing: Estimation theory, Prentice Hall, New Jersey.

Further Readings

Haykin, S. (2001) Adaptive Filter Theory, Prentice Hall, New Jersey.Mendel, J. (1995) Lessons in Estimation Theory in Signal Processing, Communications and Control, Prentice-Hall,

New Jersey.Mitra, S.K. (2006) Digital Signal Processing: A Computer-Based Approach, McGraw Hill Higher Education, Boston.Moon, T.K. and Stirling, W.C. (2000) Mathematical Methods and Algorithms for Signal Processing, Prentice Hall,

NJ.Mix, D.F. (1995) Random Signal Processing, Prentice Hall, New Jersey.Olofsson, P. (2005). Probability, Statistics and Stochastic Process, John Wiley and Sons, New Jersey.Oppenheim, A.V. and Schafer, R.W. (2010) Discrete-Time Signal Processing, Prentice-Hall, New Jersey.Proakis, J.G. and Manolakis, D.G. (2007) Digital Signal Processing: Principles, Algorithms and Applications, Prentice

Hall, New Jersey.

identification of physical systems (applications to condition monitoring, fault diagnosis, soft...

Documents