identification of physical systems (applications to condition monitoring, fault diagnosis, soft...

5Linear Least-Squares Estimation

5.1 OverviewLinear least-squares estimation is a method of fitting the measurements (data or observation values) toa specified linear model. A best fit is obtained by minimizing the sum of the squares of the residual,where a residual is defined as an error between the measured value (data or observed value) and thevalue obtained using the model. Carl Friedrich Gauss invented the method of least squares in 1794 forpredicting the planetary motion which is completely described by six parameters. A linear model that canbest describe the measurement data is generally derived from physical laws. The model is set of algebraicequations governing the measurements and is completely described by a set of parameters termed hereinas feature vector. In order to attenuate the effect of measurement errors on the estimation accuracy,the number of measurements is generally chosen to be much larger than the number of parameters tobe estimated, resulting in what is commonly known as an over determined set of equations. Furtherthe measurement data may not contain sufficient information about the parameters, resulting in an illconditioned set of equations. The estimation of model parameters has wide application in many areas ofscience and engineering, including system identification, controller design, fault diagnosis, and conditionmonitoring. The least-squares method of Gauss is still widely used for estimating unknown parametersfrom the measurement data. The main reason is that it does require any probabilistic assumptions suchas the underlying PDF of the measurement error, which is generally unknown a priori and is difficult toestimate a posteriori.

A generalized version of the least-squares method, popularly known as the weighted least-squaresmethod, is developed. The least squares and the more general weighted least-squares estimate is unbiased,and is the best linear unbiased estimator. Most importantly, the optimal estimate is a solution of set alinear equations (assuming that the measurement model is linear) which can be efficiently computedusing the Singular Value Decomposition (SVD). The least-squares estimation has a very useful geometricinterpretation. The residual is orthogonal to the hyper plane generated by the columns of the matrix. Thisis called the orthogonality principle.

5.2 Linear Least-Squares ApproachThe linear least-squares approach includes the following topics

∙ Linear algebraic model∙ Objective function∙ Least-squares estimation

Identification of Physical Systems: Applications to Condition Monitoring, Fault Diagnosis, Soft Sensor andController Design, First Edition. Rajamani Doraiswami, Chris Diduch and Maryhelen Stevenson.© 2014 John Wiley & Sons, Ltd. Published 2014 by John Wiley & Sons, Ltd.

190 Identification of Physical Systems

∙ Normal equation: the equation governing the optimal estimate∙ Properties of least-squares estimate∙ Illustrative examples∙ Cramer–Rao inequality∙ Maximum likelihood method.

5.2.1 Linear Algebraic ModelWe will restrict ourselves to the case where the set of algebraic equations relating the unknown parametersand the measurements is linear. A solution to a set of linear equations is important as it forms the backboneof estimation theory. A brief background is presented in this section. Consider a set of linear equations

y = H𝜽 + v (5.1)

where y is a (N × 1) vector of measurements, H : ℜM → ℜN is a (N × M) matrix, 𝜽 is a (M × 1)deterministic parameter vector to be estimated, and v is an (N × 1) error vector. It is assumed that thematrix H has a full rank. In system identification, which is the focus of this chapter, the objectiveis to estimate the unknown vector, termed feature vector, which is formed of the numerator and thedenominator coefficient system transfer function, from the input–output data using the measurement Eq.(5.1). In view of this application H and 𝜽 are termed herein as data matrix and feature vector. The errorterm v is generally assumed to be a zero-mean white noise process or a zero-mean colored noise process.In the case when the error term is not zero-mean, the unknown mean may be included in the model byaugmenting feature vector 𝜽 and data matrix H. Estimation of the unknown parameters depends uponthe structure of the matrix including its dimension and its rank. The following cases are considered:

∙ If (i) the number of equations N is strictly less than the number of unknown parameters M, N < Mor (ii) when the number of equations is equal to the number of parameters N = M but the resultingsquare matrix is singular, the set of equations are called under determined equations. In this case, asolution always exists. But the solution is not unique as there is an infinite number of solutions. Inthis case a constraint version of the least-squared method is employed to find a solution which (i) hasa minimum norm, and (ii) gives the best fit between the measurement and its estimate.

∙ If the number of equations N is greater than the number of unknown parameters M, N > M termed overdetermined equations, a solution may not exist. This case is the most common in system identification.In order to attenuate the effect of measurement errors on the parameter estimates, a larger number ofmeasurements (or observations) compared to the number of unknown parameters is made. Due thepresence of measurement errors the observation may not lie in the range space of the matrix of theobservation model. The least-squares method is to obtain the best fit between the measurement andthe estimate obtained using the model.

We will focus mainly on the over-determined set of equations.

5.2.2 Least-Squares MethodThe term least-squares method describes a frequently used approach to solving over-determined or aninaccurate set of equations in an indirect sense. Instead of minimizing directly the parameter estimationerror, the error between the measured value (data or observed value) and the value obtained using themodel, termed residual, is minimized: a best fit between the measurement and its estimate is obtainedby minimizing the sum of the squares of the weighted residual. A model that can best describe themeasurement data is derived from physical laws.

Linear Least-Squares Estimation 191

5.2.3 Objective Function5.2.3.1 Minimization Parameter Estimation Error

Let us first consider a direct approach to estimate the unknown parameter 𝜽. Let �� be an estimate of 𝜽.The optimal �� is obtained by minimizing the sum of the squares of the parameter estimation error 𝜽 − ��:

min��

{(𝜽 − ��)T (𝜽 − ��)} (5.2)

Since the objective function is non-negative, the optimal estimate is that which will make the objectivefunction equal to zero. Hence the optimal �� is given by

�� = 𝜽 (5.3)

Since we do not know 𝜽, the optimal solution given by Eq. (5.3) is meaningless as it cannot beimplemented. An alternative method of minimizing error in estimation of the measurement termed theresidual is used instead of the parameter estimation error.

5.2.3.2 Minimization of the Residual

The estimate �� of the unknown parameter 𝜽 is estimated by minimizing the sum of the squares of theresidual, denoted e, which is defined as

e = y − y (5.4)

The residual is the difference between the measurement y and its estimate y = H�� which is obtained by

using the model y = H𝜽. The objective function J =(y − H��

)T(y − H��) and the least-squares estimation

problem is formulated as

min��

{(y − H��)T (y − H��)} (5.5)

A more general form least-squares problem, termed the weighted least-squares problem, is toweight the measurements based on an a priori knowledge of the measurement accuracy so thatJ = (y − H��)T W(y − H��):

min��

{(y − H��)T W(y − H��)} (5.6)

where W is some NxN symmetric and positive definite matrix W = WT . Since the measurement noisecovariance 𝜮v = E[vvT ] is a measure of an error in the measurement data y, the weighting matrix W ischosen to be the inverse of the covariance matrix. The larger the𝜮−1

v , the more accurate is the model andvice versa. More weight is given to those measurements for which the elements of 𝜮−1

v is larger. Theweight is chosen as

W = 𝜮−1v (5.7)


5.2.3.3 Un-Weighted and Weighted Least-Squares

The weighted least squares may be posed as an un-weighted least-squares problem by “filtering thedata.” Consider the measurement model given by Eq. (5.1). Since W is a positive definite matrix, it canbe expressed as a product of its square root,

W = W1∕2W1∕2 (5.8)

Pre-multiplying by the filter operator W1∕2 on both sides we get

y = H𝜽 + v (5.9)

where y = W1∕2y,↼v = W1∕2v and H = W1∕2H. Substituting for W using Eq. (5.8), the weighted objective

function becomes

J =(y − H��

)TW1∕2W1∕2 (

y − H��)=

(y − H��

)T (y − H��

)(5.10)

Thus the weighted objective function can be expressed as an un-weighted objective function. The least-squares problem Eq. (5.5) becomes

min��

{(y − H��)T (y − H��)} (5.11)

In view of this, there is no loss of generality to consider an un-weighted least-squares problem withW = I by filtering the data with W1∕2.

Comments Objective functions other than the 2-norm of the residual may be used, such as the 1-normand the infinity norm of the residuals. They are as follows:

∙ 1-norm of the residuals: the objective is to minimize the sum of the absolute values of the residuals{e(i)} where e(i) is the ith element of the Nx1 residual vector e:

min𝜃

{N∑

i=1

|e(i)|} (5.12)

It is computationally more difficult than the least-squares. However, it is more robust as the estimatedparameters are less sensitive to the presence of spurious data points or outliers.

∙ Infinity norm of the residual: the objective is to minimize the largest residual

min𝜃

maxi

{e(i)} (5.13)

This is also known as a Chebyshev fit (instead of least-squares fit) of the data. It is frequently used inthe design of digital filters and in the development of approximations.

Both 1-norm and infinity-norm optimizations are reformulated as linear programming problems.


5.2.4 Optimal Least-Squares Estimate: Normal EquationConsider the weighted least-squares problem formulated in Eq. (5.6). The optimal least-squares estimateis obtained by differentiating the objective function J with respect to each element of 𝜽. For notationalsimplicity we will use the results of vector calculus. The optimal solution is

dJ

d��= 0 (5.14)

Expanding the objective function J yields

J = yT Wy − yT WH�� − ��T HT Wy + ��THT WH�� (5.15)

Differentiating the scalar J with respect to the vector 𝜽 and setting it equal to zero using vector calculus,

namelyddx

{aT x} = a andd

dx{xT Px} = 2Px, we get

dJ

d��= −HT Wy − yT HW + 2HT WH�� = −2HT Wy + 2HT WH�� = 0 (5.16)

The optimal estimate of 𝜽 denoted �� is the solution of the following equation given by

HT WH�� = HT Wy (5.17)

This equation is often called a normal equation. If H has a full rank, rank {H} = min (M, N) = M orequivalently HT WH is invertible, then the optimal least-squares estimate has a closed form solutiongiven by

�� =(HT WH

)−1HT Wy (5.18)

The expression (HT WH)−1HT W is called pseudo-inverse, denoted H†

H† = (HT WH)−1HT W (5.19)

The estimate of the measurement y = H�� is given by

y = H�� = H(HT WH

)−1HT Wy (5.20)

One of the important advantages of the least-squares method is that it has closed form solution and theoptimal solution is a linear function of the measurements thanks to the fact that (i) the model is linearand (ii) the objective function is quadratic. In the case when data matrix H is rank deficient, that isrank {H} < M, and consequently HT WH is not invertible, SVD (Singular Value Decomposition) of thedata matrix is employed to compute the optimal least-squares estimate ��. The model y = H𝜽 + v and theestimate �� = (HT H)−1HT y are shown in Figure 5.1.


TH

-1

yθ =

y =

θH

THH

Figure 5.1 Visual representation of the linear model and the optimal least-squares estimate

5.2.5 Geometric Interpretation of Least-Squares Estimate:Orthogonality Principle

We will show that the when the least-squares estimate is optimal the residual is orthogonal to the columnvectors of the data matrix. For notational convenience let us assume W = I. Rewriting the normalequation (5.17) we get

HT (y − y) = 0 (5.21)

This shows that the residual is orthogonal to the data matrix. Expressing the data matrix explicitly interms of its M column vectors H =

[h1 h2 ⋯ hM

]the orthogonality condition (5.21) becomes

hTi

(y − H��

)= 0 i = 1, 2, 3,… , M (5.22)

A geometrical interpretation of Eq. (5.22) is that the residual is orthogonal to each of the column vectors ofthe data matrix. y ∈ ℜN , �� ∈ ℜM , y ∈ ℜN , y = H�� ∈ span( h1 h2 ⋯ hM ) and the residual y − H��which is perpendicular to the span of H form a right-angled triangle. This is the well-known orthogonalityprinciple. The residual represents the part of the measurement y that is not modeled by y = H𝜽 as M < N.An expression for the residual given in Eq. (5.4) plays an important role in evaluating the performanceof the least-squares estimator. Substituting y = H�� we get

e = y − H�� (5.23)

Using Eq. (5.18) and the definition of pseudo-inverse yields

e = y − H(HT WH)−1HT Wy = y − HH†y (5.24)

This has an interesting geometric interpretation. Define an operator Pr given by

Pr = H(HT WH)−1HT W = HH† (5.25)

Using Eq. (5.20), y and e in terms of the projection operator Pr become

y = Pry (5.26)

e = y − y = (I − Pr)y (5.27)

Thus Pr projects the measurement y onto the range space of data matrix H so that y = Pry while I − Pr

is called the orthogonal complement projector which projects the measurement y so that e = (I − Pr)y: e


ye

y

Figure 5.2 The measurement, its estimate and the residual form a right-angled triangle

is orthogonal to the y null space shown in Figure 5.2. Substituting for y in Eq. (5.27) and using Eq. (5.1)we get

e = (I − Pr)(H𝜽 + v) = H𝜽 + v − PrH𝜽 − Prv = (I − Pr)v (5.28)

Consider the inner product yT e. Using Eqs. (5.26), and (5.27), and P2r = P we get

eT y = yT Pr(I − Pr)y = yT (Pr − P2r )y = 0 (5.29)

Taking expectation we get

E[eT y

]= 0 (5.30)

The residual is thus orthogonal to the estimate. See Figure 5.2.

5.3 Performance of the Least-Squares EstimatorWe will evaluate the performance of the least-squares estimator considering unbiasedness, the covarianceof the estimation error, the mean-squared residual error, and comparison with other linear unbiasedestimators.

5.3.1 Unbiasedness of the Least-Squares EstimateThe estimate is a random variable since the measurement is corrupted by a noise term. The questionarises whether the expectation of the estimate of a parameter will be equal to its true value, that is, will theaverage of all the estimates obtained from an infinite number of experiments be equal to their true value.If so, then the estimate is said to be unbiased. Unbiasedness of an estimator is important as it indicateswhether the model and/or the measurement are subject to a systemic error. The formal definition of anunbiased estimator is

E[��] = 𝜽 (5.31)

We will show that the least-squares estimate is unbiased if (i) the data matrix H is deterministic and (ii)the noise v is a zero-mean random variable E [v] = 0. Substituting for �� in the definition and using Eq.(5.18) yields

E[��] = E[(HT WH)−1HT Wy] = (HT WH)−1HT WE[y] (5.32)


Substituting y = H𝜽 + v and since E[v] = 0, we get

E[��] = (HT WH)−1HT WE[H𝜽 + v] = (HT WH)−1HT WH𝜽 = 𝜽 (5.33)

Thus the least-squares estimate is unbiased. In order to evaluate the accuracy of an estimator theunbiasedness alone is not sufficient. An estimator which is both unbiased and has low covariance of theestimation error is desirable.

5.3.1.1 Illustrative Example: Unbiasedness, Whiteness, and Orthogonality

An illustrative example is given to compare the performance of the estimator with a zero-mean whitenoise process with different variances.

Example 5.1 The measurement model (5.1) where H =[

1 1 . 1]T

is a N × 1 vector, the trueparameter 𝜽 = 1, and v is N × 1 measurement noise vector, N = 500 samples. The estimates of the param-eters were computed using N data samples and 100 experiments were performed. Figure 5.3 shows the (i)estimate ��, (ii) the orthogonality of the residual e and the measurement estimate y, and (iii) the whitenessof the residual for the noise standard deviation, 𝜎v = 1 and 𝜎v = 0.25. For the case when 𝜎v = 1, the subfig-ures (a), (b), and (c) show respectively the estimate vs. experiments, and the inner product of the residualand the estimate vs. experiments (for verifying the property of orthogonality), and the auto-correlationof the residual vs. time lag. Similarly the subfigures (d), (e), (f) show the performance when 𝜎v = 0.25.

Subfigures (a) and (d) show that the estimates are unbiased. The larger the variance of the noise,the larger is the variance of the estimator. Hence the measure of performance of the estimator should

-500 0 500

0

0.5

1C: Correlation of residual

Time-lag

-500 0 500

0

0.02

0.04

0.06F: Correlation of residual

Time-lag

0 50 1000.9

1

1.1

Experiments

A: Estimate

0 50 100-0.01

0

0.01

Experiments

B: Orthogonal:residual&estimate

0 50 1000.9

1

1.1

Experiments

D: Estimate

0 50 100-0.01

0

0.01

Experiments

E:O rthogonal:residual&estimate

Figure 5.3 Unbiasedness of the estimate, whiteness of the residual and orthogonality


include the variance of the estimator besides the unbiasedness. The auto-correlation of the residual is agood visual indicator of the whiteness of the measurement noise as shown in subfigures (c) and (f). Themaximum value of the auto-correlation, which occurs at zero lag, indicates the variance of the noise.Orthogonality of the residual and the estimated measurement (computed using time average) verifies thetheoretical result (5.30) derived using ensemble average.

5.3.2 Covariance of the Estimation ErrorThe main measure of performance of an estimator is the covariance of the estimation error, denotedcov(��) given by

cov(��) = E[(𝜽 − ��)(𝜽 − ��)T ] (5.34)

Consider the expression for �� given in Eq. (5.18). Substituting y = H𝜽 + v we get

�� = (HT WH)−1HT W (H𝜽 + v) = 𝜽 +(HT WH

)−1HT Wv (5.35)

Substituting Eq. (5.35) in Eq. (5.34) we get

cov(��) = (HT WH)−1HT W𝜮vWH(HT WH)−1 (5.36)

Let us analyze the special case when v is a zero-mean white noise process with covariance 𝜮v andW = 𝜮−1

v . In this case we get

cov(��) = (HT𝜮−1v H)−1 (5.37)

5.3.2.1 Asymptotic Behavior of the Parameter Estimate

Let us now analyze the asymptotic behavior of the estimate �� when the number of data samples N isasymptotically large. Let us consider the expression of �� given in Eq. (5.35). Dividing and multiplyingthe second term on the right by N we get

�� = 𝜽 +(

HT WHN

)−1HT Wv

N(5.38)

We will assume the following [1]

lim itN→∞

(HT WH

N

)−1

= 𝜮−1H < ∞ (5.39)

lim itN→∞

HT WvN

= 0 (5.40)

The condition (5.39) merely states that the “sample covariance” 𝜮H of the weighted data matrix W1∕2Hexists and is non-singular:

lim itN→∞

(HT WH

N

)= 𝜮H > 0 (5.41)


Consider the condition (5.40). Expanding the left-hand side yields

lim itN→∞

HT WvN

= lim itN→∞

(1N

N∑i=1

hjvj

)= 0 (5.42)

where hj is the jth column vector of HT W.This condition essentially states that the average value oflinear combinations of a zero-mean random variable is asymptotically zero. Both these conditions are notrestrictive to our problems as we assume that the data matrix has full rank and the noise is a zero-meanrandom variable. In view of these two assumptions Eq. (5.38) becomes

�� → 𝜽 as N → ∞ (5.43)

5.3.3 Properties of the ResidualAn expression for the mean-squared value, asymptotic value, and the auto-correlation of the residual areconsidered

5.3.3.1 Mean-Squared Residual

Consider the expression of the minimum mean-squared residual, denoted 𝜎2res = E[eT e∕N]. Using the

expression (5.28) we get

𝜎2res = E[vT (I − Pr)

T (I − Pr)v∕N] (5.44)

Using the property of the projection operator P2r = Pr yields

𝜎2res = E[vT (I − Pr)v∕N] (5.45)

Using the property the trace of a matrix trace{AB} = trace {BA} and using 𝜮v = E[vvT ]

𝜎2res = trace{(I − Pr)𝜮v}∕N (5.46)

Let us analyze the minimum mean-squared residual when v is a zero-mean white noise with covariance𝜮v = 𝜎2

v I. In this case we get

𝜎2res = 𝜎2

v trace{(I − Pr)∕N} (5.47)

Assuming that the data matrix H has full rank and using the property of the projection operatortrace{(I − Pr)} = N − M, we get

𝜎2res = (1 − M∕N) 𝜎2

v (5.48)

It is interesting to note that the minimum mean-squared residual depends only upon the number ofunknown parameters M and the number of data samples N and the noise variance 𝜎2

v and not on the datamatrix H. Consider the expression for the mean-squared residual when the number of data samples N isinfinitely large. Using Eq. (5.48) we get

lim itN→∞

𝜎2res = 𝜎2

v (5.49)


5.3.3.2 Covariance and the Mean-Squared Value of the Residual

We will assume that the weighting matrix is optimal, W = 𝜮−1v . Consider the expression of the covariance

of the residual, denoted cov(res) = E[eeT ]. Using the expression e = (I − Pr)v given by Eq. (5.28) weget:

cov(res) = E[eeT ] = E[(I − Pr)vvT (I − Pr)T ] = (I − Pr)𝜮v(I − Pr)

T (5.50)

Simplifying using the property of the projection operator P2r = Pr yields

cov(res) = (I − Pr)𝜮v(I − Pr)T = 𝜮v −𝜮vP

Tr − Pr𝜮v + Pr𝜮vP

Tr (5.51)

Substituting Pr = H(HT𝜮−1v H)−1HT𝜮−1

v and PTr = 𝜮−1

v H(HT𝜮−1v H)−1HT yields:

cov(res) = 𝜮v − H(HT𝜮−1

v H)−1

HT (5.52)

Let us compute the mean-squared residual 𝜎2res = E[eT e∕N]. Taking the trace of the matrices on both

sides of Eq. (5.52), and dividing by N we get

𝜎2res = E[

eT eN

] = 1N

trace{𝜮v

}− 1

Ntrace

{H

(HT𝜮−1

v H)−1

HT}

(5.53)

Using the property of the trace trace{ABC} = trace{CBA} we get:

𝜎2res =

1N

trace{𝜮v} − 1N

trace{

HT H(HT𝜮−1

v H)−1

}(5.54)

Dividing and multiplying the second term on the right by N we get

𝜎2res =

1N

trace{𝜮v

}− 1

Ntrace

⎧⎪⎨⎪⎩HT H

N

(HT𝜮−1

v H

N

)−1⎫⎪⎬⎪⎭ (5.55)

Invoking the finiteness assumption (5.39), limN→∞

HT HN

< ∞ and limN→∞

HT𝜮−1v H

N< ∞, we get:

limN→∞

𝜎2res = 𝜎2

v (5.56)

5.3.3.3 Asymptotic Expression of the Residual

Consider the expression of the residual (5.28). Expanding the expression we get

e = (I − Pr)v = (I − H(HT WH)−1HT W)v (5.57)

Similar to Eq. (5.38), dividing and multiplying by N

e = v − H(

HT WHN

)−1 (HT Wv

N

)(5.58)


Taking the limit and using the condition (5.40) we get

e → v as N → ∞ (5.59)

It interesting to note that the larger the number of data samples N, the closer the residual is to the noise. Letus compute the auto-correlation of the residual for the case when N is large. Clearly the auto-correlationof the residual will approach that of the noise as N is large.

E [e(n)e(n − m)] → E [v(n)v(n − m)] as N → ∞ (5.60)

where e(n) and v(n) are the nth element of e and v respectively. If the noise v is a zero-mean white noisethe auto-correlation of the residual is a delta function

E[e(n)e(n − m)] = 𝛿(n − m) ={

E[e2(n)] m = n0 m ≠ n

(5.61)

We say that the residual is white if the auto-correlation is a delta function as given in Eq. (5.61). Thereis simple test, termed the whiteness test, to verify whether the residual is white [1].

5.3.3.4 Illustrative Example: Performance with White and Colored Noise

An illustrative example is given to compare the performance of the estimator in the presence of whiteand colored measurement noise processes.

Example 5.2 The measurement model (5.1) where H =[

1 1 . 1]T

is a Nx1 vector, the trueparameter 𝜽 = 1, and v is Nx1 measurement noise vector, N = 2000 samples. Two cases of measurementnoise are considered:

Case 1: v is a zero-mean with unit covariance, 𝜮v = I, white noise process.

Case 2: v is a zero-mean colored noise process generated as an output of a filter Hv(z) = (1 − a)z−1

1 − az−1

with a = 0.98.

Figure 5.4 shows the asymptotic behavior of the parameter estimate, the residual, and its auto-correlationfor case 1 and case 2 when v is a zero-mean white noise process and zero-mean colored noise. Theestimates of the parameters were computed using N data samples and 100 experiments were performed.For case 1, subfigures (a), (b), and (c) show respectively (i) the true parameter and its estimates vs. theexperiments, (ii) a single realization of the residual and the measurement noise, and (iii) auto-correlationof the residual and the measurement noise. Similarly, subfigures (d), (e), and (f) show for case 2 whenthe noise is colored.

In both the cases when the data samples are large (i) the residual and the auto-correlation of the residualare close to those of the noise, and (ii) the parameter estimates are close to the true value. The varianceof the parameter estimation error is cov(��) = 0.0011 for the white noise case, while for the colored noisethe covariance is larger, cov(��) = 0.0688. The auto-correlation of the residual is a delta function only forthe zero-mean white noise case as it should be. With colored noise, a larger number of data samples Nare required to achieve the same desired asymptotic behavior obtained with a white noise.

The residual and the measurement noise, and the auto-correlation of the residual and that of the noise,are not distinguishable in subfigures (b), (c), (e), and (f). The simulation results confirm the theoreticalresults (based on ensemble average) (5.43), (5.59), and (5.60).

Pathological case:An interesting pathological case occurs when the data matrix H is square and non-singular with N = M,and the noise is a zero-mean white noise process. From Eq. (5.48) it can be deduced that the mean-squaredresidual is zero even if the noise variance is non-zero and the number of data samples is finite and merely


0 50 1000.8

1

1.2

Experiment

Para

mete

r

A: Parameter and estimate

0 500 1000 1500 2000-5

0

5

time

Res a

nd n

ois

e B: Residual and noise

-2000 -1000 0 1000 2000-2

0

2

Time lag

Corr

ela

tions

C: Correlation:residual and noise

0 50 1000.8

1

1.2

Experiment

Para

mete

r

D: Parameter and estimate

0 500 1000 1500 2000-0.5

0

0.5

time

Res a

nd n

ois

e E: Residual and noise

-2000 -1000 0 1000 2000-0.01

0

0.01

Time lag

Corr

ela

tions

F: Correlation: residual and noise

Figure 5.4 Asymptotic performance of the estimator with white and colored noise

equal to the number of unknown parameters. However, the cov(��) is non-zero as can be deduced fromEq. (5.36).

5.3.4 Model and Systemic Errors: Bias and the Variance ErrorsIt is important to emphasize that the properties of the least-squares estimator hold if the linear modelgiven by Eq. (5.1), and the statistics of the measurement noise are both accurate. In practice, the aboveassumptions may not hold. The structure of an assumed model, termed the nominal model and denotedH0, and the true model H, generally differ as the structure of the true model, characterized by the order ofthe numerator, the order of the denominator order, and the delay, is unknown unless extreme care is takento validate the model. For example, the structure of the model may be derived from physical laws andits validation by an analysis of the residual. Further, there will be systemic errors such as those resultingfrom inaccurate mean and covariance of the measurement noise, and external inputs contaminating themeasurements including low frequency disturbances, drifts, offsets, trends, and non-zero-mean noise.

There will be a bias error due to model mismatch and both bias and variance error due to themeasurement noise and disturbance. If, however, the noise and the disturbance are zero-mean, there willbe only variance error.

5.3.4.1 Model Error

Let us consider the effect of modeling error on the covariance of the estimation error and the residual.The true data matrix H of the measurement model (5.1) is not known accurately, and is assumed to beH0. The assumed or nominal model is given by

y = H0𝜽 + v (5.62)


The nominal data matrix model H0 is then employed in computing the estimate �� and the residual e weget

�� =(HT

0 W0H0

)−1HT

0 W0y (5.63)

e =(

I − H0

(HT

0 W0H0

)−1HT

0 W0

)y (5.64)

Substituting for y using the true model y = H𝜽 + v in Eq. (5.63) we get

�� =(HT

0 W0H0

)−1HT

0 W0H𝜽 +(HT

0 WH0

)−1HT

0 W0v (5.65)

The estimate is biased given by

E[��] =(HT

0 W0H0

)−1HT

0 W0H𝜽 ≠ 𝜽 (5.66)

Consider the residual given by Eq. (5.64). Substituting for y using the true model

e =(

I − H0

(HT

0 W0H0

)−1HT

0 W0

)H𝜽 +

(I − H0

(HT

0 W0H0

)−1HT

0 W0

)v (5.67)

Assuming that the conditions given by Eqs. (5.39) and (5.40) on the asymptotic behavior of the covariancematrix of the estimation error and the noise respectively hold we get

e →(

I − H0

(HT

0 W0H0

)−1HT

0 W0

)H𝜽 + v as N → ∞ (5.68)

This shows that the residual asymptotically approaches noise plus a bias term.The mean of the residual e is asymptotically non-zero given by

E [e] →(

I − H0

(HT

0 W0H0

)−1HT

0 W0

)H𝜽 as N → ∞ (5.69)

(I − H0

(HT

0 W0H0

)−1HT

0 W0

)H𝜽 is termed asbias error

As result of the bias error, the auto-correlation of the residual will not be a delta function.

5.3.4.2 Systemic Error

In this case there is systemic error but no modeling error, that is H = H0. The system error may bemodeled as an unknown additive term 𝜇

y = H𝜽 + 𝝁 + v (5.70)

Using Eq. (5.18) and substituting for y and simplifying the estimate �� becomes

�� = 𝜽 + (HT WH)−1HT W𝝁 +(HT WH

)−1HT Wv (5.71)

The estimate is biased given by

E[��] = 𝜽 +(HT WH

)−1HT W𝝁 ≠ 𝜽 (5.72)


Assuming W = 𝜮−1v the covariance of the estimation is

cov(��) =(HT𝜮−1

v H)−1

(5.73)

Consider the expression for the residual given by Eq. (5.24). Substituting for y we get

e = 𝝁 + v − H(HT WH

)−1HT Wv (5.74)

Assuming that the conditions given by Eqs. (5.39) and (5.40) on the asymptotic behavior of the covariancematrix of the estimation error and the noise respectively hold we get

e → 𝝁 + v as N → ∞ (5.75)

This shows that the residual asymptotically approaches noise plus a bias term. Due to the presence of

systemic error there will be bias error 𝝁 and the covariance error(HT𝜮−1

v H)−1

. If, however, the noise andthe disturbance are zero-mean, there will be only covariance error. The auto-correlation of the residualwill not be a delta function.

Example 5.3 Illustrative example: performance with model and systemic errorsConsider the model (5.62). An example of a second-order system is considered with 4x2 data matrixH. First the modeling error is simulated by selecting a wrong structure by the assuming the modelorder is 1 instead of the true order 2 with H0 a 4x1 vector. The actual and the assumed matrices were

H =⎡⎢⎢⎢⎣

2 13 42 54 5

⎤⎥⎥⎥⎦ and H0 =⎡⎢⎢⎢⎣

2324

⎤⎥⎥⎥⎦.

Then a systemic error was introduced by introducing a bias term in the model simulating a constantdisturbance in the true model. In this case it was assumed that the true and the assumed data matricesare equal H0 = H. Figure 5.5 shows the results of simulation when there are modeling and systemicerrors. Subfigures (a) and (b) show respectively the true parameters and the estimate (note that due tothe modeling error resulting from assuming that the model order is 1 instead of 2, there is only oneestimate and two unknown parameters) and the auto-correlation of the residual and the auto-correlationof the noise. Similarly subfigures (c) and (d) show the true parameters and their estimates and theauto-correlations when there is a systemic error resulting from an additive bias term 𝜇.

Comment When there is a modeling and/or or a systemic error, there will be a bias in the parameterestimates and the auto-correlation of the residual will not be a delta function. The auto-correlation ofthe residual is a good indicator of a systemic error or model error or both.

Lemma 5.1 Let⌣

𝜽 be some linear unbiased estimate of 𝜽 given by

⌣

𝜽 = Fy (5.76)

Then

cov(��) ≤ cov(⌣

𝜃) (5.77)

where from (5.37), cov(��) = (HT𝜮−1v H)−1.


0 2 4 6 80

0.5

1

1.5

2

2.5

Experiment

Para

mete

r

A: Estimate:model error

True

Estimate

-50 0 50-2

-1

0

1

Time

Corr

ela

tions

B: Correlations: model error

Res

v

0 2 4 6 80

0.5

1

1.5

Experiment

Para

mete

r

C: Estimate:systemic error

True

Estimate

-100 -50 0 50 100-0.5

0

0.5

1

Time

Corr

ela

tions

D: Correlations: systemic error

Res

v

Figure 5.5 Estimates and correlations with model and systemic errors

Proof: Using Eq. (5.1) we get

⌣

𝜽 = FH𝜽 + Fv (5.78)

Taking expectation we get

E[⌣

𝜽] = FH𝜽 (5.79)

Since⌣

𝜽 is unbiased, that is E[⌣

𝜽] = 𝜃 we get:

FH = I (5.80)

Substituting Eq. (5.80) in Eq. (5.78) yields:

⌣

𝜽 = 𝜽 + Fv (5.81)

The covariance of the parameter estimation error cov(⌣

𝜽) becomes

cov(

⌣

𝜽)= FE

[vvT

]FT = F𝜮vF

T (5.82)


Recall the covariance of the least-squares estimate (5.37), cov(��) =(HT𝜮−1

v H)−1

. Define the followingpositive semi-definitive matrix, denoted PFH

PFH =(

F −(HT WH

)−1HT W

)𝜮v

(F −

(HT WH

)−1HT W

)T

(5.83)

where W = 𝜮−1v . Simplifying using W = 𝜮−1

v

PFH = F𝜮vFT +

(HTWH

)−1 − FH(HTWH

)−1 −(HTWH

)−1HTFT (5.84)

Using Eq. (5.80) we get

PFH = F𝜮vFT − (HT WH)−1 (5.85)

Since PFH is positive semi-definite PFH ≥ 0 from Eq. (5.83) we deduce

F𝜮vFT ≥ (HT WH)−1 (5.86)

Comparing the expressions for cov(��) and cov(⌣

𝜽) from Eqs. (5.37) and (5.82) and using the inequality(5.86) we conclude

cov(��) ≤ cov(⌣

𝜽) (5.87)

Comment The weighted least-squares estimator with W = 𝜮−1v is the best linear unbiased estimator.

In a later section on Cramer–Rao inequality, we will show that the weighted least-squares estimator isbest if the PDF of the noise is Gaussian.

5.4 Illustrative Examples

Example 5.4 Scalar model

y = H𝜃 + v (5.88)

where H and 𝜃 are scalars.

Solution: H† =(HT WH

)−1HT W = 1

H, Pr = HH† = 1, 1 − Pr = 0

�� = H†y =y

H(5.89)

y = Pry = y (5.90)

e = y − y =(1 − Pr

)y = 0 (5.91)

cov(��) =(HT H

)−1𝜎2

v =𝜎2

v

H2(5.92)

Comment It is a pathological case. The residual is zero even though the output is corrupted by noise.However, the covariance of the parameter estimation is not zero.


Example 5.5 Vector model with scalar parameter

y = H𝜃 + v (5.93)

where H is Nx1 vector of all ones, H = [ 1 1 . 1 ]T , y =[

y(1) y(2) ⋯ y(N)]T

is a N × 1 vector.

Solution:

H† =(HT WH

)−1HT W, Pr = HH†

a. W = IThe error term v is independent and identically distributed zero-mean white noise with variance E[vvT ] =I𝜎2

v .

H† =(HT H

)−1HT , HT H = N and HT y =

N∑i=1

y(i) and Pr = HH† =( 1

N

)I1, I − Pr = I −

( 1N

)I1

where 1 =[

1 1 ⋯ 1]T

is a Nx1 vector of all ones, and I1 is a NxN matrix whose elements are all

one. The estimate �� is

�� = H†y =(HT H

)−1HT y = 1

N

N∑i=1

y(i) (5.94)

y = Pry = 1

(1N

N∑i=1

y(i)

)(5.95)

The covariance of the estimation error is

cov(��) = 𝜎2v (HT H)−1 =

𝜎2v

N(5.96)

The residual is given by

e = (I − Pr)v = v − Prv =(

I − 1N

I1

)v (5.97)

The covariance of the residual Eq. (5.52) becomes:

E[eeT ] =(

I − 1N

I1

)𝜮v (5.98)

Consider the covariance of the estimation error given by Eq. (5.96). We get

cov(��) = 0 as N → ∞ (5.99)

The asymptotic condition (5.39) is clearly satisfied as

lim itN→∞

(HT H

N

)−1

= 1 (5.100)


Further as the noise is zero-mean white noise, it satisfies Eq. (5.40)

lim itN→∞

1N

N∑i=1

v(i) = 0 (5.101)

Hence the residual approaches the noise asymptotically

lim itN→∞

e = v (5.102)

a. W = 𝜮−1v = diag

(1∕ 𝜎v1 1∕𝜎v1 ⋯ 1∕𝜎vn

)The error term v is a zero-mean independent but not identically distributed white noise with covariance𝜮v = diag

(𝜎v1 𝜎v1 ⋯ 𝜎vn

),

W = 𝜮−1v = diag

(1∕ 𝜎v1 1∕𝜎v1 ⋯ 1∕𝜎vn

), and

(HT WH

)−1 =

(N∑

i=1

1𝜎2

vi

)−1

and

H† =(HT WH

)−1HT W =

(N∑

i=1

1𝜎2

vi

)−1 [1

𝜎2v1

1

𝜎2v2

⋯ 1

𝜎2vN

]Pr = H

(HT𝜮−1

v H)−1

HT𝜮−1v

The weighted least-squares estimate �� = H†y is

�� =(HT𝜮−1

v H)−1

HT𝜮−1v y =

(n∑

i=1

1𝜎2

vi

)−1 n∑i=1

y(i)𝜎2

vi

(5.103)


cov(��) =

(N∑

i=1

1𝜎2

vi

)−1

(5.104)


e =(I − Pr

)v = v − H

(HT𝜮−1

v H)−1

HT𝜮−1v v =

⎛⎜⎜⎝I −

(N∑

i=1

1𝜎2

vi

)−1

I1𝜮−1v

⎞⎟⎟⎠ v (5.105)

Covariance of the residual Eq. (5.52) becomes:

cov(e) = 𝜮v − H(HT𝜮−1

v H)−1

HT = 𝜮v −

(N∑

i=1

1𝜎2

vi

)−1

I1 (5.106)

Consider the covariance of the estimation error given by Eq. (5.104). We get

cov(��) =

(N∑

i=1

1𝜎2

vi

)−1

= 0 as N → ∞ (5.107)


The asymptotic condition (5.39) is clearly satisfied as

lim itN→∞

HT𝜮−1v H

N= lim it

N→∞

1N

n∑i=1

1𝜎2

vi

< ∞ (5.108)

Further as the noise is zero-mean white noise, it satisfies Eq. (5.40)

lim itN→∞

1N

N∑i=1

v(i)𝜎2

vi

= 0 (5.109)


lim itN→∞

e = v (5.110)

b. W = 𝜮−1v ;

The error term v is a colored noise with non-diagonal covariance 𝜮v matrix. The pseudo-inverse and

the projection matrix are H† =(HT𝜮−1

v H)−1

HT𝜮−1v ; Pr = H

(HT𝜮−1

v H)−1

HT𝜮−1v . The information

matrix HT𝜮−1v H is a positive scalar as H =

[1 1 . 1

]T.

�� = H†y =(HT𝜮−1

v H)−1

HT𝜮−1v y (5.111)



v H)−1

(5.112)


e = v − Prv = v − H(HT𝜮−1

v H)−1

HT𝜮−1v v (5.113)

Consider the covariance of the estimation error given by Eq. (5.112). Let hi be the ith columnvector of

H = 𝜮−1∕2v H so that HT𝜮−1

v H =N∑

i=1h

T

i hi. Since the terms {hT

i hi} are positive and HT𝜮−1v H =

N∑i=1

hT

i hi is

a sum of positive values, the expression for the covariance Eq. (5.112) becomes:

cov(��) =

(N∑

i=1

hT

i hi

)−1

= 0 as N → ∞ (5.114)

Since HT𝜮−1v H is positive definite matrix, using the Appendix, the asymptotic condition (5.39) is clearly

satisfied as

0 < lim itN→∞

(HT𝜮−1

v H

N

)−1

< ∞ (5.115)

Further as the noise is a zero-mean white noise, it satisfies Eq. (5.40)

lim itN→∞

{HT Wv

N

}= 0 (5.116)



lim itN→∞

e = v (5.117)

5.4.1 Non-Zero-Mean Measurement NoiseIn some applications there is a bias error due to calibration error, an offset or other external input effecton the measurements. If the bias error is an unknown constant, we may club the unknown bias errorwith the desired parameter to be estimated, and estimate an augmented unknown parameter formed ofthe desired parameter and the bias parameter.

Example 5.6 Non-zero-mean error term

E[v] = 𝝁 (5.118)

The linear model can be rewritten with an error term which is zero-mean

y = H𝜽 + 𝝁 + v (5.119)

where v is a zero-mean term. Define augmented feature vector 𝜽ag = [𝜽 𝜇 ]T , and the augmented datamatrix Hag = [ H 1 ] where 1 is a Nx1 vector of all ones. Then the estimate of the augmented vector

��ag becomes

��ag =(

HTagWHag

)−1HagWy (5.120)

Remarks

∙ The over-determined set of equations attenuate the effect of noise.∙ The larger the number of data samples N compared to the number of unknown parameters M, the

smaller the covariance of the estimation error performance of the least-squares estimator in thepresence of noise as can be deduced from Eqs. (5.96), (5.107), and (5.114).

∙ The larger the number of data samples N compared to the number of unknown parameters M, thesmaller the mean-squared residual as shown in Eqs. (5.98), (5.105). and (5.113).

5.5 Cramer–Rao Lower BoundThe most desirable property of an estimator is that it is unbiased and has the lowest possible parametererror covariance. In Section 5.3 we showed that the least-squares estimator is unbiased and has the lowestpossible error covariance among the class of all linear unbiased estimators. A question arises as to howthe linear least-squares compares with a class of all (instead of a restricted class of linear) unbiasedestimators. The CRLB gives the theoretically minimal variance for class of all unbiased estimators of adeterministic parameter. It is widely used to measure the efficiency of an estimator. A measure ofefficiency is the ratio of the theoretically minimal variance given by CRLB to the actual variance ofthe estimator. This measure is less than or equal to 1. An estimator with efficiency 1.0 is said to bean “efficient estimator.” In many applications, for mathematical tractability, an estimator is derived bymaking simplifying assumptions. The performance of the estimator is evaluated using the measure ofefficiency. For example, the least-squares estimator is efficient if the PDF is Gaussian and the weightingmatrix is chosen to be the inverse of the covariance matrix of the measurement. The derivation of theCRLB is simple. The estimator is assumed to be unbiased and the PDF is assumed to be absolutely


continuous. Differentiating the expression for an unbiased estimator with respect to the parameter, theCRLB is derived using Cauchy–Schwartz inequality. The lower bound is shown to be equal to the Fisherinformation. Although there exists a lower bound, an estimator which achieves the lower bound may notexist. A condition for the existence of the lower bound is given. It is shown that, if there exists a lowerbound, then it is given by the maximum likelihood estimator.

Let y = [ y(1) y(2) ⋯ y(N) ]T is a Nx1 vector of measurements characterized by the probability

density function (PDF) fy(y). The measurement y is function of an unknown scalar parameter 𝜽. Let ��be an unbiased estimator of 𝜃 that is only a function of y and not a function of the unknown parameter𝜽. Then

cov(��) = E[(�� − 𝜽)(�� − 𝜽)T ] ≥ I−1F (5.121)

where the MxM Fisher information matrix

IF(𝜃) = E

[(𝛿 ln fy(y)

𝛿𝜽

)(𝛿 ln fy(y)

𝛿𝜽

)T]

.

5.6 Maximum Likelihood EstimationThe maximum likelihood estimator is widely used as the estimate gives the minimum estimation errorcovariance, and serves as a gold standard for evaluating the performance of other estimators. It is efficientas it achieves the Cramer–Rao lower bound if it exists. It is based on maximizing a likelihood functionof the PDF of the data expressed as a function of the parameter to be estimated. In general the estimatesare implicit nonlinear functions of the parameter to be estimated and the estimate is obtained recursively.If the PDF of the measurement data is Gaussian, the maximum likelihood estimation method simplifiesto a weighted least-squares method where the weight is the inverse of the noise covariance matrix. TheML estimate is obtained by maximizing the log-likelihood function

�� = argmax𝜃

{log fy(y)

}(5.122)

5.6.1 Illustrative Examples

Example 5.7 Efficiency of the least-squares estimateConsider the measurement model y = H𝜽 + v where v is a zero-mean Gaussian PDF with covariance𝜮v. The PDF of y is

fy(y) = 1√(2𝜋)N det

(𝜮v

) exp

{−

(y − H𝜽)T 𝜮−1v (y − H𝜽)

2

}(5.123)

The log-likelihood function ln fy(𝜽|y) becomes

ln fy(y) = −(y − H𝜽)T 𝜮−1

v (y − H𝜽)

2− N

2ln 2𝜋 − 1

2ln det𝜮 (5.124)

Differentiating ln fy(y) with respect to 𝜽 and setting to zero the ML estimate ��(y) satisfies

𝛿

𝛿𝜽ln fy(y) = HT𝜮−1

v (y − H𝜽) (5.125)


The Fisher information is given by:

IF(𝜽) = E

[(𝛿 ln fy(y)

𝛿𝜽

)(𝛿 ln fy(y)

𝛿𝜽

)T]= E

[HT𝜮−1

v (y − H𝜽) (y − H𝜽)T 𝜮−1v H

](5.126)

Using y = H𝜽 + v we get:

IF = HT𝜮−1v H (5.127)

Using the expression of the covariance of the estimation error given by Eq. (5.37), we get


v H)−1 = I−1

F (𝜃) (5.128)

We can deduce from the inequality (5.121) that the covariance of the estimation error is efficient and theFisher information is:

IF = HT𝜮−1v H (5.129)

Example 5.8 Maximum likelihood and the least-squares approachConsider the Gaussian PDF given in the Example 5.7. Differentiating ln fy(y) with respect to 𝜽 and setting

to zero the ML estimate ��ML satisfies

𝛿

𝛿𝜽ln fy(y) = HT𝜮−1

v (y − H𝜽) = 0

Assuming that H has a full rank(HT𝜮−1

v H)

is non-singular. Multiplying by(HT𝜮−1

v H)−1

(HT𝜮−1

v H)−1 𝛿

𝛿𝜽ln fy(y) =

(HT𝜮−1

v H)−1

HT𝜮−1v y − 𝜽 = 0 (5.130)

The ML estimator ��ML is that value of 𝜽 that satisfies the above equation:

��ML =(HT𝜮−1

v H)−1

HT𝜮−1v y (5.131)

This shows that weighted least-squares estimate and the maximum likelihood are identical if W = 𝜮−1v .

Comments If the PDF of the noise v is a zero-mean Gaussian then

∙ The weighted least-squares with weight W = 𝜮−1v and the ML estimators are both efficient as they

attain the Cramer–Rao lower bound. They both yield the minimum variance unbiased estimator.∙ The Cramer–Rao lower bound for both ML and the weighted least-squares method is

I−1F (𝜃) =

(HT𝜮−1

v H)−1


5.6.1.1 Illustrative Example

Consider the measurement model with scalar unknown parameters but vector measurements y = H𝜽 + vgiven in Example 5.7

a) The error term v is an independent and identically distributed zero-mean white noise with variance

E[vT v] = 𝜎2v . Choose W = I. The estimate is �� = 1

N

N∑i=1

y(i) and the variance of the estimation error is

var(��) = 𝜎2v

N. Let us verify whether the estimate is efficient. The Cramer–Rao lower bound is given

by

I−1F (𝜃) =

(HT𝜮−1

v H)−1 =

𝜎2v

N(5.132)

It is efficient as the covariance of the estimator equals the Cramer–Rao lower bound. The Fisherinformation is

IF(𝜃) =(HT𝜮−1

v H)= N

𝜎2v

(5.133)

b) The error term v is a zero-mean independent but not identically distributed white noise with covariance𝜮v = diag( 𝜎v1 𝜎v1 ⋯ 𝜎vn ), W = 𝜮−1

v = diag(1∕ 𝜎v1 1∕𝜎v1 ⋯ 1∕𝜎vn

). The estimate and

the variance of the estimation error are

�� =

(n∑

i=1

1𝜎2

vi

)−1 n∑i=1

y(i)𝜎2

vi

and var(��) =

(N∑

i=1

1𝜎2

vi

)−1

The Cramer–Rao lower bound is given by

I−1F (𝜽) =

(HT𝜮−1

v H)−1 =

(N∑

i=1

1𝜎2

vi

)−1

(5.134)

It is efficient and the Fisher information is

IF(𝜃) =(HT𝜮−1

v H)=

N∑i=1

1𝜎2

vi

(5.135)

5.7 Least-Squares Solution of Under-Determined SystemIf (i) the number of unknowns M is less than the number of equations N or (ii) the number of unknowns Mis equal to the number of equations N but the resulting square matrix is singular, the system of equationis called under-determined. In this case

∙ A solution will always exist as the observation is always in the range space of the matrix.∙ There are infinite solutions.

Since there are infinite solutions, it is preferable to choose a solution from the infinite set of solutions,the onewhich has a minimum norm. Let H be a N × M data matrix with N < M. Assuming H has a fullrank, which implies HHT is invertible, the optimal solution �� is

�� = HT (HHT )−1y (5.136)


HTHTHˆ

yy H

Figure 5.6 Visual representation of the under-determined system

Similarly to the over-determined case the projection matrix is given by

Pr = HT (HHT )−1H (5.137)

As an under-determined system has infinite solution, the general solution is given by

�� = HT (HHT )−1y + (I − Pr)𝜽0 (5.138)

where 𝜽0 is arbitrary. It can be verified that �� is a solution of H�� = y. The estimate y of y becomes

y = H�� = HHT (HHT )−1y = y (5.139)

Hence residual e

e = 0 (5.140)

Comments One cannot compare 𝜽 and ��. In general �� ≠ 𝜽 since �� is a minimum norm solutionwhereas 𝜽 is one of the infinite solutions of H𝜽 = y.

The visual representation of the linear equation y = H𝜽 and its solution �� = HT (HHT )−1y is given inthe Figure 5.6. See the Appendix for the proof of the least-squares estimate.

5.8 Singular Value DecompositionThe SVD is one of the most elegant algorithms in numerical algebra for providing quantitative informationabout the structure of the system of linear equations. The SVD provides a robust solution of both theover-determined and under-determined least-squares problem, matrix approximation, conditioning of ill-conditioned matrices, and principal component analysis. It is employed in a variety of signal processingapplications including spectrum analysis, filter design, system identification, model order reduction,estimation, image compression, and data reduction. It is the basis for MIMO robustness analysis usingSVD plots of various frequency response plots. It also appears in many of the standard algorithms ofrobust control, such as H∞ and H2 synthesis and state-space balancing. It can be used to improve SNRfor large arrays by identifying the directions or linear combinations of sensors which have the greatestsensitivity to the system’s important parameters, such as modes of vibrations. It has an important role inleast-squares solutions with application to signal processing, estimation, and system identification.

Theorem Every NxM matrix H can be decomposed as

H = USVT (5.141)

where U is N × N real matrix of rank r ≤ min(M, N) where


∙ U is a N × N unitary matrix

U =[

u1 u2 . ur ur+1 . uN

]U =

[U1 U2

], U1 =

[u1 u2 . ur

], U2 =

[ur+1 ur+2 . uN

]UT U = UUT = I

where ui is Nx1 vector and U is called the left singular matrixof H∙ V is a MxM unitary matrix

V =[

v1 v2 . vr vr+1 . vM

]V =

[V1 V2

], V1 =

[v1 v2 . vr

], V2 =

[vr+1 vr+2 . vM

]V =

[V1 V2

], V1 =

[v1 v2 . vr

], V2 =

[vr+1 vr+2 . vM

]VVT = VT V = I

where vi is a M × 1 vector, V is called the right singular matrix of H.∙ S is a N × M rectangular matrix given by.

S =[𝜮 00 0

], 𝜮 = diag

[𝜎1 𝜎2 𝜎3 . 𝜎r

]{𝜎i} are the singular values of H which are positive

𝜎1 ≥ 𝜎2 ≥ 𝜎3 ≥ ⋯ ≥ 𝜎r > 0 and 𝜎r+1 = 𝜎r+2 = ⋯ = 𝜎M = 0

The matrix H is decomposed into r NxM matrices {uivTi } of unity rank:

H =r∑

i=1

𝜎iuivTi (5.142)

Thus HT H and HHT have the same eigenvalues {𝜎2i } but with {vi} and {ui} respectively as eigenvec-

tors.

The choice of HT H or HHT in the definition of singular values is arbitrary. While these two matriceshave different sizes and, therefore, a different number of eigenvalues (M eigenvalues for HT H and N forHHT ), they have similar non-zero eigenvalues, the positive roots of which are called the singular valuesof matrix A. Finally, it is obvious that the singular values of H and HT are the same. The singular valuesof H(HT ) the positive square roots of the eigenvalues of HT H (HHT ). A pictorial representation of SVDnamely H = USVT is shown in Figure 5.7.

H U S

TV

=

Figure 5.7 Pictorial representation of Singular Value Decomposition


5.8.1 Illustrative Example: Singular and Eigenvalues of Square Matrices

Example 5.9 Non-symmetric square matrixConsider a square matrix, H

H =[

1 23 4

]The eigenvalue-eigenvector decomposition of H is[

1 23 4

]=

[−0.8246 −0.41600.5658 −0.9094

] [−0.3723 0

0 5.3723

] [−0.9231 0.4222−0.5743 −0.8370

]The singular value decomposition of H is[

1 23 4

]=

[−0.4046 −0.9145−0.9145 0.4046

] [5.4650 0

0 0.3660

] [−0.5760 0.8174−0.8174 −0.5760

]

Example 5.10 Symmetric square matrix

H =[

1 22 3

]The eigenvalue-eigenvector decomposition of H is[

1 22 3

]=

[−0.8507 0.52570.5257 0.8507

] [−0.2361 0

0 4.2361

] [−0.8507 0.52570.5257 0.8507

]The singular value decomposition of H is[

1 22 3

]=

[−0.5257 −0.8507−8507 0.5257

] [4.2361 0

0 0.2361

] [−0.5257 0.8507−8507 −0.5257

]

Comment The SVD and eigenvalue decomposition for a square matrix are in general different. If,however, the matrix is symmetric, then singular values are the absolute values of the eigenvalues.

Example 5.11 Rectangular matrix

H =⎡⎢⎢⎣

1 22 43 6

⎤⎥⎥⎦The SVD of H is given by

U =⎡⎢⎢⎣−0.2673 0.9562 0.1195−0.5345 −0.0439 −0.8440−0.8018 −0.2895 0.5228

⎤⎥⎥⎦ , V =[−0.4472 −0.8944−0.8944 0.4472

], S =

⎡⎢⎢⎣8.3666 0

0 00 0

⎤⎥⎥⎦Its rank is 1 as it has only one non-zero singular value. The columns of H are linearly dependent.


5.8.2 Computation of Least-Squares Estimate Using the SVDThe least-squares problem was formulated and solved assuming that the data matrix H is of full rank.Thanks to the SVD decomposition, the full rank restriction may be lifted. Consider the linear measurementmodel (5.1) where the data matrix H need not be of full rank. Its rank is r ≤ min(M, N).

5.8.2.1 Un-Weighted Least-Squares Estimate

The least-squares estimate �� is given by

�� = V[𝜮−1 0

0 0

]UT y (5.143)

where 𝜮−1 = diag(

𝜎−11 𝜎−1

2 ⋯ 𝜎−1r

)U, V, and 𝜮 are the SVD of the weighted matrix H

H = U[𝜮 00 0

]VT (5.144)

Compare the estimate �� for the case when H is of full rank given by Eq. (5.18). The pseudo-inversebecomes

H† = V[𝜮−1 0

0 0

]UT (5.145)

The pseudo-inverse H† may be expressed as linear combinations of r rank-1 matrices

H† =r∑

i=1

1𝜎i

uivTi (5.146)

5.8.2.2 Weighted Least-Squares Estimate

In the case of the weighted least-squares method the transformation given in Eq. (5.9) is employedwhere W = W1∕2W1∕2, y = W1∕2y,

↼v = W1∕2v, and H = W1∕2H. The transformed variables y = W1∕2y

and H = W1∕2H are employed instead of y and H respectively in the SVD computations given in Eqs.(5.143) and (5.144)

∙ The bounds on the covariance of the estimation error are

‖cov(��)‖ ≤⎧⎪⎪⎨⎪⎪⎩

𝜎2max

(𝜮v

)𝜎2min

(𝜮−1∕2

v H) if E

[vvT

]= 𝜮v

𝜎2v

𝜎2min(H)

if E[vvT

]= I𝜎2

v

(5.147)

∙ The bounds on the residual.

𝜎2res ≤

⎧⎪⎨⎪⎩1N

trace(𝜮v

)if E

[vvT

]= 𝜮v

𝜎2v if E

[vvT

]= 𝜎2

v I(5.148)


Comment If data matrix H is not of maximal rank r < min(M, N) the SVD gives a least-squaressolution �� = H†y such that ‖��‖ has a minimum norm.

5.8.2.3 Ill-Conditioned Matrices

Matrices in which small errors in the matrix elements are substantially amplified to produce largedeviations in the solution are referred to as ill-conditioned matrices. Ill-conditioned matrices are thosewhich are close to being rank deficient. Associated with each matrix is a number called the conditionnumber, which indicates the degree to which it is ill-conditioned. The condition number of the matrixH denoted 𝜅(H) is defined in terms of the norms of the matrix and the pseudo-inverse matrix. Since thenorm of the matrix and its inverse are respectively the maximum and minimum singular values of thematrix, the condition number is

𝜅(H) = ‖H‖‖H†‖ =𝜎max(H)

𝜎min(H)(5.149)

The condition number 𝜅(H) gives a measure of how much the error is in the data matrix H and themeasurement y may be magnified in computing the solution of y = H𝜽. Thus it is desirable to have 𝜅(H)close to 1. A poorly conditioned data matrix H will have “small” singular values. As a consequence,components of H† with small singular values in Eq. (5.146) will be amplified. Hence the pseudo-inversewill be very sensitive to any the small changes in the singular vectors ui, vi or both. To avoid this problem,set the small singular values to zero and obtain a truncated 𝜮 and compute the pseudo-inverse (5.145)using the truncated 𝜮. However, determining which of the singular values may be deemed small isproblem dependant and requires a good judgment.

Example 5.12 Ill-conditioned matrix

Consider the measurement model y = H𝜽 + v where H =⎡⎢⎢⎣

1 + 𝜀 22 + 𝜀 43 + 𝜀 6

⎤⎥⎥⎦ where 𝜀 = 0.0001, 𝜽 =[

12

],

y =⎡⎢⎢⎣

51015

⎤⎥⎥⎦. The data matrix is ill-conditioned as the column vectors are almost linearly dependent. The

singular values of H are 8.3673 and 0.0005. The least-squares estimate �� is given by

�� = V[𝜮−1 0

0 0

]UT y =

[−1.66663.3333

]

y =⎡⎢⎢⎣

5.001910.000514.9990

⎤⎥⎥⎦The SVD of H is

U =⎡⎢⎢⎣−0.2674 0.4081 −0.8729−0.5345 −0.8166 −0.2180−0.8018 0.4083 0.4365

⎤⎥⎥⎦ V =[−0.4473 0.8944−0.8944 −0.4473

], S =

⎡⎢⎢⎣8.3673 0

0 0.00050 0

⎤⎥⎥⎦The condition number is 𝜅(H) =

𝜎max(H)

𝜎min(H)


Since the smallest singular value 0.0005 is very small compared to the largest 8.3673, let us zero thesmall singular value and compute the inverse using the truncated 𝜮 = 8.3673.

�� = V[

1∕8.3673 00 0

]UT y =

[12

]

y =⎡⎢⎢⎣

5.00039.999915.0

⎤⎥⎥⎦The estimate is very accurate with truncation. However, extreme care is required to determine when thesingular value is small.

‖cov(��)‖ ≤⎧⎪⎪⎨⎪⎪⎩

𝜎2max

(𝜮v

)𝜎2min(H)

if E[vvT

]= 𝜮v

𝜎2v

𝜎2min(H)

if E[vvT

]= 𝜎2

v

5.9 Summary

Modely = H𝜽 + v

Objective Functionmin��

{(y − H��)T W(y − H��)}

Least-Squares Estimate∙ under-determined, over-determined and non-singular matrix H

�� =⎧⎪⎨⎪⎩

HT (HHT)−1

y if N ≤ MH−1y if N = M ; det(H) ≠ 0

�� =(HT WH

)−1HT Wy if N > M

∙ The least-squares estimator is unbiased

E[��] = 𝜽

∙ The least-squares estimator is the best linear unbiased estimator.∙ Orthogonality condition

E[eT y] = 0

∙ Covariance of the estimation error

cov(��) =

{(HT𝜮−1

v H)−1

if E[vvT

]= 𝜮v

𝜎2v

(HT H

)−1if E

[vvT

]= 𝜎2

v I


∙ Residual

𝜎2res =

⎧⎪⎨⎪⎩trace

{ 1N

(I − Pr

)𝜮v

}if E

[vvT

]= 𝜮v

𝜎2res =

(1 − M

N

)𝜎2

v if E[vvT

]= 𝜎2

v I

∙ Asymptotic properties: N → ∞

��→ 𝜽 as N → ∞

lim itN→∞

𝜎2res = 𝜎2

v

e → v as N → ∞

E [e(n)e(n − m)] → E [v(n)v(n − m)] as N → ∞

If v is zero-mean white noise then

E[e(n)e(n − m)] = 𝛿(n − m) ={

𝜎2v m = n

0 m ≠ n

Un-Weighted and Weighted Least-Squares

y = H𝜽 + v

where W = W1∕2W1∕2, y = W1∕2y,↼v = W1∕2v and H = W1∕2H

Model and Systemic Errors

E[��] = 𝜽 + (HT WH)−1HT W𝝁 ≠ 𝜽e = 𝝁 + v − H(HT WH)−1HT Wv

e → 𝝁 + v as N → ∞

Hence the auto-correlation of the residual will not be delta function.

Augment the Parameter Vector for Systemic Error

𝜽ag = [𝜽 𝜇 ]T , Hag = [ H 1 ]

��ag = (HTagWHag)−1HagWy

Cramer–Rao Lower BoundScalar case

var(��) = E[(��(y) − 𝜃)2] ≥(

E

[(𝛿 ln fy(y)

𝛿𝜃

)2])−1


Vector case

cov(��) = E[(��(y) − 𝜽)(��(y) − 𝜽)T ] ≥ I−1F (𝜽)

Efficient estimatorAn unbiased estimator, which achieves The Cramer–Rao lower bound is said to be efficient.

Maximum Likelihood Estimation�� = argmax

𝜃{log fy(y)}

If the PDF of the noise v is zero0mean Gaussian then

∙ The weighted least-squares with weight W = 𝜮−1∕2v and the ML estimators are both efficient as they

attain the Cramer–Rao lower bound. They both yield the minimum variance unbiased estimator.∙ The Cramer–Rao lower bound for both ML and the weighted least-squares method is

I−1F (𝜃) =

(HT𝜮−1

v H)−1

Least-Squares Estimates for Under-Determined System

�� = HT (HHT)−1

y

y = H�� = y

e = 0

H† = HT (HHT)−1

Singular Value DecompositionEvery N × M matrix H can be decomposed as

H = USVT

The matrix H is decomposed into r rank-1 NxM matrices {uivTi }

H =r∑

i=1

𝜎iuivTi

The condition number 𝜅(H) gives a measure of how close the matrix is to being rank deficient. 𝜅(H) =‖H‖‖H†‖ =𝜎max(H)

𝜎min(H)

Computation of the Least-Squares Estimate Using SVD

�� = H†y


where

H† = V[𝜮−1 0

0 0

]UT

𝜮−1 = diag(

𝜎−11 𝜎−1

2 ⋯ 𝜎−1r

)U, V and 𝜮 are the SVD of the weighted matrix H

H = U[𝜮 00 0

]VT

The pseudo-inverse H† may be expressed as linear combination r rank-1 matrices

H† =r∑

i=1

1𝜎i

uivTi

The bounds on the covariance of the estimation error are

‖cov(��)‖ ≤⎧⎪⎪⎨⎪⎪⎩

1

𝜎2min

(𝜮−1∕2

v H) if E

[vvT

]= 𝜮v

𝜎2v

𝜎2min(H)

if E[vvT

]= 𝜎2

v

The bounds on the residual

𝜎2res ≤

{ 1

Ntrace

(𝜮v

)if E

[vvT

]= 𝜮v

𝜎2v if E

[vvT

]= 𝜎2

v I

5.10 Appendix: Properties of the Pseudo-Inverse and theProjection Operator

5.10.1 Over-Determined SystemThe pseudo-inverse satisfies one property, H†H = I, but not the other property of an inverse HH† ≠ I.In fact HH† = Pr is a projection operator of H.

Pr = HH† has the following properties

∙ PTr = Pr is symmetrical

∙ P2r = Pr hence Pm

r = Pr for m = 1, 2, 3,…,∙ Eigenvalues of I − Pr are only ones and zeros.∙ Let Mr be the rank of a NxM matrix H

◦ Mr eigenvalues will be ones◦ The rest of the N − Mr eigenvalues will be zeros◦ trace

(I − Pr

)= N − Mr.

∙ I − Pr projects a vector on to a space perpendicular to the range space of the matrix H∙ If H is non-singular square matrix then Pr = I ; I − Pr = 0.


Weighted projection matrix:In a weighted least squares method, the projection matrix is defined as

Pr = H(HT WH

)−1HT W

It is not a symmetric matrix PTr ≠ Pr. However, the weighted projection operator satisfies all the properties

listed above except the symmetry property.

5.10.2 Under-Determined SystemThe pseudo-inverse is given by

H† = HT (HHT )−1

The pseudo-inverse satisfies one property, HH† = I, but not the other property of an inverse H†H ≠ I.In fact H†H is a projection operator of H. The projection operator is given by

Pr = H†H = HT (HHT )−1H

The projection operator of the under-determined system also satisfies all the properties of the over-determined system.

5.11 Appendix: Positive Definite Matrices∙ If A is a positive definite matrix, then A−1 is also a positive matrix.

Lemma 5.2 If A is a positive definite matrix, then

0 < lim itN→∞

xT AxN

< ∞ (5.150)

Proof: Since A is positive definite

0 < 𝜆min(A)xT x < xT Ax ≤ 𝜆max(A)xT x for all x ≠ 0 (5.151)

where 𝜆min(A) and 𝜆min(A) are minimum and maximum eigenvalues of A. Let mmin = mini

{x2i } and

mmax = maxi

{x2

i

}be respectively the minimum and the maximum positive values of x2

i . Since x ≠ 0 and

hence x2i ≠ 0 for all i, we get

mminN ≤ xT x =N∑

i=1

x2i ≤ mmaxN (5.152)

Hence we get

0 < Nmmin𝜆min(A) < xT Ax ≤ Nmmax𝜆max(A) (5.153)

Dividing by N and taking the limit yields

0 < mmin𝜆min (A) < lim itN→∞

xT AxN

≤ mmax𝜆max (A) (5.154)

Hence we conclude 0 < lim itN→∞

xT Ax

N< ∞.


5.12 Appendix: Singular Value Decomposition of a Matrix

Theorem 5.2 Every N × M matrix H can be decomposed as

H = USVT (5.155)

where U is N × N real matrix of rank r ≤ min (M, N) whereU is a N × N unitary matrix

U = [ u1 u2 . ur ur+1 . uN ]

U = [ U1 U2 ], U1 = [ u1 u2 . ur ], U2 = [ ur+1 ur+2 . uN ]

UT U = UUT = I

ui is N × 1 vector. U is called left singular matrix of HV is a M × M unitary matrix

V = [ v1 v2 . vr vr+1 . vM ]

V = [ V1 V2 ], V1 = [ v1 v2 . vr ], V2 = [ vr+1 vr+2 . vM ]

V = [ V1 V2 ], V1 = [ v1 v2 . vr ], V2 = [ vr+1 vr+2 . vM ]

VVT = VT V = I

vi is a M × 1 vector, V is called right singular matrix of H.S is a N × M rectangular matrix given by

S =[𝜮 00 0

], 𝜮 = diag

[𝜎1 𝜎2 𝜎3 . 𝜎r

]where 𝜎i is the ith singular values of H which are positive

𝜎1 ≥ 𝜎2 ≥ 𝜎3 ≥ ⋯ ≥ 𝜎r > 0 and 𝜎r+1 = 𝜎r+2 = ⋯ = 𝜎M = 0

The matrix H is decomposed into r rank-1 N × M matrices {Hi}

H =r∑

i=1

𝜎iuivTi (5.156)

A brief outline of the proof is given below:

Proof: Consider the MxM matrix HT H. It is a symmetric and positive semi-definite matrix. From linearalgebra we know the following:

∙ Every symmetric and positive semi-definite matrix HT H can be diagonalized by a unitary matrix withdiagonal elements which are real and non-negative eigenvalues. If r is the rank, then there will be rnon-zero eigenvalues and the rest will be zero.


Hence

HT Hvi = 𝜎2i vi (5.157)

where {vi} form a set of orthonormal vectors and 𝜎2i is the ith eigenvalue of HT H. Let V be MxM matrix

formed of Mx1 orthonormal eigenvectors {vi}, V = [ v1 v2 . vr vr+1 . vM ] As V is a unitarymatrix VT V = I. Expressing Eq. (5.157) in a matrix form we get

HT HV = V𝜞 (5.158)

where 𝜞 =[𝜮2 00 0

], 𝜮2 = diag

[𝜎2

1 𝜎22 . 𝜎2

r

]Pre-multiplying by VT and noting that V is unitary we get

VT HT HV = 𝜞 (5.159)

Substituting V =[

V1 V2

], V1 =

[v1 v2 . vr

], V2 =

[vr+1 vr+2 . vM

]we get[

VT1

VT2

]HT H

[V1 V2

]=

[VT

1 HT HV1 VT1 HT HV2

VT2 HT HV1 VT

2 HT HV2

]=

[𝜮2 0

0 0

](5.160)

Hence we get

VT2 HT HV2 = 0 (5.161)

This implies

HV2 = 0 (5.162)

VT1 HT HV1 = 𝜮

2 (5.163)

VT2 HT HV2 = VT

2 HT HV1 = 0 (5.164)

Consider U = [ U1 U2 ], U1 = [ u1 u2 . ur ], U2 = [ ur+1 ur+2 . uN ]. Define

U1 = HV1𝜮−1 (5.165)

Using Eqs. (5.163) and (5.165) we get

UT1 U1 = 𝜮

−1VT1 HT HV1𝜮

−1 = 𝜮−1𝜮2𝜮−1 = I (5.166)

Hence U1 is unitary matrix. Create a matrix U2 so that UT1 U2 = UT

2 U1 = 0, UT2 U2 = I and hence

UT U =

[UT

1 U1 UT1 U2

UT2 U1 UT

2 U2

]=

[I 0

0 I

](5.167)

Now examine

UT HV =

[UT

1

UT2

]H

[V1 V2

]=

[UT

1 HV1 UT1 HV2

UT2 HV1 UT

2 HV2

](5.168)


Using Eqs. (5.162), (5.165), and (5.167) we get

UT HV =

[𝜮−1VT

1 HT HV1 0

UT2 U1𝜮 0

]=

[𝜮−1𝜮2 0

UT2 U1𝜮 0

]=

[𝜮 0

0 0

]= S (5.169)

Thus

UT HV = S (5.170)

Pre-multiplying with U and post-multiplying with VT we get

H = USVT (5.171)

5.12.1 SVD and Eigendecompositions

Lemma 5.3 If the H is a symmetric and non-negative definite M × M matrix of rank r, then its singularvalue decomposition and the eigendecomposition will be the same.

Proof: Let qi be a ith eigenvector of H associated with the ith eigenvalue 𝜆i:

Hqi = 𝝀iqi (5.172)

Pre-multiplying by HT and noting that H and HT have same eigenvalue, we get

HT Hqi = 𝜆iHT qi = 𝜆2

i qi (5.173)

That is, qi is an eigenvector of HT H associated with 𝜆2i . Thus, if qi is normalized, we will have:

U = V = [ q1 q2 . qM ] = [ v1 v2 . vr vr+1 . vM ] (5.174)

Moreover, since H is also assumed to be a non-negative definite matrix, its eigenvalues are non-negativereal numbers. Hence

H = USUT (5.175)

where S = diag( 𝜎1 𝜎3 ⋯ 𝜎r 0 ⋯ 0 ). Therefore

𝜆(H) = 𝜎(H) (5.176)

where 𝜆(H) and 𝜎(H) are respectively eigenvalues and the singular values of H. In general

𝜎i(H) =√

𝜆i

(HT H

)(5.177)


5.12.2 Matrix NormsThe SVD can be used to compute the 2-norm and the Frobenius norm. The 2-norm is the maximumsingular value of the matrix while the Frobenius norm is the sum of the squares of the singular values

‖H‖2 = 𝜎max(H) (5.178)

If H is invertible, then

‖H−1‖2 =1

𝜎min(H)(5.179)

‖H‖2F =

r∑i=1

𝜎2i (5.180)

5.12.3 Least Squares Estimate for Any Arbitrary Data Matrix H

Theorem 5.3 Let be a H be a NxM matrix of rank r ≤ min(M, N). Then the least-squares estimate�� = arg{min

��

(y − H��)T W(y − H��)} is given by

�� = V[𝜮−1 0

0 0

]UT y (5.181)

where W1∕2H = USVT is

Proof:

∙ We will first consider the un-weighted least-squares problem.

Substituting the SVD decomposition of H we get

J = (y − USVT ��)T (y − USVT ��) (5.182)

where S =[𝜮 00 0

], 𝜮 = diag

[𝜎1 𝜎2 𝜎3 . 𝜎r

]Insert UUT to create a “weighted” inner product as it were. Since UUT = I, the cost function J is

unaffected and we get after simplification

J = (y − USVT��)T UUT (y − USVT

��) = (UT y − SVT��v)

T (UT y − SVT��v) (5.183)

Denoting yu = UT y, ��v = VT �� we get

J = (yu − S��v)T (yu − S��v) (5.184)

Using definitions S =[𝜮 00 0

], yu =

[yu1

yu2

], ��v =

[��v1

��v2

]we get

J =(

yu1 −𝜮��v1

yu2

)T (yu1 −𝜮��v1

yu2

)=

(yu1 −𝜮��v1

)T (yu1 −𝜮��v1

)+ yT

u2yu2 (5.185)


Since the cost function J is not a function of ��v2, we may choose any value for ��v2 without affecting J.The least-squares estimate ��v1 is given by

��v1 =[𝜮−1 0

0 0

]yu (5.186)

The solution is not unique as ��v2 is arbitrary. The general solution ��v is obtained by appending ��v2 whichis arbitrary to the solution ��v1.

��v =

[��v1

��v2

](5.187)

The norm of ��v is ‖��v‖2 = ‖��v1‖2 + ‖��v2‖2 [3]. A solution ��v which has minimum norm is obtained bysetting ��v2 = 0 and the minimum norm solution becomes

��v = ��v1 =[𝜮−1 0

0 0

]yu (5.188)

Substituting for ��v and yu using yu = UT y, ��v = VT ��, and noting VVT = I we get

�� = V[𝜮−1 0

0 0

]UT y (5.189)

Let us consider the weighted least-squares problem.We will employ the transformed measurement model

y = H𝜽 + v where W = W1∕2W1∕2, y = W1∕2y,↼v = W1∕2v and H = W1∕2H

With this transformation the weighted least-squares is converted to the un-weighted least-squares prob-lem. For notational convenience of the SVD of H = W1∕2H and that of H are denoted by the samesingular matrices, namely U, S, and V

H = USUT (5.190)

The least-squares estimate of the weighted least-squares problem becomes

�� = V[𝜮−1 0

0 0

]UT y = V

[𝜮−1 0

0 0

]UT W1∕2y (5.191)

Comment If the data matrix H is not of maximal rank, r < min(M, N), then the solution is not unique.It has infinite solutions and the least-squares solution using the SVD gives a minimum norm solution. Inother words the solution �� = H†y is such that ‖��‖ is minimum.


5.12.4 Pseudo-Inverse of Any Arbitrary Matrix

Theorem 5.4 Let W1∕2H = USUT . The pseudo-inverse of a NxM matrix H of rank r ≤ min(M, N) is

H† = V[𝜮−1 0

0 0

]UT (5.192)

Proof: Consider the symmetric, square matrix, and positive semi-definite MXM matrix HT WH

HT WH = VST UT USVT = VST SVT = V[𝜮2 00 0

]VT (5.193)

Hence

(HT WH

)−1 = V[𝜮−2 0

0 0

]VT (5.194)

The pseudo-inverse H† becomes

H† =(HT WH

)−1HT W =

(HT WH

)−1 = V[𝜮−2 0

0 0

]VT VSUT = V

[𝜮−1 0

0 0

]U (5.195)

5.12.5 Bounds on the Residual and the Covariance of the Estimation ErrorConsider the covariance of the estimation error cov(��) = (HT𝜮−1

v H)−1. Substituting the SVD of H and

using ‖A−1‖ = 1𝜎min(A−1)

we get

cov(��) ≤ 1

𝜎2min

(𝜮−1∕2

v H) (5.196)

Hence

‖cov(��)‖ ≤⎧⎪⎪⎨⎪⎪⎩

1

𝜎2min

(𝜮−1∕2

v H) if E

[vvT

]= 𝜮v

𝜎2v

𝜎2min(H)

if E[vvT

]= 𝜎2

v

(5.197)

5.13 Appendix: Least-Squares Solution for Under-Determined SystemSince there are infinite solutions, it is preferable to choose a solution from the infinite set of solutions,the one which has a minimum norm. The problem is formulated as follows:

min𝜃

{12‖𝜃‖2

}such that H𝜽 = y (5.198)


where ‖𝜽‖2 = 𝜽T𝜽 is the 2-norm of 𝜽. The cost function ‖𝜽‖2 is divided by 2 for notational convenienceso that it is canceled out on differentiation. It is a constrained minimization problem, which may besolved using the Lagrange multiplier technique: convert the constrained least-squares optimization intounconstrained least-squares optimization using Lagrange multipliers:

min𝜃,𝜆

{��

T��

2− 𝜆T

(H�� − y

)}(5.199)

Differentiating with respect to �� and 𝜆 yields

�� − HT𝝀 = 0 (5.200)

H�� = y (5.201)

Substituting for �� in Eq. (5.201) using Eq. (5.200) and assuming H has a full rank (which implies HHT

is invertible), we get

𝝀 =(HHT)−1

y (5.202)

Expressing �� as a function of y by substituting for 𝜆 in Eq. (5.200) we get

�� = HT (HHT)−1

y (5.203)

The estimate y of y becomes

y = H�� = HHT (HHT)−1

y = y (5.204)

5.14 Appendix: Computation of Least-Squares EstimateUsing the SVD

The least-squares estimate is computed using a procedure similar to the over-determined case.Let be a H be a NxM with N < M, and of rank r ≤ min(M, N). Then the least-squares estimate�� = arg{min

��

(y − H��)T W(y − H��)} is given by

�� = V[𝜮−1 0

0 0

]UT y (5.205)

where W1∕2H = USVT

Proof: Proof It is identical to the over-determined case.

References

[1] Ljung, L. (1999) System Identification: Theory for the User, Prentice-Hall, New Jersey.[2] Mendel, J. (1995) Lessons in Estimation Theory in Signal Processing, Communications and Control, Prentice

Hall, New Jersey.[3] Haykin, S. (2001) Adaptive Filter Theory, Prentice Hall, New Jersey.


Further Readings

Doraiswami, R. (1976) A decision theoretic approach to parameter estimation. IEEE Transactions on AutomaticControl, 21(6), 860–866.

Kay, S.M. (1993) Fundamentals of Signal Processing: Estimation Theory, Prentice Hall, New Jersey.Moon, T.K. and Stirling, W.C. (2000) Mathematical Methods and Algorithms for Signal Processing, Prentice Hall,

New Jersey.

identification of physical systems (applications to condition monitoring, fault diagnosis, soft...

Documents