stochastic signals and systemshomepages.hs-bremen.de/~krausd/iwss/sss3.pdfchapter 3 / stochastic...

I N S T I T U T E O F W A T E R A C O U S T I C S,

S O N A R E N G I N E E R I N G A N DS I G N A L T H E O R Y

Chapter 3 / Stochastic Signals and Systems / Prof. Dr.-Ing. Dieter Kraus 1

Stochastic Signals and SystemsContents1 Probability Theory2 Stochastic Processes3 Parameter Estimation4 Signal Detection5 Spectrum Analysis 6 Optimal Filtering




3 Parameter Estimation 3

3.1 Estimating Function and Estimator 4

3.2 Sufficient Statistic, Exponential Family 15

3.3 Linear Least Squares Estimation 29

3.4 Confidence Intervals 53

3.5 Cramer-Rao Lower Bound 57

3.6 Maximum Likelihood Estimation 75

3.7 Bayesian Estimation 803.7.1 Minimum Mean Square Error Estimation 843.7.2 Minimum Mean Absolute Error Estimation 913.7.3 Maximum A Posteriori Estimation 93

References to Chapter 3 97




3 Parameter EstimationThe observations (x1,…,xn)T = x are realizations of the random variables (X1,…,Xn)T = X with density fx(x), which is element of a known set { fx(x | ): Î W } and whose parameter vector = (q1,…,qp)T is unknown.

Problem:For given observations x we are looking for an estimate

of that depends on x, i.e. .In parameter estimation problems one distinguishes wheth-er the parameter vector is a perfectly unknown quantity or whether the prior density of the parameter vector is sup-posed to be known.

θθθ

θ θ = θ(x)θ = (θ1,…,θp )T




In the latter case one interprets the parameter vector as random vector, whose posterior density is used for the subsequent statistical inference. This approach leads to Bayes estimators.

3.1 Estimating Function and EstimatorFor determining qi , we denote the mapping

a estimating function for qi and its function value for a giv-en set of observations an estimate. Since x1,…,xn can be considered as random values the estimates are also ran-dom and therefore only approximate the true values qi .

( ) ( )1 1ˆ, , , , , 1, ,n i nx x x x i pq =! " ! !




We examine the accuracy properties of the estimates on the basis of the corresponding random variables. Therefore, we define the random variable

and denote it as estimator of qi.

To characterize the properties of the estimator

completely, the density function that generally de-pends on has to be determined, e.g. using the methodsdescribed in Chapter 1.7.

( )1ˆˆ , , , 1, , ,i i nX X i pqQ = =! !

Θ = θ X1,…,Xn( )

θfΘ(θ | θ)




For simplicity let p =1. Exemplarily, the densities of two alternative estimators and for q are depicted below.

0 1 2 3 4 5 6 7 8 9 100

0.1

0.2

0.3

0.4

0.5

q

!ΘΘ

f !Θ ( !θ |θ)

fΘ (θ |θ)




The determination of the density function can be rather complicated. Therefore, one must be often satisfied with the determination of the moments of an estimator. Bias:The bias or systematic error is define by

where

If the estimator is said to be unbiased.

E(Θ i ) = ! θ i x1,…,xn( )fX1,…,Xn−∞

∞

∫−∞

∞

∫ (x1,…,xn | θ) dx1!dxn.

Θb(Θ) = 0

b(Θ) = b(Θ1),…,b(Θp )( )T = E(Θ1)−θ1,…,E(Θp )−θp( )T

= E(Θ1),…,E(Θp )( )T − θ1,…,θp( )T = E(Θ)− θ,




Covariance:The covariance matrix is define by

where the diagonal elements represent the variances

for i = 1,…,p.

2 2 2

ˆ ˆ ˆVar( ) Cov( , )ˆ ˆ ˆ ˆE( E ) E( ) (E )

i i i

i i i i

Q = Q Q

= Q - Q = Q - Q

Cov(Θ) = Cov(Θ i ,Θ j )⎡⎣ ⎤⎦i, j=1,…,p

= E Θ i−E(Θ i )( ) Θ j −E(Θ j )( )( )⎡⎣⎢

⎤⎦⎥i, j=1,…,p

= E Θ −E(Θ)( ) Θ −E(Θ)( )T⎛⎝

⎞⎠ ,




Mean Square Error:The matrix-valued mean square error is define by

where the diagonal elements represent the mean square errors of the individual components given by

for i = 1,…,p.

MSE(Θ) = E (Θ i−θ i )(Θ j −θ j )( )⎡⎣

⎤⎦i, j=1,…,p

= E (Θ − θ)(Θ − θ)T( )= Cov(Θ)+b(Θ)b(Θ)T ,

MSE(Θ i ) = E (Θ i −θ i )2( ) = Var(Θ i )+ b(Θ i )

2

Θ i




Minimum Variance Unbiased (MVU) Estimator:An unbiased estimator is MVU if for any unbiased esti-mator the following inequality holds.

MVU exists no MVU existsVar Var

q!q

q

q

q!

qq q

aTCov(Θ)a ≤ aTCov( !Θ)a ∀a ∈"p, ∀θ ∈Ω

Θ!Θ




Note:To emphasize that the variance is smallest for all an MVU es-timator is sometimes called uniformly minimum variance unbiased UMVU estimator.

Minimum Mean Square Error (MMSE) Estimator:An estimator provides an MMSE if for any estimatorthe following inequality holds.

Linear Estimator:An estimator is called linear if it can be expressed as a linear function of the observations, i.e.

Θ !Θ

θ ∈Ω

Θ

aTMSE(Θ)a ≤ aTMSE( !Θ)a ∀a ∈!p, ∀θ ∈Ω

Θ = AX with X = (X1,…,Xn) and A = ai, j( )i=1,…,p; j=1,…,n




Exercise 3.1-1: MVU and MMSE variance estimator




Consistency:Often one is interested in the behavior of an estimator if the number of observations grows. An estimator is said to be

– strongly consistent if converges with probability 1 towards , i.e.

– mean square consistent if converges in mean square sense towards , i.e.

– consistent if converges in probability towards , i.e.

Θn = θ(X1,…,Xn)a.s.n→∞⎯ →⎯⎯ θ

Θn = θ(X1,…,Xn)m.s.n→∞⎯ →⎯⎯ θ

Θn = θ(X1,…,Xn)P

n→∞⎯ →⎯⎯ θ

θ

θ

θ

Θn

Θn

Θn




Asymptotic Normality:In certain cases one can show that an estimator is asymptotically normally distributed such that

Consequently, for large n the density function of can be approximated by

limn→∞

n(Θn − θ)∼Np(0,Σ).

fΘn(θ | θ) = np 2

(2π )p 2 det(Σ)⋅exp − n

2(θ − θ)T Σ−1(θ − θ)

⎧⎨⎩

⎫⎬⎭.

Θn




3.2 Sufficient Statistic, Exponential FamiliesA function T = t(X) that is only depending on the observa-tion model X is called a statistic. Let X = (X1,…,Xn)T be a model of a binary sequence with probability p and 1-p of observing a one and a zero res-pectively. Furthermore, we suppose that the X1,…,Xn are independent. Thus, our model can be described by the probabilities

1 1

1

1

( ) ( )

( | ) (1 )

(1 ) (1 )

k k

n nk kk k

nx x

k

x n x t n t

P p p p

p p p p= =

-

=

- -

= = -

å å= - = -

Õx x

X x




Now the following question arises. Do we collect all infor-mation in terms of inference about p by recording only

The statistical understanding about the collection of all information is quantified in the following definition.

Definition:A statistic T = t(X) is called sufficient for the parameter qif the conditional distribution of X given T = t(x) is inde-pendent of q for all t, i.e.

Hence, T contains "all information" about q included in x.

1( ) ?n

kkt x

==åx

FX x |T = t(x);θ( ) = FX x |T = t(x)( ).




Exercise 3.2-1: Sufficiency of

1( ) n

kkt x

==åx




Because the conditional distribution has to be determined a direct evaluation of sufficiency is usually difficult. Fortunately, the following theorem exist whose conditions can be verified easily.

Theorem: (factorization theorem for densities) A necessary and sufficient condition for a statistic T = t(X) to be sufficient is that there exist non-negative functions g(t | ) and h(x) such that fX(x | ) satisfiesθ θ

fX(x | θ) = g t(x) | θ( ) ⋅h(x).




Exercise 3.2-2: Proof of the Theorem

Exercise 3.2-3: Sufficient statistic for mean and variance




Minimal Sufficient StatisticA sufficient statistic T is said to be minimal if of all suffi-cient statistics it provides the greatest possible reduction of data, i.e. if for any sufficient statistic T' there exists a function s such that T = s(T').

Complete Sufficient StatisticA sufficient statistic T is said to be complete (or unique) if the condition

implies f(T) = 0 with probability 1 for all q.

Note: Completeness ensures minimality

θ

Eθ f(T)( ) = 0 ∀θ ∈Ω




Exercise 3.2-4: Minimal and complete sufficient statistics




Uniqueness of unbiased estimating functionCompleteness implies that there is just one estimating function of the sufficient statistic that provides an unbi-ased estimator of .Let T be a complete sufficient statistic and f1 and f2 be two functions such that

Then

and due to the completeness of TEθ f1(T)− f2(T)( ) = Eθ f(T)( ) = 0 ∀θ ∈Ω

f(T) = 0⇒ f1(T) = f2(T) with probability 1 ∀θ ∈Ω

θ

Eθ f1(T)( ) = Eθ f2(T)( ) = θ ∀θ ∈Ω.




Theorem: (Rao-Blackwell)Let be an unbiased estimator of and T = t(X) be a sufficient statistic for q. Then the estimator defined by

is unbiased and improves on as follows.

Theorem: (Lehmann-Scheffe)If in addition to the assumptions employed for the Rao-Blackwell theorem the sufficient statistic is complete, then

is unique MVU estimator of q.

θ!Θ

Θ = E( !Θ |T)

Θ = E( !Θ |T)

aTCov(Θ)a ≤ aTCov( !Θ)a ∀a ∈"p!Θ




Exercise 3.2-5:Proof of the Theorems




Exponential FamiliesA family of distributions is forming a k-dimen-sional exponential family if the distributions have densities of the form

Frequently, it is more convenient to use the xi as the pa-rameters and write the density in the canonical form

FX(x | θ){ }FX(x | θ)

fX(x | θ) = h(x) ⋅exp ξi (θ)ti (x)−B(θ)i=1

k

∑⎛⎝⎜⎞⎠⎟.

fX(x | ξ) = h(x) ⋅exp ξi ti (x)− A(ξ)i=1

k

∑⎛⎝⎜⎞⎠⎟.




Applying the factorization theorem for densities one can easily observe that

constitutes a sufficient statistic for the exponential family.

Note:The parameter space of the natural parametervector is convex.If the exponential family is of full rank, i.e. the parameter space contains a k-dimensional rectangle, then T = (T1, …,Tk)T is complete.

( ) ( )1 1, , ( ), , ( )T Tk kT T t t= =T X X! !

Ξ ⊂ !k

ξ = (ξ1,…,ξk )T

Ξ




For exponential families one can claim, that the integral

has derivatives of all orders with respect to the xi which can be obtained by differentiating under the integral sign.Exploiting the properties of the integral we can find

and

fX(x | ξ)!n∫ dx = h(x) ⋅exp ξi ti (x)− A(ξ)i=1

k

∑⎛⎝⎜⎞⎠⎟!n∫ dx = 1

E Ti( ) = E ti (X)( ) = ∂∂ξiA(ξ)

Cov Ti ,Tj( ) = Cov ti (X),t j (X)( ) = ∂2

∂ξi ∂ξ jA(ξ)




Exercise 3.2-6: Rayleigh distribution, mean and variance of the natural sufficient statistic




3.3 Linear Least Squares EstimationConsider the linear model

where X and Z are n´1 vectors modeling the measure-ments and the measurement noise, respectively. Further-more, H denotes a known n´p matrix and q the p´1 pa-rameter vector that has to be estimated. The measurement noise is supposed to be statistically characterized by and Hence, the measurement model X possess the mean vec-tor and covariance matrix

E( ) =Z 0

2Cov( ) .Zs=X I

2Cov( ) E( ) .TZs= =Z ZZ I

X =Hθ + Z with Xi=hi1θ1+…+hipθp+Zi , i =1,…,n,

E(X) =Hθ




Exercise 3.3-1: Model examples, • polynomial curve fitting, • amplitude and phase estimation of sinusoids • FIR filter identification




Now, the least squares criterion can be expressed by

Differentiating with respect to by using the identities

results after equating to zero in the following so-called normal equation system

A solution of the normal equation system is given by

q(θ) = xi − (hi1θ1+…+hipθp )( )2i=1

n

∑= (x −Hθ)T (x −Hθ) = xTx − 2θTHTx + θTHTHθ.

∇θ θTAa = Aa, ∇θ θ

TAθ = (A + AT )θ

HTHθ =HTx.

θ = (HTH)−HTx,

θ




where denotes a generalized inverse.

1) If the rank(H) = p, the number of unknown parameters, then is the ordinary inverse.

2) Let rank(H) = M < p, i.e. either n < p or the columns ofH are linearly dependent, then

is the Moore-Penrose inverse, where l1,…,lM and u1, …,uM denote the non-zero eigenvalues and corre-sponding eigenvectors of , respectively.

( )T -H H

1( ) ( )T T- -=H H H H

1

1( ) ( )M

T T Tm m

m ml- +

=

= =åH H H H u u

TH H




Supplement:A generalized inverse of A is defined by the property

It is not unique. The Moore-Penrose inverse of A is de-fined by the properties

It is unique and provides the minimum length solution of the linear equation system

The Moore-Penrose inverse A+ can be determined by employing the singular value decomposition of A.

+ + + +

+ + + +

= == =, ,

( ) ( ) .T T

AA A A A AA AAA AA A A A A

.- =AA A A

.=Ax b




Exercise 3.3-2: Normal equation system and generalized inverse




Properties of the least squares estimatorThe mean vector is given by

where the matrices U1 and U2 can be derived from the singular value decomposition

with1( , , ,0, ,0)T T

Mdiag l l=H H U U! !

( ) ( ) ( )1 2 1 1 2 1, , , , .M M p+= = =U U U U u u U u u! !

E(Θ) = (HTH)−HTE(X) = (HTH)−HTHθ

=θ rank(H) = pU1U1

T θ = (I−U2U2T )θ rank(H) =M < p

⎧⎨⎪

⎩⎪,




The covariance matrix of the least squares estimator

results after exploiting

in the expression

CΘΘ

= Cov(Θ) = E Θ −E(Θ)( ) Θ −E(Θ)( )T⎛⎝

⎞⎠

CΘΘ

= E (HTH)−HT (X −Hθ)(X −Hθ)TH(HTH)−( )= (HTH)−HT E (X −Hθ)(X −Hθ)T( )H(HTH)−= (HTH)−HT (σ Z

2 I)H(HTH)− = σ Z2(HTH)−.

E(Θ) = (HTH)−HTHθ




Exercise 3.3-3: LSE for p = 1 and sample mean




Theorem: (Gauß-Markov)Given the model

where

and

Then the best linear unbiased estimator (BLUE) is the least squares estimator

rank( ) .p=H


E(X) =Hθ, Cov(X) = Cov(Z) = σ Z2 I

Θ = (HTH)−1HTX.




Exercise 3.3-4: Proof of the Gauß-Markov Theorem




Consequently, the minimum of the sum of squares is

where

are projection matrices, which project a vector a Î Rn by Pa and P^a into R(H) and N(HT), respectively.R(H): range of H, i.e. the space spanned by the columns of H

N(HT): nullspace of HT, i.e. the space that is orthogonal to R(H)

1 1( ) and ( )T T T T- ^ -= = - = -P HH H H P I P I H H H H

q(θ) = x −H(HTH)−1HTx( )T x −H(HTH)−1HTx( )= (x −Px)T (x −Px) = xT (I−P)T (I−P)x= xTP⊥TP⊥x = xTP⊥x,




Thus, and The projection matrices are also symmetric and idempotent, i.e.

Moreover, by employing the trace of a square matrix

together with its property

where A, B and C are n´k, k´l and l´n matrices the min-imum of the sum of squares can be expressed by

, and , .T T^ ^ ^ ^ ^= = = =P P P PP P P P P P

,T ^ ^= =H P 0 P H 0

( ) , 1, ,1

tr( ) withn

ii ij i j nia a

==

= =åA A!

.^ ^= =PP P P 0

tr( ) tr( ) tr( ),= =ABC BCA CAB

q(θ) = xTP⊥x = tr(P⊥xxT ).




Now we want to consider how the variance of the noise model can be estimated. The expected value of the minimum of the sum of squares provides

where

has been exploited.

( )( )

1

1

tr( ) tr ( )

tr ( ) tr( )

T Tn

T Tpn n n p

^ -

-

= -

= - = - = -

P I H H H H

H H H H I

2Zs

E q(Θ)( ) = E tr(P⊥XXT )( ) = tr P⊥E (Hθ + Z)(Hθ + Z)T( )( )= tr P⊥ HθθTHT +E(Z)θTHT +HθE(ZT )+E(ZZT )( )( )= tr P⊥(HθθTHT +σ Z

2 I)( ) = σ Z2 tr(P⊥ ) = (n − p)σ Z

2,




This result motivates

as an unbiased estimator for

Theorem:If in addition to the Gauß-Markov theorem Z obeys a nor-mal distribution, i.e. the following holds.

2.Zs

S2 = q(Θ) (n − p)

Z∼Nn(0,σ Z2 I),

a) Θ = (HTH)−1HTX∼Np θ,σ Z2 (HTH)−1( )

b) Θ and S2 are stochastically independent

c) (n − p) σ Z2 ⋅S2 ∼ χn−p

2





Exercise 3.3-6: Amplitude and phase estimation of sinusoids

Exercise 3.3-7: System identification (FIR filter)




In the following we generalize the linear model

such that the measurement noise possesses the meanand the covariance matrix

PrewhiteningIf is decomposed, e.g. by the Cholesky decomposi-tion , and

be introduced, the linear model can be reformulated to

with

E( )=Z 0 2Cov( ) .Zs= ZZZ C

ZZCT=ZZC CC

1 1 1, ,- - -= = =Y C X K C H W C Z


Y =Kθ +W

E(Y) =Kθ = C−1Hθ and Cov(Y) = Cov(W) = σ Z2 I.




Hence, the weighted least squares criterion for the gen-eralized linear model can be derived as follows.

The normal equation system

and its solution

are obtained analogous to the white noise case.

q(θ) = (y −Kθ)T (y −Kθ)

= C−1(x −Hθ)( )T C−1(x −Hθ)( )= (x −Hθ)T (C−1)TC−1(x −Hθ) = (x −Hθ)TCZZ

−1 (x −Hθ).

KTKθ =KTy resp. HTCZZ−1 Hθ =HTCZZ

−1 x

θ = (KTK)−KTy = (HTCZZ−1 H)−HTCZZ

−1 x




Properties of the generalized least squares estimatorThe mean vector is given by

where the matrices U1 and U2 can be derived from the singular value decomposition

with

11( , , ,0, ,0)T T

Mdiag l l- =ZZH C H U U! !

E(Θ) = (HTCZZ−1H)−HTCZZ

−1 E(X) = (HTCZZ−1 H)−HTCZZ

−1 Hθ

=θ rank(H) = pU1U1

T θ = (I−U2U2T )θ rank(H) =M < p

⎧⎨⎪

⎩⎪,

U = U1 U2( ), U1 = u1,…,uM( ) U2 = uM+1,…,up( ).




The covariance matrix of the generalized least squares estimator

results after exploiting

in the expression

CΘΘ

= Cov(Θ) = E Θ −E(Θ)( ) Θ −E(Θ)( )T⎛⎝

⎞⎠

E(Θ) = (HTCZZ−1 H)−HTCZZ

−1 Hθ

CΘΘ

= E (HTCZZ−1H)−HTCZZ

−1 (X−Hθ)(X−Hθ)TCZZ−1H(HTCZZ

−1H)−( )= (HTCZZ

−1H)−HTCZZ−1E (X−Hθ)(X−Hθ)T( )CZZ

−1H(HTCZZ−1H)−

= (HTCZZ−1H)−HTCZZ

−1 (σ Z2CZZ)CZZ

−1H(HTCZZ−1H)−

= σ Z2(HTCZZ

−1H)−.




Theorem: (Gauß-Markov)Given the model

where

and

Then the best linear unbiased estimator (BLUE) is the least squares estimator

rank( ) .p=H

X =Hθ + Z with Xn=hi1θ1+…+hipθp+Zi , i =1,…,n,

E(X) =Hθ, Cov(X) = Cov(Z) = σ Z2CZZ

Θ = (HTCZZ−1 H)−1HTCZZ

−1 X.




Hence, the minimum of the weighted sum of squares is

where

represent again projection matrices.

1 1 11 1 1( )T T

- - -- - - ^= = -

ZZ ZZ ZZZZ ZZC C CP HH C H H C and P I P

q(θ) = tr(P⊥yyT ) = tr I−K(KTK)−1KT( )yyT( )= tr I−C−1H HT (C−1)TC−1H( )−1HT (C−1)T⎛

⎝⎞⎠C

−1xxT (C−1)T⎛⎝⎜

⎞⎠⎟

= tr CZZ−1 I−H HTCZZ

−1H( )−1HTCZZ−1⎛

⎝⎞⎠ xx

T⎛⎝⎜

⎞⎠⎟

= tr CZZ−1 I−P

CZZ−1( )xxT( ) = tr CZZ

−1 PCZZ

−1⊥ xxT( ),




Again, we are interested in estimating the variance of the noise. The expected value of the minimum of the weighted sum of squares results in

where

has been utilized.

2Zs

E q(Θ)( )= E tr(CZZ−1P

CZZ−1

⊥ XXT )( )= tr CZZ−1P

CZZ−1

⊥ E (Hθ+Z)(Hθ+Z)T( )( )= tr CZZ

−1PCZZ

−1⊥ HθθTHT +E(Z)θTHT +HθE(ZT )+E(ZZT )( )( )

= tr CZZ−1P

CZZ−1

⊥ (HθθTHT +σ Z2CZZ)( )=σ Z

2 tr(PCZZ

−1⊥ )= (n−p)σ Z

2,

tr(PCZZ

−1⊥ ) = tr In −H(H

TCZZ−1 H)−1HTCZZ

−1( )= n − tr (HTCZZ

−1 H)−1HTCZZ−1H( ) = n − tr(Ip ) = n − p




This result motivates

as an unbiased estimator for

Theorem:If in addition to the Gauß-Markov theorem Z obeys a nor-mal distribution, i.e. the following holds.

2.Zs

Z∼Nn(0,σ Z2CZZ),

S2 = q(Θ) (n − p)

a) Θ = (HTCZZ−1H)−1HTCZZ

−1 X∼Np θ,σ Z2 (HTCZZ

−1H)−1( )b) Θ and S2 are stochastically independent

c) (n − p) σ Z2 ⋅S2 ∼ χn−p

2




3.4 Confidence IntervalsNow, an unknown parameter q is considered, where the density fX(x|q) of X and an estimator for q pos-sessing the density are given.With the knowledge of and a given a, e.g. a =0.05, we can derive from

the probability equation

Θ = θ(X)fΘ (θ |θ)

fΘ (θ |θ)

1−α = P θ −a1<Θ ≤θ +a2( ) = FΘ θ +a2 |θ( )− FΘ θ −a1 |θ( )

1−α = P θ − a1 < Θ ≤ θ + a2( ) = P −a1 < Θ −θ ≤ a2( )= P −Θ − a1 < −θ ≤ a2 − Θ( ) = P Θ − a2 ≤ θ < Θ + a1( )




which states that random interval covers the parameter q with probability 1-a.

For given observations x the set

is called – confidence interval for q with the confidence

coefficient 1- a

or alternatively – interval estimate to the confidence level 1- a.

{ }2 1ˆ ˆ( ) : ( ) ( )C a aq q q q= - £ < +x x x

[Θ − a2,Θ + a1)




In case of an unknown parameter vector q an interval es-timate can be constructed if a function exists such that the corresponding random variable

obeys a known distribution which is independent of .

Frequently one can find such a function, which depends on x only over the estimate , i.e.

Examples are given in the subsequent exercises.

θ(x)

θ

g(x | θ)

Y = g(x | θ)

g(x | θ) = h(θ | θ).




Exercise 3.4-1: Signal + white noise, confidence interval for the • signal amplitude estimate (known noise variance)• signal amplitude estimate (unknown noise variance)• noise variance estimate

Exercise 3.4-2: Sinusoids + white noise, confidence interval for the amplitude estimates (unknown noise variance)




3.5 Cramer-Rao Lower BoundIn the following only unbiased estimators, i.e.

are considered, whose covariance matrix

exist. The Cramer-Rao inequality provides now a lower bound to the covariance matrix of any unbiased estima-tor of . Due to the following regularity conditions certainexpected values can be determined, which are of impor-tance for deriving the Cramer-Rao inequality.

E(Θ) = θ(x)fX(x | θ) dx!n∫ = θ,

CΘΘ

= Cov(Θ) = E (Θ − θ)(Θ − θ)T( )

θ




Regularity conditions:a) The parameter set W is an interval in .

b) The gradient exists for any x and q.

c) The gradient of with respect to q can be obtained by taking the gradient under the integralsign, i.e. .

d) The gradient of with respect to q can be ob-tained by taking the gradient under the integral sign,i.e. .

!p

∇θ ln fX(x | θ)( )fX(x | θ) dx!n∫

∇θ fX(x | θ) dx!n∫ = ∇θ fX(x | θ) dx!n∫

∇θ θ(x) fX(x | θ) dx!n∫ = θ(x)∇θ fX(x | θ) dx!n∫

E θ(X)( )




Theorem: (Cramer-Rao inequality)Suppose the former assumptions and regularity condi-tions hold and the Fisher information matrix is posi-tive definite. Then the variance of any estimator is bounded downwards by

If particularly a = ei, i.e. the i-th Cartesian unit vector, the inequality provides the lower limit for the variance of the i-th component of .

I (θ)Θa = a

TΘ

Θ

Var(Θa) = aTC

ΘΘa ≥ aTI (θ)−1a ∀a ∈!p.

Var(Θ i ) = eiTC

ΘΘei ≥ ei

TI (θ)−1ei = I (θ)ii−1

i = 1,…,p




Proof:First, the random vector

is introduced and its second order moment matrix

is determined. After exploiting the former corollary the

E(YYT ) = E (Θ − θ)(Θ − θ)T( )− E (Θ − θ)∇θ

T ln fX(X | θ)( )( )I (θ)−1− I (θ)−1E ∇θ ln fX(X | θ)( )(Θ − θ)T( )+ I (θ)−1E ∇θ ln fX(X | θ)( )∇θ

T ln fX(X | θ)( )( )I (θ)−1.

Y = Θ − θ − I (θ)−1∇θ ln fX(X | θ)( )




second order moment matrix can be simplified to

Since

the second order moment matrix is positive semidefinite.Thus, we can write

and the assertion

of the theorem follows.

E(YYT ) = E (Θ − θ)(Θ − θ)T( )− I (θ)−1.

aTE(YYT )a = E(aTYYTa) = E (aTY)2( ) ≥ 0 ∀a ∈!p

aTE(YYT )a = aTCΘΘa − aTI (θ)−1a ≥ 0 ∀a ∈!p

aTCΘΘa ≥ aTI (θ)−1a ∀a ∈!p




Under the following regularity condition an alternative and often more convenient expression for calculating the in-formation matrix can be presented.

Regularity condition:e) The Hessian matrix of

with respect to can be obtained by taking all the re-quired second partial derivatives under the integral sign, i.e.

∇θ∇θT fX(x | θ) dx!n∫ = ∇θ∇θ

T fX(x | θ) dx.!n∫

fX(x | θ) dx!n∫θ




Corollary:If the former regularity condition holds, the Fisher infor-mation matrix can be alternatively determined by

Consequently, the elements of the Fisher information matrix can be expressed by

I (θ) = E ∇θ ln fX(X | θ)( )∇θT ln fX(X | θ)( )( )

= −E ∇θ∇θT ln fX(X | θ)( )( ).

I ij (θ)=E∂ ln fX(x|θ)( )

∂θ i⋅∂ ln fX(X|θ)( )

∂θ j

⎛

⎝⎜

⎞

⎠⎟ =−E

∂2 ln fX(x|θ)( )∂θ i ∂θ j

⎛

⎝⎜

⎞

⎠⎟

i, j = 1,…,p.




Theorem:Let belong to the exponential family in canonical form and T denote the corresponding sufficient statistic. 1) The regularity conditions a) – e) are satisfied and the

information matrix is positive definite

Thus, for unbiased estimators the Cramer-Rao in-equality holds.

fX(x | ξ)

I(ξ) = E ∇ξ ln fX(X | ξ)( )∇ξT ln fX(X | ξ)( )( )

= −E ∇ξ∇ξT ln fX(X | ξ)( )( )

= ∇ξ∇ξT A(ξ) = Cov(T) = CTT.




2) Let the vector valued function

be one-to-one and differentiable with

Then the regularity conditions a) - e) are satisfied with respect to q and the information matrix can be expressed byI (θ) = −E ∇θ∇θ

T ln fX(X | θ)( )( ) = G(θ)I (ξ) ξ=g(θ)G(θ)T= G(θ) ∇ξ∇ξ

T A(ξ)( )θ=g(θ)

G(θ)T .

ξ = g(ξ), ξ ∈!k , ξ ∈!p with p ≤ k

G(θ) = ∇θ g(θ)T =

∂gj (θ)∂θ i

⎛

⎝⎜

⎞

⎠⎟i=1,…,p; j=1,…,k

.




Now, one asks oneself immediately, does there exist an unbiased estimator for which the information inequality takes the equality sign, i.e. whose covariance matrix e-quals the inverse Fisher information matrix.An estimator for which the information inequality takes the equality sign, i.e. whose covariance matrix coincides with the inverse Fisher information matrix (Cramer-Rao lower bound), is called efficient.

Efficient estimators can be achieved only in special ca-ses, since the random vector Y, introduced for proving the information inequality, must be the zero-vector with probability 1.




Exercise 3.5-1: Non/efficiency of linear least squares estimators when the noise is i.i.d. and normally distributed




The previous exercise showed, that the deviation of the covariance matrix from the Fisher information matrix can be neglected for large samples.This motivates the definition of the limit

where n indicates the growing sample size. If the matrix exists and is not singular, then unbiased and con-

sistent estimators are of interest that satisfy the limit

Simultaneously, such estimators are then even consis-tent in the mean square sense.

Γ(θ) = limn→∞

1nIn(θ),

Γ(θ)

limn→∞nC

ΘnΘn= Γ(θ)−1.

Θn




Estimators used in practice are often neither unbiased nor mean square consistent. In these cases the limiting distributions of such estimators are considered. If the lim-iting distributions have the property

the estimators are said to be asymptotically efficient.Furthermore, under certain regularity conditions (similar to the former) one can show that in the class of asympto-tically normally distributed estimators the following lowerbound can be proved.

aTD(θ)a ≥ aTΓ(θ)−1a a ∈!p

limn→∞

n(Θnn − θ)∼Np 0,D(θ)( ) and D(θ) = Γ(θ)−1




Exercise 3.5-2: Asymptotic efficiency of linear least squares estimatorswhen the noise is i.i.d. and normally distributed




3.6 Maximum Likelihood EstimationThe maximum Likelihood method is one of the most im-portant procedures for the construction of estimators. For given observations x the procedure selects that q as maximum likelihood estimate , for which the density function takes its maximum. Thus, the maximum likelihood estimate can be interpre-ted heuristically as that parameter vector that makes the occurrence of the data observed most likely. Instead of determining the maximum of , one can due to the monotonicity of the logarithmic function alter-natively maximize .

fX(x | θ)

fX(x | θ)

ln fX(x | θ)( )

θ(x)




Definition:Suppose X possesses the density with and the vector of observations x = (x1,…,xn)T is given. Then

is called likelihood function and

is called maximum likelihood estimate (MLE) for .

If the gradient of with respect to q exists and if is positive for the given x, one can try to find the

MLE by solving the likelihood equation system

θ

θ ∈ΩfX(x | θ)

fX(x | θ)

fX(x | θ)fX(x | θ)

θ = argmaxθ∈Ω

fX(x | θ) = argmaxθ∈Ωln fX(x | θ)( )

∇θ ln fX(x | θ)( )θ=θ(x)

= 0




Remarks:• Generally the likelihood equation system is nonlinear.• The likelihood equation system can possess several

solutions.• The solutions of the likelihood equation system do not

necessarily correspond to relative maxima.• If the absolute maximum of the likelihood function lies

on the boundary of W it can not be described by a so-lution of the likelihood equation system.

• Consequently, one has generally to employ sophisti-cated numerical optimization techniques for determin-ing maximum likelihood estimates.




Properties of Maximum Likelihood Estimation:1) Let be a continuous mapping of

the parameter vector h and the MLE of h. Then the MLE of is given by .

2) Suppose t(x) is a sufficient transformation for q, i.e.

Then the maximum likelihood estimate only de-pends over t(x) on x.

3) If belongs to the exponential family in cano-nical form the related likelihood equation system

possesses a unique solution.

θ = g(η) θ = g(η)

θ

η

ηg :!q → !p, p ≤ q

fX(x | θ) = g t(x) | θ( ) ⋅h(x).

∇ξ ln fX(x | ξ)( ) = t(x)−∇ξA(ξ) = 0

fX(x | ξ)




Exercise 3.6-1: Maximum likelihood estimation for linear models when the noise is i.i.d. and normally distributed

Exercise 3.6-2: Sinusoids + white noise, maximum likelihood estimation of amplitude, phase and frequency when the noise isi.i.d. and normally distributed

Exercise 3.6-3: Example for inconsistent maximum likelihood estimation




3.7 Bayesian EstimationIn the Bayesian theory of parameter estimation the un-known parameter vector q is treated itself as a realiza-tion of a random experiment that obeys its own so calledprior distribution .The objective is to use together with the measure-ments x drawn from the distribution to turn the prior distribution into a posterior distribution

that is used for the statistical inference.

fΘ(θ)

fΘ(θ)

fΘ(θ)fX(x | θ)

fΘ(θ | x) =fXΘ(x,θ)fX(x)

=fX(x | θ)fΘ(θ)fXΘ(x,θ)dθ!p∫

=fX(x | θ)fΘ(θ)fX(x | θ)fΘ(θ)dθ!p∫




Loss functionThe quality of the estimate is measured by a real-valued loss function , e.g. by the Euclidean dis-tance between q and , i.e.

Risk or loss averageThe risk is the loss averaged over the distribu-tion of the measurements for any fixed parameter vector

, i.e.

L(θ,θ(x)) = θ − θ(x)( )T θ − θ(x)( ).L(θ,θ(x))

L(θ,θ(x))

θ(x)

θ ∈Ω

RL(θ | θ) = EX|θ L(θ,θ(X))( ) = L(θ,θ(x))fx(x | θ)dx!n∫ .

θ(x)




Bayes riskThe Bayes risk is the risk averaged over the pri-or distribution , i.e.

The Bayes risk measures the average risk one incurs if the estimator is used for the random experimentat hand. Hence, is depending only on the estimat-ing function .

RL(θ) = EΘ RL(θ |Θ)( ) = EΘ EX|θ L(Θ,θ(X))( )( )= L(θ,θ(x))fX(x | θ)fΘ(θ)dxdθ!n∫!p∫= L(θ,θ(x))fXΘ(x,θ)dxdθ!n+p∫ = EXΘ L(Θ,θ(X))( )

RL(θ |Θ)

RL(θ)

fΘ(θ)

Θ= θ(X)

θ(x)




Bayes estimatorMinimization of the Bayes risk over the class of all estimating functions for which exists, providesthe Bayes estimating function

and therefore the Bayes estimator .

In the following Bayes estimators are considered in more detail for the square error, the absolute error and uniform error loss functions.

RL(θ)RL(θ)θ(x)

ΘL = θL(X)

θL(x) = argminθ(x)RL(θ) = argminθ(x)

EX,Θ L(Θ,θ(X))( )




3.7.1 Minimum Mean Square Error EstimationNon-Linear Minimum Mean Square Error EstimationTo estimate the random parameter vector Q by means of the realization of the random vector X in the minimum mean square error sense we have to find an estimating function such that

is minimum. Over the class of all functions for whichthe expected value exists, is minimized by

θ(x)

θ(x)

RMSE (θ) = E Θ − θ(X)( )T Θ − θ(X)( )⎛⎝

⎞⎠

= θ − θ(x)( )T θ − θ(x)( )fXΘ(x,θ) dxdθ!n+p∫

RMSE (θ)




cf. Exercise 1.9-18.

Linear Minimum Mean Square Error EstimationNow, we wish to estimate the random parameter vectorQ by the linear estimating function

such that the mean square error

θMMSE (x) = E θ |X = x( ),

θ(x) = Ax +b

RLMSE (A,b) = E Θ − (AX +b)( )T Θ − (AX +b)( )( )= θ − (Ax +b)( )T θ − (Ax +b)( )fXΘ(x,θ)dxdθ!n+p∫




is minimized by varying the parameter matrix A and vec-tor b. Thus, we have to solve the minimization problem

The mean square error is minimum if

are satisfied. Applying the expectation operator we obtain

where

( )=,

ˆ ˆ( , ) argmin ( , ) .LMSERAb

A b A b

∂∂ARLMSE (A,b) = −2E Θ − (AX + b)( )XT( ) = 0

∇bRLMSE (A,b) = −2E Θ − (AX + b)( ) = 0

RΘX − ARXX − bµXT = 0 and µΘ − AµX − b = 0,

RXX = E(XXT), RΘX = E(ΘXT), µX = E(X) and µΘ = E(Θ).




Resolving for leads after some manipulations to

and therefore to the estimating function

After exploiting the relationships between the covarianceand the first and second order moments

the estimating function can also be expressed by

ˆ ˆandA b

θLMMSE (x) = µΘ +CΘXCXX−1 x − µX( ).

CXX =RXX − µXµXT and CΘX =RΘX − µΘµX

T

θLMMSE (x)= Ax+ b=µΘ+ RΘX−µΘµXT( ) RXX−µXµX

T( )−1 x− µX( ).

A = RΘX−µΘµXT( ) RXX−µXµX

T( )−1

b = µΘ − RΘX−µΘµXT( ) RXX−µXµX

T( )−1µX




Theorem:Let Y be a random vector composed of the random vec-tors X and Q and obeying the distribution

Then X and Q possessing the marginal distributions

and the conditional distribution

cf. exercise 1.10-7.

Y =XΘ

⎛⎝⎜

⎞⎠⎟∼Nn+p

µXµΘ

⎛

⎝⎜

⎞

⎠⎟ ,

ΣXX ΣXΘ

ΣΘX ΣΘΘ

⎛

⎝⎜

⎞

⎠⎟

⎛

⎝⎜

⎞

⎠⎟ .

X∼Nn µX,ΣXX( ), Θ∼Np µΘ,ΣΘΘ( )

Θ |X = x∼Np µΘ + ΣΘXΣXX−1 (x − µX ),ΣΘΘ − ΣΘXΣXX

−1 ΣXΘ( )




Theorem:Given the linear model

where H is a known matrix, Q and Z are statistically in-dependent random vectors with and

. Then

and the MMSE estimate can be expressed by

X =HΘ + Z,

Θ∼Np(µΘ,ΣΘΘ )

θMMSE (x) = µΘ + ΣΘΘHT HΣΘΘH

T + ΣZZ( )−1(x −HµΘ ).

Y =XΘ

⎛⎝⎜

⎞⎠⎟∼Nn+p

HµΘ

µΘ

⎛

⎝⎜

⎞

⎠⎟ ,HΣΘΘH

T + ΣZZ HΣΘΘ

ΣΘΘHT ΣΘΘ

⎛

⎝⎜

⎞

⎠⎟

⎛

⎝⎜⎜

⎞

⎠⎟⎟

Z∼Nn(0,ΣZZ)




3.7.2 Minimum Mean Absolute Error EstimationThe mean absolute error estimation is considered for the single parameter case only. To estimate the random para-meter q by means of the realization of the random vec-tor X in the minimum mean absolute error sense

has to be minimized. As shown in the subsequent exer-cise the minimization of leads to

Hence, the is the median of the posterior den-sity function .

ˆ ( )MMAEq x

ˆ( )MAER q

RMAE (θ) = E |Θ − θ(X)|( ) = |θ − θ(x)| fXΘ (x,θ) dxdθ!n+1∫

fΘ (θ | x) dθ−∞

θMMAE (x)∫ = fΘ (θ | x) dθθMMAE (x)

∞

∫ .

fΘ (θ | x)




Exercise 3.7-2: Proof of median




3.7.3 Maximum A Posteriori EstimationAnother estimation technique that fits within the Bayesian framework is the maximum a posteriori estimation (MAP).Single parameter caseTo motivate the approach the uniform loss function

is introduced. The Bayes risk is then given by

ˆ0 | ( )|ˆ( , ( )) where 0ˆ1 | ( )|

Lq q d

q q dq q d

ì - £ï= >í- >ïî

xxx

RL(θ) = E L(θ,θ(X))( ) = L(θ,θ(x))fXΘ (x,θ)dxdθ!n+1∫= L(θ,θ(x)) fΘ (θ | x)dθ−∞

∞

∫⎛⎝⎞⎠!n∫ fX(x)dx.




Since fX(x) and the integral in the brackets are always po-sitive it is sufficient to minimize

or equivalently to maximize

for every x.For d arbitrary small determines the mode of and is termed maximum a posteriori (MAP) estimate, i.e.

ˆ( )q x

L(θ,θ(x))fΘ (θ |x)dθ−∞

∞

∫ = fΘ (θ |x)dθ + fΘ (θ |x)dθθ (x)+δ

∞

∫−∞

θ (x)−δ

∫

1− fΘ (θ |x)dθ + fΘ (θ |x)dθθ (x)+δ

∞

∫−∞

θ (x)−δ

∫⎛⎝⎜⎞⎠⎟ = fΘ (θ |x)dθθ (x)−δ

θ (x)+δ

∫

fΘ (θ |x)

θMAP (x) = argmaxθfΘ (θ |x) = argmaxθ

fX(x |θ)fΘ (θ)= argmax

θln fX(x |θ)( )+ ln fΘ (θ)( )( )




Multiple parameter caseEmploying the marginal posterior density functions

the MAP estimates are given by

However, due to the required integration we might pro-pose the alternative vector valued MAP estimate

Remark:Generally, both MAP estimates are not equivalent.

fΘ i(θ i | x) = fΘ(θ | x)dθ1…dθ i−1dθ i+1…dθp!p−1∫

θ i,MAP (x) = argmaxθ ifΘ i(θ i | x) i = 1,…,p.

θMAP (x) = argmaxθ fΘ(θ | x) = argmaxθ fX(x | θ)fΘ(θ).




Exercise 3.7-3:MMSE, MMAE and MAP estimation of the parameter of an exponential distribution, where the parameter obeys also an exponential distribution

Exercise 3.7-4: Signal + noise, MMSE signal amplitude estimation




References to Chapter 3[1] M. Fisz, Probability Theory and Mathematical Statistics,

Krieger Publishing Company, 1980[2] C.W. Helstrom, Elements of Signal Detection and Estimation,

Prentice Hall, 1994[3] S. Kay, Fundamentals of Statistical Signal Processing,

Vol. 1: Estimation Theory, Prentice Hall, 1993[4] E.L. Lehmann and G. Casella, Theory of Point Estimation,

Springer, 2003[5] H.V. Poor, An Introduction to Signal Detection and Estimation,

Springer, 1994[6] C.R. Rao, Linear Statistical Inference and Its Application,

John Wiley, 1973[7] L.L. Scherf, Statistical Signal Processing, Addison Wesley, 1990

stochastic signals and systemshomepages.hs-bremen.de/~krausd/iwss/sss3.pdfchapter 3 / stochastic...

Documents