class notes

563
1/28 Review of Probability Review of Probability

Upload: hany-samy

Post on 02-Dec-2014

491 views

Category:

Documents


7 download

TRANSCRIPT

Page 1: Class Notes

1/28

Review of ProbabilityReview of Probability

Page 2: Class Notes

2/28

Random Variable

! DefinitionNumerical characterization of outcome of a random event

!Examples1) Number on rolled dice2) Temperature at specified time of day3) Stock Market at close4) Height of wheel going over a rocky road

Page 3: Class Notes

3/28

Random Variable

! Non-examples1) Heads or Tails on coin2) Red or Black ball from urn

! Basic Idea dont know how to completely determine what value will occur Can only specify probabilities of RV values occurring.

But we can make these into RVs

Page 4: Class Notes

4/28

Two Types of Random Variables

Random Variable

Discrete RV Die Stocks

Continuous RV Temperature Wheel height

Page 5: Class Notes

5/28

Given Continuous RV X What is the probability that X = x0 ?

Oddity : P(X = x0) = 0Otherwise the Prob. Sums to infinity

Need to think of Prob. Density Function (PDF)

pX(x)

x

The Probability density function of RV X

∆+oxox

∫∆+

=

=∆+<<

o

o

x

x X dxxp

xXxP

)(

shownarea )( 00

PDF for Continuous RV

Page 6: Class Notes

6/28

Most Commonly Used PDF: Gaussian

m & σ are parameters of the Gaussian pdfm = Mean of RV Xσ = Standard Deviation of RV X (Note: σ > 0)σ2 = Variance of RV X

22 2/)(

21)( σ

πσmx

X exp −−=

A RV X with the following PDF is called a Gaussian RV

Notation: When X has Gaussian PDF we say X ~ N(m,σ 2)

Page 7: Class Notes

7/28

" Generally: take the noise to be Zero Mean

22 2

21)( σ

πσx

x exp =

Zero-Mean Gaussian PDF

Page 8: Class Notes

8/28

Small σ

Large σ

Small Variability (Small Uncertainty)

Large Variability (Large Uncertainty)

pX(x)

x

Area within ±1 σ of mean = 0.683 = 68.3%

σ σ

x = m

pX(x)

pX(x)

x

x

Effect of Variance on Guassian PDF

Page 9: Class Notes

9/28

"Central Limit theorem (CLT)The sum of N independent RVs has a pdfthat tends to be Gaussian as N →∞

"So What! Here is what : Electronic systems generate internal noise due to random motion of electrons in electronic components. The noise is the result of summing the random effects of lots of electrons.

CLT applies Guassian Noise

Why Is Gaussian Used?

Page 10: Class Notes

10/28

Describes probabilities of joint events concerning X and Y. For example, the probability that X lies in interval [a,b] and Y lies in interval [a,b] is given by:

),( yxpXY

∫ ∫=<<<<b

a

d

cXY dxdyyxpdYcbXa ),()( and )(Pr

This graph shows the Joint PDFGraph from B. P. Lathis book: Modern Digital & Analog Communication Systems

Joint PDF of RVs X and Y

Page 11: Class Notes

11/28

When you have two RVs often ask: What is the PDF of Y if X is constrained to take on a specific value.

In other words: What is the PDF of Y conditioned on the fact X is constrained to take on a specific value.

Ex.: Husbands salary X conditioned on wifes salary = $100K?

First find all wives who make EXACTLY $100K how are their husbands salaries distributed.

Depends on the joint PDF because there are two RVs but it should only depend on the slice of the joint PDF at Y=$100K.

Now we have to adjust this to account for the fact that the joint PDF (even its slice) reflects how likely it is that X=$100K will occur (e.g., if X=105 is unlikely then pXY(105,y) will be small); soif we divide by pX(105) we adjust for this.

Conditional PDF of Two RVs

Page 12: Class Notes

12/28

Conditional PDF (cont.)

Thus, the conditional PDFs are defined as (slice and normalize):

≠=

otherwise,0

0)(,)(

),()|(|

xpxp

yxpxyp X

X

XY

XY

≠=

otherwise,0

0)(,)(

),()|(|

ypyp

yxpyxp Y

Y

XY

YX

This graph shows the Conditional PDF

Graph from B. P. Lathis book: Modern Digital & Analog Communication Systems

y is held fixed

x is held fixed

slice and normalize

y is held fixed

Page 13: Class Notes

13/28

Independence should be thought of as saying that:

neither RV impacts the other statistically thus, the values that one will likely take should be irrelevant to the value that the other has taken.

In other words: conditioning doesnt change the PDF!!!

)()(

),()|(

)()(

),()|(

|

|

xpyp

yxpyxp

ypxp

yxpxyp

XY

XYyYX

YX

XYxXY

==

==

=

=

Independent RVs

Page 14: Class Notes

14/28

Independent and Dependent Gaussian PDFs

Independent(zero mean)

Independent(non-zero mean)

Dependent

y

x

y

x

y

x

Contours of pXY(x,y).

If X & Y are independent, then the contour ellipses are aligned with either the x or y axis

Different slices give

differentnormalized curves

Different slices give

same normalized curves

Page 15: Class Notes

15/28

RVs X & Y are independent if:

)()(),( ypxpyxp YXXY =

)()(

)()()(

),()|(| ypxp

ypxpxp

yxpxyp YX

YX

X

XYxXY ====

Heres why:

An Independent RV Result

Page 16: Class Notes

16/28

Characterizing RVs! PDF tells everything about an RV

but sometimes they are more than we need/know! So we make due with a few Characteristics

Mean of an RV (Describes the centroid of PDF) Variance of an RV (Describes the spread of PDF) Correlation of RVs (Describes tilt of joint PDF)

Mean = Average = Expected Value

Symbolically: EX

Page 17: Class Notes

17/28

Motivation First w/ Data Analysis ViewConsider RV X = Score on a test Data: x1, x2, xN

Possible values of RV X : V0 V1 V2... V1000 1 2 100

This is called Data Analysis ViewBut it motivates the Data Modeling View

Ni = # of scores of value Vi

N = (Total # of scores)∑=

n

iiN

1

Test Average

≈ P(X = Vi)

∑∑=

= =++

===100

0

10011001 ...

i

ii

nNi i

NNV

NVNVNVN

N

xx

Probability

Statistics

Motivating Idea of Mean of RV

Page 18: Class Notes

18/28

Theoretical View of Mean

" For Discrete random Variables :

Data Analysis View leads to Probability Theory:

" This Motivates form for Continuous RV:

Probability Density Function

∑=

=n

niXi xPxXE

1)(

∫∞

∞−

= dxxpxXE X )(

Probability Function

Notation: XXE =

Data Modeling

Shorthand Notation

Page 19: Class Notes

19/28

Aside: Probability vs. StatisticsProbability Theory» Given a PDF Model» Describe how the

data will likely behave

Statistics» Given a set of Data» Determine how the

data did behave

∑=

=n

iix

NAvg

1

1

Data

There is no DATA here!!!The PDF models how the data will likely behave

There is no PDF here!!!The Statistic measures how the data did behave

∫∞

∞−

= dxxpxXE X )(

Dummy Variable

PDF

≈Law of Large

Numbers

Page 20: Class Notes

20/28

Variance: Characterizes how much you expect the RV to Deviate Around the Mean

There are similar Data vs. Theory Views here But lets go right to the theory!!

Variance:

∫ −=

−=

dxxpmx

mXE

Xx

x

)()(

)(

2

22σ

Note : If zero mean

∫==

dxxpx

XE

X )(

2

22σ

Variance of RV

Page 21: Class Notes

21/28

Motivating Idea of Correlation

Consider a random experiment that observes the outcomes of two RVs:Example: 2 RVs X and Y representing height and weight, respectively

y

x

Positively Correlated

Motivate First w/ Data Analysis View

x

y

Page 22: Class Notes

22/28

Illustrating 3 Main Types of Correlation

Positive CorrelationBest Friends

Negative CorrelationWorst Enemies

Zero Correlationi.e. uncorrelatedComplete Strangers

GPA &

Starting Salary

Height &

$ in Pocket

Student Loans&

Parents Salary

yy −

xx −

yy − yy −

xx − xx −

Data Analysis View: ∑=

−−=N

iiixy yyxx

NC

1))((1

Page 23: Class Notes

23/28

To capture this, define Covariance :

If the RVs are both Zero-mean :

))(( YYXXEXY −−=σ

XYXY Ε=σ

If X = Y: 22YXXY σσσ ==

∫ ∫ −−= dxdyyxpYyXx XYXY ),())((σ

Prob. Theory View of Correlation

If X & Y are independent, then: 0=XYσ

Page 24: Class Notes

24/28

If 0))(( =−−= YYXXEXYσ

Then Say that X and Y are uncorrelated

If 0))(( =−−= YYXXEXYσ

Then YXXYE =

Called Correlation of X & Y

So RVs X and Y are said to be uncorrelated

if σXY = 0

or equivalently if EXY = EXEY

Page 25: Class Notes

25/28

X & Y are Independent

Implies X & Y are Uncorrelated

Uncorrelated

Independence

INDEPENDENCE IS A STRONGER CONDITION !!!!

)()(

),(

yfxf

yxf

YX

XY

=

YEXE

XYE

=

Independence vs. Uncorrelated

PDFs Separate Means Separate

Page 26: Class Notes

26/28

Covariance : ))(( YYXXEXY −−=σ

Correlation : XYE

Correlation Coefficient :YX

XYXY σσ

σρ =

11 ≤≤− XYρ

Same if zero mean

Confusing Covariance and Correlation Terminology

Page 27: Class Notes

27/28

Correlation Matrix :

==

NNNN

N

N

T

XXEXXEXXE

XXEXXEXXE

XXEXXEXXE

E

!

"#""

!

!

21

22212

12111

xxRx

TNXXX ][ 11 !=x

Covariance Matrix :

))(( TE xxxxCx −−=

Covariance and Correlation For Random Vectors

Page 28: Class Notes

28/28

YEXEYXE +=+

( ) ( ) ( ) ( ) ( ) ( )

XYYX

zzzz

zzzz

zzz

YXEYEXE

YXYXE

XXXYXE

YXYXEYX

σσσ 2

2

2

where

var

22

22

22

2

2

++=

++=

++=

−=+=

−−+=+

+

++=+

eduncorrelat are & if,

2var

22

22

YXYX

YX

XYYX

σσ

σσσ

∫= dxxpxfXfE X )()()( XaEaXE =

22var XaaX σ=

A Few Properties of Expected Value

Page 29: Class Notes

1/45

Review of Matrices and Vectors

Page 30: Class Notes

2/45

Definition of Vector: A collection of complex or real numbers, generally put in a column

[ ]TN

N

vv

v

v

!" 1

1

=

=v

Transpose

+

+=+

=

=

NNNN ba

ba

b

b

a

a"""

1111

baba

Definition of Vector Addition: Add element-by-element

Vectors & Vector Spaces

Page 31: Class Notes

3/45

Definition of Scalar: A real or complex number.

If the vectors of interest are complex valued then the set of scalars is taken to be complex numbers; if the vectors of interest are real valued then the set of scalars is taken to be real numbers.

Multiplying a Vector by a Scalar :

α

α=α

=

NN a

a

a

a""

11

aa

changes the vectors length if |α| ≠ 1

reverses its direction if α < 0

Page 32: Class Notes

4/45

Arithmetic Properties of Vectors: vector addition and scalar multiplication exhibit the following properties pretty much like the real numbers do

Let x, y, and z be vectors of the same dimension and let αand β be scalars; then the following properties hold:

αα xx

xyyx

=

+=+

xx

zxyzyx

)()(

)()(

αβ=βα

++=++

xxx

yxyx

β+α=β+α

α+α=+α

)(

)(

zeros all of vector zero theis where,

1

000x

xx

=

=

1. Commutativity

2. Associativity

3. Distributivity

4. Scalar Unity & Scalar Zero

Page 33: Class Notes

5/45

Definition of a Vector Space: A set V of N-dimensional vectors (with a corresponding set of scalars) such that the set of vectors is:

(i) closed under vector addition(ii) closed under scalar multiplication

In other words: addition of vectors gives another vector in the set multiplying a vector by a scalar gives another vector in the set

Note: this means that ANY linear combination of vectors in the space results in a vector in the spaceIf v1, v2, and v3 are all vectors in a given vector space V, then

∑=

=++=3

1332211

iiivvvvv αααα

is also in the vector space V.

Page 34: Class Notes

6/45

Axioms of Vector Space: If V is a set of vectors satisfying the above definition of a vector space then it satisfies the following axioms:

1. Commutativity (see above)

2. Associativity (see above)

3. Distributivity (see above)

4. Unity and Zero Scalar (see above)

5. Existence of an Additive Identity any vector space V must have a zero vector

6. Existence of Negative Vector: For every vector v in V its negative must also be in V

So a vector space is nothing more than a set of vectors with an arithmetic structure

Page 35: Class Notes

7/45

Def. of Subspace: Given a vector space V, a subset of vectors in Vthat itself is closed under vector addition and scalar multiplication (using the same set of scalars) is called a subspace of V.

Examples:

1. The space R2 is a subspace of R3.

2. Any plane in R3 that passes through the origin is a subspace

3. Any line passing through the origin in R2 is a subspace of R2

4. The set R2 is NOT a subspace of C2 because R2 isnt closed under complex scalars (a subspace must retain the original spaces set of scalars)

Page 36: Class Notes

8/45

Length of a Vector (Vector Norm): For any vector v in CN

we define its length (or norm) to be

∑=

=N

iiv

1

2

2v ∑

==

N

iiv

1

222v

22 vv αα =

2221221 vvvv βαβα +≤+

NC∈∀∞< vv2

0vv == iff02

Properties of Vector Norm:

Geometric Structure of Vector Space

Page 37: Class Notes

9/45

Distance Between Vectors: the distance between two vectors in a vector space with the two norm is defined by:

22121 ),( vvvv −=d

2121 iff 0),( vvvv ==dNote that:

v2

v1 v1 v2

Page 38: Class Notes

10/45

Angle Between Vectors & Inner Product:

Motivate the idea in R2:v

θA

u

=

θθ

=01

sincos

uvAA

θ=θ⋅+θ⋅=∑=

cossin0cos12

1AAAvu i

iiNote that:

Clearly we see that This gives a measure of the angle between the vectors.

Now we generalize this idea!

Page 39: Class Notes

11/45

Inner Product Between Vectors : Define the inner product between two complex vectors in CN by:

*

1i

N

iivu∑

=

>=< vu,

Properties of Inner Products:1. Impact of Scalar Multiplication:

2. Impact of Vector Addition:

3. Linking Inner Product to Norm:

4. Schwarz Inequality:

5. Inner Product and Angle: (Look back on previous page!)

><>=<

><>=<

vu,vu,

vu,vu,

*ββ

αα

><+>>=<+<

><+>>=<+<

vw,vu,vw,u

zu,vu,zvu,

>=< vv,v 2

2

22vuvu, ≤><

)cos(22

θ=><

vuvu,

Page 40: Class Notes

12/45

Inner Product, Angle, and Orthogonality :

)cos(22

θ=><

vuvu,

(i) This lies between 1 and 1;

(ii) It measures directional alikeness of u and v

= +1 when u and v point in the same direction

= 0 when u and v are a right angle

= 1 when u and v point in opposite directions

Two vectors u and v are said to be orthogonal when <u,v> = 0

If in addition, they each have unit length they are orthonormal

Page 41: Class Notes

13/45

Can we find a set of prototype vectors v1, v2, , vM from which we can build all other vectors in some given vector space V by using linear combinations of the vi?

Same Ingredients just different amounts of them!!!

∑=

=M

kkk

1vv α ∑

==

M

kkk

1vu β

We want to be able to do is get any vector just by changing the amounts To do this requires that the set of prototypevectors v1, v2, , vM satisfy certain conditions.

Wed also like to have the smallest number of members in the set of prototype vectors.

Building Vectors From Other Vectors

Page 42: Class Notes

14/45

Span of a Set of Vectors: A set of vectors v1, v2, , vM is said to span the vector space V if it is possible to write each vector v in V as a linear combination of vectors from the set:

∑=

α=M

kkk

1vv

This property establishes if there are enough vectors in the proposed prototype set to build all possible vectors in V.

It is clear that:

1. We need at least N vectors to span CN or RN but not just any Nvectors.

2. Any set of N mutually orthogonal vectors spans CN or RN (a set of vectors is mutually orthogonal if all pairs are orthogonal).Does not

Span R2Spans R2

Examples in R2

Page 43: Class Notes

15/45

Linear Independence: A set of vectors v1, v2, , vM is said to be linearly independent if none of the vectors in it can be written as a linear combination of the others.

If a set of vectors is linearly dependent then there is redundancy in the setit has more vectors than needed to be a prototype set!

For example, say that we have a set of four vectors v1, v2, v3, v4 and lets say that we know that we can build v2 from v1 and v3then every vector we can build from v1, v2, v3, v4 can also be built from only v1, v3, v4.

It is clear that:

1. In CN or RN we can have no more than N linear independent vectors.

2. Any set of mutually orthogonal vectors is linear independent (a set of vectors is mutually orthogonal if all pairs are orthogonal).

Linearly Independent

Not Linearly Independent

Examples in R2

Page 44: Class Notes

16/45

Basis of a Vector Space: A basis of a vector space is a set of linear independent vectors that span the space.

Span says there are enough vectors to build everything

Linear Indep says that there are not more than needed

Orthonormal (ON) Basis: If a basis of a vector space contains vectors that are orthonormal to each other (all pairs of basis vectors are orthogonal and each basis vector has unit norm).

Fact: Any set of N linearly independent vectors in CN (RN) is a basis of CN (RN).

Dimension of a Vector Space: The number of vectors in anybasis for a vector space is said to be the dimension of the space. Thus, CN and RN each have dimension of N.

Page 45: Class Notes

17/45

Fact: For a given basis v1, v2, , vN, the expansion of a vector vin V is unique. That is, for each v there is only one, unique set of coefficients α1, α2, , αN such that ∑

=

α=N

kkk

1vv

In other words, this expansion or decomposition is unique. Thus, for a given basis we can make a 1-to-1 correspondence between vector v and the coefficients α1, α2, , αN.

We can write the coefficients as a vector, too: [ ]TNαα !1=α

= →←

=−−

NNv

v

α

α

""

1

1to1

1

αv

Expansion can be viewed as a mapping (or transformation) from vector v to vector α.

We can view this transform as taking us from the original vector space into a new vector space made from the coefficient vectors of all the original vectors.

Expansion and Transformation

Page 46: Class Notes

18/45

Fact: For any given vector space there are an infinite number of possible basis sets.

The coefficients with respect to any of them provides complete information about a vector

some of them provide more insight into the vector and are therefore more useful for certain signal processing tasks than others.

Often the key to solving a signal processing problem lies in finding the correct basis to use for expanding this is equivalent to finding the right transform. See discussion coming next linking DFT to these ideas!!!!

Page 47: Class Notes

19/45

DFT from Basis Viewpoint:

If we have a discrete-time signal x[n] for n = 0, 1, N-1

Define vector:

Define a orthogonal basis from the exponentials used in the IDFT:

[ ]TNxxx ]1[]1[]0[ −= !x

=

1

11

0 "d

=

−π

⋅π

NNj

Nj

e

e

/)1(12

/112

1

1

"d

=

−π

⋅π

NNj

Nj

e

e

/)1(22

/122

2

1

"d

=

−−π

⋅−π

NNNj

NNj

N

e

e

/)1)(1(2

/1)1(2

1

1

"d

Then the IDFT equation can be viewed as an expansion of the signal vector x in terms of this complex sinusoid basis:

∑−

==

1

0][1N

kk

k

kXN

dx#$#%&

α

kth coefficient

T

NNX

NX

NX

=]1[]1[]0[ !α

coefficient vector

Page 48: Class Notes

20/45

Whats So Good About an ON Basis?: Given any basis v1, v2, , vN we can write any v in V as

∑=

α=N

kkk

1vv

Given the vector v how do we find the αs? In general hard! But for ON basis easy!!

If v1, v2, , vN is an ON basis then

i

ij

ij

N

jj

ij

N

jji

α=

α=

α=

−δ=

=

#$#%&][

1

1

v,v

v,vvv,

ii vv,=α

ith coefficient = inner product with ith ON basis vector

Usefulness of an ON Basis

Page 49: Class Notes

21/45

Another Good Thing About an ON Basis: They preserve inner products and norms (called isometric):

If v1, v2, , vN is an ON basis and u and v are vectors expanded as

Then.

1. < v ,u > = < α , β > (Preserves Inner Prod.)

2. ||v||2 = ||α||2 and ||u||2 = ||β||2 (Preserves Norms)

∑=

α=N

kkk

1vv ∑

==

N

kkk

1vu β

So using an ON basis provides: Easy computation via inner products Preservation of geometry (closeness, size, orientation, etc.

Page 50: Class Notes

22/45

Example: DFT Coefficients as Inner Products:

Recall: N-pt. IDFT is an expansion of the signal vector in terms of N Orthogonal vectors. Thus

=

=

=

=

=

1

0

/2

1

0

*

][

][][

][

N

n

Nknj

N

nk

k

enx

ndnx

kX

π

dx,

See reading notes for some details about normalization issues in this case

Page 51: Class Notes

23/45

Matrix: Is an array of (real or complex) numbers organized in rows and columns.

Here is a 3x4 example:

Well sometimes view a matrix as being built from its columns; The 3x4 example above could be written as:

[ ]4321 ||| aaaaA = [ ]Tkkkk aaa 321=a

Well take two views of a matrix:

1. Storage for a bunch of related numbers (e.g., Cov. Matrix)

2. A transform (or mapping, or operator) acting on a vector(e.g., DFT, observation matrix, etc. as well see)

=

34333231

24232221

14131211

aaaa

aaaa

aaaa

A

Matrices

Page 52: Class Notes

24/45

Matrix as Transform: Our main view of matrices will be as operators that transform one vector into another vector.

Consider the 3x4 example matrix above. We could use that matrix to transform the 4-dimensional vector v into a 3-dimensional vector u:

[ ] 44332211

4

3

2

1

4321 ||| aaaaaaaaAvu vvvv

v

v

v

v

+++=

==

Clearly u is built from the columns of matrix A; therefore, it must lie in the span of the set of vectors that make up the columns of A.

Note that the columns of A are 3-dimensional vectors so is u.

Page 53: Class Notes

25/45

Transforming a Vector Space: If we apply A to all the vectors in a vector space V we get a collection of vectors that are in a new space called U.

In the 3x4 example matrix above we transformed a 4-dimensional vector space V into a 3-dimensional vector space U

A 2x3 real matrix A would transform R3 into R2 :

Facts: If the mapping matrix A is square and its columns are linearly independent then

(i) the space that vectors in V get mapped to (i.e., U) has the same dimension as V (due to square part)

(ii) this mapping is reversible (i.e., invertible); there is an inverse matrix A-1 such that v = A-1u (due to square & LI part)

A

Page 54: Class Notes

26/45

1x

1, −AA

1y2x

2y

1, −AA

Transform = Matrix × Vector: a VERY useful viewpoint for all sorts of signal processing scenarios. In general we can view many linear transforms (e.g., DFT, etc.) in terms of some invertible matrix A operating on a signal vector x to give another vector y:

ii Axy = ii yAx 1−=

We can think of A and A-1 as mapping back and forth

between two vector spaces

Page 55: Class Notes

27/45

Basis Matrix & Coefficient Vector:

Suppose we have a basis v1, v2, , vN for a vector space V. Then a vector v in space V can be written as:

Another view of this:

∑=

α=N

kkk

1vv

[ ]

=

N

NxNN

α

α

α

"## $## %& !

2

1

matrix 21 ||| vvvv Vαv =

The Basis Matrix V transforms the coefficient vector into the original vector v

Matrix View & Basis View

Page 56: Class Notes

28/45

Three Views of Basis Matrix & Coefficient Vector:View #1Vector v is a linear combination of the columns of basis matrix V.

View #2Matrix V maps vector α into vector v.

View #3There is a matrix, V-1, that maps vector v into vector α.

∑=

α=N

kkk

1vv

Vαv =

vVα 1−=

Aside: If a matrix A is square and has linearly independent columns, then A is invertible and A-1 exists such that A A-1 = A-1A = I where I is the identity matrix having 1s on the diagonal and zeros elsewhere.

Now have a way to go back-and-forth between vector vand its coefficient vector α

Now have a way to go back-and-forth between vector vand its coefficient vector α

Page 57: Class Notes

29/45

Basis Matrix for ON Basis: We get a special structure!!!

Result: For an ON basis matrix V V-1 = VH

(the superscript H denotes hermitian transpose, which consists of transposing the matrix and conjugating the elements)

To see this:

I

vvvvvv

vvvvvvvvvvvv

VV

=

=

><><><

><><><><><><

=

100

010001

,,,

,,,,,,

21

22212

12111

!"'"

!!

!"'"

!!

NNNN

N

N

H

Inner products are 0 or 1 because this is an ON basis

Page 58: Class Notes

30/45

A unitary matrix is a complex matrix A whose inverse is A-1 = AH

For the real-valued matrix case we get a special case of unitarythe idea of unitary matrix becomes orthogonal matrixfor which A-1 = AT

Two Properties of Unitary Matrices: Let U be a unitary matrix and let y1 = Ux1 and y2 = Ux2

1. They preserve norms: ||yi|| = ||xi||.

2. They preserve inner products: < y1, y2 > = < x1, x2 >

That is the geometry of the old space is preserved by the unitary matrix as it transforms into the new space.

(These are the same as the preservation properties of ON basis.)

Unitary and Orthogonal Matrices

Page 59: Class Notes

31/45

DFT from Unitary Matrix Viewpoint:

Consider a discrete-time signal x[n] for n = 0, 1, N-1.

Weve already seen the DFT in a basis viewpoint:

Now we can view the DFT as a transform from the Unitary matrix viewpoint:

∑−

==

1

0][1N

kk

k

kXN

dx#$#%&

α

==

−−π

⋅−π

−π

⋅π

−π

⋅π

NNNj

NNj

NNj

Nj

NNj

Nj

N

e

e

e

e

e

e

/)1)(1(2

/1)1(2

/)1(22

/122

/)1(12

/112

110

111

1

11

]|||["

!

!!

"""… dddD

xDx H=~xDx ~1

N=

DFT IDFT

(Acutally D is not unitary but N-1/2D is unitary see reading notes)

Page 60: Class Notes

32/45

Geometry Preservation of Unitary Matrix Mappings

Recall unitary matrices map in such a way that the sizes of vectors and the orientation between vectors is not changed.

1x

1, −AA

1y2x

2y

1, −AA

Unitary mappings just rigidly rotate the space.

Page 61: Class Notes

33/45

1x

1, −AA

1y2x

2y

1, −AA

Effect of Non-Unitary Matrix Mappings

Page 62: Class Notes

34/45

More on Matrices as Transforms

Axy =

m×1 m×n n×1

Well limit ourselves here to real-valued vectors and matrices

A maps any vector x in Rn

into some vector y in Rm

Rn Rm

x yA

Range(A): Range Space of A = set of all vectors in Rm that can be reached by mapping

vector y = weighted sum of columns of A

⇒ may only be able to reach certain ys

Mostly interested in two cases:1. Tall Matrix m > n2. Square Matrix m = n

Page 63: Class Notes

35/45

Range of a Tall Matrix (m > n) Rn Rm

x yA

The range(A) ⊂ Rm

Proof: Since y is built from the n columns of A there are not enough to form a basis for Rm (they dont span Rm)

Range of a Square Matrix (m = n)

If the columns of A are linearly indep.The range(A) = Rm

because the columns form a basis for Rm

Otherwise.The range(A) ⊂ Rm

because the columns dont span Rm

Page 64: Class Notes

36/45

Rank of a Matrix: rank(A) = largest # of linearly independent columns (or rows) of matrix A

For an m×n matrix we have that rank(A) ≤ min(m,n)

An m×n matrix A has full rank when rank(A) = min(m,n)

Example: This matrix has rank of 3 because the 4th column cam be written as a combination of the first 3 columns

=

0000

0000

1100

2010

1001

A

Page 65: Class Notes

37/45

Tall Matrix (m > n) Case

If y does not lie in range(A), then there is No Solution

If y lies in range(A), then there is a solution (but not necessarily just one unique solution)

y∈range(A)y∉range(A)

No Solution

One Solution Many Solutions

A full rank

A not full rank

Characterizing Tall Matrix MappingsWe are interested in answering: Given a vector y, what vector x mapped into it via matrix A?

Axy =

Page 66: Class Notes

38/45

Full-Rank Tall Matrix (m > n) Case

Rn Rm

xy

A

Range(A)

Axy =

For a given y∈range(A)there is only one x that maps to it.

This is because the columns of A are linearly independent and we know from our studies of vector spaces that the coefficient vector of y is unique x is that coefficient vector

By looking at y we can determine which x gave rise to it

Page 67: Class Notes

39/45

NonFull-Rank Tall Matrix (m > n) Case

Rn Rm

x1y

ARange(A)

Axy =

For a given y∈range(A) there is more than one x that maps to it

This is because the columns of A are linearly dependent and that redundancy provides several ways to combine them to create y

x2A

By looking at y we can not determine which x gave rise to it

Page 68: Class Notes

40/45

Characterizing Square Matrix Mappings

Q: Given any y∈Rn can we find an x∈Rn that maps to it?

A: Not always!!!

y∈range(A)y∉range(A)

No Solution

One Solution

Many Solutions

A full rank

A not full rank

Axy = Careful!!! This is quite a different flow diagram here!!!

When a square A is full rank then its range covers the complete new space then, y must be in range(A) and because the columns of A are a basis there is a way to build y

Page 69: Class Notes

41/45

A Full-Rank Square Matrix is InvertibleA square matrix that has full rank is said to be.

nonsingular, invertibleThen we can find the x that mapped to y using x = A-1y

Several ways to check if n×n A is invertible:1. A is invertible if and only if (iff) its columns (or rows) are

linearly independent (i.e., if it is full rank)2. A is invertible iff det(A) ≠ 03. A is invertible if (but not only if) it is positive definite (see

later)4. A is invertible if (but not only if) all its eigenvalues are

nonzero

Pos. Def.Matrices

InvertibleMatrices

Page 70: Class Notes

42/45

Eigenvalues and Eigenvectors of Square MatricesIf matrix A is n×n, then A maps Rn → Rn

Q: For a given n×n matrix A, which vectors get mapped into being almost themselves???

More precisely Which vectors get mapped to a scalar multiple of themselves???

Even more precisely which vectors v satisfy the following:

vAv λ=

These vectors are special and are called the eigenvectors of A.The scalar λ is that e-vectors corresponding eigenvalue.

Input Output

v Av

Page 71: Class Notes

43/45

If n×n real matrix A is symmetric, then e-vectors corresponding to distinct e-values are orthonormal e-values are real valued can decompose A as

If, further, A is pos. def. (semi-def.), then e-values are positive (non-negative) rank(A) = # of non-zero e-values

Pos. Def. ⇒ Full Rank (and therefore invertible) Pos. Semi-Def. ⇒ Not Full Rank (and therefore not invertible)

When A is P. D., then we can write

Eigen-Facts for Symmetric Matrices

TVVΛA =

[ ]

n

Tn

diag λλλ ,,, 21

21

!

=

==

Λ

IVVvvvV

TVVΛA 11 −− =

n

diag λλλ1111 ,,,

21…=−Λ

For P.D. A, A-1 has the same e-vectors and has reciprocal e-values

Page 72: Class Notes

44/45

Well limit our discussion to real-valued matrices and vectors

Quadratic Forms and Positive-(Semi)Definite Matrices

Quadratic Form = Matrix form for a 2nd-order multivariate polynomial

Example:

=

=

2221

1211

2

1

aa

aa

x

xAx

fixedvariable

2121122222

2111

2

1

2

1

21

)(

scalar )11()12()22()21(),(

xxaaxaxaxxa

xxQ

i jjiij

T

+++==

×=×⋅×⋅×=

∑∑= =

AxxA

scalar

The quadratic form of matrix A is:

Other Matrix Issues

Page 73: Class Notes

45/45

Values of the elements of matrix A determine the characteristics of the quadratic form QA(x) If QA(x) ≥ 0 ∀x ≠ 0 then say that QA(x) is positive semi-definite If QA(x) > 0 ∀x ≠ 0 then say that QA(x) is positive definite Otherwise say that QA(x) is non-definite

These terms carry over to the matrix that defines the Quad Form If QA(x) ≥ 0 ∀x ≠ 0 then say that A is positive semi-definite If QA(x) > 0 ∀x ≠ 0 then say that A is positive definite

Page 74: Class Notes

1/15

Ch. 1 Introduction to Estimation

Page 75: Class Notes

2/15

An Example Estimation Problem: DSB RxS( f )

f o

M( f )

f f –f o)2cos()(),;( oooo tftmfts φπφ +=

AudioAmp

BPF&

Amp

)()()( twtstx +=X

)ˆˆ2cos( ootf φπ +

Est. Algo.

Electronics Adds Noise w(t)

(usually “white”)oof φ&ˆ

f

)(ˆ fMOscillatorw/ oof φ&ˆ

Goal: Given

Find Estimates(that are optimal in some sense)

)(),;()( twftstx oo += φ Describe with Probability Model: PDF & Correlation

Page 76: Class Notes

3/15

Discrete-Time Estimation ProblemThese days, almost always work with samples of the observed signal (signal plus noise): ][],;[][ nwfnsnx oo += φ

Our Thought Model: Each time you “observe” x[n] it contains same s[n] but different “realization” of noise w[n], so the estimate is different each time. oof φ&ˆ are RVs

Our Job: Given finite data set x[0], x[1], … x[N-1] Find estimator functions that map data into estimates:

)(])1[,],1[],0[(ˆ

)(])1[,],1[],0[(ˆ

22

11

x

x

gNxxxg

gNxxxgf

o

o

=−=

=−=

φ

These are RVs… Need to describe w/ probability model

Page 77: Class Notes

4/15

PDF of EstimateBecause estimates are RVs we describe them with a PDF…

Will depend on: 1. structure of s[n]2. probability model of w[n]3. form of est. function g(x)

)ˆ( ofp

ofof

Mean measures centroid

Std. Dev. & Variance measure spread

Desire:

( ) small ˆˆ

ˆ

22ˆ =

−=

=

oof

oo

fEfE

ffE

Page 78: Class Notes

5/15

1.2 Mathematical Estimation ProblemGeneral Mathematical Statement of Estimation Problem:

For… Measured Data x = [ x[0] x[1] … x[N-1] ]

Unknown Parameter θ = [θ1 θ2 … θp ]

θ is Not Random x is an N-dimensional random data vector

Q: What captures all the statistical information needed for an estimation problem ?

A: Need the N-dimensional PDF of the data, parameterized by θ

);( θxpIn practice, not given PDF!!!Choose a suitable model

• Captures Essence of Reality• Leads to Tractable Answer We’ll use p(x;θ) to find )(ˆ xθ g=

Page 79: Class Notes

6/15

Ex. Estimating a DC Level in Zero Mean AWGN

]0[]0[ wx +=θConsider a single data point is observedGaussianzero meanvariance σ2

~ N(θ, σ2)

So… the needed parameterized PDF is:

p(x[0];θ ) which is Gaussian with mean of θ

So… in this case the parameterization changes the data PDF mean:

θ1

p(x[0];θ1)

x[0] θ2

p(x[0];θ2)

x[0] θ3

p(x[0];θ3)

x[0]

Page 80: Class Notes

7/15

Ex. Modeling Data with Linear TrendSee Fig. 1.6 in Text

Looking at the figure we see what looks like a linear trend perturbed by some noise…

So the engineer proposes signal and noise models:

[ ] ][][],;[

nwBnAnxBAns

++= "#"$%

Signal Model: Linear Trend Noise Model: AWGN w/ zero mean

AWGN = “Additive White Gaussian Noise”“White” = x[n] and x[m] are uncorrelated for n ≠ m Iwwww 2))(( σ=−− TE

Page 81: Class Notes

8/15

Typical Assumptions for Noise Model• W and G is always easiest to analyze

– Usually assumed unless you have reason to believe otherwise– Whiteness is usually first assumption removed– Gaussian is less often removed due to the validity of Central Limit Thm

• Zero Mean is a nearly universal assumption – Most practical cases have zero mean– But if not… µ+= ][][ nwnw zm

Non-Zero Mean of µ Zero Mean Now group into signal model

• Variance of noise doesn’t always have to be known to make an estimate– BUT, must know to assess expected “goodness” of the estimate– Usually perform “goodness” analysis as a function of noise variance (or

SNR = Signal-to-Noise Ratio)– Noise variance sets the SNR level of the problem

Page 82: Class Notes

9/15

Classical vs. Bayesian Estimation ApproachesIf we view θ (parameter to estimate) as Non-Random

→ Classical EstimationProvides no way to include a priori information about θ

If we view θ (parameter to estimate) as Random → Bayesian EstimationAllows use of some a priori PDF on θ

The first part of the course: Classical Methods• Minimum Variance, Maximum Likelihood, Least Squares

Last part of the course: Bayesian Methods• MMSE, MAP, Wiener filter, Kalman Filter

Page 83: Class Notes

10/15

1.3 Assessing Estimator PerformanceCan only do this when the value of θ is known:

• Theoretical Analysis, Simulations, Field Tests, etc.

is a random variableRecall that the estimate )(ˆ xg=θ

Thus it has a PDF of its own… and that PDF completely displays the quality of the estimate.

Illustrate with 1-D parameter case

θ θ

)ˆ(θp

Often just capture quality through mean and variance of )(ˆ xg=θ

Desire:

( ) small ˆˆ

ˆ

22ˆ

ˆ

=

−=

==

θθσ

θθ

θ

θ

EE

Em If this is true: say estimate is

unbiased

Page 84: Class Notes

11/15

Equivalent View of Assessing Performance)ˆ(ˆ ee +=−= θθθθDefine estimation error:

RV RV Not RV

Completely describe estimator quality with error PDF: p(e)

p(e)

e

Desire:

( ) small

0

22 =−=

==

eEeE

eEm

e

e

σ

If this is true: say estimate is

unbiased

Page 85: Class Notes

12/15

Example: DC Level in AWGNModel: x 1,,1,0],[][ −=+= NnnwAn …

Gaussian, zero mean, variance σ2

White (uncorrelated sample-to-sample)

PDF of an individual data sample:

−−= 2

2

2 2)][(exp

2

1])[(σπσ

Aixixp

Uncorrelated Gaussian RVs are Independent… so joint PDF is the product of the individual PDFs:

−=

−−=

∑∏

=−

=2

1

0

2

2/2

1

02

2

2 2

)][(exp

)2(1

2)][(exp

2

1)(σπσσπσ

N

nN

N

n

AnxAnxp x

( property: prod of exp’s gives sum inside exp )

Page 86: Class Notes

13/15

Each data sample has the same mean (A), which is the thing we are trying to estimate… so, we can imagine trying to estimate A by finding the sample mean of the data:

Statistics ∑−

==

1

0][1ˆ

N

nnx

NA

Prob. Theory

Let’s analyze the quality of this estimator…• Is it unbiased?

AAE

ixEN

nxN

EAEn A

N

n

=⇒

=

= ∑∑=

=

ˆ

][1][1ˆ1

0#$%

Yes! Unbiased!

NA

NN

Nnx

Nnx

NA

N

n

N

n

N

n

2

2

21

0

22

1

02

1

0

)ˆvar(

1])[var(1][1var)ˆvar(

σ

σσ

=⇒

===

= ∑∑∑

=

=

=

Can make var small by increasing N!!!

Due to Indep.(white & Gauss.

⇒ Indep.)• Can we get a small variance?

Page 87: Class Notes

14/15

Theoretical Analysis vs. Simulations

• Ideally we’d like to be always be able to theoretically analyze the problem to find the bias and variance of the estimator– Theoretical results show how performance depends on the problem

specifications

• But sometimes we make use of simulations– to verify that our theoretical analysis is correct– sometimes can’t find theoretical results

Page 88: Class Notes

15/15

Course Goal = Find Optimal Estimators• There are several different definitions or criteria for optimality!• Most Logical: Minimum MSE (Mean-Square-Error)

– See Sect. 2.4– To see this result: ( )

)(ˆvar

ˆ)ˆ(

2

2

θθ

θθθ

b

Emse

+=

−=

( )

( ) ( )[ ]

[ ]

)(ˆvar

)(ˆˆ)(ˆˆ

ˆˆˆ

ˆ)ˆ(

2

2

0

2

2

2

θθ

θθθθθθ

θθθθ

θθθ

b

bEEbEE

EEE

Emse

+=

+−+

−=

−+−=

−=

=""#""$%

θθθ −= ˆ)( EbBias

Although MSE makes sense, estimates usually rely on θ

Page 89: Class Notes

Chapter 2

Minimum Variance Unbiased Estimators

Page 90: Class Notes

MVU

Ch. 2: Minimum Variance Unbiased Est.

Basic Idea of MVU: Out of all unbiased estimates, find the one with the lowest variance

(This avoids the realizability problem of MSE)

2.3 Unbiased EstimatorsAn estimator is unbiased if

θθ =ˆE for all θ

Page 91: Class Notes

Example: Estimate DC in White Uniform Noise

[ ] [ ] 1...,,1,0 −=+= NnnwAnx

Unbiased Estimator:

[ ]

valueAofregardlessAAEbeforeassame

nxN

N

n

=

=

:

1 1

0

Page 92: Class Notes

Biased Estimator:

<≠

≥=⇒

<

=

⇒=⇒

=≥

∨∧∨

>

10

10

,1

][][,1:

)(1 1

0

Aif

AifBias

AAEthenAif

AAEAA

nxnxthenAifNote

nxN

AN

n

Biased Est.

Page 93: Class Notes

MVUE = Minimum Variance Unbiased Estimator

)()ˆvar()ˆ( 2 θθθ bmse +=

So, MVU could also be called“Minimum MSE Unbiased Est.”

(Recall problem with MMSE criteria)

Constrain bias to be zero 0 find the estimator that minimizes variance

2.4 Minimum Variance Criterion

Note:

= 0 for MVU

Page 94: Class Notes

2.5 Existence of MVU EstimatorSometimes there is no MVUE… can happen 2 ways:

1. There may be no unbiased estimators2. None of the above unbiased estimators has a

uniformly minimum varianceEx. of #2Assume there are only 3 unbiased estimators for a problem. Two possible cases:

3,2,1),(ˆ == igii xθ

∃ an MVU∃ an MVU

θ

ˆvar iθ

ˆvar iθ1θ

θ

Page 95: Class Notes

Even if MVU exists: may not be able to find it!!2.6 Finding the MVU Estimator

No Known “turn the crank” Method

Three Approaches to Finding the MVUE

1. Determine Cramer-Rao Lower Bound (CRLB)… and see if some estimator satisfies it (Ch 3 & 4)(Note: MVU can exist but not achieve the CRLB)

2. Apply Rao-Blackwell-Lechman-Scheffe TheoremRare in Practice… We’ll skip Ch. 5

3. Restrict to Linear Unbiased & find MVLU (Ch. 6)Only gives true MVU if problem is linear

Page 96: Class Notes

2.7 Vector ParameterWhen we wish to estimate multiple parameters we group them into a vector: [ ]Tpθθθ !21=θ

[ ]Tpθθθ ˆˆˆˆ21 !=θThen an estimator is notated as:

θθ =ˆEUnbiased requirement becomes:

Minimum Variance requirement becomes:For each i…

estimates unbiased allover minˆvar θθ =

Page 97: Class Notes

Chapter 3 Cramer-Rao Lower Bound

Page 98: Class Notes

Abbreviated: CRLB or sometimes just CRB

CRLB is a lower bound on the variance of any unbiasedestimator:

The CRLB tells us the best we can ever expect to be able to do (w/ an unbiased estimator)

then, ofestimator unbiased an is If θθ

)()()()( 2 θθσθθσ

θθθθ CRLBCRLB ≥⇒≥

What is the Cramer-Rao Lower Bound

Page 99: Class Notes

1. Feasibility studies ( e.g. Sensor usefulness, etc.) Can we meet our specifications?

2. Judgment of proposed estimators Estimators that dont achieve CRLB are looked

down upon in the technical literature

3. Can sometimes provide form for MVU est.

4. Demonstrates importance of physical and/or signal parameters to the estimation problem

e.g. Well see that a signals BW determines delay est. accuracy⇒ Radars should use wide BW signals

Some Uses of the CRLB

Page 100: Class Notes

Q: What determines how well you can estimate θ ?

Recall: Data vector is x

3.3 Est. Accuracy Consideration

samples from a random process that depends on an θ

⇒ the PDF describes that dependence: p(x;θ )

Clearly if p(x;θ ) depends strongly/weakly on θwe should be able to estimate θ well/poorly.

See surface plots vs. x & θ for 2 cases:1. Strong dependence on θ2. Weak dependence on θ

⇒ Should look at p(x;θ ) as a function of θ for fixed value of observed data x

Page 101: Class Notes

Surface Plot Examples of p(x;θ )

Page 102: Class Notes

( )

−−= 2

2

2 2)]0[(exp

2

1];0[σπσ

AxAxp

x[0] = A + w[0]

Ex. 3.1: PDF Dependence for DC Level in Noisew[0] ~ N(0,σ2)

Then the parameter-dependent PDF of the data point x[0] is:

A x[0]

A3

p(x[0]=3;θ )

Say we observe x[0] = 3So Slice at x[0] = 3

Page 103: Class Notes

The LF = the PDF p(x;θ )

but as a function of parameter θ w/ the data vector x fixed

Define: Likelihood Function (LF)

We will also often need the Log Likelihood Function (LLF):

LLF = lnLF = ln p(x;θ )

Page 104: Class Notes

LF Characteristics that Affect Accuracy Intuitively: sharpness of the LF sets accuracy But How???Sharpness is measured using curvature: ( )

valuetruedatagiven 2

2 ;ln

==∂

∂−

θθ

θx

xp

Curvature ↑ ⇒ PDF concentration ↑ ⇒ Accuracy ↑

But this is for a particular set of data we want in general:SoAverage over random vector to give the average curvature:

( )

valuetrue2

2 ;ln

=

∂−

θθ

θxpEExpected sharpness

of LF

E is w.r.t p(x;θ )

Page 105: Class Notes

Theorem 3.1 CRLB for Scalar Parameter

Assume regularity condition is met: θθθ

∀=

∂∂ 0);(ln xpE

E is w.r.t p(x;θ )

Then( )

valuetrue2

22

;ln

1

=

∂∂

θ

θ

θθ

σxpE

3.4 Cramer-Rao Lower Bound

( ) ( ) dxxpppE );(;ln;ln2

2

2

θθ

θθ

∫ ∂

∂=

∂ xx

Right-Hand Side is CRLB

Page 106: Class Notes

1. Write log 1ikelihood function as a function of θ: ln p(x;θ )

2. Fix x and take 2nd partial of LLF: ∂2ln p(x;θ )/∂θ 2

3. If result still depends on x: Fix θ and take expected value w.r.t. x Otherwise skip this step

4. Result may still depend on θ: Evaluate at each specific value of θ desired.

5. Negate and form reciprocal

Steps to Find the CRLB

Page 107: Class Notes

Need likelihood function:

( ) [ ]( )

( )[ ]( )

−∑−

=

−−=

=

=∏

2

21

0

22

1

02

2

2

2exp

2

1

2exp

2

1;

σπσ

σπσ

Anx

AnxAp

N

nN

N

nx

Example 3.3 CRLB for DC in AWGNx[n] = A + w[n], n = 0, 1, , N 1

w[n] ~ N(0,σ2)& white

Due to whiteness

Property of exp

Page 108: Class Notes

( ) [ ]( )!!! "!!! #$!! "!! #$

?(~~)

1

0

22

0(~~)

22

212ln);(ln

=∂∂

=

=∂∂

∑ −−

−=

A

N

n

A

NAnxAp

σπσx

Now take ln to get LLF:

[ ]( ) ( )AxNAnxApA

N

n−=−=

∂∂ ∑

=2

1

02

1);(lnσσ

x

Now take first partial w.r.t. A:sample mean

(!)

Now take partial again:

22

2);(ln

σNAp

A−=

∂ x

Doesnt depend on x so we dont need to do E

Page 109: Class Notes

Since the result doesnt depend on x or A all we do is negate and form reciprocal to get CRLB:

( ) NpE

CRLB2

valuetrue2

2 ;ln

1 σ

θθ

θ

=

∂∂

=

=

xN

A2

var σ≥

Doesnt depend on A Increases linearly with σ 2

Decreases inversely with N

CRLB

For fixed N

σ 2

A

CRLB

For fixed N & σ 2

CRLB Doubling DataHalves CRLB!

N

For fixed σ 2

Page 110: Class Notes

Continuation of Theorem 3.1 on CRLBThere exists an unbiased estimator that attains the CRLB iff:

[ ]θθθ

θ−=

∂∂ )()();(ln xx gIp

for some functions I(θ ) and g(x)

Furthermore, the estimator that achieves the CRLB is then given by:

)( xg=θ

(!)

Since no unbiased estimator can do better this is the MVU estimate!!

This gives a possible way to find the MVU: Compute ∂ln p(x;θ )/∂θ (need to anyway) Check to see if it can be put in form like (!)

If so then g(x) is the MVU esimator

with CRLBI

==)(

1varθ

θ

Page 111: Class Notes

Revisit Example 3.3 to Find MVU Estimate

( )AxNApA

−=∂∂

2);(lnσ

x

For DC Level in AWGN we found in (!) that:Has form of

I(A)[g(x) A]

[ ]∑−

====

1

0

1)(N

nnx

Nxg xθCRLB

NANAI ==⇒=

2

2 var)( σσ

So for the DC Level in AWGN: the sample mean is the MVUE!!

Page 112: Class Notes

Definition: Efficient Estimator

An estimator that is:

unbiased and

attains the CRLB

is said to be an Efficient Estimator

Notes:

Not all estimators are efficient (see next example: Phase Est.)

Not even all MVU estimators are efficient

So there are times when our 1st partial test wont work!!!!

Page 113: Class Notes

Example 3.4: CRLB for Phase EstimationThis is related to the DSB carrier estimation problem we used for motivation in the notes for Ch. 1Except here we have a pure sinusoid and we only wish to estimate only its phase

Signal Model: ][)2cos(][];[

nwnfAnxons

oo ++= !!! "!!! #$φ

φπ AWGN w/ zero mean & σ 2

Assumptions:1. 0 < fo < ½ ( fo is in cycles/sample)

2. A and fo are known (well remove this assumption later)

Signal-to-Noise Ratio: Signal Power = A2/2Noise Power = σ 2 2

2

2σASNR =

Page 114: Class Notes

Problem: Find the CRLB for estimating the phase.

We need the PDF:

( )( )

[ ]( )

+−∑−

=

=2

21

0

22 2

)2cos(exp

2

1;σ

φπ

πσ

φnfAnx

po

N

nNx

Exploit Whiteness and Exp.

Form

Now taking the log gets rid of the exponential, then taking partial derivative gives (see book for details):

( ) [ ]21

02 )24sin(2

)2sin(;ln

+−+∑

−=

∂∂ −

=φπφπ

σφφ nfAnfnxAp

ooN

n

x

Taking partial derivative again:( ) [ ]( ))24cos()2cos(;ln 1

022

2φπφπ

σφφ

+−+∑−

=∂

∂ −

=nfAnfnxAp

ooN

n

x

Still depends on random vector x so need E

Page 115: Class Notes

Taking the expected value:

( ) [ ]( )

[ ] ( ))24cos()2cos(

)24cos()2cos(;ln

1

02

1

022

2

φπφπσ

φπφπσφ

φ

+−+∑=

+−+∑=

∂−

=

=

nfAnfnxEA

nfAnfnxAEpE

ooN

n

ooN

n

x

Ex[n] = A cos(2π fon + φ )

So plug that in, get a cos2 term, use trig identity, and get

( ) SNRNNAnfApEN

no

N

n×=≈

+−=

∂− ∑∑

=2

21

0

1

02

2

2

2

2)24cos(1

2;ln

σφπ

σφφx

= N << N iffo not near 0 or ½

nN-1

Page 116: Class Notes

Now invert to get CRLB:SNRN ×

≥1varφ

CRLB Doubling DataHalves CRLB!

N

For fixed SNR

Non-dB

CRLB Doubling SNRHalves CRLB!

SNR (non-dB)

For fixed N Halve CRLB for every 3B

in SNR

Page 117: Class Notes

Does an efficient estimator exist for this problem? The CRLB theorem says there is only if

[ ]θθθ

θ−=

∂∂ )()();(ln xx gIp

( ) [ ]21

02 )24sin(2

)2sin(;ln

+−+∑

−=

∂∂ −

=φπφπ

σφφ nfAnfnxAp

ooN

n

x

Our earlier result was:

Efficient Estimator does NOT exist!!!

Well see later though, an estimator for which CRLB→varφas N →∞ or as SNR →∞

Such an estimator is called an asymptotically efficient estimator(Well see such a phase estimator in Ch. 7 on MLE)

CRBN

varφ

Page 118: Class Notes

1

Alternate Form for CRLB

∂∂

≥2);(ln

1)var(

θθ

θxpE

See Appendix 3A for Derivation

Sometimes it is easier to find the CRLB this way.

This also gives a new viewpoint of the CRLB:

From Gardners Paper (IEEE Trans. on Info Theory, July 1979)

Consider the Normalized version of this form of CRLB

Posted on BB

∂∂

≥2

22

);(ln

1)var(

θθθ

θθ

xpE

Well derive this in a way that will re-interpret the

CRLB

Page 119: Class Notes

2

Consider the Incremental Sensitivity of p(x;θ ) to changes in θ :

If θ → θ +∆θ, then it causes p(x;θ ) → p(x;θ +∆θ )

How sensitive is p(x;θ ) to that change??

∆∆

==

=∆

);();(

in change %);( in change %);(

);(

)(~θ

θθθ

θθ

θθθθ

θ xxxx

x

xp

pppp

S p

Now let ∆θ → 0:θ

θθθ

θθθ

θθθ ∂∂

=

∂∂

==→∆

);(ln);(

);()(~lim)(0

xx

xxx pp

pSS pp

Recall from Calculus:xxf

xfxxf

∂∂

=∂

∂ )()(

1)(ln

[ ]

=

∂∂

≥222

22

)(

1

);(ln

1)var(

xx pSEpE θθθ

θθθθ

InterpretationNorm. CRLB = Inverse Mean

Square Sensitivity

Page 120: Class Notes

3

Definition of Fisher Information

The denominator in CRLB is called the Fisher Information I(θ )

It is a measure of the expected goodness of the data for the purpose of making an estimate

∂−= 2

2 );(ln)(θ

θθ xpEI

Has the needed properties for info (as does Shannon Info): 1. I(θ ) ≥ 0 (easy to see using the alternate form of CRLB)2. I(θ ) is additive for independent observations

follows from: [ ]∑∏ =

=

nnnxpnxpp )];[(ln)];[(ln);(ln θθθx

If each In (θ ) is the same: I(θ ) = N×I(θ )

Page 121: Class Notes

4

3.5 CRLB for Signals in AWGNWhen we have the case that our data is signal + AWGN then we get a simple form for the CRLB:

Signal Model: x[n] = s[n;θ ] + w[n], n = 0, 1, 2, , N-1

White, Gaussian, Zero MeanQ: What is the CRLB?

First write the likelihood function:

( )( )

−−

= ∑−

=

1

0

222/2

];[][2

1exp2

1);(N

nN nsnxp θ

σπσθx

Differentiate Log LF twice to get:

( )∑−

=

∂∂

−∂

∂−=

∂ 1

0

2

2

2

22

2 ];[];[];[][1);(lnN

n

nsnsnsnxpθθ

θθθ

σθ

θx

Depends on random x[n] so must take

E

Page 122: Class Notes

5

2

1

0

2

1

0

2

2

2

0

];[22

2

];[

];[];[];[][1);(ln

σ

θθ

θθ

θθθ

σθ

θ θ

=

=

=

∂∂

=

∂∂

−∂

−=

N

n

N

n ns

ns

nsnsnsnxEpE

!!! "!!! #$"#$x

Then using this we get the CRLB for Signal in AWGN:

∑−

=

∂∂

≥1

0

2

2

];[)var(

N

n

nsθθ

σθ2];[

∂∂

θθnsNote: tells how

sensitive signal is to parameter

If signal is very sensitive to parameter change then CRLB is small can get very accurate estimate!

Page 123: Class Notes

6

Ex. 3.5: CRLB of Frequency of Sinusoid Signal Model: 1,,2,1,00][)2cos(][ 2

1 −=<<++= NnfnwnfAnx oo …φπ

0 0.1 0.2 0.3 0.4 0.50

2

4

6x 10

-4

fo (cycles/sample)

CR

LB (

cycl

es/s

ampl

e)2

0 0.1 0.2 0.3 0.4 0.50.01

0.015

0.02

0.025

fo (cycles/sample)

CRL

B1/

2 (cy

cles/

sam

ple)

Error in Book

Bound on Variance

Bound on Std. Dev.

[ ]∑−

=+×

≥ 1

0

2)2sin(2

1)var( N

nonfnSNR φππ

θ

Signal is less sensitive if fonear 0 or ½

Page 124: Class Notes

7

3.6 Transformation of ParametersSay there is a parameter θ with known CRLBθ

But imagine that we instead are interested in estimating some other parameter α that is a function of θ :

α = g(θ )

Q: What is CRLBα ?

θα θθα CRLBgCRLB

2)()var(

∂∂

=≥

Captures the sensitivity of α to θ

Proved inAppendix 3B

Large ∂g/∂ θ → small error in θ gives larger error in α→ increases CRLB (i.e., worsens accuracy)

Page 125: Class Notes

8

Example: Speed of Vehicle From Elapsed Time

Known Distance D

start

Laser

Sensor Sensor

Laser

stop

Measure Elapsed Time TPossible Accuracy Set by CRLBT

T

T

TV

CRLBDV

CRLBTD

CRLBTD

TCRLB

×=

×

−=

×

∂∂

=

2

4

2

2

But really want to measure speed V = d/TFind the CRLBV:

)/(2

smCRLBD

VTV ≥σ

Accuracy Bound

Less accurate at High Speeds (quadratic) More accurate over large distances

Page 126: Class Notes

9

Effect of Transformation on EfficiencySuppose you have an efficient estimator of θ : θ

But you are really interested in estimating α = g(θ )

Suppose you plan to use )( θα g=

Q: Is this an efficient estimator of α ???A: Theorem: If g(θ ) has form g(θ ) = aθ + b, thenis efficient.

)( θα g=

affine transform

Proof:First: ( ) ( ) ( ) θθθα CRLBaaba 22 varvarvar ==+=

= because efficient

Now, what is CRBα ? Using transformation result:

( )θθα θ

θ CRLBaCRLBbaCRLB

a

22

2

=

∂+∂

=

=!!"!!#$

( ) αα CRLB=var

Efficient!

Page 127: Class Notes

10

Asymptotic Efficiency Under TransformationIf the mapping α = g(θ ) is not affine this result does NOT hold

But if the number of data samples used is large, then the estimator is approximately efficient (Asymptotically Efficient)

θ

)(),( θθα pg=

θ of pdf

Small N CasePDF is widely spread

over nonlinear mapping

θ

)(),( θθα pg=

θ of pdf

Large N CasePDF is concentrated

onto linearized section

Page 128: Class Notes

1

3.7 CRLB for Vector Parameter CaseVector Parameter: [ ]Tpθθθ !21=θ

Its Estimate: [ ]Tpθθθ 21 !=θ

Assume that estimate is unbiased: θθ =E

For a scalar parameter we looked at its variancebut for a vector parameter we look at its covariance matrix:

[ ][ ] θCθθθθθ var =

−−=

TE

For example:

for θ = [x y z]T

=

)var(),cov(),cov(

),cov()var(),cov(

),cov(),cov()var(

zyzxz

zyyxy

zxyxx

θC

Page 129: Class Notes

2

Fisher Information MatrixFor the vector parameter case

Fisher Info becomes the Fisher Info Matrix (FIM) I(θ) whose mnth element is given by:

[ ] [ ] pnmpEmn

mn ,,2,1,,);(ln)(2

…=

∂∂∂

−=θθθxθI

Evaluate at true value of θ

Page 130: Class Notes

3

The CRLB Matrix Then, under the same kind of regularity conditions,

the CRLB matrix is the inverse of the FIM: )(1 θI−=CRLB

So what this means is: nnnnn

][][ )(1

2 θICθ

−≥=θσ

Diagonal elements of Inverse FIM bound the parameter variances, which are the diagonal elements of the parameter covariance matrix

=

)var(),cov(),cov(

),cov()var(),cov(

),cov(),cov()var(

zyzxz

zyyxy

zxyxx

θC )(1

333231

232221

131211

θI−=

bbb

bbb

bbb

(!)

Page 131: Class Notes

4

More General Form of The CRLB Matrix

definite-semi positive is)(1 θICθ

−−

0θICθ ≥− − )(1

Mathematical Notation for this is:

(!!)

Note: property #5 about p.d. matrices on p. 573 states that (!!) ⇒ (!)

Page 132: Class Notes

5

CRLB Off-Diagonal Elements InsightLet θ = [xe ye]T represent the 2-D x-y location of a transmitter (emitter) to be estimated.

Consider the two cases of scatter plots for the estimated location:

ex

ey

ex

ey

exex

ey eyeyσeyσ

exσexσ

Each case has the same variances but location accuracy characteristics are very different. ⇒ This is the effect of the off-diagonal elements of the covariance

Should consider effect of off-diagonal CRLB elements!!!

Not In Book

Page 133: Class Notes

6

CRLB Matrix and Error Ellipsoids Not In Book

Assume [ ]Tee yx =θ is 2-D Gaussian w/ zero meanand a cov matrix θC Only For Convenience

Then its PDF is given by:

( )( )

−= − θCθ

Cθ θ

θ

21exp

2

1 1

TN

Quadratic Form!!(recall: its scalar valued)

So the equi-height contours of this PDF are given by the values of θ such that:

kT =θAθ Some constant

easefor

:Let

1 AC θ =−

Note: A is symmetric so a12 = a21 because any cov. matrix is symmetric and the inverse of symmetric is symmetric

Page 134: Class Notes

7

What does this look like? kyayxaxa eeee =++ 22212

211 2

An Ellipse!!! (Look it up in your calculus book!!!)

Recall: If a12 = 0, then the ellipse is aligned w/ the axes & the a11 and a22 control the size of the ellipse along the axes

Note: a12 = 0 ⇒

=⇒

=−

22

11

22

111

10

01

0

0

a

a

a

a

θθ CC

eduncorrelat are& ee yx⇒

Note: a12 ≠ 0 correlated are& ee yx⇒

=2

2

eee

eee

yxy

yxx

σσ

σσ

θC

Page 135: Class Notes

8

ex ex

eyeduncorrelat are& ee yxif

ey

ex ex

eycorrelated are& ee yxif

ey

ex2~ σ

ey2~ σ

ex2~ σ

ey2~ σ

ex2~ σ

ey2~ σ

ex2~ σ

ey2~ σ

Not In BookError Ellipsoids and Correlation

Choosing k ValueFor the 2-D case

k = -2 ln(1-Pe)

where Pe is the prob. that the estimate will lie inside the ellipse

See posted paper by Torrieri

Page 136: Class Notes

9

Ellipsoids and Eigen-StructureConsider a symmetric matrix A & its quadratic form xTAx

kT =Axx⇒ Ellipsoid: or k=xAx ,

Principle Axes of Ellipse are orthogonal to each otherand are orthogonal to the tangent line on the ellipse:

x1

x2

Theorem: The principle axes of the ellipsoid xTAx = k are eigenvectors of matrix A.

Not In Book

Page 137: Class Notes

10

Proof: From multi-dimensional calculus: gradient of a scalar-valued function φ(x1,, xn) is orthogonal to the surface:

x1

x2

T

n

n

xx

xxgrad

∂∂

∂∂

=

=∂

∂=∇=

φφ

φφφ

!

1

1)()(),,(

xxxx

Different Notations

See handout posted on Blackboard on Gradients and Derivatives

Page 138: Class Notes

11

∑∑∑∑ ∂

∂=

∂∂

⇒==i k

ji

jij

ki jjiij

Txxx

ax

xxa)(

xAx)x( φφ

Product rule:#$%#$%

jkik

k

ji

kiki

jk

i

k

ji

xx

xxxx

xxx

δδ

∂∂

+∂∂

=∂

≠=

==01

)(

For our quadratic form function we have:

(♣)

(♣♣)

Using (♣♣) in (♣) gives:

jj

kj

ji

ikjj

jkk

xa

xaxax

∑∑=

+=∂∂

2

φ

By Symmetry:aik = aki

And from this we get:AxAxxx 2)( =∇ T

Page 139: Class Notes

12

x1

x2

Since grad ⊥ ellipse, this says Ax is ⊥ ellipse:

xAx

k=xAx ,

When x is a principle axis, then x and Ax are aligned:

x1

x2

x Ax

k=xAx ,

xAx λ=Eigenvectors are Principle Axes!!!

< End of Proof >

Page 140: Class Notes

13

Theorem: The length of the principle axis associated with eigenvalue λi is ik λ/

Proof: If x is a principle axis, then Ax = λx. Take inner productof both sides of this with x:

xxxAx ,, λ==#$%

kλλkk

=⇒=

=

xxx

x#$%

2

,

< End of Proof >

Note: This says that if A has a zero eigenvalue, then the error ellipse will have an infinite length principle axis ⇒ NOT GOOD!!

So well require that all λi > 0⇒ must be positive definite

θC

Page 141: Class Notes

14

Application of Eigen-Results to Error EllipsoidsThe Error Ellipsoid corresponding to the estimator covariance matrix must satisfy:

θC kT =− θCθ θ

1 Note that the error

ellipse is formed using the inverse covThus finding the eigenvectors/values of

shows structure of the error ellipse 1

−θC

Recall: Positive definite matrix A and its inverse A-1 have the same eigenvectors reciprocal eigenvalues

Thus, we could instead find the eigenvalues ofand then the principle axes would have lengthsset by its eigenvalues not inverted

)(1 θICθ

−=

Inverse FIM!!

Page 142: Class Notes

15

Illustrate with 2-D case: kT =− θCθ θ 1

v1 & v2λ1 & λ2

Eigenvectors/values for (not the inverse!)

θC

v1

v2

1λk2λk

Page 143: Class Notes

16

The CRLB/FIM Ellipse

We can re-state this in terms of the FIM

Once we find the FIM we can: Find the inverse FIM Find its eigenvectors gives the Principle Axes Find its eigenvalues Prin. Axis lengths are then

Can make an ellipse from the CRLB Matrix instead of the Cov. Matrix

This ellipse will be the smallest error ellipse that an unbiased estimator can achieve!

ikλ

Page 144: Class Notes

1

3.8 Vector TransformationsJust like for the scalar case…. α = g(θ)

If you know CRLBθ you can find CRLBα

!!!!! "!!!!! #$"#$

α

θα θ

θθIθθCRLB

onCRLB

T

onCRLB

gg

∂= − )()()( 1

Jacobian Matrix(see p. 46)

Example: Usually can estimate Range (R) and Bearing (ϕ) directlyBut might really want emitter (x, y)

Page 145: Class Notes

2

Example of Vector Transform

xe

ye

x

yEmitter

Can estimate Range (R) and Bearing (φ) directlyBut might really want emitter location (xe, ye)

==

=

=

φ

φ

φ sin

cos)(

R

Rg

y

xR

e

eθαθ

Direct

ParametersMapped

Parameters

xe

ye

x

y

Tgg

∂=

θθCRLB

θθCRLB θα

)()(

Jacobian Matrix

−=

∂∂

∂∂

∂∂

∂∂

=∂

φφ

φφ

φφφ

φφφ

θθ

cossin

sincos

cossin

coscos)(

R

R

RR

R

RR

Rg

Page 146: Class Notes

3

3.9 CRLB for General Gaussian CaseIn Sect. 3.5 we saw the CRLB for “signal + AWGN”

For that case we saw: The PDF’s parameter-dependence showed up only in the mean of the PDF

Deterministic Signal w/ Scalar Deterministic Parameter

Now generalize to the case where:• Data is still Gaussian, but• Parameter-Dependence not restricted to Mean• Noise not restricted to White… Cov not necessarily diagonal

( ))(),(~ θCθµx N

One way to get this case: “signal + AGN”

Random Gaussian Signal w/ Vector Deterministic Parameter

Non-White Noise

Page 147: Class Notes

4

For this case the FIM is given by: (See Appendix 3c)

!!!!!!! "!!!!!!! #$!!!!! "!!!!! #$

∂∂

∂∂

+

∂∂

∂∂

= −−−

jij

T

iij tr

θθθθ)()()()(

21)()()()]([ 111 θCθCθCθCθµθCθµθI

Variability of Mean w.r.t. parameters

Variability of Covw.r.t. parameters

This shows the impact of signal model assumptions• deterministic signal + AGN• random Gaussian signal + AGN

Est. Cov. uses average over only noise

Est. Cov. uses average over signal & noise

Page 148: Class Notes

5

Gen. Gauss. Ex.: Time-Difference-of-ArrivalGiven:

Goal: Estimate ∆τ

[ ]TTT21 xxx =

Rx1Tx

Rx2

x1(t) = s(t – ∆τ) + w1(t)

x2(t) = s(t) + w2(t)

How to model the signal? • Case #1: s(t) is zero-mean, WSS, Gauss. Process• Case #2: s(t) is a deterministic signal

Passive SonarRadar/Comm Location

Case #1Case #10µ =∆ )( τ No Term #1 CC =∆ )( τ

∆−

∆−

=∆

];1[

];0[

];1[

];1[

];0[

)(

2

2

1

1

1

τ

τ

τ

τ

τ

τ

Ns

s

Ns

s

s

%

%

µ

No Term #2

∆=∆

2221

1211

)(

)()(

CC

CCC

τ

ττ

)()( ττ ∆=∆

+=

ji

iiii

ij

ii

ss

wwss

CC

CCC

Page 149: Class Notes

6

Comments on General Gaussian CRLB

It is interesting to note that for any given problem you may find each case used in the literature!!!

For example for the TDOA/FDOA estimation problem:• Case #1 used by M. Wax in IEEE Trans. Info Theory, Sept. 1982• Case #2 used by S. Stein in IEEE Trans. Signal Proc., Aug. 1993

See also differences in the book’s examples

Well skip Section 3.10 and leave it as a reading assignment

Page 150: Class Notes

1/19

3.11 CRLB Examples

1. Range Estimation sonar, radar, robotics, emitter location

2. Sinusoidal Parameter Estimation (Amp., Frequency, Phase) sonar, radar, communication receivers (recall DSB Example), etc.

3. Bearing Estimation sonar, radar, emitter location

4. Autoregressive Parameter Estimation speech processing, econometrics

Well now apply the CRLB theory to several examples of practical signal processing problems.

Well revisit these examples in Ch. 7 well derive ML estimators that will get close to achieving the CRLB

Page 151: Class Notes

2/19

max,);(

0)()()( osts

o TTttwtstxo

τττ

+=≤≤+−= !"!#$

Transmit Pulse: s(t) nonzero over t∈[0,Ts]

Receive Reflection: s(t τo)

Measure Time Delay: τo

C-T Signal Model

BandlimitedWhite Gaussian

tTs

s(t)

tT

s(t τo)BPF& Amp

x(t)PSD of w(t)

f BB

No/2

Ex. 1 Range Estimation Problem

Page 152: Class Notes

3/19

Sample Every ∆ = 1/2B secw[n] = w(n∆)

DT White Gaussian Noise

Var σ2 = BNo1,,1,0][)(][ −=+−∆= Nnnwnsnx o …τ

f

ACF of w(t)

τ1/2B

BB1/B 3/2B

PSD of w(t)No/2 σ2 = BNo

−≤≤+

−+≤≤+−∆

−≤≤

=

1][

1][)(

10][

][

NnMnnw

Mnnnnwns

nnnw

nx

o

ooo

o

τ

s[n;τo] has M non-zero samples starting at no

Range Estimation D-T Signal Model

Page 153: Class Notes

4/19

Now apply standard CRLB result for signal + WGN:

∑∑

∑∑

= ∆=

−+

= −∆=

−+

=

=

∂∂

=

∂∂

=

∂−∆∂

=

∂≥

1

0

2

2

1 2

2

1 2

2

1

0

2

2

)()(

)(];[)var(

M

n nt

Mn

nn nt

Mn

nn o

oN

n o

oo

tts

tts

nsns

o

o o

o

o

σσ

ττ

σ

ττ

στ

τ

Plug in and keep non-zero terms

Exploit Calculus!!! Use approximation: τo = ∆ noThen do change of variables!!

Range Estimation CRLB

Page 154: Class Notes

5/19

s

T

o

s

T

o

To

E

dttts

NE

dttts

N

dttts sss ∫∫∫

∂∂

=

∂∂

=

∂∂

0

2

0

2

0

2

2

)(

2/

1

)(

2/

)(1)var( στ

Assume sample spacing is small approx. sum by integral

∫= sTs dttsE

02 )(

( )

( )

∫∫

∞−

∞−

=

dtfS

dffSf

NE

E

dffSf

NE

o

s

s

T

o

s

os

2

22

022

)(

)(2

2/

1

)(2

2/

1)var(

π

πτ FT Theorem

& Parseval

Parseval( )

∫∫

∞−

∞−=dtfS

dffSfBrms 2

22

)(

)(2π

Define a BW measure:

Brms is RMS BW (Hz)A type of SNR

Range Estimation CRLB (cont.)

Page 155: Class Notes

6/19

Using these ideas we arrive at the CRLB on the delay:

)(sec1)var( 22rms

oBSNR ×

≥τ

To get the CRLB on the range use transf. of parms result:

)(m4/)var( 22

2

rmsBSNRcR×

≥o

CRLBRCRLBo

R ττ

2

∂∂

= with R = cτo / 2

CRLB is inversely proportional to: SNR Measure RMS BW Measure

CRLB is inversely proportional to: SNR Measure RMS BW Measure

So the CRLB tells us Choose signal with large Brms Ensure that SNR is large Better on Nearby/large targets Which is better?

Double transmitted energy? Double RMS bandwidth?

Range Estimation CRLB (cont.)

Page 156: Class Notes

7/19

1,,1,0][)cos(][ −=++Ω= NnnwnAnx o …φ

Given DT signal samples of a sinusoid in noise.Estimate its amplitude, frequency, and phase

DT White Gaussian NoiseZero Mean & Variance of σ2

Ωo is DT frequency in cycles/sample: 0 < Ωo < π

Multiple parameters so parameter vector: ToA ][ φΩ=θ

Recall SNR of sinusoid in noise is:

2

2

2

2

22/

σσAA

PPSNR

n

s ===

Ex. 2 Sinusoid Estimation CRLB Problem

Page 157: Class Notes

8/19

Approach: Find Fisher Info Matrix Invert to get CRLB matrix Look at diagonal elements to get bounds on parm variances

Recall: Result for FIM for general Gaussian case specialized to signal in AWGN case:

j

N

n i

T

jiij

nsnsθθσ

θθσ

∂∂

∂∂

=

∂∂

∂∂

=

∑−

=

];[];[1

1)]([

1

02

2

θθ

ssθI θθ

Sinusoid Estimation CRLB Approach

Page 158: Class Notes

9/19

Taking the partial derivatives and using approximations given inbook (valid when Ωo is not near 0 or π) :

( )

( )

2

21

0

22233

1

02

21

0

2223223

1

0

22

21

0

22

21

0

222222

1

02

1

023113

1

02

1

022112

2

1

02

1

0

2211

2)(sin1)]([

2)(sin1)]([)]([

2)22cos(1

2)(sin)(1)]([

0)22sin(2

)sin()cos(1)]([)]([

0)22sin(2

)sin()cos(1)]([)]([

2)22cos(1

21)(cos1)]([

σφ

σ

σφ

σ

σφ

σφ

σ

φσ

φφσ

φσ

φφσ

σφ

σφ

σ

NAnA

nAnnA

nAnnAnnA

nAnnA

nnAnnAn

Nnn

N

no

N

n

N

no

N

n

N

no

N

no

N

no

N

noo

N

no

N

noo

N

no

N

no

≈+Ω=

≈+Ω==

≈+Ω−=+Ω=

≈+Ω−

=+Ω+Ω−

==

≈+Ω−

=+Ω+Ω−

==

≈+Ω+=+Ω=

∑∑

∑∑∑

∑∑

∑∑

∑∑

=

=

=

=

=

=

=

=

=

=

=

=

θI

θIθI

θI

θIθI

θIθI

θI

ToA ][ φΩ=θ

Sinusoid Estimation Fisher Info Elements

Page 159: Class Notes

10/19

∑∑

=

=

=

2

21

02

2

1

02

21

0

22

2

2

220

220

002

)(

σσ

σσ

σ

NAnA

nAnA

N

N

n

N

n

N

nθI

Fisher Info Matrix then is:

2

2

2σASNR =Recall and closed form results for these sums

ToA ][ φΩ=θ

Sinusoid Estimation Fisher Info Matrix

Page 160: Class Notes

11/19

Inverting the FIM by hand gives the CRLB matrix and then extracting the diagonal elements gives the three bounds:

(using co-factor & detapproach helped by 0s)

)rad(4)1(

)12(2)var(

))rad/sample(()1(

12)var(

)volts(2)var(

2

22

22

NSNRNNSNRN

NNSNR

NA

o

×≈

+×−

−×≥Ω

φ

σ

Amp. Accuracy: Decreases as 1/N, Depends on Noise Variance (not SNR)

Freq. Accuracy: Decreases as 1/N3, Decreases as as 1/SNR

Phase Accuracy: Decreases as 1/N, Decreases as as 1/SNR

To convert to Hz2

multiply by (Fs /2π)2

Sinusoid Estimation CRLBs

Page 161: Class Notes

12/19

The CRLB for Freq. Est. referred back to the CT is

)Hz()1()2(

12)var( 222

2

−×≥

NNSNRFf s

Does that mean we do worse if we sample faster than Nyquist?NO!!!!! For a fixed duration T of signal: N = TFs

Also keep in mind that Fs has effect on the noise structure:

Not in Book

f

ACF of w(t)

τ1/2B

BB1/B 3/2B

PSD of w(t)No/2 σ2 = BNo

Frequency Estimation CRLBs and Fs

Page 162: Class Notes

13/19

Uniformly spaced linear array with M sensors: Sensor Spacing of d meters Bearing angle to target β radians

Figure 3.8 from textbook:

Simple model

Emits or reflects signal s(t)

)2cos()( φπ += tfAts ot

Propagation Time to nth Sensor: 1,,1,0cos0 −=−= Mncdnttn …β

+

+−=

−=

φβπ

α

cos2cos

)()(

0 cdnttfA

ttsts

o

nn

Signal at nth Sensor:

Ex. 3 Bearing Estimation CRLB Problem

Page 163: Class Notes

14/19

Now instead of sampling each sensor at lots of time instants we just grab one snapshot of all M sensors at a single instant ts

( )φφβπ

φβπ

ω

~cos

~cos2cos

cos2cos)( 0

+Ω=

+

=

+

+−=

Ω

nAndcfA

cdnttfAts

so

sosn

s

s !!! "!!! #$!!"!!#$

Spatial sinusoid w/ spatial frequency Ωs

Spatial Frequencies: ωs is in rad/meter Ωs is in rad/sensor

For sinusoidal transmitted signal Bearing Est. reduces to Frequency Est.

And we already know its FIM & CRLB!!!

Bearing Estimation Snapshot of Sensor Signals

Page 164: Class Notes

15/19

( ) ][~

cos][)(][ nwnAnwtsnx ssn ++Ω=+= φ

Each sample in the snapshot is corrupted by a noise sample

and these M samples make the data vector x = [x[0] x[1] x[M-1] ]:

Each w[n] is a noise sample that comes from a different sensor so Model as uncorrelated Gaussian RVs (same as white temporal noise)Assume each sensor has same noise variance σ2

So the parameters to consider are: TsA ][ φΩ=θ

which get transformed to:

Ω=

==

φ

π

φ

β

2arccos

)(df

c

AA

o

sθgα

Parameter of interest!

Bearing Estimation Data and Parameters

Page 165: Class Notes

16/19

Using the FIM for the sinusoidal parameter problem together with the transform. of parms result (see book p. 59 for details):

( ))rad(

sin11)2(

12)var( 2

22

2 βλ

π

β

−+

×

≥L

MMMSNR

Bearing Accuracy:

Decreases as 1/SNR Depends on actual bearing β

Decreases as 1/M ! Best at β = π/2 (Broadside)

Decreases as 1/Lr2 ! Impossible at β = 0! (Endfire)

L = Array physical length in metersM = Number of array elementsλ = c/fo Wavelength in meters (per cycle)

Define: Lr = L/λArray Length in wavelengths

Low-frequency (i.e., long wavelength) signals need very large physical lengths to achieve good accuracy

Bearing Estimation CRLB Result

Page 166: Class Notes

17/19

In speech processing (and other areas) we often model the signal as an AR random process and need to estimate the AR parameters. An AR process has a PSD given by

2

1

2

2

][1

);(

∑=

−+

=p

m

fmj

uxx

ema

fPπ

σθ

AR Estimation Problem: Given data x[0], x[1], , x[N-1] estimate the AR parameter vector

[ ]Tupaaa 2][]2[]1[ σ&=θ

This is a hard CRLB to find exactly but it has been published. The difficulty comes from the fact that there is no easy direct relationship between the parameters and the data.

It is not a signal plus noise problem

Ex. 4 AR Estimation CRLB Problem

Page 167: Class Notes

18/19

Approach: The asymptotic result we discussed is perfect here: An AR process is WSS is required for the Asymp. Result

Gaussian is often a reasonable assumption needed for Asymp. Result

The Asymp. Result is in terms of partial derivatives of the PSD and that is exactly the form in which the parameters are clearly displayed!

[ ] [ ] [ ] dffPfPN

j

xx

i

xxij ∫− ∂

∂∂

∂≈ 2

1

21

);(ln);(ln2

)(θθ

θθθI

2

1

222

1

2

2][1lnln

][1

ln);(ln ∑∑

=

=

+−=

+

=p

m

fmju

p

m

fmj

uxx ema

ema

fP π

π

σσθ

Recall:

AR Estimation CRLB Asymptotic Approach

Page 168: Class Notes

19/19

After taking these derivatives you get results that can be simplified using properties of FT and convolution.

The final result is: [ ]

N

pkN

ka

uu

kkxxu

42

12

2)var(

,,2,1])[var(

σσ

σ

=≥ − …R

Both Decrease as 1/N

To get a little insight look at 1st order AR case (p = 1):

])1[1(1])1[var( 2aN

a −≥

Complicated dependence on AC Matrix!!

Improves as pole gets closer to unit

circle PSDswith sharp

peaks are easier to estimate

a[1] Re(z)

Im(z)

AR Estimation CRLB Asymptotic Result

Page 169: Class Notes

1

CRLB Example: Single-Rx Emitter Location via Doppler

],[);( 111 Ttttfts +∈

Received Signal Parameters Depend on Location

Estimate Rx Signal Frequencies: f1, f2, f3, , fN

Then Use Measured Frequencies to Estimate Location

(X, Y, Z, fo)

],[);( 222 Ttttfts +∈

],[);( 333 Ttttfts +∈

Page 170: Class Notes

2

Problem BackgroundRadar to be Located: at Unknown Location (X,Y,Z)Transmits Radar Signal at Unknown Carrier Frequency fo

Signal is intercepted by airborne receiver

Known (Navigation Data): Antenna Positions: (Xp(t), Yp(t), Zp(t))Antenna Velocities: (Vx(t), Vy(t), Vz(t))

Goal: Estimate Parameter Vector x = [X Y Z fo]T

Page 171: Class Notes

3

Physics of Problem

Receiver

Emitter

v(t)

u(t)Relative motion between emitter and receiver causes a Doppler shift of the carrier frequency:

( ) ( ) ( )( ) ( ) ( )

.)()()(

)()()()()()(

)()(),(

222

−+−+−

−+−+−−=

•−=

ZtZYtYXtX

ZtZtVYtYtVXtXtVcff

ttcfftf

ppp

pzpypxoo

oo uvx

Because we estimate the frequency there is an error added:

)(),(),(~

iii tvtftf += xx

Page 172: Class Notes

4

Estimation Problem StatementGiven:

Data Vector:

Navigation Info:

[ ]TNtftftf ),(~

),(~

),(~

)(~

21 xxxxf !=

)(,),(),(

)(,),(),()(,),(),(

)(,),(),(

)(,),(),(

)(,),(),(

21

21

21

21

21

21

Nzzz

Nyyy

Nxxx

Nppp

Nppp

Nppp

tVtVtV

tVtVtVtVtVtV

tZtZtZ

tYtYtY

tXtXtX

!

!!

!

!

!

Estimate:Parameter Vector: [X Y Z fo]T

Right now only want to consider the CRLB

Vector-Valued function of a Vector

Page 173: Class Notes

5

The CRLBNote that this is a signal plus noise scenario:

The signal is the noise-free frequency values The noise is the error made in measuring frequency

Assume zero-mean Gaussian noise with covariance matrix C: Can use the General Gaussian Case of the CRLB Of course validity of this depends on how closely the errors

of the frequency estimator really do follow this

Our data vector is distributed according to: )),(()(~ Cxf~xf Ν

Only Mean Shows Dependence on

parameter x!

Only need the first term in the CRLB equation:

[ ]

∂∂

∂∂

= −

j

T

iij xx

)()( 1 xfCxfJ

I use J for the FIM instead of I to avoid confusion with the identity matrix.

Page 174: Class Notes

6

Convenient Form for FIM

To put this into an easier form to look at Define a matrix H:

Called The Jacobian of f(x)

[ ]4321 valuetrue

|||),( hhhhxtx

Hx

=∂∂

==

f

where

xx

x

x

x

h

"

#

=

∂∂

∂∂∂∂

=

),(

),(

),(

2

1

Nj

j

j

j

tfx

tfx

tfx

Then it is east to verify that the FIM becomes:

HCHJ 1−= T

Page 175: Class Notes

7

CRLB MatrixThe Cramer-Rao bound covariance matrix then is:

[ ] 11

1)(−−

=

=

HCH

JxC

T

CRB

A closed-form expression for the partial derivatives needed for H can be computed in terms of an arbitrary set of navigation data see Reading Versionof these notes.

Thus, for a given emitter-platform geometry it is possible to compute the matrix H and then use it to compute the CRLB covariance matrix in (5), from which an eigen-analysis can be done to determine the 4-D error ellipsoid.

Cant really plot a 4-D ellipsoid!!!

But it is possible to project this 4-D ellipsoid down into 3-D (or even 2-D) so that you can see the effect of geometry.

Page 176: Class Notes

8

Projection of Error EllipsoidsA zero-mean Gaussian vector of two vectors x & y:

The the PDF is:

[ ]TTT yxθ =

θCθC

θ θθ

121exp

)(det2

1)(21

−−= Tpπ

=

yyx

xyxθ CC

CCC

The quadratic form in the exponential defines an ellipse:

kT =− θCθ θ1 Can choose k to make size of

ellipsoid such that θ falls inside the ellipsoid with a

desired probability

Q: If we are given the covariance Cθ how is x alone is distributed?

A: Extract the sub-matrix Cx out of Cθ

See also Slice Of Error Ellipsoids

Page 177: Class Notes

9

Projections of 3-D Ellipsoids onto 2-D Space

2-D Projections ShowExpected Variation

in 2-D Space

xy

z

Projection ExampleTzyx ][=θ Tyx ][=xFull Vector: Sub-Vector:

We want to project the 3-D ellipsoid for θdown into a 2-D ellipse for x

2-D ellipse still shows the full range of

variations of x and y

Page 178: Class Notes

10

Finding ProjectionsTo find the projection of the CRLB ellipse:

1. Invert the FIM to get CCRB2. Select the submatrix CCRB,sub from CCRB3. Invert CCRB,sub to get Jproj 4. Compute the ellipse for the quadratic form of Jproj

Mathematically:T

TCRBsubCRB

PPJ

PPCC1,−=

= ( ) 11 −−= Tproj PPJJ

P is a matrix formed from the identity matrix: keep only the rows of the variables projecting onto

For this example, frequency-based emitter location: [X Y Z fo]T

To project this 4-D error ellipsoid onto the X-Y plane:

=

00100001

P

Page 179: Class Notes

11

Projections Applied to Emitter Location

Shows 2-D ellipses that result from projecting 4-D

ellipsoids

Page 180: Class Notes

12

Slices of Error EllipsoidsQ: What happens if one parameter were perfectly known.

Capture by setting that parameters error to zero⇒ slice through the error ellipsoid.

Slices ShowImpact of Knowledge

of a Parameter

xy

z

Impact: slice = projection when ellipsoid not tilted slice < projection when ellipsoid is tilted.

Recall: Correlation causes tilt

Page 181: Class Notes

1

Chapter 4Linear Models

Page 182: Class Notes

2

General Linear ModelRecall signal + WGN case: x[n] = s[n;θ] + w[n]

x = s(θ) + w Here, dependence on θ is general

N×1 known “observation matrix” (N×p)

p×1 known “offset”(p×1)

Now we consider a special case: Linear Observations:s(θ) = Hθ + b

The General Linear Model:

wbHθx ++=

Data Vector Known &

Full Rank

To Be Estimated

~N(0,C)

zero-mean, Gaussian,

C is pos. def.

Note: Gaussian is part of the Linear Model

Known

Page 183: Class Notes

3

Need For Full-Rank H MatrixNote: We must assume H is full rank

Q: Why?

A: If not, the estimation problem is “ill-posed”…given vector s there are multiple θ vectors that give s:

If H is not full rank…Then for any s : ∃ θ1, θ2 such that s = Hθ1 = Hθ2

Page 184: Class Notes

4

Importance of The Linear Model

There are several reasons:

1. Some applications admit this model

2. Nonlinear models can sometimes be linearized

( ) ( )bxCHHCHθ −= −−− 111 ˆ TTMVU … as we’ll see!!!

3. Finding Optimal Estimator is Easy

Page 185: Class Notes

5

MVUE for Linear ModelTheorem: The MVUE for the General Linear Model and its covariance (i.e. its accuracy performance) are given by:

( ) ( )bxCHHCHθ −= −−− 111 ˆ TTMVU

( ) 11ˆ

−−= HCHCθT and achieves the CRLB.

Proof: We’ll do this for the b = 0 case but it can easily be done for the more general case.

First we have that x~N(Hθ,C) because:

Ex = EHθ + w = Hθ + Εw = Hθ

covx = E(x – Hθ) (x – Hθ)T = Ew wT = C

Page 186: Class Notes

6

Recalling CRLB Theorem… Look at the partial of LLF:

Constant w.r.t. θ

Linear w.r.t. θ

Quadratic w.r.t. θ(Note: HTC-1H is symmetric)

Now use results in “Gradients and Derivatives” posted on BB:

The CRLB Theorem says that if we have this form we have found the MVU and it achieves the CRLB of I-1(θ)!!

[ ]

+−

∂∂

−=

−−∂∂

−=∂

−−−

!!"!!#$!"!#$!"!#$ HθCHθHθCxxCxθ

HθxCHθxθθ

θ

111

1

221

)()(21);x(ln

TTTT

Tp (Hθ)TC-1x = [(Hθ)TC-1x]T

= xTC-1 (Hθ)

[ ] [ ]

( )

−=

−=+−−=∂

=

−−−−

−−−−

θxCHHCHHCH

HθCHxCHHθCHxCHθθ

θθθI!!! "!!! #$!"!#$

ˆ)(

111

)(

1

1111 2221);x(ln

g

TTT

TTTTp

Pull out HTC-1H

Page 187: Class Notes

7

For simplicity… assume b = 0Whitening Filter ViewpointAssume C is positive definite (necessary for C-1 to exist)

Thus, from (A1.2): for pos. def. C ∃ N×N invertible matrix D, s.t.

C-1 = DTD C = D-1(DT)-1

Transform data x using matrix D: wθHDwDHθDxx ~~~ +=+==

Claim: White!! ( ) IDDDDDCD

DDwwDwDwww

=

==

==

−− TTT

TTTT EEE

11

))((~~

DMVUE for Lin. Model w/ White

Noise

x~x θ

WhiteningFilter

Page 188: Class Notes

8

Ex. 4.1: Curve FittingCaution: The “Linear” in “Linear Model”

does not come from fitting straight lines to data

It is more general than that !!

n

x[n] Data

Model is Quadratic in Index n…But Model is Linear in Parameters

wHθx +=

=

3

2

1

θ

θ

θ

θ

=

2

2

2

1

221

111

NN

%%%H

][][ 2321 nwnnnx +++= θθθ

Linear in θ’s

Page 189: Class Notes

9

Ex. 4.2: Fourier Analysis (not most general) ][2sin2cos][

1 1nw

Nknb

Nknanx

M

k

M

kkk +

+

= ∑ ∑

= =

ππData Model:

Parameters to Estimate

AWGN

Parameters: θ = [a1 . . . aM b1 . . . bM]T (Fourier Coefficients)

=H

ObservationMatrix:

Nknπ2cos

Nknπ2sin k = 1, 2, …, M

n = 0, 1, 2, …, NDown each column

Page 190: Class Notes

10

Now apply MVUE Theorem for Linear Model:

( ) xHHH TTMVU

1ˆ −=θ

I2N

=

Using standard orthogonality of

sinusoids (see book)

xHTMVU

N2

ˆ =θ

Each Fourier coefficient estimate is found by the inner product of a column of H with

the data vector x

Interesting!!! Fourier Coefficients for signal + AWGN are MVU estimates of the Fourier Coefficients of the noise-free signal

COMMENT: Modeling and Estimation (are Intertwined)• Sometimes the parameters have some physical significance (e.g. delay

of a radar signal).• But sometimes parameters are part of non-physical assumed model

(e.g. Fourier)• Fourier Coefficients for signal + AGWN are MVU estimates of the

Fourier Coefficients of the noise-free signal

Page 191: Class Notes

11

H(z) +

w[n]

u[n]Ex. 4.3: System Identification

x[n]

Observed Noisy Output

Known Input

Unknown System

Goal: Determine a model for the system Some Application Areas:

• Wireless Communications (identify & equalize multipath)• Geophysical Sensing (oil exploration)• Speakerphone (echo cancellation)

In many applications: assume that the system is FIR (length p)

unknown, but here we’ll assume known][][][][

1

0nwknukhnx

p

k+−= ∑

=

Measured EstimationParameters

Known InputAssume u[n] =0, n < 0

AWGN

Page 192: Class Notes

12

Write FIR convolution in matrix form:

wx

θ

H

+

−−

=

!"!#$

%

!!!!!!!!!!!! "!!!!!!!!!!!! #$ &&&&

%%

%

'%

'''%

%''''%

&

&&

&&

]1[

]1[

]0[

][]1[

]1[

]0[

0

00]0[]1[]2[

00]0[]1[

000]0[

)(

ph

h

h

pNuNu

u

u

uuu

uu

u

NxpEstimate

This

Measured Data

Known Input Signal Matrix

The Theorem for the Linear Model says:

( ) xHHHθ TTMVU

1 ˆ −

=

( ) 12ˆ

−= HHCθ

Tσ and achieves the CRLB.

Page 193: Class Notes

13

Q: What signal u[n] is best to use ?

A: The u[n] that gives the smallest estimated variances!!

Book shows: Choosing u[n] s.t. HTH is diagonal will minimize variance

⇒ Choose u[n] to be pseudo-random noise (PRN)u[n] is ⊥ to all its shifts u[n – m]

Proof uses: ( ) 12ˆ

−= HHCθ

And Cauchy-Schwarz Inequality (same as Schwarz Ineq.)

Note: PRN has approximately flat spectrum

So from a frequency-domain view a PRN signal equally probes at all frequencies

Page 194: Class Notes

1

Chapter 6Best Linear Unbiased Estimate

(BLUE)

Page 195: Class Notes

2

Motivation for BLUEExcept for Linear Model case, the optimal MVU estimator might:

1. not even exist2. be difficult or impossible to find

⇒ Resort to a sub-optimal estimateBLUE is one such sub-optimal estimate

Idea for BLUE: 1. Restrict estimate to be linear in data x2. Restrict estimate to be unbiased3. Find the best one (i.e. with minimum variance)

Advantage of BLUE:Needs only 1st and 2nd moments of PDF

Mean & CovarianceDisadvantages of BLUE:

1. Sub-optimal (in general)2. Sometimes totally inappropriate (see bottom of p. 134)

Page 196: Class Notes

3

6.3 Definition of BLUE (scalar case)Observed Data: x = [x[0] x[1] . . . x[N – 1] ]T

PDF: p(x;θ ) depends on unknown θ

BLUE constrained to be linear in data: xaTN

nnBLU nxa == ∑

1

0][θ

Choose a’s to give: 1. unbiased estimator2. then minimize variance

LinearUnbiased

Estimators

NonlinearUnbiased Estimators

BLUE

MVUE

Var

ianc

e

Note: This is not Fig. 6.1

Page 197: Class Notes

4

6.4 Finding The BLUE (Scalar Case)∑−

=1

0][ˆ

N

nn nxaθ1. Constrain to be Linear:

2. Constrain to be Unbiased:

∑−

=

=

=

1

0][

ˆ

N

nn nxEa

E

θ

θθUsing linear constraint

Q: When can we meet both of these constraints?

A: Only for certain observation models (e.g., linear observations)

Page 198: Class Notes

5

Finding BLUE for Scalar Linear ObservationsConsider scalar-parameter linear observation:

x[n] = θs[n] + w[n] ⇒ Ex[n] = θs[n]

Tells how to choose weights to use in the

BLUE estimator form 1

][ˆ1

0

=

== ∑−

− ⇓

saT

N

nn

Need

nsaE θθθ !"#

∑−

=1

0][ˆ

N

nn nxaθ

Then for the unbiased condition we need:

Now… given that these constraints are met…We need to minimize the variance!!

Given that C is the covariance matrix of x we have:

Caaxa TTBLU == varˆvar θ

Like varaX =a2 varX

Page 199: Class Notes

6

Goal: minimize aTCa subject to aTs = 1

⇒ Constrained optimization

Appendix 6A: Use Lagrangian Multipliers: Minimize J = aTCa + λ(aTs – 1)

sCssCssa

sCa

sa

11

1

1

12

12

20 :Set

−−

=

=−⇒=−=⇒

−=⇒=∂∂

TTT

T

aJ

λλ

λ

$$!$$"#

sCssCa 1

1

−= T

sCs 11)ˆvar( −= Tθ

sCsxCsxa 1

−== T

TT

BLUEθ

Appendix 6A shows that this achieves a global minimum

Page 200: Class Notes

7

Applicability of BLUE

We just derived the BLUE under the following:1. Linear observations but with no constraint on the noise PDF2. No knowledge of the noise PDF other than its mean and cov!!

What does this tell us???BLUE is applicable to linear observations

But noise need not be Gaussian!!! (as was assumed in Ch. 4 Linear Model)

And all we need are the 1st and 2nd moments of the PDF!!!

But well see in the Example that we can often linearize a nonlinear model!!!

Page 201: Class Notes

8

6.5 Vector Parameter Case: Gauss-Markov Thm

Gauss-Markov Theorem:If data can be modeled as having linear observations in noise:

wHθx +=Known Matrix Known Mean & Cov

(PDF is otherwise arbitrary & unknown)

Then the BLUE is: ( ) xCHHCHθ 111ˆ −−−= TTBLUE

and its covariance is: ( ) 11ˆ

−−= HCHCθT

Note: If noise is Gaussian then BLUE is MVUE

Page 202: Class Notes

9

Ex. 4.3: TDOA-Based Emitter Location

Tx @ (xs,ys)

Rx3(x3,y3)

Rx2(x2,y2)

Rx1(x1,y1)

s(t)

s(t – t1) s(t – t2) s(t – t3)

Hyperbola:τ12 = t2 – t1 = constant

Hyperbola:τ23 = t3 – t2 = constant

TDOA = Time-Difference-of-Arrival

Assume that the ith Rx can measure its TOA: ti

Then… from the set of TOAs… compute TDOAs

Then… from the set of TDOAs… estimate location (xs,ys)

We won’t worry about “how” they do that.Also… there are TDOA systems that never actually estimate TOAs!

Page 203: Class Notes

10

TOA Measurement ModelAssume measurements of TOAs at N receivers (only 3 shown above):

t0, t1, … ,tN-1There are measurement errors

TOA measurement model:To = Time the signal emittedRi = Range from Tx to Rxic = Speed of Propagation (for EM: c = 3x108 m/s)

ti = To + Ri/c + εi i = 0, 1, . . . , N-1

Measurement Noise ⇒ zero-mean, variance σ2, independent (but PDF unknown)(variance determined from estimator used to estimate ti’s)

Now use: Ri = [ (xs – xi)2 + (ys - yi)2 ]1/2

iisisossi yyxxc

Tyxft ε+−+−+== 22 )()(1),(Nonlinear

Model

Page 204: Class Notes

11

Linearization of TOA Model

⇒ θ = [δx δy]T

So… we linearize the model so we can apply BLUE:Assume some rough estimate is available (xn, yn)

xs = xn + δxs ys = yn + δys

know estimate know estimate

Now use truncated Taylor series to linearize Ri (xn, yn):

sn

ins

n

inni y

Ryyx

RxxRR

iBiAii

iδδ

$!$"#$!$"#

== ∆∆

−+

−+≈

Known

isi

si

on

ii ycBx

cAT

c

Rtt i εδδ +++=−=~Apply to TOA:

known known known

Three unknown parameters to estimate: To, δys, δys

Page 205: Class Notes

12

TOA Model vs. TDOA ModelTwo options now:

1. Use TOA to estimate 3 parameters: To, δys, δys

2. Use TDOA to estimate 2 parameters: δys, δys

Generally the fewer parameters the better…Everything else being the same.

But… here “everything else” is not the same: Options 1 & 2 have different noise models

(Option 1 has independent noise)(Option 2 has correlated noise)

In practice… we’d explore both options and see which is best.

Page 206: Class Notes

13

Conversion to TDOA Model N–1 TDOAs rather than N TOAs

TDOAs: 1,,2,1,~~1 −=−= − Nitt iii …τ

$!$"#$!$"#$!$"# noise correlated1

known

1

known

1−

−− −+−

+−

= iisii

sii y

cBBx

cAA εεδδ

In matrix form: x = Hθ + w

εwH A

)()(

)()(

)()(

1

21

12

01

2121

1212

0101

=

=

−−

−−

−−

=

−−−−−− NNNNNN BBAA

BBAA

BBAA

c

εε

εε

εε

&

&

&&&

&

&

[ ]TN 121 −= τττ 'x [ ]Tss yx δδ=θ

See book for structure of matrix ATAAwCw

2cov σ==

Page 207: Class Notes

14

Apply BLUE to TDOA Linearized Model( )

( ) 112

11ˆ

−−

−−

=

=

HAAH

HCHC wθ

TT

T

σ

( )( ) ( ) xAAHHAAH

xCHHCHθ ww

111

111ˆ

−−−

−−−

=

=

TTTT

TTBLUE

Describes how large the location error is

Dependence on σ2

cancels out!!!

Things we can now do:1. Explore estimation error cov for different Tx/Rx geometries

• Plot error ellipses2. Analytically explore simple geometries to find trends

• See next chart (more details in book)

Page 208: Class Notes

15

Apply TDOA Result to Simple Geometry

Rx1 Rx2 Rx3

d dα α

R

Tx

=

2

222

ˆ

)sin1(2/30

0cos2

1

α

ασ cθCThen can show:

Diagonal Error Cov ⇒ Aligned Error Ellipse

And y-error always bigger than x-error ex

ey

Page 209: Class Notes

16

0 10 20 30 40 50 60 70 80 9010-1

100

101

102

103

α (degrees )

σx/cσ

o

r σ

y/cσ

σxσy

Rx1 Rx2 Rx3

d dα α

R

Tx• Used Std. Dev. to show units of X & Y• Normalized by cσ… get actual values by

multiplying by your specific cσ value

• For Fixed Range R: Increasing Rx Spacing d Improves Accuracy

• For Fixed Spacing d: Decreasing Range R Improves Accuracy

Page 210: Class Notes

1

Chapter 7Maximum Likelihood Estimate

(MLE)

Page 211: Class Notes

2

Motivation for MLEProblems: 1. MVUE often does not exist or can’t be found

<See Ex. 7.1 in the textbook for such a case>2. BLUE may not be applicable (x ≠ Hθ + w)

Solution: If the PDF is known, then MLE can always be used!!!

This makes the MLE one of the most popular practical methods

• Advantages: 1. It is a “Turn-The-Crank” method2. “Optimal” for large enough data size

• Disadvantages: 1. Not optimal for small data size2. Can be computationally complex

- may require numerical methods

Page 212: Class Notes

3

Rationale for MLEChoose the parameter value that:

makes the data you did observe…the most likely data to have been observed!!!

Consider 2 possible parameter values: θ1 & θ2

Ask the following: If θi were really the true value, what is the probability that I would get the data set I really got ?

Let this probability be Pi

So if Pi is small… it says you actually got a data set that was unlikely to occur! Not a good guess for θi!!!

But p1 = p(x;θ1) dx

p2 = p(x;θ2) dx⇒ pick so that is largestMLθ )ˆ;( MLp θx

Page 213: Class Notes

4

Definition of the MLEis the value of θ that maximizes the “Likelihood

Function” p(x;θ) for the specific measured data xMLθ

maximizes the likelihood function

MLθp(x;θ)

θMLθ

Note: Because ln(z) is a monotonically increasing function…

maximizes the log likelihood function lnp(x; θ)MLθ

General Analytical Procedure to Find the MLE

1. Find log-likelihood function: ln p(x;θ)

2. Differentiate w.r.t θ and set to 0: ∂ln p(x;θ)/∂θ = 0

3. Solve for θ value that satisfies the equation

Page 214: Class Notes

5

Ex. 7.3: Ex. of MLE When MVUE Non-Existentx[n] = A + w[n] ⇒ x[n] ~ N(A,A)

WGN~N(0,A)

Likelihood Function:

!!!!!! "!!!!!! #$

−−= ∑

=

1

0

2

2

)][(21exp

)2(

1);(N

nN Anx

AA

Ap

π

x

To take ln of this… use log properties:

Take ∂/∂A, set = 0, and change A to A

A > 0

0ˆ2

ˆ][ˆ2ˆ2

1][ˆ21ˆ

ˆ1][ˆ

1ˆ2

:thisExpand

0)ˆ][(ˆ21)ˆ][(ˆ

1ˆ2

2

2

22

1

0

22

1

0

=+−+−+−

=−+−+−

∑∑∑

∑∑−

=

=

ANAnxA

Anx

AAN

Anx

AAN

AnxA

AnxAA

N N

n

N

n

Cancel

Page 215: Class Notes

6

Manipulate to get: 0][1ˆˆ1

0

22 =−+ ∑−

=

N

nnx

NAA

41][1

21ˆ

1

0

2 ++−= ∑−

=

N

nML nx

NA

Solve quadratic equation to get MLE:

Can show this estimator biased (see bottom of p. 160)But it is asymptotically unbiased…

Use the “Law of Large Numbers”: Sample Mean → True Mean

][][1 21

0

2 nxEnxN Nas

N

n → →∞

=∑

AnxEnxEEAEAA

ML =++−=

++−→

+=41][

21

41][

21ˆ

2

22!"!#$

CRLBAN

AA =

+

21

)ˆvar(2

So can use this to show:

Asymptotically…Unbiased & Efficient

Page 216: Class Notes

7

7.5 Properties of the MLE (or… “Why We Love MLE”)

The MLE is asymptotically:

1. unbiased

2. efficient (i.e. achieves CRLB)

3. Gaussian PDF

Also, if a truly efficient estimator exists, then the ML procedure finds it !

The asymptotic properties are captured in Theorem 7.1:

If p(x;θ ) satisfies some “regularity” conditions, then the MLE is asymptotically distributed according to

))(,(~ˆ 1 θIθNθ aML

where I(θ ) = Fisher Information Matrix

Page 217: Class Notes

8

Size of N to Achieve Asymptotic

This Theorem only states what happens asymptotically…when N is small there is no guarantee how the MLE behaves

Q: How large must N be to achieve the asymptotic properties?

A: In practice: use “Monte Carlo Simulations” to answer this

Page 218: Class Notes

9

Monte Carlo Simulations: see Appendix 7A

Not just for the MLE!!!A methodology for doing computer simulations to evaluate performance of any estimation method Illustrate for deterministic signal s[n; θ ] in AWGN

Monte Carlo Simulation:

Data Collection:

1. Select a particular true parameter value, θtrue- you are often interested in doing this for a variety of values of θso you would run one MC simulation for each θ value of interest

2. Generate signal having true θ: s[n;θt] (call it s in matlab)

3. Generate WGN having unit variancew = randn ( size(s) );

4. Form measured data: x = s + sigma*w;- choose σ to get the desired SNR- usually want to run at many SNR values

→ do one MC simulation for each SNR value

Page 219: Class Notes

10

Data Collection (Continued):

5. Compute estimate from data x

6. Repeat steps 3-5 M times

- (call M “# of MC runs” or just “# of runs”)

7. Store all M estimates in a vector EST (assumes scalar θ)

Statistical Evaluation:

1. Compute bias

2. Compute error RMS

3. Compute the error Variance

4. Plot Histogram or Scatter Plot (if desired)

( )∑=

−=M

itrueiM

b1

ˆ1 θθ

( )∑=

−=M

itiM

RMS1

2ˆ1 θθ

∑ ∑= =

−=

M

i

M

iii MM

VAR1

2

1

ˆ1ˆ1 θθ

Now explore (via plots) how: Bias, RMS, and VAR vary with: θ value, SNR value, N value, Etc.

Is B ≈ 0 ?Is RMS ≈ (CRLB)½ ?

Page 220: Class Notes

11

Ex. 7.6: Phase Estimation for a SinusoidSome Applications: 1. Demodulation of phase coherent modulations

(e.g., DSB, SSB, PSK, QAM, etc.)2. Phase-Based Bearing Estimation

Signal Model:x[n] = Acos(2πfon + φ) + w[n], n = 0, 1,…, N-1

A and fo known, φ unknown White~N(0,σ2)

( )SNRNNA ⋅

=≥12ˆvar 2

2σφRecall CRLB:

For this problem… all methods for finding the MVUE will fail!!⇒ So… try MLE!!

Page 221: Class Notes

12

So first we write the likelihood function:

( )( )[ ]

+−−= ∑−

= !!!!! "!!!!! #$

1

0

22

222cos][

21exp

2

1);(N

noN nfAnxp φπ

σπσ

φx

… equivalent to minimizing this

End up in same place if we maximize LLF

GOAL: Find φ that maximizes this

So, minimize: ( ) ( )[ ] ( ) gives0 Setting2cos][1

0

2 =∂

∂+−= ∑

=

φφφπφ JnfAnxJ

N

no

( ) ( ) ( )!!!!!! "!!!!!! #$

0

1

0

1

02cosˆ2sinˆ2sin][

=

=∑∑ ++=+N

noo

N

no nfnfAnfnx φπφπφπ

sin and cos are ⊥ when summed over full cycles

So… MLE Phase Estimate satisfies: ( ) 0ˆ2sin][1

0=+∑

=

N

nonfnx φπ

Interpret via inner product or correlation

Page 222: Class Notes

13

Now…using a Trig Identity and then re-arranging gives:

( ) ( )

−=

∑∑n

on

o nfnxnfnx πφπφ 2cos][)ˆsin(2sin][)ˆcos(

( )

( )

−=∑∑

no

no

ML nfnx

nfnx

π

πφ

2cos][

2sin][tanˆ 1

Or… Recall: This is the approximate MLE

Don’t need to know A or σ2 but do need

to know foLPF

LPF

yi(t)

yq(t)

x(t)cos(2πfot)

-sin(2πfot)

Recall: I-Q Signal Generation

The “sums” in the above equation play the role of the LPF’s in the figure (why?)Thus, ML phase estimator can be viewed as: atan of ratio of Q/I

Page 223: Class Notes

14

Monte Carlo Results for ML Phase Estimation

See figures 7.3 & 7.4 in text book

Page 224: Class Notes

1

7.6 MLE for Transformed ParametersGiven PDF p(x;θ ) but want an estimate of α = g (θ )

What is the MLE for α ??

θg(θ )Two cases:

1. α = g(θ ) is a one-to-one function

))(;( maximizes ˆ 1 αα −gpML x

2. α = g(θ ) is not a one-to-one function θg(θ )

Need to define modified likelihood function:

!!! "!!! #$);(max);(

)(:θα

θαθxx pp

gT

==

• For each α, find all θ’s that map to it• Extract largest value of p(x; θ ) over

this set of θ’s);( maximizes ˆ αα xTML p

Page 225: Class Notes

2

Invariance Property of MLE Another Big Advantage of MLE!

Theorem 7.2: Invariance Property of MLEIf parameter θ is mapped according to α = g(θ ) then the MLE of α is given by

where is the MLE for θ found by maximizing p(x;θ )

)ˆ(ˆ θα g=

θ

Note: when g(θ ) is not one-to-one the MLE for α maximizes the modified likelihood function

“Proof”: Easy to see when g(θ ) is one-to-one

Otherwise… can “argue” that maximization over θ inside definition for modified LF ensures the result.

Page 226: Class Notes

3

Ex. 7.9: Estimate Power of DC Level in AWGNx[n] = A + w[n] noise is N(0,σ2) & White

Want to Est. Power: α = A2 ⇒

A

α = A2

⇒ For each α value there are 2 PDF’s to consider

+−=

−−=

nNT

nNT

nxp

nxp

222/2

222/2

)][(2

1exp)2(

1);(

)][(2

1exp)2(

1);(

2

1

ασπσ

α

ασπσ

α

x

x

[ ]2

2

2

0

ˆ

);x(maxarg

);(),;x(maxargˆ

ML

A

ML

A

Ap

xpp

=

=

−=

∞<<∞−

≥ααα

αThen: Demonstration that

Invariance Result Holds for this

Example

Page 227: Class Notes

4

Ex. 7.10: Estimate Power of WGN in dB x[n] = w[n] WGN w/ var = σ2 unknown

Recall: Pnoise = σ2

∑−

=

=1

0

2 ][1ˆN

nnoise nx

NPCan show that the MLE for variance is:

To get the dB version of the power estimate:

Note: You may recall a result for estimating variance that divides by N–1 rather than by N … that estimator is unbiased, this estimate is biased (but asymptotically unbiased)

!!! "!!! #$

= ∑

=

1

0

210 ][1log10ˆ

N

ndB nx

NP

Using Invariance Property !

Page 228: Class Notes

5

7.7: Numerical Determination of MLENote: In all previous examples we ended up with a closed-formexpression for the MLE: )(ˆ xfML =θ

Ex. 7.11: x[n] = rn + w[n] noise is N(0,σ2) & whiteEstimate r If –1 < r < 0 then this signal

is a decaying oscillation that might be used to model:• A Ship’s “Hull Ping”• A Vibrating String, Etc.

∑−

=

− =−⇒

=∂

1

0

1 0)][(

0);x(ln

N

n

nn nrrnx

θTo find MLE:

No closed-form solution for the MLE

Page 229: Class Notes

6

So…we can’t always find a closed-form MLE!But a main advantage of MLE is:

We can always find it numerically!!!(Not always computationally efficiently, though)

Brute Force MethodCompute p(x;θ ) on a fine grid of θ values

Advantage: Sure to Find maximum (if grid is fine enough)

Disadvantage: Lots of Computation (especially w/ a fine grid)

p(x;θ )

θ

Page 230: Class Notes

7

Iterative Methods for Numerical MLEStep #1: Pick some “initial estimate”Step #2: Iteratively improve it using

),ˆ(ˆ1 xii f θθ =+

);x(max);x(lim θθθ

pp ii

=∞→

such that

Hill Climbing in the Fogp(x;θ )

θ0θ 1θ 2θ

Note: A so-called “Greedy”maximization algorithm will always move up even though taking an occasional step downward may be the better global strategy!

Convergence Issues:1. May not converge2. May converge, but to local maximum

- good initial guess is needed !!- can use rough grid search to initialize- can use multiple initializations

Page 231: Class Notes

8

Iterative Method: Newton-Raphson MLEThe MLE is the maximum of the LF… so set derivative to 0:

0);(ln

)(

=∂

∆=

!"!#$θ

θθ

g

p x

Newton-Raphson is a numerical method for finding the zero of a function… so it can be applied here… Linearize g(θ )

So… MLE is a zero of g(θ )

θ0θ1θ2θ

)ˆ()()()(

0

ˆ!!!!! "!!!!! #$

+

=

=−

+≈

k

k

foresolvset

kk ddggg

θ

θθθθ

θθθθ

−=

=

+

kd

dg

g kkk

θθθθ

θθθ

ˆ

1)(

)ˆ(ˆˆ

Truncated Taylor Series

Page 232: Class Notes

9

θθθ

∂∂

=);(ln)( xpNow… using our “definition of convenience”: g

So then the Newton-Raphson MLE iteration is:

k

ppkk

θθθθ

θθθθ

ˆ

1

2

2

1);(ln);(lnˆˆ

=

+

∂∂

∂−=

xxIterate until convergence criterion is met:

εθθ <−+ |ˆˆ| 1 kk

Look Familiar???Looks like I(θ ), except: I(θ ) is evaluated at the

true θ, and has an expected valueYou get to

choose!

Generally: For a given PDF model, compute derivatives analytically…

or… compute derivatives numerically:

θθθθ

θθ

θ ∆−∆+

≈∂

∂ )ˆ;(ln)ˆ;(ln);(lnˆ

kk ppp

k

xxx

Page 233: Class Notes

10

Convergence Issues of Newton-Raphson:1. May not converge2. May converge, but to local maximum

- good initial guess is needed !!- can use rough grid search to initialize- can use multiple initializations

0θ1θ

2θ3θ

θθ

∂∂ );(ln xp

θ

Some Other Iterative MLE Methods1. Scoring Method

• Replaces second-partial term by I(θ )2. Expectation-Maximization (EM) Method

• Guarantees convergence to at least a local maximum• Good for complicated multi-parameter cases

Page 234: Class Notes

11

7.8 MLE for Vector ParameterAnother nice property of MLE is how easily it carries over to the vector parameter case.

The vector parameter is: [ ]Tpθθθ %21=θ

0);x(ln=

∂∂

!"!#$ θθp

is the vector that satisfies:MLθ

∂∂

∂∂

∂∂

=∂

p

f

f

f

f

θ

θ

θ

)(

)(

)(

)( 2

1

θ

θ

θ

θθ

&

Derivative w.r.t. a vector

Page 235: Class Notes

12

Ex. 7.12: Estimate DC Level and Variancex[n] = A + w[n] noise is N(0,σ2) and white

Estimate: DC level A and Noise Variance σ2 ⇒

=

( )[ ]

−−= ∑−

=

1

0

22

22

2 ][2

1exp

2

1),;(N

nN AnxAp

σπσ

σxLF is:

0θθx setp

=∂

∂ );(lnSolve:

−=

∑ 2)][(1ˆ

n

MLxnx

N

x

θ

( ) ( )

( )∑

=

=

=−+−=∂

=−=−=∂

1

0

2422

2

1

02

0][2

12

);(ln

0][1);(ln

N

n

N

n

AnxNp

AxNAnxA

p

σσσ

σσ

θx

θx

Interesting: For this problem… First estimate A just like scalar caseThe subtract it off and then estimate variance like scalar case

Page 236: Class Notes

13

Properties of Vector MLThe asymptotic properties are captured in Theorem 7.3:

If p(x;θ) satisfies some “regularity” conditions, then the MLE is asymptotically distributed according to

))(,(~ˆ 1 θIθθ −NaML

where I(θ) = Fisher Information MatrixSo the vector ML is asymptotically:

• unbiased • efficient

Invariance Property Holds for Vector Case

If α = g (θ ), then )ˆ(ˆ MLML g θα =

Page 237: Class Notes

14

Ex. 7.12 Revisited

−=

−=4

2

2

2 )1(20

0ˆcov)1(ˆ

σ

σ

σNN

N

NN

A

E θθIt can be shown that:

)(20

0ˆcovˆ 1

4

2

2θIθθθ −=

≈=

σ

σ

σN

NA

EFor large N then :

which we see satisfies the asymptotic property.

Diagonal covariance matrix shows estimates are uncorrelated:

Error Ellipse is aligned with axes

Ae

2σe

This is why we could “decouple”

the estimates

Page 238: Class Notes

15

MLE for the General Gaussian CaseLet the data be general Gaussian: x ~ N (µ(θ), C(θ))

Thus ∂ ln p(x;θ)/ ∂θ will depend in general onθθ

θθ

∂∂

∂∂ )(C)(u and

For each k = 1, 2, . . . , p set: 0);(ln=

∂∂

k

θx

This gives p simultaneous equations, the kth one being:

[ ] [ ] [ ] 0)()()(21)()()()()(

21 1

11 =−

∂∂

−−−

∂∂

+

∂∂

−−

−− θµxθCθµxθµxθCθµθCθCk

TT

kktr

θθθ

Term #1 Term #2 Term #3

Note: for the deterministic signal + noise case: Terms #1 & #3 are zero

This gives general conditions to find the MLE… but can’t always solve it!!!

Page 239: Class Notes

16

MLE for Linear Model CaseThe signal model is: x = Hθ + w with the noise w ~ N(0,C)

So terms #1 & #3 are zero and term #2 gives:

For this case we cansolve these equations!

[ ] 0HθxCθ

H

=−

∂∂ −

=

1)(

!"!#$

T

( ) xCHHCH 111ˆ −−−= TTMLθSolving this gives:

Hey! Same as chapter 4’s MVU for linear model

Recall: the Linear Model is specified to have Gaussian noise

For Linear Model: ML = MVU

))(,(~ˆ 11 −− HCHθθ TML N

EXACT… Not Asymptotic!!

Page 240: Class Notes

17

Numerical Solutions for Vector CaseObvious generalizations… see p. 187

There is one issue to be aware of, though:

The numerical implementation needs ∂ln p(x;θ)/∂θ

For the general Gaussian case this requires:θθC

∂∂ − )(1

…often hard to analytically: get C-1(θ)

& then differentiate!

So… we use (3C.2):

"#$"#$

"#$ )()()()( 111

θCθCθCθC −−−

∂∂

−=∂

kk θθ

GetAnalytically

GetNumerically

Page 241: Class Notes

18

7.9 Asymptotic MLE

Useful when data samples x[n] come from a WSS process

Reading Assignment Only

Page 242: Class Notes

1

7.10 MLE ExamplesWe’ll now apply the MLE theory to several examples of practical signal processing problems.

These are the same examples for which we derived the CRLB in Ch. 3

1. Range Estimation – sonar, radar, robotics, emitter location

2. Sinusoidal Parameter Estimation (Amp., Frequency, Phase)– sonar, radar, communication receivers (recall DSB Example), etc.

3. Bearing Estimation – sonar, radar, emitter location

4. Autoregressive Parameter Estimation– speech processing, econometrics

See Book

We Will

Cover

Page 243: Class Notes

2

Ex. 1 Range Estimation ProblemTransmit Pulse: s(t) nonzero over t∈[0,Ts]

Receive Reflection: s(t – τo)

Measure Time Delay: τo

max,);(

0)()()( osts

o TTttwtstxo

τττ

+=≤≤+−= !"!#$

C-T Signal Model

tTs

s(t)

tT

s(t – τo)

BandlimitedWhite Gaussian

BPF& Amp

x(t)PSD of w(t)

f B–B

No/2

Page 244: Class Notes

3

Range Estimation D-T Signal Model

Sample Every ∆ = 1/2B secw[n] = w(n∆)

DT White Gaussian Noise

Var σ2 = BNo

f

ACF of w(t)

τ1/2B

B–B1/B 3/2B

PSD of w(t)No/2 σ2 = BNo

1,,1,0][][][ −=+−= Nnnwnnsnx o …

s[n;no]… has M non-zero samples starting at no no ≈ τo /∆

−≤≤+

−+≤≤+−

−≤≤

=

1][

1][][

10][

][

NnMnnw

Mnnnnwnns

nnnw

nx

o

ooo

o

Page 245: Class Notes

4

Range Estimation Likelihood FunctionWhite and Gaussian ⇒ Independent ⇒ Product of PDFs

3 different PDFs – one for each subinterval

2

3#

1

2

2

2#

1

2

2

1#

1

02

2

2

1

2][exp

2])[][(exp

2][exp);(

πσ

σσσ

=

−•

−−−•

−= ∏∏∏

−+

+=

−+

=

=

C

nxCnnsnxCnxCnpMn

Mnn

Mn

nn

on

no

o

o

o

o

o

!!!! &!!!! '(!!!!!!! &!!!!!!! '(!!! &!!! '(

x

Expand to get an x2[n] term… group it with the other x2[n] term

−+−−−•

−= ∑∑ −+

=

=

!!!!!!! "!!!!!!! #$!!! "!!! #$

12

022

1

0

2

])[][][2(2

1exp2

][exp);(

Mn

nno

N

nNo

o

o

nnsnnsnxnx

Cnpσσ

x

must minimize this or maximize its negative over values of noDoes not depend on no

Page 246: Class Notes

5

Range Estimation ML Condition

!!! "!!! #$!!! "!!! #$∑∑

−+

=

−+

=−+−

12

1

0 ][][][Mn

nno

Mn

nn

o

o

o

o

nnsnnsnx

Doesn’t depend on no! …Summand moves with the limits as no changes.

Because s[n – no] = 0 outside summation

range… so can extend it!

2So maximize this:

∑−

=−

1

00 ][][

N

nnnsnxSo maximize this:

So…. MLE Implementation is based on Cross-correlation: “Correlate” Received signal x[n] with transmitted signal s[n]

,][][][][maxargˆ1

00∑−

=−≤≤−==

N

nxsxs

MNmo mnsnxmCmCn

Page 247: Class Notes

6

Range Estimation MLE Viewpoint

mno

Cxs[m]

,][][][1

0∑−

=−=

N

nxs mnsnxmC

Doesn’t depend on no! …Summand moves with the limits as no changes.

Warning: When signals are complex (e.g., ELPS) take find peak of |Cxs[m] |

• Think of this as an inner product for each m• Compare data x[n] to all possible delays of signal s[n]

! pick no to make them most alike

Page 248: Class Notes

7

Ex. 2 Sinusoid Parameter Estimation ProblemGiven DT signal samples of a sinusoid in noise….

Estimate its amplitude, frequency, and phase

1,,1,0][)cos(][ −=++Ω= NnnwnAnx o …φ

Ωo is DT frequency in cycles/sample: 0 < Ωo < π

DT White Gaussian NoiseZero Mean & Variance of σ2

Multiple parameters… so parameter vector: ToA ][ φΩ=θ

The likelihood function is:

),,(

))cos(][(2

1exp);(1

0

22

φ

φσ

o

N

no

N

AJ

nAnxCp

Ω=

+Ω−−=

=∑θx

For MLE: Minimize This

Page 249: Class Notes

8

Sinusoid Parameter Estimation ML ConditionTo make things easier…

Define an equivalent parameter set:

[α1 α2 Ωo ]T α1 = Acos(φ) α2 = –Asin(φ)

Then… J'(α1 ,α2,Ωo) = J(A,Ωo,φ) α = [α1 α2]T

Define:

c(Ωo) = [1 cos(Ωo) cos(Ωo2) … cos(Ωo(N-1))]T

s(Ωo) = [0 sin(Ωo) sin(Ωo2) … sin(Ωo(N-1))]T

and…

H(Ωo) = [c(Ωo) s(Ωo)] an Nx2 matrix

Page 250: Class Notes

9

Then: J'(α1 ,α2,Ωo) = [x – H (Ωo) α]T [x – H (Ωo) α]

Looks like the linear model case… except for Ωo dependence of H (Ωo)

Thus, for any fixed Ωo value, the optimal α estimate is

[ ] xHHHα )()()(ˆ 1o

Too

T ΩΩΩ=−

Then plug that into J'(α1 ,α2,Ωo):

[ ] [ ]

[ ][ ]

[ ][ ]

[ ] !!!!!!! "!!!!!!! #$

!!!!!!!! "!!!!!!!! #$

o

oT

ooT

o

oT

ooT

oTT

oT

ooT

oT

ooTTT

oT

ooJ

Ω

ΩΩΩΩ−=

ΩΩΩΩ−=

ΩΩΩΩ−=

Ω−Ω−=

Ω−Ω−=Ω′

w.r.t.minimize

1

)()()()(

21

21

)()()()(

)()()()(

ˆ)()(ˆ

ˆ)(ˆ)(),ˆ,ˆ(

1

xHHHHxxx

xHHHHIx

αHxHαx

αHxαHx

HHHHI

αα

Page 251: Class Notes

10

Sinusoid Parms. Exact MLE Procedure

[ ]

ΩΩΩΩ=Ω

≤Ω≤xHHHHx )()()()(minargˆ 1

0o

Too

To

To

o π

oΩStep 1: Minimize “this term” over Ωo to find

Step 2: Use result of Step 1 to get

[ ] xHHHα )ˆ()ˆ()ˆ(ˆ 1o

Too

T ΩΩΩ=−

Done Numerically

Step 3: Convert Step 2 result by solving

)ˆsin(ˆˆ

)ˆcos(ˆˆ

2

1

φα

φα

A

A

−=

=φ&ˆfor A

Page 252: Class Notes

11

Sinusoid Parms. Approx. MLE ProcedureFirst we look at a specific structure:

[ ]

Ω

Ω

ΩΩΩΩ

ΩΩΩΩ

Ω

Ω=ΩΩΩΩ

xs

xc

sscs

sccc

xs

xcxHHHHx

)(

)(

)()()()(

)()()()(

)(

)()()()()(

1

1

oT

oT

ooT

ooT

ooT

ooTT

oT

oT

oT

ooT

oT

!!!!!! "!!!!!! #$

Then… if Ωo is not near 0 or π, then approximately1

20

02

≈N

N

and Step 1 becomes

2

0

21

00)(minarg)exp(][2minargˆ Ω=

Ω−=Ω≤Ω≤

=≤Ω≤∑ Xnjnx

N

N

noo

o ππ

and Steps 2 & 3 become DTFT of Data x[n]

)ˆ(ˆ

)ˆ(2ˆ

o

o

X

XN

A

Ω∠=

Ω=

φ

Page 253: Class Notes

12

The processing is implemented as follows:

Given the data: x[n], n = 0, 1, 2, … , N-1

1. Compute the DFT X[m], m = 0, 1, 2, … , M-1 of the data• Zero-pad to length M = 4N to ensure dense grid of frequency points

• Use the FFT algorithm for computational efficiency

2. Find location of peak• Use quadratic interpolation of |X[m]|

3. Find height at peak• Use quadratic interpolation of |X[m]|

4. Find angle at peak• Use linear interpolation of ∠X[m]

|X(Ω)|

Ω

∠X(Ω)

ΩoΩ

Page 254: Class Notes

13

Figure 3.8 from textbook:

)2cos()( φπ += tfAts ot

Ex. 3 Bearing Estimation MLEEmits or reflects

signal s(t)

Simple model

Grab one “snapshot” of all M sensors at a single instant ts:

( ) ][~

cos][)(][ nwnAnwtsnx ssn ++Ω=+= φ

Same as Sinusoidal Estimation!! So Compute DFT and Find Location of Peak!!

If emitted signal is not a sinusoid then you get a different MLE!!

Page 255: Class Notes

1

MLE for TDOA/FDOA Location

Overview Estimating TDOA/FDOA Estimating Geo-Location

Page 256: Class Notes

2

)(ts

tjetts 1)( 1ω−

Data Link

Data Link

tjetts 2)( 2ω−

tjetts 3)( 3ω−

MULTIPLE-PLATFORM LOCATION

Emitter to be located

Page 257: Class Notes

3

)(ts

tjetts 1)( 1ω−

Data Link

Data Link

τ23 = t2 t3= constant

TDOATime-Difference-Of-Arrival

τ21 = t2 t1 = constant

tjetts 2)( 2ω−

tjetts 3)( 3ω−

ν21 = ω2 ω1 = constant

ν23 = ω2 ω3 = constant

FDOAFrequency-Difference-Of-Arrival

TDOA/FDOA LOCATION

Page 258: Class Notes

4

Estimating TDOA/FDOA

Page 259: Class Notes

5

SIGNAL MODEL! Will Process Equivalent Lowpass signal, BW = B Hz

Representing RF signal with RF BW = B Hz

! Sampled at Fs > B complex samples/sec! Collection Time T sec! At each receiver:

BPF ADCMakeLPE

SignalEqualize

cos(ω1t)

f

XRF(f)

X(f)

f

f

XfLPE(f)

B/2-B/2

Page 260: Class Notes

6

Tx Rx

s(t) sr(t) = s(t τ(t))

R(t)

Propagation Time: τ(t) = R(t)/c

!+++= 2)2/()( tavtRtR o

Use linear approximation assumes small change in velocity over observation interval

)/]/1([)/][()( cRtcvscvtRtsts oor −−=+−=Time

ScalingTime Delay: τd

For Real BP Signals:

DOPPLER & DELAY MODEL

Page 261: Class Notes

7

Analytic Signals Model

)]([)()(~ ttj cetEts φ+ω=

Now what? Notice that v << c " (1 v/c) ≈ 1Say v = 300 m/s (670 mph) then v/c = 300/3x108 = 10-6 " (1 v/c)=1.000001

Now assume E(t) & φ(t) vary slowly enough that

)()]/1([

)()]/1([

ttcv

tEtcvE

φ≈−φ

≈− For the range of vof interest

DOPPLER & DELAY MODEL (continued)Analytic Signal of Tx

)]/1([)]/1([)]/1([

)]/1([~)(~

ddc tcvtcvjd

dr

etcvE

tcvsts

τ−−φ+τ−−ωτ−−=

τ−−=Analytic Signal of Rx

Called Narrowband Approximation

Page 262: Class Notes

8

)()/(

)()/(

)(

)()(~

dccdc

ddccc

tjd

tjtcvjj

ttcvtjdr

etEeee

etEts

τ−φωω−τω−

τ−φ+τω−ω−ω

τ−=

τ−=

ConstantPhaseTerm

α= ωcτd

DopplerShiftTermωd= ωcv/c

CarrierTerm

Transmitted SignalsLPE Signal

Time-Shifted by τd

Narrowband Analytic Signal Model

Narrowband Lowpass Equivalent Signal Model

)()( dtjj

r tseets d τ−= ω−α

This is the signal that actually gets processed digitally

DOPPLER & DELAY MODEL (continued)

Page 263: Class Notes

9

CRLB for TDOAWe already showed that the CRLB for the active sensor case is:

But here we need to estimate the delay between two noisy signals rather than between a noisy one and a clean one.

The only difference in the result is: replace SNR by an effective SNR given by

,min1111

21

2121

SNRSNR

SNRSNRSNRSNR

SNReff ≈++

=

∑−

−=

−== 12/

2/

2

12/

2/

22

2

][

][1

N

Nk

N

Nkrms

nS

kSk

NB

2281)(

rmsBSNRNTDOAC

××=

π

where Brms is an effective bandwidth of the signal computed from the DFT values S[k].

Page 264: Class Notes

10

CRLB for TDOA (cont.)A more familiar form for this is in terms of the C-T version of the problem:

SNRBT B 1

effrmsTDOA

×≥

22πσ

dffS

dffSf Brms∫∫= 2

222

)(

)(seconds

BT = Time-Bandwidth Product (≈ N, number of samples in DT)B = Noise Bandwidth of Receiver (Hz)T = Collection Time (sec)

BT is called Coherent Processing Gain(Same effect as the DFT Processing Gain on a sinusoid)

For a signal with rectangular spectrum of RF width of Bs, then the bound becomes:

SNRBT B

effsTDOA

×≥

255.0

σ

S. Stein, Algorithms for Ambiguity Function Processing, IEEE Trans. on ASSP, June 1981

Page 265: Class Notes

11

CRLB for FDOAHere we take advantage of the time-frequency duality if the FT:

where Trms is an effective duration of the signal computed from the signal samples s[k].

∑−

−=

−== 12/

2/

2

12/

2/

22

2

][

][1

N

Nn

N

Nnrms

ns

nsk

NT

2281)(

rmseff TSNRNFDOAC

××=

π

Again we use the same effective SNR:

,min1111

21

2121

SNRSNR

SNRSNRSNRSNR

SNReff ≈++

=

Page 266: Class Notes

12

CRLB for FDOA (cont.)

A more familiar form for this is in terms of the C-T version of the problem:

SNRBT T 1

effrmsFDOA

×≥

22πσ

dtts

dttst Trms∫∫= 2

222

)(

)(Hz

For a signal with constant envelope of duration Ts, then the bound becomes:

SNRBT T

effsFDOA

×≥

255.0σ

S. Stein, Algorithms for Ambiguity Function Processing, IEEE Trans. on ASSP, June 1981

Page 267: Class Notes

13

Interpreting CRLBs for TDOA/FDOAA more familiar form for this is in terms of the C-T version of the problem:

SNRBT T 1

effrmsFDOA

×≥

22πσ

SNRBT B 1

effrmsTDOA

×≥

22πσ

BT pulls the signal up out of the noise Large Brms improves TDOA accuracy Large Trms improves FDOA accuracy

SNR1 SNR2 T = Ts B = Bs σTDOA σFDOA

3 dB 30 dB 1 ms 1 MHz 17.4 ns 17.4 Hz

3 dB 30 dB 100 ms 10 kHz 1.7 µs 0.17 Hz

Two Examples of Accuracy Bounds:

Page 268: Class Notes

14

MLE for TDOA/FDOA S. Stein, Differential Delay/Doppler ML Estimation with Unknown Signals, IEEE Trans. on SP, August 1993

We already showed that the ML Estimate of delay for the active sensor case is the Cross-Correlation of the time signals.

By the time-frequency duality the ML estimate for doppler shift should be Cross-Correlation of the FT, which is mathematically equivalent to

dttstsCT

∫ +=0

21 )()()( ττ

dtetstsCT

tj∫ −=0

21 )()()( ωω

dtetstsAT

tj∫ −+=0

21 )()(),( ωττω

The ML estimate of the TDOA/FDOA has been shown to be:

Find Peak of |C(τ)|

Find Peak of |C(ω)|

Find Peak of |A(ω,τ)|

Page 269: Class Notes

15

Ambiguity Function

τωωd

τd

FindPeak

of|A(ω,τ)|

ML Estimator for TDOA/FDOA (cont.)

)(1 dtjj tsee d τωα −=

Delayτ

Dopplerω

CompareSignalsFor all

Delays & Dopplers

)(1 ts

)(2 ts

LPE RxSignalsAt Two

Receivers

Called: Ambiguity Function Complex Ambiguity Function (CAF) Cross-Correlation Surface

Page 270: Class Notes

16

ML Estimator for TDOA/FDOA (cont.)

How well do we expect the Cross-Correlation Processing to perform?

Well it is the ML estimator so it is not necessarily optimum.

But we know that an ML estimate is asymptotically Unbiased & Efficient (that means it achieves the CRLB) Gaussian

))(,(~ 1 θIθθ −NML

Those are some VERY nice properties that we can make use of in our location accuracy analysis!!!

Page 271: Class Notes

17

Consider when τ = τd [ ]∫ ω−ω=τωT

tjtjd dteetsA d

0

2)(),(

like windowed FT of sinusoidwhere window is |s(t)|2

ωωd

|A(ω,τd)|

width ∼ 1/T

Consider when ω = ωd

∫ τ+τ−=τωT

dd dttstsA0

)()(),(

correlation

|A(ωd,τ)|

ττd

width ∼ 1/BW

Properties of the CAF

Page 272: Class Notes

18

TDOA Accuracy depends on:» Effective SNR: SNReff

» RMS Widths: Brms = RMS Bandwidth

TDOA ACCURACY REVISITED

dffS

dffSf Brms∫∫= 2

222

)(

)(

XCorr Function

~1/Brms

TDOA

Narrow Brms CasePoor Accuracy

Wide Brms CaseGood Accuracy

TDOA

XCorr Function

Low Effective SNR Causes Spurious Peaks

On Xcorr Function

Narrow Xcorr FunctionLess Susceptible to

Spurious Peaks

Page 273: Class Notes

19

FDOA Accuracy depends on:» Effective SNR: SNReff

» RMS Widths: Drms = RMS Duration

FDOA ACCURACY REVISITED

dffS

dffSf Brms∫∫= 2

222

)(

)(

XCorr Function

~1/Drms

FDOA

Narrow Drms CasePoor Accuracy

Wide Drms CaseGood Accuracy

FDOA

XCorr Function

Low Effective SNR Causes Spurious Peaks

On Xcorr Function

Narrow Xcorr FunctionLess Susceptible to

Spurious Peaks

Page 274: Class Notes

20

COMPUTING THE AMBIGUITY FUNCTION

Direct computation based on the equation for the ambiguity function leads to computationally inefficient methods.

In EECE 521 notes we showed how to use decimation to efficiently compute the ambiguity function

Page 275: Class Notes

21

Estimating Geo-Location

Page 276: Class Notes

22

Data Link

Data Link

TDOA/FDOA LOCATION

Data Link

Centralized Network of P P-Choose-2 Pairs

# P-Choose-2 TDOA Measurements# P-Choose-2 FDOA Measurements

Warning: Watch out for Correlation Effect Due to Signal-Data-In-Common

Page 277: Class Notes

23

TDOA/FDOA LOCATIONPair-Wise Network of P P/2 Pairs

# P/2 TDOA Measurements# P/2 FDOA Measurements

Many ways to select P/2 pairs Warning: Not all pairings are equally good!!! The Dashed Pairs are Better

Page 278: Class Notes

24

TDOA/FDOA Measurement ModelGiven N TDOA/FDOA measurements with corresponding 2×2 Cov. Matrices

),(,),,(),,( 2211 NN ντντντ …

For notational purposes define the 2N measurements r(n) n = 1, 2, , 2N

Nnr

Nnr

nn

nn

,,2,1,

,,2,1,

2

12

==

==−

ν

τ

NCCC ,,, 21 …

TNrrr ][ 221 !=r

Data Vector

Now, those are the TDOA/FDOA estimates so the true values are notated as:

),(,),,(),,( 2211 NN ντντντ …

Nns

Nns

nn

nn

,,2,1,

,,2,1,

2

12

==

==−

ν

τT

Nsss ][ 221 !=s

Signal Vector

Assume pair-wise network, soTDOA/FDOA pairs are uncorrelated

Page 279: Class Notes

25

TDOA/FDOA Measurement Model (cont.)Each of these measurements r(n) has an error ε(n) associated with it, so

==

N

N

C00

00

00C

CCCC #…

1

21 ,,,diag

εsr +=Because these measurements were estimated using an ML estimator (with sufficiently large number of signal samples) we know that error vector ε is a zero-mean Gaussian vector with cov. matrix C given by:

The true TDOA/FDOA values depend on:Emitter Parms: (xe, ye, ze) and transmit frequency fe xe = [ xe ye ze fe ]T

Receivers Nav Data (positions & velocities): The totality of it called xr

εxxsr += );( re Deterministic Signal + Gaussian Noise Signal is nonlinearly related to parms

Assumes that TDOA/FDOA pairs are uncorrelated!!!

To complete the model we need to know how s(xe;xr) depends on xe and xr.Thus we need to find TDOA & FDOA as functions of xe and xr

Page 280: Class Notes

26

TDOA/FDOA Measurement Model (cont.)

Two Receivers with: (x1, y1, Vx1, Vy1) and (x2, y2, Vx2, Vy2)Emitter with: (xe, ye)

(Let Ri be the range between Receiver i and the emitter; c is the speed of light.)

The TDOA and FDOA are given by:

( ) ( ) ( ) ( )

−+−−−+−=

−==

22

22

21

21

21121

1

),(

eeee

ee

yyxxyyxxc

cRRyxs τ

( )

( ) ( )( ) ( )

( ) ( )( ) ( )

−+−

−+−−

−+−

−+−=

−==

22

22

22222

12

1

1111

21122 ),,(

ee

ee

ee

eee

eeee

yyxx

VyyyVxxx

yyxx

VyyyVxxxcf

RRdtd

cffyxs ν

Here well simplify to the x-y plane extension is straight-forward.

Page 281: Class Notes

27

[ ]$$$$$$$ %$$$$$$$ &'$$$$$ %$$$$$ &'

parms w.r.t.cov. ofy variabilit

11

parms w.r.t.mean ofy variabilit

1 )()()()(tr21)()()()(

∂∂

∂∂

+

∂∂

∂= −−−

mnm

T

nnm θθθθ

θCθCθCθCθµθCθµθJ xx

xx

xx

x

CRLB for Geo-Location via TDOA/FDOARecall: For the General Gaussian Data case the CRLB depends on a FIM that has structure like this:

Here we have a deterministic signal plus Gaussian noise so we only have the 1st term Using the notation introduced here gives

11 )()()(

−−

∂∂

∂∂

=e

e

e

eT

eCRLB xxsC

xxsxC

Called the Jacobian for the 3-D location with TDOA/FDOA will be a 2N × 4 matrix whose columns are derivatives of s w.r.t. each of the 4 parameters.

($)

HHT

Page 282: Class Notes

28

TDOA/FDOA Jacobian:

∂∂

∂∂

∂∂

∂∂

∂∂

∂∂

∂∂

∂∂

∂∂

∂∂

∂∂

∂∂

=∂∂

=∆

e

eN

e

eN

e

eN

e

eN

e

e

e

e

e

e

e

ee

e

e

e

e

e

e

e

e

e

fs

zs

ys

xs

fs

zs

ys

xs

fs

zs

ys

xs

)()()()(

)()()()(

)()()()(

)(

2222

2222

1111

xxxx

xxxx

xxxx

xxsH

((((

ex∂∂

ey∂∂

ez∂∂

ef∂∂

CRLB for Geo-Loc. via TDOA/FDOA (cont.)

Jacobian can be computed for any desired Rx-Emitter Scenario Then plug it into ($) to compute the CRLB for that scenario:

[ ] 11)(−−= HCHxC T

eCRLB

Page 283: Class Notes

29

The Location CRLB can be used to study various aspects of the emitter location problem. It can be used to study the effect of

Rx-Emitter Geometry and/or Platform Velocity TDOA accuracy vs. FDOA accuracy Number of Platforms Platform Pairings Etc., Etc., Etc

Once you have computed the CRLB Covariance CCRLB you can use it to compute and plot error ellipsoids.

CRLB Studies

Faster than doing Monte Carlo Simulation Runs!!!

Sensor 1

Emitter

V V

Sensor 2

k1σTDOA

k2σFDOA

Assumes Geo-Location Error is Gaussian

Usually reasonably valid

Page 284: Class Notes

30

Projections of 3-D Ellipsoids onto 2-D Space

2-D Projections ShowExpected Variation

in 2-D Space

xy

z

Error Ellipsoids: If our CCRLB is 4×4 how do we get 2-D ellipses to plot???

Projections!

CRLB Studies (cont.)

Page 285: Class Notes

31

CRLB Studies (cont.)Another useful thing that can be computed from the CRLB is the CEP for the location problem. For the 2-D case:

Circular Error Probable= radius of a circle that when centered at the estimates mean, contains 50% of the estimates

22

2121 75.075.0CEP σσλλ +=+≈

within 10%Diagonal elements of the 2-D Cov. MatrixEigen-Values of the

2-D Cov. Matrix

Cross Range

Dow

n R

ange

CEP = 10

CEP = 20

CEP = 40

CEP = 80

CEP Contour Plots are Good Ways to Assess

Location Performance

Page 286: Class Notes

32

25

TargetTarget

TargetSensor

FDOA OnlyTDOA Only

TDOA/FDOA

Both Important FDOA Important TDOA Important

TDOA Only FDOA Only

CRLB Studies (cont.)

Geometry and TDOA vs. FDOA Trade-Offs

Page 287: Class Notes

33

Estimator for Geo-Location via TDOA/FDOABecause we have used the ML estimator to get the TDOA/FDOA estimates the MLs asymptotic properties tell us that we have Gaussian TDOA/FDOA measurements

Because the TDOA/FDOA measurement model is nonlinear it is unlikely that we can find a truly optimal estimate so we againresort to the ML. For the ML of a Nonlinear Signal in Gaussian we generally have to proceed numerically.

One way to do Numerical MLE is ML Newton-Raphson (need vector version):

k

ppTkk

θθθθx

θθθxθθ

12

1);(ln);(ln

=

+

∂∂

∂∂

∂−=

However, the Hessian requires a second derivativeThis can add complexity in practice Alternative:

Gauss-Newton Nonlinear Least Squares based on linearizing the model.

Gradient: p×1 vectorHessian: p×p matrix

Page 288: Class Notes

1

Chapter 8Least-Squares Estimation

Page 289: Class Notes

2

8.3 The Least-Squares (LS) ApproachAll the previous methods we’ve studied… required a probabilistic model for the data: Needed the PDF p(x;θ)

For a Signal + Noise problem we needed: Signal Model & Noise Model

Least-Squares is not statistically based!!! ⇒ Do NOT need a PDF Model ⇒ Do NEED a Deterministic Signal Model

signal model ∑ ∑

x[n] = strue[n;θ] + w[n]

= s[n;θ] + e[n]

model & measurement error

+

δ[n] model error

w[n] noise (measurement

error)

+ +

Similar to Fig. 8.1(a)

s[n;θ]+

strue[n;θ]

Page 290: Class Notes

3

Least-Squares Criterion

signal model

∑ε[n]

+

x[n]

]ˆ;[ θns

θ

Choose the Estimate…

… to make this “residual” small

( )∑∑−

=

=−==

1

0

21

0

2 ];[][][)(N

n

N

nnsnxnJ θθ ε

Minimize the LS Cost

Ex. 8.1: Estimate DC Level x[n] = A + e[n] = s[n;θ] + e[n]

xnxN

AAAJSet

AnxAJ

N

n

N

n

==⇒=∂

−=

∑−

=

=

1

0

1

0

2

][1ˆ0)(

)][()( Same thing we’ve gotten before!

Note: If e[n] is WGN, then LS = MVUTo Minimize…

Page 291: Class Notes

4

Weighted LS CriterionSometimes not all data samples are equally good:

x[0], x[1], … , x[N-1]

Say you know x[10] was poor in quality compared to other data…

You’d want to de-emphasize its importance in the sum of squares:

∑−

=−=

1

0

2]);[][()(N

nn nsnxwJ θθ

set this small to de-emphasize a sample

Page 292: Class Notes

5

8.4 Linear Least-SquaresA linear least-squares problem is one where the parameter observation model is linear: s = Hθ x = Hθ + e

p×1N×1

p = Order of the modelN×p Known Matrix

We must assume that H is full rank… otherwise there are multiple parameter vectors that will map to the same s!!!

Note: Linear LS does NOT mean “fitting a line to data”… although that is a special case:

H

s

=⇒+=BA

N

BnAns

"#"$%

&&11

211101

][

Page 293: Class Notes

6

Finding the LSE for the Linear Model( )

( ) ( )HθxHθx

θθ

−−=

−= ∑−

=

T

N

nnsnxJ

1

0

2];[][)(For the linear model the LS cost is:

Now, to minimize, first expand:

HθHθHθxxx

HθHθxHθHθxxxθ

TTTT

TTTTTTJ

+−=

+−−=

2

)(

Scalar = scalarT So…θTHTx = (θTHTx)T = xTHθ

0θθ

=∂

∂ )(JNow setting gives 0θHHxH =+− ˆ22 TT

Called the “LS Normal Equations”xHθHH TT =ˆ

Because H is full rank we know that HTH is invertible:

( ) xHHHθ TTLS

1ˆ −= ( ) xHHHHθHs TT

LSLS1ˆˆ −

==

Page 294: Class Notes

7

Comparing the Linear LSE to Other EstimatesModel Estimate

eHθx += ( ) xHHHθ TTLS

1ˆ −=

No Probability Model Needed

( ) xHHHθ TTBLUE

1ˆ −=

wHθx +=

PDF Unknown, White

( ) xHHH TTML

1ˆ −=θ

wHθx +=

PDF Gaussian, White

( ) xHHHθ TTMVU

1 ˆ −=

wHθx += PDF Gaussian, White

If you assume

Gaussian & apply

these BUT you

are WRONG you at least

get the LSE!

Page 295: Class Notes

8

The LS Cost for Linear LSFor the linear LS problem…

what is the resulting LS cost for using ( ) ?ˆ 1xHHHθ TT

LS−

=

( ) ( ) ( ) ( )

( ) ( )

( ) ( )( )

xHHHHIHHHHxIx

xHHHHxHHHHxx

xHHHHxxHHHHxθHxθHx

HHHHI

"""""""" #"""""""" $%

−=

−−

−−

−−

−=

−=

−=−−=

TT

TTTTTT

TTTTTT

TTT

TTLS

TLSJ

1

11

11

11min

ˆˆ

Properties of Transpose

Factor out xs

Easily Verified!Note: if AA = A then A is called idempotent

( ) xHHHHIx

−=

− TTTJ1

min ( ) xHHHHxxx TTTTJ1

min−

−=

2min0 x≤≤ J

Page 296: Class Notes

9

Weighted LS for Linear LSRecall: de-emphasize bad samples’ importance in the sum of squares:

∑−

=−=

1

0

2]);[][()(N

nn nsnxwJ θθ

( ) ( )HθxWHθx −−= TJ )(θFor the linear LS case we get:

Diagonal Matrix

Minimizing the weighted LS cost gives:

( ) xWHWHHWHWx

−=

− TTTJ1

min( ) WxHWHHθ TTWLS

1ˆ −=

Note: Even though there is no true LS-based reason… many people use an inverse cov matrix as the weight: W = Cx

-1

This makes WLS look like BLUE!!!!

Page 297: Class Notes

10

8.5 Geometry of Linear LS• Provides different derivation• Enables new versions of LS

Recall the LS Cost to be minimized: ( ) ( ) 2)( HθxHθxHθxθ −=−−= TJ

s– Order Recursive– Sequential

Thus, LS minimizes the length of the error vector between the data and the signal estimate: sxε ˆ−=

∑=

==p

iii

1hHθs θ [ ]phhhH '21=But… For Linear LS we have

N×p

θ sRange (H) ⊂ RN

N > pRNRp

s lies in subspace of RN

x can lie anywhere in RN

Page 298: Class Notes

11

LS Geometry Example N = 3 p = 2Notation a bit different from the book

x = s + e“noise” takes s out of Range(H) and into RN

h1

h2H columns lie in this plane = “subspace” spanned by the columns of H = S2

(Sp in general)

x

2211ˆ hhs θθ +=s

e sxε ˆ−= ihε ⊥

Page 299: Class Notes

12

LS Orthogonality PrincipleThe LS error vector must be ⊥ to all columns of H

TT 0Hε = 0εH =Tor

( )

( ) xHHHθxHHθH

0HθxH0εH

TTLS

TT

TT

1ˆ −=⇒=⇒

=−⇒=Can use this property to derive the LS estimate:

θs

RNRp

θ

H

x

(HTH)-1HT

Same answer as before…but no derivatives to worry about!

Range (H) ⊂ RN

Acts like an inverse from RN back to Rp called pseudo-inverse of H

Page 300: Class Notes

13

LS Projection ViewpointFrom the R3 example earlier… we see that must lies“right below” x

= “Projection” of x onto Range(H)s

(Recall: Range(H) = subspace spanned by columns of H)

From our earlier results we have: ( ) xHHHHθHs

HP

"" #"" $%∆=

== TT

LS1ˆˆ

x

xPs H=ˆ

sxε ˆ−= “Projection Matrix onto Range(H)”

Page 301: Class Notes

14

Aside on ProjectionsIf something is “on the floor”… its projection onto the floor = itself!

zzPHz H =∈ then,)Range( if

Now… for a given x in the full space… PHx is already in Range(H)… so PH(PHx) = PHx

Thus… for any projection matrix PH we have: PH PH = PH

HH PP =2 Projection Matrices are Idempotent

Note also that the projection onto Range(H) is symmetric:

( ) TT HHHHPH1−

= Easily Verified

Page 302: Class Notes

15

What Happens w/ Orthonormal Columns of H

( ) xHHHθ TTLS

1ˆ −=Recall the general Linear LS solution:

=

pppp

p

p

T

hhhhhh

hhhhhh

hhhhhh

HH

,,,

,,,

,,,

21

22212

12111

'

&(&&

'

'

where

If the columns of H are orthonormal then <hi,hj> = δij ⇒ HTH = I

xHθ TLS =ˆ

Easy!! No Inversion Needed!!

Recall Vector Space Ideas with ON Basis!!

Page 303: Class Notes

16

Geometry with Orthonormal Columns of H Inner Product Between ithColumn and Data VectorxhT

iiθ =ˆRe-write this LS solution as:

∑∑==

===p

ii

Ti

p

iii

11)(ˆˆˆ "#"$% hxhhθHs θThen we have:

Projection of xonto hi axis

h1

h2

2211 )()(ˆ hxhhxhs TT +=

22 )( hxhT

11 )( hxhT

x

When the columns of H are ⊥we can first find the projection

onto each 1-D subspace independently, then add these independently derived results.

Nice!

Page 304: Class Notes

1

8.6 Order-Recursive LS s[n]

nWant to fit a polynomial to data.., but which one is the right model?

! Constant ! Quadratic! Linear ! Cubic, Etc.

Motivate this idea with Curve FittingGiven data: n = 0, 1, 2, . . ., N-1

s[0], s[1], . . ., s[N-1]

Try each model, look at Jmin … which one works “best”Jmin(p) Constant

Line Quadratic Cubic

p 3

(# of parameters in model)

1 2 4

Page 305: Class Notes

2

Choosing the Best Model OrderQ: Should you pick the order p that gives the smallest Jmin??A: NO!!!!Fact: Jmin(p) is monotonically non-increasing as order p increases

If you have any N data points… you can perfectly fit a p = N model to them!!!!

s[n]

n

2 points define a line3 points define a quadratic4 points define a cubic

N points define aNxN+aN-1xN-1+…+a1x+a0

Warning: Dont Fit the Noise!!Warning: Dont Fit the Noise!!

Page 306: Class Notes

3

Choosing the Order in PracticePractice: use simplest model that adequately describes the dataScheme: Only increase order if cost reduction is “significant”

" Increase to order p+1 only if Jmin(p) – Jmin(p=1) > ε

" Also, in practice you may have some idea of the expected level of error ⇒ thus have some idea of expected Jmin⇒ use order p such that Jmin(p) ≈ Expected Jmin

user-set threshold

Wasteful to independently compute the LS solution for each order

Drives Need for: Efficient way to compute LS for many models

Q: If we have computed p-order model, can we use it to recursively compute (p+1)-order model?

A: YES!! ⇒ Order-Recursive LS

Page 307: Class Notes

4

Define General Order-Increasing Models Define: Hp+1 = [ Hp hp+1 ] ⇒ h1, h2, h3, . . .

H1

H2 Etc.H3

Order-Recursive LS with Orthonormal ColumnsIf all hi are ⊥ ⇒ EASY !!

h1

h2

2212 )(ˆˆ hxhss T+=

22 )( hxhT

111 )(ˆ hxhs T=

x

!!

3323

2212

111

)(ˆˆ3

)(ˆˆ2

)(ˆ1

hxhss

hxhss

hxhs

T

T

T

p

p

p

+==

+==

==

Page 308: Class Notes

5

Order-Recursive Solution for General HIf hi are Not ⊥ ⇒ Harder, but Possible!

Basic Idea: Given current-order estimate: • map new column of H into an ON version • use it to find new “estimate,”• then transform to correct for orthogonalization

Quotes here because this estimate is for theorthogonalized model

h1

h3

h2

3~h

Orthogonalized version of h3 S2 = 2-D space spanned

by h1 & h2= Range(H2)

Note: x is not shown here… it is in a higher dimensional space!!

Page 309: Class Notes

6

Geometrical Development of Order-Recursive LS The Geometry of Vector Space is indispensable for DSP!

Current-Order = k⇒ Hk = [h1 h2 . . . hk] (not necessarily ⊥)

See App. 8A for AlgebraicDevelopment

Yuk! Geometry is

Easier!"""" #"""" $%T

kkT

kkk HHHHP 1)( −=

Projector onto Sk = Range(Hk)

Recall:

Given next column: hk+1 Find , which is ⊥ to Sk1

~+kh

( ) 1111~

++++⊥

−=−= kkkkkk

k

hPIhPhhP"#"$%

1+kkhP

11~

+⊥

+ = kkk hPh

kkk

k S shh ˆ~~11 ⊥⇒⊥ ++ ks

h3

Sk

Page 310: Class Notes

7

So our approach is now: project x ontoand then add to

1~

+kh

ks

The projection of x onto is given by1~

+kh

1

!

21

112

1

1

111

1

1

11

~~

~

~~~

~~

+⊥

+⊥

+⊥

+

+

+

+⊥

++

+

+

++

==

==∆

kk

scalar

kk

kkT

kk

kT

kkkk

k

k

kk use

hPhP

hPxhh

hx

hPhhh

hhxs

"" #"" $%

Divide by norm to

normalize

1

11

ˆˆ

ˆˆˆ

+

++

∆+=

∆+=

kkk

kkk

sθH

sssNow add this to current signal estimate:

Page 311: Class Notes

8

""#""$% 11

11

121

11

)(ˆ

ˆˆ

+⊥

+

⊥++

+⊥

+⊥

+⊥

+

−+=

+=

kkTk

kTkkk

kk

kk

kk

kkT

kkk

hPhxPhhPIθH

hPhP

hPxθHsScalar…

can move here and transpose

Write out Pk⊥

scalar… define as b for convenience

Now we have:

Write out ||.||2 and use that Pk

⊥ is idempotent

[ ]

−=

−+=

+−

=+

+−

+

+ b

b

bb

kTkk

Tkk

kk

kTkk

Tkkkkkk

k

11

1

11

1

)(ˆ

)(ˆˆ

1

hHHHθhH

hHHHHhθHs

H"#"$%

Finally:

Clearly this is 1ˆ+kθ

Page 312: Class Notes

9

=

+⊥

+

⊥+

+⊥

+

⊥+

+−

+

11

1

11

11

1

1

)(ˆˆ

kkTk

kTk

kkTk

kTk

kTkk

Tkk

k

hPhxPh

hPhxPhhHHHθ

θ

Order-Recursive LS Solution

Drawback: Needs Inversion Each RecursionSee Eq. (8.29) and (8.30) for a way to avoid inversion

Comments:

1. If hk+1 ⊥ Hk ⇒ simplifies problem as we’ve seen(This equation simplifies to our earlier result)

2. Note: P⊥k x above is residual of k-order model

= part of x not modeled by k-order model⇒ Update recursion works solely with this

Makes Sense!!!

Page 313: Class Notes

10

8.7 Sequential LSIn Last Section: In This Section:• Data Stays Fixed • Data Length Increases• Model Order Increases • Model Order Stays Fixed

You have received new data sample!

Say we have based on x[0], . . ., x[N-1]

If we get x[N]… can we compute based on and x[N]?(w/o solving using full data set!)

]1[ˆ −Nθ

][ˆ Nθ ]1[ˆ −Nθ

])[],1[ˆ(][ˆ NxNfN −= θθWe want…

Approach Here:1. Derive for DC-Level case2. Interpret Results3. Write Down General Result w/o Proof

Page 314: Class Notes

11

Sequential LS for DC-Level Case

∑−

=− =

1

01 ][1ˆ

N

nN nx

NAWe know this:

Re-Write

][1

1ˆ1

][][11

1][1

1

111

1

00

NxN

AN

N

NxnxN

NN

nxN

A

N

N

N

n

N

nN

++

+=

+

+=

+=

+−=

==∑∑

#$%

… and this:

""" #""" $%

#$%#$%

errorprediction

datanew theofprediction

1estimateold

1 )ˆ][(1

1ˆˆ−− −

++= NNN ANx

NAA

Page 315: Class Notes

12

Weighted Sequential LS for DC-Level CaseThis is an even better illustration… w[n] has unknown PDF

but has known time-dependent variance

2][var][][ nnwnwAnx σ=+=Assumed model:

∑−

=

=− = 1

02

1

02

11

][

ˆN

n n

N

n nN

nx

A

σ

σStandard WLS gives:

With manipulations similar to the above case we get:

"" #"" $%

"#"$%

#$%errorprediction

1

02

2

estimateold1 )ˆ][(

1

1

ˆˆ−

=

=

− −

+=

∑N

k

N

n n

NNN ANxAA

N

σ

σ

kN is a “Gain” term that reflects “goodness” of

new data

Page 316: Class Notes

13

Exploring The Gain Term

∑−

=

=

1

02

11

1)ˆvar(N

n n

NA

σ

We know that … and using it in kN …

&datanew theofvariance

21

1

)ˆvar()ˆvar(

NN

NN

AAk

σ+=

“poorness” of current estimate

…we get that“poorness” of

new data

Note: 0 ≤ K[N] ≤ 1

⇒ Gain depends on Relative Goodness Between:o Current Estimateo New Data Point

Page 317: Class Notes

14

Extreme Cases for The Gain Term

""" #""" $%"#"$%errorpredictionestimateold

])1[ˆ][(][]1[ˆ][ˆ −−+−= NANxNKNANA

New Data on Based"Correction" LittleMake

UseLittleHasDataNew

0][

])1[ˆvar( If 2

≈⇒

<<−

NK

NA nσGood EstimateBad Data

New Data on Based"Correction" LargeMake

UsefulVeryDataNew

1][

])1[ˆvar( If 2

≈⇒

>>−

NK

NA nσBad EstimateGood Data

Page 318: Class Notes

15

General Sequential LS Result See App. 8C for derivation

Diagonal Covariance

(Sequential LS requires this)

At time index n-1 we have: [ ]

estimate of measurequality ˆcov

using EstimateLSˆ

,,,diag

]1[]1[]0[

11

11

21

21

201111

1

−∆

−−

−−−−−

=

=+=

−=

nn

nn

nnnnn

Tn nxxx

θΣ

CwθHx

x

σσσ '

'

At time index n we get x[n]:

nTn

n

nnn wθh

HwθHx +

=+=

−1 Tack on row at bottom to show how θmaps to x[n]

Page 319: Class Notes

16

Iterate these Equations:2

11 ][ˆnnnn nx σhΣθ −−Given the Following:

1

12

1

11

)(

)ˆ][(ˆˆ

−−

−=

+=

−+=

nTnnn

nnTnn

nnn

nTnnnn nx

ΣhkIΣ

hΣhhΣk

θhkθθ

σ

"#"$%

Prediction of x[n] using current

parameter estimate

Update the Estimate:

Compute the Gain:

Update the Est. Cov.:

Initialization: (Assume p parameters)• Collect first p data samples x[0], . . ., x[p-1]• Use “Batch” LS to compute:• Then start sequential processing

11ˆ

−− pp Σθ

Gain has same kind of dependence on RelativeGoodness between:

o Current Estimateo New Data Point

Page 320: Class Notes

17

Sequential LS Block Diagram

kn Σ

z-1

1ˆ−nθ

+

+

x[n]

21 ,, nnn σhΣ −

Compute Gain

1ˆ][ −− n

Tnnx θh

Updated Estimate

Σ

Tnh1

ˆ−n

Tnθh

+

Observations

( )11ˆ][ˆ−− −+ n

Tnnn nxk θhθ

nθ−

Predicted Observation Previous

Estimatenh

Page 321: Class Notes

1

8.8 Constrained LSWhy Constrain? Because sometimes we know (or believe!) certain values are not allowed for θ

For example: In emitter location you may know that the emitter’s range can’t exceed the “radio horizon”

You may also know that the emitter is on the left side of the aircraft (because you got a strong signal from the left-side antennas and a weak one from the right-side antennas)

LSθThus, when finding you want to constrain it to satisfy these conditions

Page 322: Class Notes

2

Constrained LS Problem StatementSay that Sc is the set of allowable θ values (due to constraints).

Then we seek such thatcCLS S∈θ

22minˆ HθxθHxθ

−=−∈ cS

CLS

Types of Constraints

1. Linear Equality Aθ = b

2. Nonlinear Equality f (θ) = b

3. Linear Inequality Aθ ≥ bAθ ≤ b

4. Nonlinear Inequality f (θ) ≥ bf(θ) ≤ b

HARDER

Constrained to a line, plane or hyperplane

Constrained to lie above/below a

hyperplane

Well Cover #1. See Books on Optimization for Other Cases

Page 323: Class Notes

3

LS Cost with a Linear Equality Constraint

Using Lagrange Multipliers… we need to minimize

( ) ( )λθ

bAθλHθxHθxθ

and

J TTc

w.r.t.

)()( −+−−=

Linear Equality Constraint

x2

x1

contours of (x – Hθ)T (x – Hθ)

Unconstrained Minimum

2-D Linear Equality Constraint

Constrained Minimum

Page 324: Class Notes

4

Constrained Optimization: Lagrange Multiplier

x1

0),(),(

),(),(

2121

2121

=∇+∇⇒

∇−=∇

xxhxxf

xxhxxf

λ

λ

Constrained Max occurs when:

( )[ ] 0),(),( 2121 =−+∇ Cxxgxxf λ

=

∂∂

∂∂

=∇b

a

xxxh

xxxh

xxh

2

21

1

21

21 ),(

),(

),(

Ex. The grad vector has “slope” of b/a ⇒orthogonal to constraint line

Ex. ax1 + bx2 – c = 0⇒ x2 = (–a/b)x1 + c/bA Linear Constraint

Constraint: g(x1,x2) = Cg(x1,x2) – C = h(x1,x2) = 0

x2 f (x1,x2) contours

Page 325: Class Notes

5

LS Solution with a Linear Equality Constraint

)(ˆoffunctionaasˆ λλ CLSCLScJ θθ0θ

⇒=∂∂

Follow the usual steps for Lagrange Multiplier Solution:

1. Set

( ) ( ) λAHHxHHHλθ0λAHθHxHθ

TTTTc

TTT

uc

1

ˆ

1

21)(ˆ22

−−−=⇒=++− !!"!!#$

Unconstrained Estimate

2. Solve for λ to make satisfy the constraint:CLSθ !"!#$cforsolve

cλλ

bλθA⇒

=)(ˆ

( ) ( ) ( )bθAAHHAλbλAHHθA −

=⇒=

−−−uc

TTc

TTuc

ˆ221ˆ

111

3. Plug in to get the constrained solution: θ )(ˆˆccc λθ=

( ) ( ) ( )!!!!!!!! "!!!!!!!! #$

Term" Correction"

111 ˆˆˆ bθAAHHAAHHθθ −

−=

−−−uc

TTTTucc

Amount of Constraint Deviation

Page 326: Class Notes

6

Geometry of Constrained Linear LSThe above result can be interpreted geometrically:

x

ucss

cs

Constraint Line

Constrained Estimate of the Signal is the Projection of the Unconstrained Estimate

onto the Linear Constraint Subspace

Page 327: Class Notes

7

8.9 Nonlinear LSEverything we’ve done up to now has assumed a linearobservation model… but we’ve already seen that many applications have nonlinear observation models: s(θ) ≠ Hθ

Recall: For linear case – closed-form solution

< Not so for nonlinear case!! >

Must use numerical, iterative methods to minimize the LS cost given by:

J(θ) = [x s(θ)]T [x – s(θ)]

But first… Two Tricks!!!

Page 328: Class Notes

8

Two Tricks for Nonlinear LSSometimes it is possible to:

1. Transform into a Linear Problem2. Separate out any Linear Parameters

Trick #1: Seek an invertible function

such that

s(θ(α)) = Hα, which can be easily solved for

and then find

=

=

− )(

)(

1 αθ

αθ

g

g

LSα

)ˆ(ˆ 1LSLS g αθ −=

Sometimes Possible to Do

Both Tricks Together

Trick #2: See if some of the parameters are linear:

Try to decompose βαHθsβ

αθ )()(get to =

=

Linear in β!!!Nonlinear in α

Page 329: Class Notes

9

Example of Linearization Trick

=+=φ

φπA

nfAns o θ)2cos(][

Consider estimation of a sinusoid’s amplitude and phase (with a known frequency):

But we can re-write this model as:

)2sin()sin()2cos()cos(][21

nfAnfAns oo πφπφαα!"!#$!"!#$ −=

xHHHα TT 1)(ˆ −=which is linear in α = [α1 α2]T so:

Then map this estimate back using

−+

== −−

1

21

22

21

1

ˆˆ

tan

ˆˆ)ˆ(ˆ

αααα

αθ gNote that for this example this is merely exploiting polar-to-

rectangular ideas!!!

Page 330: Class Notes

10

Example of Separation TrickConsider a signal model of three exponentials:

%T

nnn

rAAA

rrArArAns

T

][

10][

321

33

221

α!"!#$

β

θ =

<<++=

=

−−− )1(3)1(21

32

111

)(

NNN rrr

rrrr

&&&H

Then we can write:

βHθs )()( r=

xHHHβ )()]()([)(ˆ 1 rrrr TT −=

[ ] [ ][ ] [ ]xHHHHxxHHHHx

βHxβHx

)()]()()[()()]()()[(

)(ˆ)()(ˆ)()(

11 rrrrrrrr

rrrrrJ

TTTTT

T

−− −−=

−−=

Depends on only one variable… so might conceivably just compute on

a grid and find minimum

Then we need to minimize :

Page 331: Class Notes

11

Iterative Methods for Solving Nonlinear LSGoal: Find θ value that minimizes J(θ) = [x-s(θ)]T [x-s(θ)]without computing it over a p-dimensional gridTwo most common approaches:1. Newton-Raphson

a. Analytically find ∂J(θ)/∂θb. Apply Newton-Raphson to find a zero of ∂J(θ)/∂θ

(i.e. linearize ∂J(θ)/∂θ about the current estimate)c. Iteratively Repeat

2. Gauss-Newtona. Linearize signal model s(θ) about the current estimateb. Solve resulting linear problemc. Iteratively Repeat

Both involve: • Linearization (but they each linearize something different!)• Solve linear problem• Iteratively improve result

Page 332: Class Notes

12

Newton-Raphson Solution to Nonlinear LS

%0

)(

=∂∂

∆= θ

θg

JTo find minimum of J(θ): set

∂∂

∂∂

=∂

p

J

J

J

θ

θ

)(

)(

)( 1

θ

θ

θθ & ∑

=−=

1

0

2])[][()(N

iisixJ θθforNeed to find

% ∑−

=

==

∆∂∂

−−=∂∂ 1

0

?

][])[][(2)( N

i

h

jr

Whyignorecanj

ij

i

isisixJ

"#$!!"!!#$ θθ

θθ

θTaking these partials gives:

Page 333: Class Notes

13

pjforhr

VectorMatrix

N

iiji ,,10

1

0…

!"!#$==

×

=∑ 0rHθ θθ ==⇒ Tg )(

Depend nonlinearly on θ

Now set to zero:

−−−

=

∂−∂

∂−∂

∂−∂

∂∂

∂∂

∂∂

∂∂

∂∂

∂∂

=

])1[]1[(

])0[]0[(

]1[]1[]1[

]1[]1[]1[

]0[]0[]0[

21

21

21

NsNx

sx

NsNsNs

sss

sss

p

p

p

θ

θ

θ

θθθ

θθθ

θθθ

θ rH &

'

&'&&

'

'

θθθ

θθθ

θθθ

∂∂

∂∂

∂∂

=P

Ti

isisisθθθ

][][][)(21

θθθθ 'Define the ith row of Hθ : h

0θhrHθ θθθ === ∑−

=)(][)(

1

0i

N

n

T nrgThen the equation to solve is:

Page 334: Class Notes

14

For Newton-Raphson we linearize g(θ) around our current estimate and iterate: Need this

kk

TT

kkk gg

θθ

θθθθ

θθ

rHθrHθθ

θθθθ

ˆ

1

ˆ

1

1ˆ)()(ˆˆ

=

=

+

∂∂

−=

∂−=

!! "!! #$!!"!!#$θθHH

θ

θG

θθ

θθθ

θθh

θθh

θθhθh

θθrH

Tn

N

nn

N

n

nN

n

nN

nn

T nrnrnrnr

=

=

=

=

=∑∑∑∑ ∂

∂+

∂∂

=∂

∂=

∂∂

=∂

1

0

1

0

)(

1

0

1

0

][)(][)()(][][)(

∂∂

∂∂

∂∂

−=∂−∂

=∂

pθns

θns

θns

nsnxnr

][

][

][

])[][(][ 2

1

θ

θ

θ

θθθθ

&[ ] pjins

jiijn ,,2,1,][)(

2…=

∂∂∂

=θθθθG

( ) θθθθθ HHθG

θrH T

N

nn

Tnsnx −−=

∂∂ ∑

=

1

0][][)(

Derivative of Product Rule

Page 335: Class Notes

15

So the Newton-Raphson method becomes:

( ) ( )kkkkk

k

TN

nkn

Tk

TT

kk

nsnx θθθθθ

θθ

θθθθ

sxHθGHHθ

rHθrHθθ

ˆˆ

11

0ˆˆˆ

ˆ

1

1

][][)ˆ(ˆ

ˆˆ

−−+=

∂∂

−=

−−

=

=

+

2nd partials of s[n] w.r.t. parameters

1st partials of signal w.r.t. parameters

Note: if the signal is linear in parameters… this collapses to the non-iterative result we found for the linear case!!!

Newton-Raphson LS Iteration Steps:1. Start with an initial estimate2. Iterate the above equation until change is “small”

Page 336: Class Notes

16

Gauss-Newton Solution to Nonlinear LSFirst we linearize the model around our current estimate by using a Taylor series and keeping only the linear terms:

)ˆ(

)ˆ(

ˆˆ k

k

kk

θθθsss

θH

θθ

θθθ −

∂∂

+≈

∆=

=!"!#$

Then we use this linearized model in the LS cost:

[ ] [ ]

[ ] [ ][ ] [ ]θHθHsxθHθHsx

θθHsxθθHsx

sxsxθ

θθθθθθ

θθθθ

θθ

kkkkkk

kkkk

kT

k

kT

k

TJ

ˆˆˆˆˆˆ

ˆˆˆˆ

ˆˆ

)ˆ()ˆ(

)(

−+−−+−=

−+−−+−≈

−−=

y∆= y∆=All Known ThingsAll Known Things

Page 337: Class Notes

17

[ ] [ ]θHyθHyθ θθ kk

TJ ˆˆ)( −−=

This gives a form for the LS cost that looks like a linear problem!!

We know the LS solution to that problem is

[ ][ ] ( )[ ] [ ] ( )

kkkkkkkk

kkkkk

kkk

TTk

TT

kTT

TTk

θθθθ

I

θθθθ

θθθθθ

θθθ

sxHHHθHHHH

θHsxHHH

yHHHθ

ˆˆ1

ˆˆˆˆ1

ˆˆ

ˆˆˆ1

ˆˆ

ˆ1

ˆˆ1

ˆ

ˆ

ˆ

−+=

+−=

=

=

−+

!!! "!!! #$

[ ] ( )kkkk

TTkk θθθθ sxHHHθθ ˆˆ

1ˆˆ1

ˆˆ −+=−

+Gauss-Newton LS Iteration:

Gauss-Newton LS Iteration Steps:1. Start with an initial estimate2. Iterate the above equation until change is “small”

Page 338: Class Notes

18

Newton-Raphson vs. Gauss-NewtonHow do these two methods compare?

[ ] ( )kkkk

TTkk θθθθ sxHHHθθ ˆˆ

1ˆˆ1

ˆˆ −+=−

+G-N:

( ) ( )kkkkk

TN

nkn

Tkk nsnx θθθθθ sxHθGHHθθ ˆˆ

11

0ˆˆˆ1 ][][)ˆ(ˆˆ −

−−+=

−−

=+ ∑N-R:

The term of 2nd partials is missingin the Gauss-Newton Equation

Which is better?Typically I prefer Gauss-Newton:

• Gn matrices are often small enough to be negligible • … or the error term is small enough to make the sum term negligible• Inclusion of the sum term can sometimes de-stablize the iteration

See p. 683 of Numerical Recipes book

Page 339: Class Notes

1

8.10 Signal Processing Examples of LSWe’ll briefly look at two examples from the book…

Book Examples

1. Digital Filter Design

2. AR Parameter Estimation for the ARMA Model

3. Adaptive Noise Cancellation

4. Phase-Locked Loop (used in phase-coherent demodulation)

The two examples we will cover highlight the flexibility of the LS viewpoint!!!

Then (in separate note files) we’ll look in detail at two emitter location examples not in the book

Page 340: Class Notes

2

Ex. 8.11 Filter Design by Pronys LS Method The problem:

• You have some desired impulse response hd[n]• Find a rational TF with impulse response h[n] ≈ hd[n]

View: hd[n] as the observed “data”!!!Rational TF model’s coefficients as the parameters

General LS Problem

signal model

∑ε[n]

+

x[n]

]ˆ;[ θns

θ

Choose the Estimate…

… to make this “residual” small

LS Filter Design Problem

∑ε[n]

+

hd[n]

]ˆ,ˆ;[ banh

ba ˆ,ˆ)()()(

zAzBzH =

δ[n]p×1

(q+1)×1

Page 341: Class Notes

3

Pronys Modification to Get Linear ModelThe previous formulation results in a model that is nonlinear in the TF coefficient vectors a, b

Prony’s idea was to change the model slightly…

∑ε[n]

+

hd[n]

]ˆ,ˆ;[ banh

ba ˆ,ˆ)()()(

zAzBzH =

δ[n]

)(zA

)(zA

p×1(q+1)×1

This model is only approximately

equivalent to the original!!

Solution (see book for details):

1,1][ˆ −−= Nq

Tqq

Tq hHHHa

aHhb ˆˆ0,0 += q

Hq, H0, hq,N-1, and h0,q all contain elements from hd[n]… the subscripts indicate the range of these elements

Page 342: Class Notes

4

Key Ideas in Prony LS Example1. Shows power and flexibility of LS approach

- There is no noise here!!! ⇒ MVU, ML, etc. are not applicable- But, LS works nicely!

2. Shows a slick trick to convert nonlinear problem to linear one- Be aware that finding such tricks is an art!!!

3. Results for LS “Prony” method have links to modeling methods for Random Processes (i.e. AR, MA, ARMA)

Is this a practical filter design method?It’s not the best: Remez-Based Method is Used Most

Page 343: Class Notes

5

Ex. 8.13 Adaptive Noise Cancellation Done a bit different from

the book

Σ

Adaptive FIR Filter

][][][ nindnx +=

][~ ni ][ˆ ni

+–

Desired Interference

][ˆ nd

Estimate of the desired signal… with

“cancelled” interference

Statistically correlated with interference i[n] but mostly

uncorrelated with desired d[n]

Estimate of the interference i[n] adapted to “best” cancel the interference

∑=

−=p

ln lkilhni

0][~][][ˆ

Time-Varying Filter!! Coefficients change at each sample index

Page 344: Class Notes

6

Noise Cancellation Typical Applications1. Fetal Heartbeat Monitoring

Σ

Adaptive FIR Filter

][][][ nindnx +=

][~ ni ][ˆ ni

+–

On Mother’s Chest

][ˆ nd

On Mother’s Stomach

Fetal Heartbeat

Mother’s Heartbeat

via Stomach

Mother’s Heartbeat via Chest Adaptive filter has to mimic the TF of the

chest-to-stomach propagation

Page 345: Class Notes

7

2. Noise Canceling Headphones

Adaptive FIR Filter

][~ ni

][ni

Σ+

Ear

Noise

][ˆ ni

][ˆ][ ninm −

MusicSignal

][nm !"!#$cancel

nininm ][ˆ][][ −+

Page 346: Class Notes

8

3. Bistatic Radar System

Σ

Adaptive FIR Filter

][][][ ndntnx t+=

][ˆ nt][ˆ ndt

+–

t[n]

dt[n]

d[n]

d[n]

Desired Interference

Tx

d[n]

Delay/Doppler Radar

Processing

Page 347: Class Notes

9

LS and Adaptive Noise CancellationGoal: Adjust the filter coefficients to cancel the interferenceThere are many signal processing approaches to this problem…

We’ll look at this from a LS point of view:Adjust the filter coefficients to minimize ∑=

nndJ ][ˆ2

Σ

Adaptive FIR Filter

][][][ nindnx += ])[ˆ][(][][ˆ ninindnd −+=

][~ ni ][ˆ ni

+–

Because i[n] is uncorrelated with d[n] minimizing J is essentially the

same as making this term zeroDesired Interference

Because the interference likely changes is character with time… we want to adapt!Use Sequential LS with

Fading Memory

Page 348: Class Notes

10

Sequential LS with Forgetting FactorWe want to weight recent measurements more heavily than past measurements… that is we want to “forget” past values.

So we can use weighted LS… and if we choose our weighting factor as an exponential function then it is easy to implement!

[ ]

∑ ∑

=

=

=

−−=

−=

n

k

p

ln

kn

n

k

kn

lkilhkx

kikxnJ

0

21

0

0

2

][~][][

][ˆ][][

λ

λ

Small λ quickly “down weights”the past errors

λ = forgetting factor if 0 < λ < 1

See book for solution details

See Fig. 8.17 for simulation results

Page 349: Class Notes

Single Platform Emitter Location

AOA(DF) FOA Interferometery TOA

SBI LBI

Emitter Location is Two Estimation Problems in One:1) Estimate Signal Parameter(s) that Depend on Emitter’s Location:

a) Time-of-Arrival (TOA) of Pulsesb) Phase Interferometery: Phase is measured between two different signals

received at nearby antennas• SBI – Short Baseline Interferometery (antennas are close enough

together that phase is measured without ambiguity)• LBI – Long Baseline Interferometery (antennas are far enough apart that

phase is measured with ambiguity; ambiguity resolved either using processing or so-called self-resolved)

c) Frequency-of-Arrival (FOA) or Dopplerd) Angle-of-Arrival (AOA)

2) Use Signal Parameters Measured at Several Instants to Estimate Location

Page 350: Class Notes

Frequency-Based Location (i.e. Doppler Location)The Problem• Emitter assumed non-moving and at position (X,Y,Z)

– Transmitting a radar signal at unknown carrier frequency is fo

• Signal is intercepted by a receiver on a single aircraft– A/C dynamics are considered to be perfectly known as a function of time

• Nav Data: Position Xp(t), Yp(t), Zp(t) and Velocity Vx(t), Vy(t), Vz(t)• Relative motion between the Tx and Rx causes Doppler shift

– Received carrier frequency differs from transmitted carrier frequency– Thus, the carrier frequency of the received signal will change with time

• For a given set of nav data, how the frequency changes dependS on the transmitter’s carrier frequency fo and the emitter’s position (X,Y,Z)– Parameter Vector: x = [X Y Z fo]T

– fo is a “nuisance” parameter• Received frequency is a function of time as well as parameter vector x

( ) ( ) ( )( ) ( ) ( )

)1()()()(

)()()()()()(),(

222

−+−+−

−+−+−−=

ZtZYtYXtX

ZtZtVYtYtVXtXtVcfftf

ppp

pzpypxoox

Page 351: Class Notes

• Make noisy frequency measurements at t1, …, tN:• Problem: Given noisy frequency measurements and the nav data,

estimate x• What PDF model do we use for our data????

)(),(),(~

iii tvtftf += xx

In the TDOA/FDOA case… we had an ML estimator for TDOA/FDOA so we could claim that the measurements were asymptotically Gaussian. Because we then had a well-specified PDF for the TDOA/FDOA we could hope to use ML for the location processing.

However, here we have no ML estimator for the instantaneous frequency so claiming that the inst. freq. estimates are Gaussian is a bit of a stretch.

So we could:

1. Outright ASSUME Gaussian and then use ML approach

2. Resort to LS… which does not even require a PDF viewpoint!

Page 352: Class Notes

Both paths get us to the exact same place:

Find the estimate that minimizes

∑=

−=N

ieieie tftfJ

1

2)]ˆ,(),(~

[)ˆ( xxx

If we Assume Gaussian… we could choose:

• Newton-Raphson MLE approach: leads to double derivatives of the measurement model f (ti,xe).

If we Resort to LS… we could choose either:

• Newton-Raphson approach, which in this case is identical to N-R under the Gaussian assumption

• Gauss-Newton approach, which needs only first derivatives of the measurement model f (ti,xe).

We’ll resort to LS and use Gauss-Newton

Page 353: Class Notes

Time

Measured Frequency

Frequency ComputedUsing Measured Navand Poor Assumed Loc.

Frequency ComputedUsing Measured Navand Good Assumed Loc.

∑=

−=N

iii tftfJ

1

2)]ˆ,(),(~

[ xx

LS Approach: Find the estimate such that the corresponding computed frequency measurements are “close” to the actual measurements:– Minimize

x)ˆ,( xitf

Page 354: Class Notes

The SolutionMeasurement model in (1) is nonlinear in x ! no closed form solution

– Newton-Raphson: Linearize the derivative of the cost function– Gauss-Newton: Linearize the measurement model

Thus: ! (A Linear Model)

where…

Get LS solution for update and then update current estimate:

[ ] vxxHxfxf +−+≈ nn ˆ)ˆ()( vxHxf +∆≈∆ )ˆ( n~

[ ]4321ˆ

|||),( hhhhxtx

Hxx=

∂∂

== n

f

( ) )ˆ(ˆ 111n

TT xfRHHRHx ∆=∆ −−− xxx ˆˆˆ 1 ∆+=+ nn

Under the condition that the frequency measurement errors are Gaussian, then the CRLB for the problem can be shown to be

( ) 11var−−≥ HRHx T

Can use this to investigate performance under geometries of interest…even when the measurement errors aren’t truly Gaussian

Page 355: Class Notes

The AlgorithmInitialization: • Use the average of the measured frequencies as an initial transmitter

frequency estimate. • To get an initial estimate of the emitter’s X,Y,Z components there are

several possibilities:– Perform a grid search– Use some information from another sensor (e.g., if other on-board

sensors can give a rough angle use that together with a typical range)

– Pick several typical initial locations (e.g., one in each quadrant with some typical range)

• Let the initial estimate be

]ˆˆˆˆ[ˆ 0,0000 ofZYX=x

Page 356: Class Notes

Iteration:

For n = 0, 1, 2, …

1. Compute the vector of predicted frequencies at times t1, t2, …, tN using the current nth estimate and the nav info:

( ) ( ) ( )( ) ( ) ( )

−+−+−

−+−+−−=

222,

,ˆ)(ˆ)(ˆ)(

ˆ)()(ˆ)()(ˆ)()(ˆˆ)ˆ,(ˆ

njpnjpnjp

njpjznjpjynjpjxnononj

ZtZYtYXtX

ZtZtVYtYtVXtXtVc

fftf x

[ ]TnNnnn tftftf )ˆ,(ˆ)ˆ,(ˆ)ˆ,(ˆ)ˆ( 21 xxxxf !=

2. Compute the residual vector by subtracting the predicted frequency vector from the measured frequency vector:

)ˆ(ˆ)(~)ˆ( nn xfxfxf −=∆

Page 357: Class Notes

3. Compute Jacobian matrix H using the nav info and the current estimate:

[ ]4321ˆ

|||),( hhhhxtx

Hxx=

∂∂

== n

f

,)(ˆ)(ˆ)(ˆ)(ˆ

ˆ)()(ˆˆ)()(ˆˆ)()(ˆ

222jnjnjnjn

njpjnnjpjnnjpjn

tZtYtXtR

ZtZtZYtYtYXtXtX

∆+∆+∆=

−=∆−=∆−=∆Define:

[ ]

∆+∆+∆∆+

−−=

∂∂

==

1 ˆ)(ˆ)()(ˆ)()(ˆ)()(ˆ

ˆ)(ˆ

),()(j

jnjzjnjyjnjxjn

j

jxoj

R

tZtVtYtVtXtVtX

R

tVcftf

Xj

nxxxh

[ ]

∆+∆+∆∆+

−−=

∂∂

==

2 ˆ)(ˆ)()(ˆ)()(ˆ)()(ˆ

ˆ)(ˆ

),()(j

jnjzjnjyjnjxjn

j

jyoj

R

tZtVtYtVtXtVtY

R

tVcftf

Xj

nxxxh

[ ]

∆+∆+∆∆+

−−=

∂∂

==

3 ˆ)(ˆ)()(ˆ)()(ˆ)()(ˆ

ˆ)(ˆ

),()(j

jnjzjnjyjnjxjn

j

jzoj

R

tZtVtYtVtXtVtZ

R

tVcftf

Zj

nxxxh

1),()(ˆ

4 ≈∂∂

== n

jo

tff

jxx

xh

Page 358: Class Notes

4. Compute the estimate update:

( ) )ˆ(ˆ 111n

TTn xfCHHCHx ∆=∆ −−−

C is the covariance of the frequency measurements;

usually assumed to be diagonal with measurement variances on the diagonal

In practice you would implement this inverse using Singular Value Decomposition (SVD) due to numerical issues of H being near to

singular (MATLAB will give you a warning when this is a problem)See pp. 676-677 of the book “Numerical Recipes …”

5. Update the estimate using

nnn xxx ˆˆˆ 1 ∆+=+

Page 359: Class Notes

6. Check for convergence of solution: look to see if update is small in some specified sense.

If “Not Converged”… go to Step 1.

If Converged or Maximum number of iterations… quit loop & Set x

7. Compute Least-Squares Cost of Converged solution

1ˆ += nx

( )∑=

−=

N

n n

nn tftfC1

2)ˆ,(ˆ),(

~)ˆ(

σxxx This last step is often done to

allow assessment of how much confidence you have in the

solution. There are other ways to assess confidence – see discussion in Ch. 15 of

“Numerical Recipes …”

Note: There is no guarantee that this algorithm will converge… it might not converge at all… it might: (i) simply wander around aimlessly,

(ii) oscillate back and forth along some path, or

(iii) wander off in complete divergence.

In practical algorithms it is a good idea to put tests into the code to check for such occurrences

Page 360: Class Notes

-1 0 1 2 3 4 5

x 104

-0.5

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5x 10

4

x (meters)

y (m

eter

s)

-400 -200 0 200 400 600

-400

-300

-200

-100

0

100

200

300

400

500

-300 -200 -100 0 100 200 300

-200

-150

-100

-50

0

50

100

150

200

250

300

Simulation Results with 95% CRLB Error Ellipses

Platform Trajectory

mfowler
Page 361: Class Notes

1/11

Doppler Tracking

Passive Tracking of an Airborne Radar:An Example of Least Squares

“State” Estimation

Page 362: Class Notes

2/11

Problem StatementAirborne radar to be located follows a trajectory X(t), Y(t), Z(t) with velocities Vx(t), Vy(t), Vz(t)

It is transmitting a radar signal whose carrier frequency is fo.

Signal is intercepted by a non-moving receiver at known location Xp, Yp, Zp.

Problem: Estimate the trajectory X(t), Y(t), Z(t)

Solution Here: • Measure received frequency at instants t1, t2, … , tN• Assume a simple model for the aircraft’s motion• Estimate model parameters to give estimates of trajectory

unknown

Page 363: Class Notes

3/11

An AdmissionThis problem is somewhat of a “rigged” application…

• Unlikely it would be done in practice just like this • Because it will lead to poorly observable parameters • The H matrix is likely to be less than full rank

In real practice we would likely need either:• Multiple Doppler sensors or • A single sensor that can measure other things in addition to Doppler (e.g., bearing).

We present it this way to maximize the similarity to the example of locating a non-moving radar from a moving platform

• Can focus on the main characteristic that arises when the parameter to be estimated is a varying function (i.e. state estimation).

Page 364: Class Notes

4/11

Doppler Shift ModelRelative motion between the emitter and receiver… Doppler Shift

Frequency observed at time t is related to the unknown transmitted frequency of fo by:

( ) ( ) ( )( ) ( ) ( )

−+−+−

−+−+−−=

222 )()()(

)()()()()()(

tZZtYYtXX

tZZVtYYtVtXXtVcfftf

ppp

pzpypxoo

We measure this at time instants t1, t2, … , tN

)()()(~

iii tvtftf +=And group them into a measurement vector:

Frequency Measurement

“Noise”

[ ]TNtftftf )(~

)(~

)(~~

21 !=f

But what are we trying to estimate from this data vector???

Page 365: Class Notes

5/11

Trajectory ModelWe can’t estimate arbitrary trajectory functions… like X(t), Y(t), etc.

Need a trajectory model… to reduce the problem to estimating a few parameters

Here we will choose the simplest… Constant-Velocity Model

NNz

NNy

NNx

ZttVtZ

YttVtYXttVtX

+−×=

+−×=+−×=

)()(

)()()()(

Final Positions in Observation Block

Velocity Values

Now, given measurements of frequencies f(t1), f(t2), … , f(tN) ……we wish to estimate the 7-parameter vector:

TozyxNNN fVVVZYX ][=x

Page 366: Class Notes

6/11

Measurement Model and Estimation ProblemSubstituting the Trajectory Model into the Doppler Modelgives our measurement model:

[ ]( ) [ ]( ) [ ]( )[ ]( ) [ ]( ) [ ]( )

+−×−++−×−++−×−

+−×−++−×−++−×−−=

222 )()()(

)()()()(),(

NNzpNNypNNxp

NNzpzNNypyNNxpxoo

ZttVZYttVYXttVX

ZttVZVYttVYtVXttVXVcfftf x

Dependence on parameter vector

)(),(),(~

tvtftf += xx

[ ]vxf

xxxxf

+=

=

)(

),(~

),(~

),(~

)(~21

TNtftftf !

Noisy Frequency

Measurement

Noisy Measurement

VectorNoise-Free

Frequency VectorNoise Vector

Page 367: Class Notes

7/11

Estimation Problem

Given: Noisy Data Vector:Sensor Position:

Estimate:Parameter Vector:

[ ]TNtftftf ),(~

),(~

),(~

)(~21 xxxxf !=

ppp ZYX ,,

TozyxNNN fVVVZYX ][=x

This is a nonlinear problem…

Although we could use ML to attack this we choose LS here partly because we aren’t given an explicit noise model and partly because LS is “easily” applied here!!!

“Nuisance” parameter

Page 368: Class Notes

8/11

Linearize the Nonlinear ModelWe have a non-linear measurement model here…

so we choose to linearize our model (as before):

[ ] vxxHxfxf +−+≈ nn ˆ)ˆ()(~

where…

nx is the “current” estimate of the parameter vector

)ˆ( nxf is the “predicted” frequency measurements computed using the Doppler & Trajectory models with “back-propagation” (see next)

[ ]7654321ˆ

||||||),( hhhhhhhxtx

Hxx=

∂∂

== n

f

is the N×7 Jacobian matrix evaluated at the current estimate

Page 369: Class Notes

9/11

Back-Propagate to Get Predicted FrequenciesGiven the current parameter estimate:

TozyxNNNn nfnVnVnVnZnYnX )](ˆ)(ˆ)(ˆ)(ˆ)(ˆ)(ˆ)(ˆ[ˆ =x

Back-Propagate to get the current trajectory estimate:

)(ˆ)()(ˆ)(ˆ

)(ˆ)()(ˆ)(ˆ

)(ˆ)()(ˆ)(ˆ

nZttnVtZ

nYttnVtY

nXttnVtX

NNzn

NNyn

NNxn

+−×=

+−×=

+−×=

Use Back-Propagated trajectory to get predicted frequencies:

( ) ( ) ( )( ) ( ) ( )

−+−+−

−+−+−−=

222)(ˆ)(ˆ)(ˆ

)(ˆ)(ˆ)(ˆ)(ˆ)(ˆ)(ˆ)(ˆ)(ˆ)ˆ,(

tZZtYYtXX

tZZnVtYYnVtXXnVcnfnftf

npnpnp

npznpynpxooni x

Page 370: Class Notes

10/11

Converting to Linear LS Problem FormFrom the linearized model and the back-propagated trajectory estimate we get:

" vxHxfxxxfxf

+∆≈∆−− nn

nnˆ)ˆ()(~

)ˆ(#$#%&“Update”

Vector“Residual”

Vector

This is in standard form of Linear LS… so the solution is:

( ) )ˆ(ˆ 111n

TTn xfRHHRHx ∆=∆ −−−

This LS estimated “update” is then used to get an updated parameter estimate:

nnn xxx ˆˆˆ 1 ∆+=+

R is the covariance

matrix of the measurements

Page 371: Class Notes

11/11

Iterating to the Solution• n=0: Start with some initial estimate• Loop until stopping criterion satistfied

– n ← n+1– Compute Back-Propagated Trajectory– Compute Residual– Compute Jacobian– Compute Update– Check Update for smallness of norm

• If Update small enough… stop• Otherwise, update estimate and loop

Page 372: Class Notes

1

Pre-Chapter 10Results for Two Random Variables

See Reading Notes posted on BB

Page 373: Class Notes

2

Let X and Y be two RVs each with there own PDF: pX(x) and pY(y)

Their complete probabilistic description is captured in…

Joint PDF of X and Y: pXY(x,y)

Describes probabilities of joint events concerning X and Y.

∫ ∫=<<<<b

a

d

cXY dxdyyxpdYcbXa ),()( and )(Pr

Marginal PDFs of X and Y: The individual PDFs pX(x) and pY(y)

Imagine “adding up” the joint PDF along one direction of a piece of paper to give values “along one of the margins”.

∫∫ == dxyxpypdyyxpxp XYYXYX ),()(),()(

Page 374: Class Notes

3

Expected Value of Functions of X and Y: You sometimes create a new RV that is a function of the two of them: Z = g(X,Y).

∫ ∫== dxdyyxpyxgYXgEZE XYXY ),(),(),(

Example: Z = X + Y

( )

YEXE

dyyypdxxxp

dydxyxpydxdyyxpx

dxdyyxypdxdyyxxp

dxdyyxpyxYXEZE

YX

YX

XYXY

XYXY

XYXY

+=

+=

+

=

+=

+=+=

∫∫

∫ ∫∫ ∫

∫ ∫∫ ∫

∫ ∫

)()(

),(),(

),(),(

),(

Page 375: Class Notes

4

Conditional PDFs : If you know the value of one RV how is the remaining RV now distributed?

≠=

otherwise,0

0)(,)(

),(

)|(|

xpxp

yxp

xypX

X

XY

XY

≠=

otherwise,0

0)(,)(

),(

)|(|

ypyp

yxp

yxpY

Y

XY

YX

Sometimes we think of a specific numerical value upon which we are conditioning… pY|X(y|X = 5)

Other times it is an arbitrary value…

pY|X(y|X = x) or pY|X(y|x) or pY|X(y|X)

Various Notations

Page 376: Class Notes

5

Independence: RVs X and Y are said to be independent if knowledge of the value of one does not change the PDF model for the other.

)()|(

)()|(

|

|

xpyxp

ypxyp

XYX

YXY

=

=

)()(),( ypxpyxp YXXY =This implies (and is implied by)…

)()(

)()()|(

)()(

)()()|(

|

|

xpyp

ypxpyxp

ypxp

ypxpxyp

XY

YXYX

YX

YXXY

==

==

Page 377: Class Notes

6

Decomposing the Joint PDF: Sometimes it is useful to be able to write the joint PDF in terms of conditional and marginal PDFs.

From our results for conditioning above we get…

)()|(),( | xpxypyxp XXYXY =

)()|(),( | ypyxpyxp YYXXY =

From this we can get results for the marginals:

∫=

=

dxxpxypyp

dyypyxpxp

XXYY

YYXX

)()|()(

)()|()(

|

|

Page 378: Class Notes

7

Bayes’ Rule: Sometimes it is useful to be able to write one conditional PDF in terms of the other conditional PDF.

)()()|(

)|(

)()()|(

)|(

||

||

ypxpxyp

yxp

xpypyxp

xyp

Y

XXYYX

X

YYXXY

=

=

Some alternative versions of Bayes’ rule can be obtained by writing the marginal PDFs using some of the above results:

∫∫

∫∫

==

==

dxxpxyp

xpxyp

dxyxp

xpxypyxp

dyypyxp

ypyxp

dyyxp

ypyxpxyp

XXY

XXY

XY

XXYYX

YYX

YYX

XY

YYXXY

)()|(

)()|(

),(

)()|()|(

)()|(

)()|(

),(

)()|()|(

|

|||

|

|||

Page 379: Class Notes

8

Conditional Expectations: Once you have a conditional PDF it works EXACTLY like a PDF… that is because it IS a PDF!

Remember that any expectation involves a function of a random variable(s) times a PDF and then integrating that product.

So the trick to working with expected values is to make sure youknow three things:

1. What function of which RVs

2. What PDF

3. What variable to integrate over

Page 380: Class Notes

9

For conditional expectations… one idea but several notations!

∫= dxyxpyxgYXgE YXYX )|(),(),( ||

∫== dxyxpyxgYXgE oYXoyYX o)|(),(),( ||

∫= dxyxpyxgYYXgE YX )|(),(|),( |

∫== dxyxpyxgyYYXgE oYXoo )|(),(|),( |

Uses subscript on E to indicate that you use the cond. PDF.

Does not explicitly state the value at which Y should be fixed so use an arbitrary y

Uses subscript on E to indicate that you use the cond. PDF.

Explicitly states that the value at which Y should be fixed is yo

Uses “conditional bar” inside brackets of E to indicate use of the cond. PDF.

Does not explicitly state the value at which Y should be fixed so use an arbitrary y

Uses “conditional bar” inside brackets of E to indicate use of the cond. PDF.

Explicitly states that the value at which Y should be fixed is yo

Page 381: Class Notes

10

Decomposing Joint Expectations: When averaging over the joint PDF it is sometimes useful to be able to decompose it into nested averaging in terms of conditional and marginal PDFs.

This uses the results for decomposing joint PDFs.

),(

),(),(

| YXgEE

YXgEYXgE

XYX

XY

=

=

dxxpxypyxg

dxdyyxpyxgYXgE

Xx

YXgE

XYy

xpxypXY

XY

XXY

)()|(),(

),(),(),(

),(

|

)()|(

|

|

∫ ∫

∫ ∫

=

=

!!!! "!!!! #$

!"!#$

This is an RV that “inherits” the PDF of X!!!

Page 382: Class Notes

11

Ex. Decomposing Joint Expectations:

Let X = # on Red Die Y = # on Blue Die g(X,Y) = X + Y

),(),( | YXgEEYXgE XYX=

∑=

=+6

15.9

61)6(

yy(6+6)(6+5)(6+4)(6+3)(6+2)(6+1)6

(5+6)(5+5)(5+4)(5+3)(5+2)(5+1)5

(4+6)(4+5)(4+4)(4+3)(4+2)(4+1)4

(3+6)(3+5)(3+4)(3+3)(3+2)(3+1)3

(2+6)(2+5)(2+4)(2+3)(2+2)(2+1)2

(1+6)(1+5)(1+4)(1+3)(1+2)(1+1)1

EY|X654321X/Y

∑=

=+6

15.4

61)1(

yy

∑=

=+6

15.5

61)2(

yy

∑=

=+6

15.8

61)5(

yy

∑=

=+6

15.7

61)4(

yy

∑=

=+6

15.6

61)3(

yy

These constitute

an RV with uniform

probability of 1/6

∑=

===+6

17

61||

xxYEXYEEYXE

Page 383: Class Notes

1

Chapter 10Bayesian Philosophy

Page 384: Class Notes

2

10.1 IntroductionUp to now… Classical Approach: assumes θ is deterministicThis has a few ramifications:

• Variance of the estimate could depend on θ• In Monte Carlo simulations:

– M runs done at the same θ, – must do M runs at each θ of interest– averaging done over data – no averaging over θ values

E is w.r.t. p(x;θ)

Bayesian Approach: assumes θ is random with pdf p(θ)This has a few ramifications:

• Variance of the estimate CAN’T depend on θ• In Monte Carlo simulations:

– each run done at a randomly chosen θ, – averaging done over data AND over θ values

E is w.r.t. p(x,θ)

joint pdf

Page 385: Class Notes

3

Why Choose Bayesian?1. Sometimes we have prior knowledge on θ ⇒ some values are

more likely than others

2. Useful when the classical MVU estimator does not exist because of nonuniformity of minimal variance

θ

)(2ˆ θθi

σ1θ

3. To combat the “signal estimation problem”… estimate signal s

x = s + w If s is deterministic and is the parameter to estimate, then H = I

Classical Solution: ( ) xxIIIs ==− TT 1ˆ Signal Estimate is

the data itself!!!

The Wiener filter is a Bayesian method to combat this!!

Page 386: Class Notes

4

10.3 Prior Knowledge and Estimation

Bayesian Data Model:• Parameter is “chosen” randomly w/ known “prior PDF”• Then data set is collected• Estimate value chosen for parameter

Every time you collect data, the parameter has a different value, but some values may be more likely to occur than others

This is how you think about it mathematically and how you run simulations to test it.

This is what you know ahead of time about the parameter.

Page 387: Class Notes

5

Ex. of Bayesian Viewpoint: Emitter Location Emitters are where they are and don’t randomly jump around each

time you collect data. So why the Bayesian model?

(At least) Three Reasons1. You may know from maps, intelligence data, other sensors,

etc. that certain locations are more likely to have emitters• Emitters likely at airfields, unlikely in the middle of a lake

2. Recall Classical Method: Parm Est. Variance often depends on parameter• It is often desirable (e.g. marketing) to have a single

number that measures accuracy.3. Classical Methods try to give an estimator that gives low

variance at each θ value. However, this could give large variance where emitters are likely and low variance where they are unlikely.

Page 388: Class Notes

6

Bayesian Criteria Depend on Joint PDFThere are several different optimization criteria within the Bayesian framework. The most widely used is…

Minimize the Bayesian MSE: Bmse ∫∫ −=

−=

dθd,θpθθ

θθEθ

xxx )()](ˆ[

)ˆ()ˆ(

2

2Take E w.r.t.

joint pdf of x and θ

Can Not Depend on θ Joint pdf of x and θ

To see the difference… compare to the Classical MSE:

∫ −=

−=

xxx dθpθθ

θθEθmse

);()](ˆ[

)ˆ()ˆ(

2

2

pdf of x parameterized by θCan Depend on θ

Page 389: Class Notes

7

Ex. Bayesian for DC Level Zero-Mean White Gaussian

Same as before… x[n] = A + w[n] p(A)

-Ao

1/2Ao

Ao A

But here we use the following model: • that A is random w/ uniform pdf• RVs A and w[n] are independent of each other

Now we want to find the estimator function that maps data x into the estimate of A that minimizes Bayesian MSE:

[ ] xxx

xx

dpdAApAA

dAdApAAABmse

∫ ∫

∫∫−=

−=

)()|(]ˆ[

),(]ˆ[)ˆ(

2

2 Now use… p(x,A) = p(A|x)p(x)

Minimize this for each x valueThis works because p(x) ≥ 0

So… fix x, take its partial derivative, set to 0

Page 390: Class Notes

8

Finding the Partial Derivative gives:

∫∫

∫∫

+−=

−−=

∂−∂

=−∂∂

dAApAdAAAp

dAApAA

dAApA

AAdAApAAA

)|(ˆ2)|(2

)|(]ˆ[2

)|(ˆ]ˆ[)|(]ˆ[ˆ

22

xx

x

xx

=1Setting this equal to zero and solving gives:

x

x

|

)|(ˆ

AE

dAAApA

=

= ∫Conditional mean of A given data x

Bayesian Minimum MSE Estimate = The Mean of “posterior pdf”

MMSE So… we need to explore how to compute this from our data given knowledge of the

Bayesian model for a problem

Page 391: Class Notes

9

Compare this Bayesian Result to the Classical Result:… for a given observed data vector x look at

MVUE = x

AAoAo

p(A|x)p(x;A)

MMSE = EA|x

Before taking any data… what is the best “estimate” of A?• Classical: No best guess exists!• Bayesian: Mean of the Prior PDF…

– observed data “updates” this “a priori” estimate into an “a posteriori” estimate that balances “prior” vs. data

Page 392: Class Notes

10

So… for this example we’ve seen that we need EA|x.How do we compute that!!!?? Well…

∫==

dAAAp

AEA

)|(

x

x

So… we need the posterior pdf of A given the data… which can be found using Bayes’ Rule:

∫=

=

dAApApApAp

pApApAp

)()|()()|(

)()()|()|(

xx

xxx

Allows us to write one cond. PDF in terms of the other way around

Assumed KnownMore easily found than p(A|x)… very much the same structure as the parameterized PDF

used in Classical Methods

Page 393: Class Notes

11

So now we need p(x|A)… For x[n] = A + w[n] we know that

( )

−−=

−=

−=

222

][2

1exp2

1

)][(

)|][()|][(

Anx

Anxp

AAnxpAnxp

w

wx

σπσ

For A known, x[n] is the known A plus random w[n]

PDF of x

Because w[n] and A are assumed Independent

Because w[n] is White Gaussian they are independent… thus, the data conditioned on A is independent:

( )( )

−−= ∑

=

1

0

222/2

][2

1exp2

1)|(N

nN AnxAp

σπσx

Same structure as the parameterized PDF used in Classical Methods… But here A is an RV upon which we have conditioned the PDF!!!

Page 394: Class Notes

12

Now we can use all this to find the MMSE for this problem:

MMSE Estimator…A function that maps observed data into the estimate… No Closed Form for this Case!!!

( ) ( ) [ ]

( ) ( ) [ ]

( )

( )∫ ∑

∫ ∑

∫ ∑

∫ ∑

∫∫∫

=

=

=

=

−−

−−

=

−−

−−

=

===

o

o

o

o

o

o

o

o

A

A

N

n

A

A

N

n

A

A o

N

nN

A

A o

N

nN

dAAnx

dAAnxA

A

dAAAnx

dAAAnxA

dAApAp

dAApAApdAAApAEA

1

0

22

1

0

22

1

0

222/2

1

0

222/2

][2

1exp

][2

1expˆ

2/1][2

1exp2

1

2/1][2

1exp2

1

)()|(

)()|()|(|ˆ

σ

σ

σπσ

σπσ

x

xxx Using

Bayes’ Rule

Use Prior PDF

Use Parameter-Conditioned PDF

IdeaEasy!!

Hard toBuild

Page 395: Class Notes

13

How the Bayesian approach balances a priori and a posteriori info:

AAoAo

p(A)

EA

No Data

AAoAo

p(A|x)

EA|x x

Short DataRecord

AAoAo

p(A|x)

xx ≈|AE

Long DataRecord

Page 396: Class Notes

14

General Insights From Example1. After collecting data: our knowledge is captured by the

posterior PDF p(θ |x)

2. Estimator that minimizes the Bmse is Eθ |x… the mean of the posterior PDF

3. Choice of prior is crucial: Bad Assumption of Prior ⇒ Bad Bayesian Estimate!

(Especially for short data records)

4. Bayesian MMSE estimator always exists!But not necessarily in closed form

(Then must use numerical integration)

Page 397: Class Notes

15

10.4 Choosing a Prior PDFChoice is crucial:1. Must be able to justify it physically

2. Anything other than a Gaussian prior will likely result in no closed-form estimates

We just saw that a uniform prior led to a non-closed form

We’ll see here an example where a Gaussian prior gives a closed form

So… there seems to be a trade-off between:• Choosing the prior PDF as accurately as possible• Choosing the prior PDF to give computable closed form

Page 398: Class Notes

16

Ex. 10.1: DC in WGN with Gaussian Prior PDFWe assume our Bayesian model is now: x[n] = A + w[n]with a prior PDF of

),(~ 2AANA σµ

AWGN

So… for a given value of the RV A the conditional PDF is

( )( )

−−= ∑

=

1

0

222/2

][2

1exp2

1)|(N

nN AnxAp

σπσx

Then to get the needed conditional PDF we use this and the a priori PDF for A in Bayes’ Theorem:

∫=

dAApApApApAp)()|()()|()|(

xxx

Page 399: Class Notes

17

Then… after much algebra and gnashing of teeth we get:See the Book( )

−−= 2

|2|2

|2

1exp2

1)|( xAxAxA

AAp µσπσ

x

which is a Gaussian PDF with

22

2|

2

2|

2

2|

|

11

A

xA

AA

xAxAxA

N

xN

σσ

σ

µσ

σ

σ

σµ

+=

+

= Weighted Combination of a

priori and sample means

“Parallel” Combination of a priori and sample variances

So… the main point here so far is that by assuming:• Gaussian noise• Gaussian a priori PDF on the parameter

We get a Gaussian a posteriori PDF for Bayesian estimation!!

Page 400: Class Notes

18

Now recall that the Bayesian MMSE was the conditional a posteriori mean: x|ˆ AEA =

Because we now have a Gaussian a posteriori PDF it is easy to find an expression for this:

AA

xAxAxA x

NAEA µ

σ

σ

σ

σµ

+

=== 2

2|

2

2|

||ˆ x

After some algebra we get:

( ) 10,1

ˆ2

2

2

22

2

<<−+=

+

+

+

=

αµαα

µσ

σ

σ

σσ

σ

A

A

AA

A

x

N

Nx

N

A

Easily Computable Estimator:• Sample mean computed from data• σ known from data model• µA and σA known from prior model

22

2| 1

1|varˆvar

A

xA NAA

σσ

σ+

=== x

Little or Poor Data:

Much or Good Data:

AA AN µσσ ≈<< ˆ/22

xANA ≈>> ˆ/22 σσ

Page 401: Class Notes

19

Comments on this Example for Gaussian Noise and Gaussian Prior1. Closed-Form Solution for Estimate!2. Estimate is… Weighted sum of prior mean & data mean3. Weights balance between prior info quality and data quality4. As N increases…

a. Estimate EA|x movesb. Accuracy varA|x moves

xA →µNA /22 σσ →

p(A|x)

A1AxA ≈2

ˆ

N2 > N1 > 0AoA µ=ˆ

No = 0

Page 402: Class Notes

20

2|)ˆ( xAABmse σ=Bmse for this Example:

To see this: ( ) ( )

( ) ( )

( ) ( ) ( )

( ) ( )[ ] ( )∫ ∫

∫∫

∫∫

==

−=

−=

−=

−=

xxxx

xxxx

xx

x

dpdAApAEA

dAdpApAEA

dAdApAA

AAEABmse

xAA!!!! "!!!! #$

2||var

2

2

2

2

|

|

ˆˆ

σ

General Result: Bmse = posterior variance averaged over PDF of x

In this case σA|x is not a function of x:

( ) ( ) 2|

2|

ˆxAxA dpABmse σσ == ∫ xx

Page 403: Class Notes

21

The big thing that this example shows:

Gaussian Data & Gaussian Prior gives Closed-Form MMSE SolutionThis will hold in general!

Page 404: Class Notes

1

10.5 Properties of Gaussian PDFTo help us develop some general MMSE theory for the Gaussian Data/Gaussian Prior case, we need to have some solid results forjoint and conditional Gaussian PDFs.

Well consider the bivariate case but the ideas carry over to the general N-dimensional case.

Page 405: Class Notes

2

Bivariate Gaussian Joint PDF for 2 RVs X and Y

−−= −

!!!! "!!!! #$formquadratic

y

xT

y

x

y

x

y

xyxp

µ

µ

µ

µ

π1

1/2 21exp

||21),( CC

=

Y

X

Y

XE

µ

µ

=

=

=

2

2

2

2

)var(),cov(

),cov()var(

YYX

YXX

YYX

XYX

YXY

YXX

σσρσ

σρσσ

σσ

σσC

-10 -5 0 5 10-8

-6

-4

-2

0

2

4

6

8

x

y

-8-6

-4-2

02

46

8

-10

-5

0

5

100

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

xy

p(x,

y)

xy

p(x,

y)

x

y

Page 406: Class Notes

3

Marginal PDFs of Bivariate GaussianWhat are the marginal (or individual) PDFs?

∫∞

∞−= dyyxpxp ),()( ∫

∞−= dxyxpyp ),()(

We know that we can get them by integrating:

After performing these integrals you get that:

X ~ N(µX, varX) Y ~ N(µY, varY)

-10 -5 0 5 10-8

-6

-4

-2

0

2

4

6

8

x

y

x

y

p(x)

p(y)

Page 407: Class Notes

4

Comment on Jointly GaussianWe have used the term Jointly Gaussian

Q: EXACTLY what does that mean?A: That the RVs have a joint PDF that is Gaussian

−−= −

y

xT

y

x

y

x

y

xyxp

µ

µ

µ

µ

π1

1/2 21exp

||21),( CC

Weve shown that jointly Gaussian RVs also have Gaussian marginal PDFs

Q: Does having Gaussian Marginals imply Jointly Gaussian?

In other words if X is Gaussian and Y is Gaussian is it always true that X and Y are jointly Gaussian???

A: No!!!!!

Example for 2 RVs

See Reading Notes on Counter Example

posted on BB

Page 408: Class Notes

5

Well construct a counterexample: start with a zero-mean, uncorrelated 2-D joint Gaussian PDF and modify it so it is no longer 2-D Gaussian but still has Gaussian marginals.

+

−= 2

2

2

2

21exp

21),(

YXYXXY

yxyxpσσσπσ x

y

x

y

But if we modify it by: Setting it to 0 in the shaded regions Doubling its value elsewhere

We get a 2-D PDF that is not a joint Gaussian but the marginals are the same as the original!!!!

Page 409: Class Notes

6

Conditional PDFs of Bivariate GaussianWhat are the conditional PDFs?

If you know that X has taken value X = xo, how is Y distributed?

×

×=

1616258.0

16258.025C

Slope of LinecovX,Y/varX = ρσY/σX

∫∞

∞−

==dyyxp

yxpxp

xxpxyp),(

),()(

)|()|(0

0

0

00

Slice @ xo

Normalizer

-15 -10 -5 0 5 10 15-15

-10

-5

0

5

10

15

x

y

p(y|X=5)

p(y)Note: Conditioning on correlated RV shifts mean reduces variance

Page 410: Class Notes

7

Theorem 10.1: Conditional PDF of Bivariate GaussianLet X and Y be random variables distributed jointly Gaussian with mean vector [EX EY]T and covariance matrix

=

=

2

2

)var(),cov(

),cov()var(

YYX

XYX

YXY

YXX

σσ

σσC

Then p(y|x) is also Gaussian with mean and variance given by:

( )

( )

| 2

XExYE

XExYExXYE

oX

Y

oX

XYo

−+=

−+==

σρσ

σσ

( ) 22222

2

22

1

|var

YYY

X

XYYoxXY

σρσρσ

σσσ

−=−=

−==

Slope of Line

Amount of Reduction

Reduction Factor

Page 411: Class Notes

8

Impact on MMSE

We know the MMSE of RV Y after observing the RV X = xo: oxXYEY == |

So using the ideas we have just seen: if the data and the parameter are jointly Gaussian, then

( )|2 XExYExXYEY oX

XYoMMSE −+===

σσ

It is the correlation between the RVs X and Y that allow us to perform Bayesian estimation.

Page 412: Class Notes

9

Theorem 10.2: Conditional PDF of Multivariate GaussianLet X (k×1) and Y (l×1) be random vectors distributed jointlyGaussian with mean vector [EXT EYT ]T and covariance matrix

××

××=

=

)()(

)()(

llkl

lkkk

YYYX

XYXX

CC

CCC

Then p(y|x) is also Gaussian with mean vector and covariance matrix given by:

( )| 1 XxCCYxXY XXYX EEE oo −+== −XYXXYXYYxX|Y CCCCC 1−

= −=o

( )| 2 XExYExXYE oX

XYo −+==

σσ

2

22|var

X

XYYoxXY

σσσ −==

Compare to Bivariate Results

For the Gaussian case the cond. covariance does not depend on the conditioning x-value!!!

Page 413: Class Notes

10

10.6 Bayesian Linear ModelNow we have all the machinery we need to find the MMSE for the Bayesian Linear Model

wHθx +=

N×1 N×p known

p×1~N(µθ,Cθ)

N×1~N(0,Cw)

Clearly, x is Gaussian and θ is GaussianBut are they jointly Gaussian???

If yes then we can use Theorem 10.2 to get the MMSE for θ!!!

Answer = Yes!!

Page 414: Class Notes

11

Bayesian Linear Model is Jointly Gaussianθ and w are each Gaussian and are independent

Thus their joint PDF is a product of Gaussians which has the form of a jointly Gaussian PDF

Can now use: a linear transform of jointly Gaussian is jointly Gaussian

=

w

θ

0I

IH

θ

xJointly Gaussian

Thus, Thm. 10.2 applies! Posterior PDF is

! Joint Gaussian

! Completely described by its mean and variance

Page 415: Class Notes

12

Conditional PDF for Bayesian Linear ModelTo apply Theorem 10.2, notationally let X = x and Y = θ.

First we need EX = H Eθ + Ew = Hµθ

EY = Eθ = µθ

And also θYY CC = ( )( ) ( )[ ] ( )[ ] ( )( ) TTT

T

T

EE

E

EEE

wwHµθµθH

wµθHwµθH

xxxxC

θCθθ

θθ

XX

+−−=

+−+−=

−−=

!!! "!!! #$

TT E wwHHCC θXX +=

Cross Terms are Zero because θ and w are

independent

Page 416: Class Notes

13

( )( ) ( )( ) ( )( ) TT

T

T

E

E

E

Hµθµθ

HµwHθµθ

µxµθCC

θθ

θθ

xθθxYX

−−=

−+−=

−−==SimilarlyTHCC θθx =Use Eθw = 0

Eµθw = 0

Then Theorem 10.2 gives the conditional PDFs mean and cov(and we know the conditional mean is the MMSE estimate)

( ) ( )θwθθθ HµxCHHCHCµ

xθθ

−++=

=

−1

|

TT

MMSE E

Data Prediction Error

Update TransformationMaps unpredictable part

a priori estimate

Cross Correlation CθxRelative Quality

( ) θwθθθxθ HCCHHCHCCC1

|−

+−= TTPosterior Covariance:

a priori covariance Reduction Due to Data

Posterior Mean:

Bayesian MMSE

Estimator

Page 417: Class Notes

14

Ex. 10.2: DC in AWGN w/ Gaussian PriorData Model: x[n] = A + w[n] A & w[n] are independent

),(~ 2AAN σµ ),0(~ 2σN

Write in linear model form:

x = 1A + w with H = 1 = [ 1 1 1]T

Now General Result gives the MMSE estimate as:

)-()(

)-()(|

12

2

2

2

1222

ATATA

A

AT

AT

AAMMSE AEA

µσσ

σσµ

µσσσµ

1x11I1

1xI111x

++=

++==

Can simplify using The Matrix Inversion Lemma

Page 418: Class Notes

15

Aside: Matrix Inversion Lemma

( ) ( ) 1111111 −−−−−−− +−=+ DACBDABAABCDA

n×n n×m m×m m×n

( )uAu

AuuAAuuA 1

1111

1 −

−−−−

+−=+ T

TT

n×n n×1

Special Case (m = 1):

Page 419: Class Notes

16

Continuing the Example Apply the Matrix Inversion Lemma:

)(/

1

)(/

)(/

)(

222

2

222

2

222

2

1

2

2

2

2

AA

AA

AT

A

TAA

AA

TTA

A

ATATA

AMMSE

NxNN

N

NN

N

A

µσσσ

σµ

µσσσ

σµ

µσσσ

σµ

µσσ

σσµ

+−+=

+−+=

+−+=

++=

1x11

1x11I1

1x11I1

)(/

22

2

AA

AAMMSE x

NA µ

σσσµ −

++=

Use Matrix Inv Lemma

Pass through 1T

& use 1T 1 = N

Factor Out 1T

& use 1T 1 = N

Algebraic Manipulation

Error BetweenData-Only Est.

& Prior-Only Est.

GainFactor

a priori estimate

When data is bad (σ2/N >> σ2A),

gain is small, data has little use

When data is good (σ2/N >> σ2A),

gain is large, data has large use

AMMSEA µ≈

xAMMSE ≈

Page 420: Class Notes

17

Using similar manipulations gives:

NN

NA

AA

A

/11

1)x|var(

222

2

22

σσσσ

σσ

+=

+

=

Like || resistors small one wins!⇒ var (A|x) is ≈ the smaller of:

data estimate variance prior variance

NA A /11

)x|var(1

22 σσ+=

Or looking at it another way:

additive information!

Page 421: Class Notes

18

10.7 Nuisance Parameters

One difficulty in classical methods is that nuisance parameters must explicitly dealt with.

In Bayesian methods they are simply Integrated Away!!!!

Recall Emitter Location: [x y z f0]

In Bayesian ApproachFrom p(x, y, z, f0 | x) can get p(x, y, z | x):

Nuisance Parameter

∫= 00 )x|,,,()|,,( dffzyxpzyxp x

Then find conditional mean for the MMSE estimate!

Page 422: Class Notes

1

Ch. 11 General Bayesian Estimators

Page 423: Class Notes

2

IntroductionIn Chapter 10 we:

• introduced the idea of a “a priori” information on θ⇒ use “prior” pdf: p(θ)

• defined a new optimality criterion⇒ Bayesian MSE

• showed the Bmse is minimized by E θ|x

called:• “mean of posterior pdf”• “conditional mean”

In Chapter 11 we will:• define a more general optimality criterion

⇒ leads to several different Bayesian approaches⇒ includes Bmse as special case

Why? Provides flexibility in balancing: • model, • performance, and• computations

Page 424: Class Notes

3

11.3 Risk FunctionsPreviously we used Bmse as the Bayesian measure to minimize

( )

εθθ

θθθ

∆=−

−=

ˆ

),(...ˆ 2xptrwEBmse

So, Bmse is… Expected value of square of error

Let’s write this in a way that will allow us to generalize it.

Define a quadratic Cost Function: ( )22 ˆ)( θθεε −==C

Then we have that )(εCEBmse =

ε

C(ε) = ε2

Why limit the cost function to just quadratic?

Page 425: Class Notes

4

General Bayesian Criteria1. Define a cost function: C(ε)

2. Define Bayes Risk: R = EC(ε) w.r.t. p(x,θ )

)ˆ()ˆ( θθθ −= CER

Depends on choice of estimator

3. Minimize Bayes Risk w.r.t. estimate θ

The choice of the cost function can be tailored to:• Express importance of avoiding certain kinds of errors• Yield desirable forms for estimates

– e.g., easily computed• Etc.

Page 426: Class Notes

5

Three Common Cost Functions

1. Quadratic: C(ε) = ε2

ε

C(ε)

2. Absolute: C(ε) = | ε |ε

C(ε)

3. Hit-or-Miss:

<=

δε

δεε

,1

,0)(C

δ > 0 and small ε

C(ε)

δ–δ

Page 427: Class Notes

6

General Bayesian EstimatorsDerive how to choose estimator to minimize the chosen risk:

[ ] dxxpdθθ|xpθθC

xpθ|xp

dθdxx,θpθθC

θθCE

θg

)()()ˆ(

)()(

)()ˆ(

)ˆ()ˆ(

)ˆ(

∫ ∫

∫∫

∆=

−=

=

−=

−=θR

must minimize this for each x value

So… for a given desired cost function… you have to find the form of the optimal estimator

Page 428: Class Notes

7

The Optimal Estimates for the Typical Costs1. Quadratic: ( ) )ˆ(ˆ)ˆ(

2θθθθ BmseE =

−=R

x

x

|( of mean

θ

θθ

p

E

=

=

As we saw in Ch. 10

2. Absolute: θθθ ˆ)ˆ( −= ER )|( of median ˆ xθθ p=

3. Hit-or-Miss: )|( of mode ˆ xθθ p=

p(θ|x)

θMode

MedianMean

If p(θ|x) is unimodal & symmetricmean = median = mode

“Maximum A Posteriori”or MAP

Page 429: Class Notes

8

Derivation for Absolute Cost Function

θθθθ

θ

θθθθ

θθθθθθθθθ

θθθθθ

ˆ|ˆ| whereregion

ˆ

ˆ|ˆ| whereregion

ˆ

)|()ˆ()|()ˆ(

)|(|ˆ|)ˆ(

−=−

−=−

∞−

∞−

∫∫

−+−=

−=

dpdp

dpg

xx

x

Writing out the function to be minimized gives:

Now set 0ˆ)ˆ(=

θ∂θ∂g and use Leibnitz’s rule for ∫

φ

φ∂∂ )(

)(

2

1

),(u

udvvuh

u

0)|()|(ˆ

ˆ=−⇒ ∫ ∫

∞−

∞θ

θ

dθθpdθθp xx

which is satisfied if… (area to the left) = (area to the right)⇒ Median of conditional PDF

Page 430: Class Notes

9

Derivation for Hit-or-Miss Cost Function

∫∫

+

+

∞−

∞−

−=

⋅+⋅=

−=

δθ

δθ

δθ

δθ

θθ

θθθθ

θθθθ

ˆ

ˆ

ˆ

ˆ

)x|(1

)x|(1)x|(1

)x|()ˆ()ˆ(

dp

dpdp

dθpCg

Writing out the function to be minimized gives:

Almost all the probability = 1 – left out

Maximize this integral

So… center the integral around peak of integrand⇒ Mode of conditional PDF

Page 431: Class Notes

10

11.4 MMSE EstimatorsWe’ve already seen the solution for the scalar parameter case

x

x

|( of mean

θ

θθ

p

E

=

=

Here we’ll look at:• Extension to the vector parameter case• Analysis of Useful Properties

Page 432: Class Notes

11

Vector MMSE EstimatorThe criterion is… minimize the MSE for each component

Vector Parameter: [ ]Tpθθθ 21=θ

Vector Estimate: [ ]Tpθθθ ˆˆˆˆ21=θ

is chosen to minimize each of the MSE elements:

∫ −=− iiiiii ddpE θθθθθθ xx ),()ˆ()ˆ( 22= p(x, θ) integrated over all other θj’s

∫ ∫

∫ ∫

=

=

=

θxθx

xx

xx

ddp

dddp

ddp

i

ppi

iiii

),(

),,,(

),(ˆ

11

θ

θθθθθ

θθθθ

From the scalar case we know the solution is:

|ˆ xii E θθ =

Page 433: Class Notes

12

So… putting all these into a vector gives:

[ ][ ]

[ ] x

xxx

θ

|

|||

ˆˆˆˆ

21

21

21

Tp

Tp

Tp

E

EEE

θθθ

θθθ

θθθ

=

=

=

xθθ |ˆ E=Vector MMSE Estimate

= Vector Conditional Mean

Similarly… [ ] pidpBmse iii ,,1)(C)ˆ( | …== ∫ xxxθθ

where [ ][ ] TEEE |||| xθθxθθC xθxθ −−=

Page 434: Class Notes

13

Ex. 11.1 Bayesian Fourier AnalysisSignal model is: x[n] = acos(2πfon) + bsin(2πfon) + w[n]

AWGN w/ zero mean and σ2),(~ 2I0θ θσN

b

a

=

θ and w[n] are independent for each n

This is a common propagation model called Rayleigh Fading

Write in matrix form: x = Hθ + w Bayesian Linear Model

↓↓

↑↑

= sinecosineH

Page 435: Class Notes

14

1

222

1

2211|ˆ

−−

+=

+==

σσσσσ θθ

HHICxHHHIxθθ x|θ

TTTE

Results from Ch. 10 show that

For fo chosen such that H has orthogonal columns then

xHxθθ TE

+==

22

2

11

1

σσ

σ

θ

=

=

=

=

1

0

1

0

)2sin(][2ˆ

)2cos(][2ˆ

N

no

N

no

nfnxN

b

nfnxN

a

πβ

πβ

2

2 /21

1

θσσ

βN

+

=

Fourier Coefficients in the Brackets

Recall: Same form as classical result, except there β = 1

Note: β ≈ 1 if σθ2 >> 2σ2/N

⇒ if prior knowledge is poor, this degrades to classical

Page 436: Class Notes

15

Impact of Poor Prior KnowledgeConclusion: For poor prior knowledge in Bayesian Linear Model

MMSE Est. → MVU Est.

Can see this holds in general: Recall that

[ ] [ ]θwwθθ HµxCHHCHCµxθθ +++== −−−− 1111|ˆ TTE

For no prior information: 0Cθ →−1 and 0µθ→

[ ] xCHHCHθ ww111ˆ −−−→ TT

MVUE for General Linear Model

Page 437: Class Notes

16

Useful Properties of MMSE Est.1. Commutes over affine mappings:

If we have α = Aθ + b then bθAα += ˆˆ

2. Additive Property for independent data sets Assume θ, x1, x2 are jointly Gaussian w/ x1 and x2 independent

][][ˆ22

111

12211

xxCCxxCCθθ xxθxxθ EEE −+−+= −−

a priori Estimate Update due to x1 Update due to x2

Proof: Let x = [x1T x2

T]T. The jointly Gaussian assumption gives:

[ ]

−−

+=

−+=

00

][ˆ

22

111

1

1

2

121 xx

xxC

CCCθ

xxCCθθ

EE

E

EE

x

xxx

xx

θθ

θ Indep. ⇒ Block Diagonal

Simplify to get the result

Will be used for Kalman Filter

3. Jointly Gaussian case leads to a linear estimator: mPxθ +=ˆ

Page 438: Class Notes

1

11.5 MAP EstimatorRecall that the “hit-or-miss” cost function gave the MAP estimator… it maximizes the a posteriori PDF

Q: Given that the MMSE estimator is “the most natural” one…why would we consider the MAP estimator?

A: If x and θ are not jointly Gaussian, the form for MMSE estimate requires integration to find the conditional mean.

MAP avoids this Computational Problem!Note: MAP doesn’t require this integration

Trade “natural criterion” vs. “computational ease”

What else do you gain? More flexibility to choose the prior PDF

Page 439: Class Notes

2

Notation and Form for MAP

)|(maxargˆ xθθθ

pMAP =

MAPθNotation: maximizes the posterior PDF

“arg max” extracts the value of θ that causes the maximum

Equivalent Form (via Bayes’ Rule): )]()|([maxargˆ θθθθ

ppMAP x=

Proof: Use )()()|()|(

xxx

pppp θθθ =

)]()|([maxarg)(

)()|(maxargˆ θθθθθθθ

ppp

ppMAP x

xx

=

=

Does not depend on θ

Page 440: Class Notes

3

Vector MAP < Not as straight-forward as vector extension for MMSE >

The obvious extension leads to problems:

iθChoose to minimize ˆ()ˆ( iii CE θθθ −=R

Exp. over p(x,θi)

⇒ )|(maxargˆ xii pi

θθθ

= 1-D marginal conditioned on x

Need to integrate to get it!!

Problem: The whole point of MAP was to avoid doing the integration needed in MMSE!!!

Is there a way around this?Can we find an Integration-Free Vector MAP?

pddpp θθθ 21 )|θ()|( ∫∫= xx

Page 441: Class Notes

4

Circular Hit-or-Miss Cost Function Not in Book

First look at the p-dimensional cost function for this “troubling”version of a vector map:

It consists of p individual applications of 1-D “Hit-or-Miss”

ε1

ε2

δ-δ-δ

δ

=square innot ),(,1

square in ),(,0),(

21

2121

εε

εεεεC

The corners of the square “let too much in” ⇒ use a circle!

ε1

ε2

δ

<=

δ

δ

ε

εε

,1

,0)(C

This actually seems more natural than the “square” cost function!!!

Page 442: Class Notes

5

MAP Estimate using Circular Hit-or-Miss Back to Book

So… what vector Bayesian estimator comes from using this circular hit-or-miss cost function?

Can show that it is the following “Vector MAP”

)|(maxargˆ xθθθ

pMAP = Does Not Require Integration!!!

That is… find the maximum of the joint conditional PDF

in all θi conditioned on x

Page 443: Class Notes

6

How Do These Vector MAP Versions CompareIn general: They are NOT the Same!!

Example: p = 2p(θ1, θ2 | x)

1/6

1/3

1/6

θ1

θ2

1 2 3 4 5

1

2

The vector MAP using Circular Hit-or-Miss is: [ ]T5.05.2ˆ =θ

To find the vector MAP using the element-wise maximization:

θ1

p(θ1|x)

1 2 3 4 5

1/6

1/3

θ2

p(θ2|x)

1 2

1/3

2/3[ ]T5.15.2ˆ =θ

Page 444: Class Notes

7

“Bayesian MLE”Recall… As we keep getting good data, p(θ|x) becomes more concentrated as a function of θ. But… since:

)]()|([maxarg)|(maxargˆ θθxxθθθθ

pppMAP ==

… p(x|θ) should also become more concentrated as a function of θ.

p(x|θ)p(θ)

θ

• Note that the prior PDF is nearly constant where p(x|θ) is non-zero

• This becomes truer as N →∞, and p(x|θ) gets more concentrated

)|(maxarg)]()|([maxarg θxθθxθθ

ppp ≈

MAP “Bayesian MLE”

Uses conditional PDF rather than the parameterized PDF

Page 445: Class Notes

8

11.6 Performance CharacterizationThe performance of Bayesian estimators is characterized by looking at the estimation error: θθε ˆ−=

Random (due to a priori PDF)

Random (due to x)

Performance characterized by error’s PDF p(ε)We’ll focus on Mean and Variance

If ε is Gaussian then these tell the whole storyThis will be the case for the Bayesian Linear Model (see Thm. 10.3)

We’ll also concentrate on the MMSE Estimator

Page 446: Class Notes

9

Performance of Scalar MMSE Estimator

∫==

θθθ

θθ

dp

E

)|(

x

xThe estimator is:

Function of x

So the estimation error is: ),(| θθθε xx fE =−=Function of two RV’s

General Result for a function of two RVs: Z = f (X, Y)

dydxyxpyxfZE XY ),(),( ∫∫=

dydxyxpZEyxfZEZEZ XY ),()),(()(var 22 ∫∫ −=−=

Page 447: Class Notes

10

Evaluated as seen below

So… applying the mean result gives:

00

|][

]|[][

]|[

|

|

|

||

|

,

==

−=

−=

−=

−=

x

xx

x

xxx

xx

x

x

x

x

x

E

EEE

EEEE

EEE

EEE

E

θθ

θθ

θθ

θθε

θ

θ

θθ

θ

θ

See Chart on “Decomposing Joint

Expectations” in “Notes on 2 RVs”

Pass Eθ |x through the terms

Two Notations for the same thing

|)|(|

)|(|]|[

|

|

on dependnot does

|

xxx

xxx

x

xx

θθθθ

θθθθ

θθ

θθ

θ

θ

EdpE

dpEEE

==

=

∫0 =εE

i.e., the Mean of the Estimation Error (over data

& parm) is Zero!!!!

Page 448: Class Notes

11

And… applying the variance result gives:

)ˆ(

),()ˆ(

)ˆ()(var

,2

222

θ

θθθθ

θθεεεε

θ

Bmse

ddp

EEEE

=

−=

−==−=

∫∫ xxx

Use Eε = 0

So… the MMSE estimation error has:

• mean = 0

• var = Bmse

So… when we minimize Bmsewe are minimizing the variance

of the estimate

If ε is Gaussian then ( ))ˆ(,0~ θε BmseN

Page 449: Class Notes

12

Ex. 11.6: DC Level in WGN w/ Gaussian PriorWe saw that

22 /1/1)ˆ(

ANABmse

σσ +=

AAA

A

N

Nx

NA µ

σσ

σ

σσσ

++

+=

/

/

22

2

22

2with

constantconstant

So… A is Gaussian

+ 22 /1/1,0~

ANN

σσε

Note: As N gets large this PDF collapses around 0.

This estimate is “consistent in the Bayesian sense”

Bayesian Consistency: For large N

(regardless of the realization of A!)

AA ≈ˆ

this is Gaussian because it is a linear combo of the jointly

Gaussian data samples

If X is Gaussian then Y = aX + b

is also Gaussian

Page 450: Class Notes

13

Performance of Vector MMSE Estimatorθθε ˆ−=Vector estimation error: The mean result is obvious.

Must extend the variance result:

θθx,ε MεεCε ˆcov ∆=== TE

Some New Notation…“Bayesian Mean Square

Error Matrix”

Look some more at this:

|

|

ˆ

]][[

]][[

||

||

xθx

xθx

θx,θ

xθθxθθ

xθθxθθM

CE

EEEE

EEE

T

T

=

−−=

−−=

0ε =E |ˆ xθxθε MC CE==

General Vector Results:

See Chart on “Decomposing

Joint Expectations”

= Cθ|x In general this is a function of x

Page 451: Class Notes

14

θM ˆThe Diagonal Elements of are Bmse’s of the Estimates

[ ]

)(

),(]|[

),(]|[

2

21

i

iiii

iiiiT

Bmse

dθdθpE

ddpEE

i

p

θ

θθ

θθ

θ

θ θ

=

−=

−=

∫ ∫

∫ ∫ ∫

x

xθx,

xxx

θxθxxεε

To see this:

Why do we call the error covariance the “Bayesian MSE Matrix”?

Integrate over all the other

parameters…“marginalizing”

the PDF

Page 452: Class Notes

15

Perf. of MMSE Est. for Jointly Gaussian CaseLet the data vector x and the parameter vector θ be jointly Gaussian.

0ε =ENothing new to say about the mean result:

Now… look at the Error Covariance (i.e., Bayesian MSq Matrix):

|ˆ xθxθε MC CE==Recall General Result:

Thm 10.2 says that for Jointly Gaussian Vectors we get that…Cθ|x does NOT depend on x

xθxθxθε MC ||ˆ CCE ===

xθxθxθ

xθθε

CCCC

MC

1

−−=

== C

Thm 10.2 also gives the form as:

Page 453: Class Notes

16

Perf. of MMSE Est. for Bayesian Linear ModelwHθx += ~N(µθ,Cθ)

~N(0,Cw)Recall the model:

0ε =ENothing new to say about the mean result:

Now… for the error covariance… this is nothing more than a special case of the jointly Gaussian case we just saw:

xθxθxθ

xθθε

CCCC

MC

1

−−=

== CResults for Jointly Gaussian Case

THCC θθx =

wθx CHHCC += T

Evaluations for Bayesian Linear

( )( ) 111

1|ˆ

−−−

+=

+−===

HCHC

HCCHHCHCCMC

θwθθθxθθε

T

TTC

Alternate Form … see (10.33)

Page 454: Class Notes

17

Summary of MMSE Est. Error Results1. For all cases: Est. Error is zero mean 0ε =E

2. Error Covariance for three “Nested” Cases:

|ˆ xθxθε MC CE==

[ ]iiiBmse θM ˆ)( =θ

General Case:

Jointly Gaussian: xθxθxθxθθε CCCCMC 1|ˆ

−−=== C

Bayesian Linear:Jointly Gaussian

& Linear Observation

( )( ) 111

1|ˆ

−−−

+=

+−===

HCHC

HCCHHCHCCMC

θwθθθxθθε

T

TTC

Page 455: Class Notes

18

Main Bayesian Approaches

MAP“Hit-or-Miss” Cost Function

MMSE“Squared” Cost Function

(In General: Nonlinear Estimate)

xθxθ CMxθθ

|ˆ :Cov. Err.

ˆ :EstimateE

E=

=

Jointly Gaussian x and θ(Yields Linear Estimate)

( )xθxxθxθθθ

xxθx

CCCCM

xxCCθθ1

ˆ

1

:Cov. Err.

ˆ :Estimate−

−=

−+= EE

Bayesian Linear Model(Yields Linear Estimate)

( ) ( )

( ) θθθθθ

θθθ

HCCHHCHCCM

HµxCHHCHCθθ1

ˆ

1

:Cov. Err.

ˆ :Estimate−

+−=

−++=

wTT

wTTE

)|(maxargˆ :Estimate xθθθ

p=

Hard to Implement…numerical integration

Easy to Implement…Performance Analysis

is Challenging

Easier to Implement…Determining Cθx can be

hard to find

“Easy” to Implement…Only need accurate model: Cθ, Cw, H

Page 456: Class Notes

19

11.7 Example: Bayesian DeconvolutionThis example shows the power of Bayesian approaches over classical methods in signal estimation problems (i.e. estimating the signal rather than some parameters)

h(t) Σs(t)

w(t)

x(t)

Model as a zero-mean

WSS Gaussian

Process w/ known

ACF Rs(τ)

Assumed Known

Gaussian Bandlimited White Noise w/ Known Variance

Measured Data = Samples of x(t)

Goal: Observe x(t) & Estimate s(t)Note: At Output…

s(t) is Smeared & Noisy

So… model as D-T System

Page 457: Class Notes

20

Sampled-Data Formulation

+

−−−

=

− ]1[

]1[

]0[

]1[

]1[

]0[

][]1[]1[

00]0[]1[

000]0[

]1[

]1[

]0[

Nw

w

w

ns

s

s

nNhNhNh

hh

h

Nx

x

x

ss

Measured Data Vector x

Known Observation Matrix H

Signal Vector sto Estimate

AWGN: wCw = σ2I

We have modeled s(t) as zero-mean WSS process with known ACF…So… s[n] is a D-T WSS process with known ACF Rs[m]…So… vector s has a known covariance matrix (Toeplitz & Symmetric) given by:

=

]0[]1[]2[]1[

]1[

]2[]0[]1[]2[

]1[]0[]1[

]1[]2[]1[]0[

sssss

s

ssss

sss

sssss

RRRnR

R

RRRR

RRR

nRRRR

sC

Model for Prior PDF is then s ~ N(0,Cs)

s and w are independent

Page 458: Class Notes

21

MMSE Solution for DeconvolutionWe have the case of the Bayesian Linear Model… so:

( ) xIHHCHCs ss12ˆ −

+= σTT

Note that this is a linear estimateThis matrix is called “The Weiner Filter”

( ) 121ˆ /

−− +== σHHCMC ssεT

The performance of the filter is characterized by:

Page 459: Class Notes

22

Sub-Example: No Inverse Filtering, Noise OnlyDirect observation of s with H = I… x = s + w

Σs(t)

w(t)

x(t) Goal: Observe x(t) & “De-Noise” s(t)Note: At Output… s(t) with Noise

( ) xICCs ss12ˆ −

+= σ ( ) 121ˆ /

−− +== σICMC ssε

Note: Dimensionality Problem… # of “parms” = # of observationsClassical Methods Fail… xs =ˆ Bayesian methods can solve it!!

For insight… consider “single sample” case:

)(]0[]0[1

]0[]0[

]0[]0[s 22 SNRRxxR

R s

s

s

ση

ηη

σ=

+=

+=

0]0[ˆ]0[]0[ˆ ≈≈ sSNRLow

xsSNRHigh

Data Driven Prior PDF Driven

Page 460: Class Notes

23

Sub-Sub-Example: Specific Signal ModelDirect observation of s with H = I… x = s + w

But here… the signal follows a specific random signal model

][]1[][ 1 nunsans +−−= u[n] is White Gaussian “Driving Process”

This is a 1st-order “auto-regressive” model: AR(1)Such a random signal has an ACF & PSD of

||12

1

2)(

1][ ku

s aa

kR −

−=

σ22

1

2

1)(

fj

us

eafP

π

σ−+

=

See Figures 11.9 & 11.10 in the

Textbook

Page 461: Class Notes

1

Ch. 12 Linear Bayesian Estimators

Page 462: Class Notes

2

IntroductionIn chapter 11 we saw:

the MMSE estimator takes a simple form when x and θ are jointly Gaussian – it is linear and used only the 1st and 2nd order moments (means and covariances).

Without the Gaussian assumption, the General MMSE estimator requires integrations to implement – undesirable!

So what to do if we can’t “assume Gaussian” but want MMSE?

Keep the MMSE criteria

But…restrict the form of the estimator to be LINEAR

⇒ “LMMSE Estimator” Something similar to

BLUE!

LMMSE Estimator = Wiener Filter

Page 463: Class Notes

3

Bayesian Approaches

MAP“Hit-or-Miss” Cost Function

xθxθ CMxθθ

|ˆ :Cov. Err.

ˆ :EstimateE

E=

=

MMSE“Squared” Cost Function

(Nonlinear Estimate)

Other Cost

Functions

LMMSEForce Linear EstimateKnown: Eθ,Ex, C

Jointly Gaussian x and θ(Yields Linear Estimate)

( )xθxxθxθθθ

xxθx

CCCCM

xxCCθθ1

ˆ

1

:Cov. Err.

ˆ :Estimate−

−=

−+= EE ( )xθxxθxθθθ

xxθx

CCCCM

xxCCθθ1

ˆ

1

:Cov. Err.

ˆ :−

−=

−+= EEEstimateSame!

Bayesian Linear Model(Yields Linear Estimate)

( ) ( )

( ) θθθθθ

θθθ

HCCHHCHCCM

HµxCHHCHCθθ1

ˆ

1

:Cov. Err.

ˆ :Estimate−

+−=

−++=

wTT

wTTE

Page 464: Class Notes

4

12.3 Linear MMSE Estimator SolutionScalar Parameter Case:Estimate: θ, a random variable realizationGiven: data vector x = [x[0] x[1] . . .x[N-1] ]T

Assume:– Joint PDF p(x, θ) is unknown– But…its 1st two moments are known– There is some statistical dependence between x and θ

• E.g., Could estimate θ = salary using x = 10 past years’ taxes owed• E.g., Can’t estimate θ = salary using x = 10 past years’ number of Christmas

cards sent

Goal: Make the best possible estimate while using an affine form for the estimator

∑−

=

+=1

0][ˆ

N

nNn anxaθ

Handles Non-Zero Mean Case

)ˆ()ˆ( 2θθθ θ −= xEBmseChoose an to minimize

Page 465: Class Notes

5

Derivation of Optimal LMMSE CoefficientsUsing the desired affine form of the estimator, the Bmse is

+−= ∑

=

21

0][)ˆ(

N

nNn anxaEBmse θθ

0)ˆ(=

∂∂

NaBmse θ

0][21

0=+−− ∑

=

N

nNn anxaE θ

Step #1: Focus on aN

Passing ∂/∂aN through E gives

∑−

=−=

1

0][

N

nnN nxEaEa θ

Note: aN = 0 if Eθ = Ex[n] = 0

Page 466: Class Notes

6

Step #2: Plug-In Step #1 Result for aN

−−−=

−−−= ∑

=

2

21

0

)()(

)(])[][()ˆ(

!"!#$!!"!!#$scalarscalar

T

N

nn

EEE

EnxEnxaEBmse

θθ

θθθ

xxa

where a = [a0 a1 . . . aN-1]T

Only up to N-1

Note: aT (x – Ex) = (x – Ex)Ta since it is scalar

Page 467: Class Notes

7

Thus, expanding out [aT (x – Ex) – (θ – Eθ )] 2 gives

θθθθTT

T

TT

TT

c

Etc

EtcEEE

EtcEEEBmse

+−−=

+=

+−−=

+−−=

accaaCa

aCa

axxxxa

axxxxa

xxxx

xx .

.))((

.))(()ˆ(θ

N×N N×1 1×N 1×1

θTθ

Tθθ θEθE

xx

xx

cc

xcxc

=

== cross-covariancevectors ⇒

θθθTT cBmse +−= xxx caaCa 2)ˆ(θ

Page 468: Class Notes

8

Step #3: Minimize w.r.t. a1, a2, … , aN-1

Only up to N-10)ˆ(

=∂

∂aθBmse

θxxxcCa 1−= 1−= xxxCca θT0caC xxx =− θ22

This is where the statistical dependence between the data and the parameter is used… via a cross-covariance vector

Step #4: Combine Results

[ ] ( )

][ˆ1

0

xxaxaxa EEEE

anxa

TTT

N

nNn

−+=−+=

+= ∑−

=

θθ

θ

So the Optimal LMMSE Estimate is:

xCc xxx1ˆ −= θθ( )ˆ 1 xxCc xxx EE θ −+= −θθ If Means = 0

Note: LMMSE Estimate Only Needs 1st and 2nd Moments… not PDFs!!

Page 469: Class Notes

9

Step #5: Find Minimum BmseSubstitute into Bmse result and simplify:

θθθθθθ

θθθθθθ

θθθTT

c

c

cBmse

+−=

+−=

+−=

−−

−−−

xxxxxxxx

xxxxxxxxxxxx

xxx

cCccCc

cCccCCCc

caaCa

11

111

2

2

2)ˆ(θ

θθθθcBmse xxxx cCc 1)ˆ( −−=θ

Note: If θ and x are statistically independent then Cθx = 0

ˆ θθ E=Totally based on prior info… the data is uselessθθcBmse =)ˆ(θ

Page 470: Class Notes

10

Ex. 12.1 DC Level in WGN with Uniform PriorRecall: Uniform prior gave a non-closed form requiring integration

…but changing to a Gaussian prior fixed this.

Here we keep the uniform prior and get a simple form:

• by using the Linear MMSE

xCc xxx1ˆ −= AAFor this problem the LMMSE estimate is:

( )( ) I11

w1w1Cxx

22 σσ +=

++=

TA

TAAE

( ) T

A

Tθ AAEAE

1

w1xc x

2

σ=

+==Need

A & w are uncorrelatedA & w are

uncorrelated

xN

AA

A

+=

22

2

σσσ

Page 471: Class Notes

11

12.4 Geometrical InterpretationsAbstract Vector Space

Mathematicians first tackled “physical” vector spaces like RN

and CN, etc.

But… then abstracted the “bare essence” of these structures into the general idea of a vector space.

We’ve seen that we can interpret Linear LS in terms of “Physical” vector spaces.

We’ll now see that we can interpret Linear MMSE in terms of “Abstract” vector space ideas.

Page 472: Class Notes

12

Abstract Vector Space RulesAn abstract vector space consists of a set of “mathematical objects” called vectors and another set called scalars that obey:1. There is a well-defined operation of “addition” of vectors that

gives a vector in the set, and…• “Adding” is commutative and associative• There is a vector in the set – call it 0 – for which “adding” it to any

vector in the set gives back that same vector • For every vector there is another vector s.t. when the 2 are added you get

the 0 vector

2. There is a well-defined operation of “multiplying” a vector by a “scalar” and it gives a vector in the set, and…

• “Multiplying” is associative • Multiplying a vector by the scalar 1 gives back the same vector

3. The distributive property holds• Multiplication distributes over vector addition• Multiplication distributes over scalar addition

Page 473: Class Notes

13

Examples of Abstract Vector Spaces

1. Scalars = Real Numbers Vectors = Nth Degree Polynomials w/ Real Coefficients

2. Scalars = Real Numbers Vectors = M×N Matrices of Real Numbers

3. Scalars = Real Numbers Vectors = Functions from [0,1] to R

4. Scalars = Real Numbers Vectors = Real-Valued Random Variables with Zero Mean

Colliding Terminology… a scalar RV is a vector!!!

Page 474: Class Notes

14

There is a well-defined concept of inner product s.t. all the rules of “ordinary” inner product still hold

• <x,y> = <y, x>*

• <a1x1+ a2x2,y> = a1<x1,y > + a2<x2,y> • <x,x> ≥ 0; <x,x> = 0 iff x = 0

Note: an inner product “induces” a norm (or length measure):

||x||2 = <x,x>

So an inner product space has:1. Two sets of elements: Vectors and Scalars2. Algebraic Structure (Vector Addition & Scalar Multiplication) 3. Geometric Structure

• Direction (Inner Product)• Distance (Norm)

Not needed for Real IP Spaces

Inner Product SpacesAn extension of the idea of Vector Space… must also have:

Page 475: Class Notes

15

Inner Product Space of Random VariablesVectors: Set of all real RVs w/ zero mean & finite variance (ZMFV)Scalars: Set of all real numbersInner Product: <X,Y> = EXYClaim… This is an Inner Product Space

Inner Product is Correlation!Uncorrelated = Orthogonal

First this is a vector space

Addition Properties: X+Y is another ZMFV RV1. It is Associative and Commutative: X+(Y+Z) = (X+Y)+Z; X+Y = Y+X2. The zero RV has variance of 0 (What is an RV with var = 0???)3. The negative of RV X is –X

Multiplication Properties: For any real # a, aX is another ZMFV RV1. It is Associative: a(bX) = (ab)X2. 1X = X

Distributive Properties:1. a(X+Y) = aX + aY2. (a+b)X = aX + bX

NextThis is an inner product space• <a1X1+ a2X2,Y> = E(a1X1+ a2X2)Y

= a1EX1Y+ a2EX2Y• ||X||2 = <X, X> = EX2 = varX ≥ 0

Page 476: Class Notes

16

Use IP Space Ideas for Section 12.3∑−

==

1

0][ˆ

N

nn nxaθApply to the Estimation of a zero-mean scalar RV:

Trying to estimate the realization of RV θ via a linear combination of N other RVs x[0], x[1], x[2],… x[N-1]

Zero-Mean… don’t need aN

Now…using our new vector space view of RVs, this is the same structural mathematics that we saw for the Linear LS !

( ) ( )θθθθθ ˆˆˆ 22BmseE =

−=−N = 2 Case Minimize:

Each RV is viewed as a vector

θ

x[0]x[1] θ

Connects to Geometry Connects to MSE

Recall Orthogonality Principle!!!

Estimation Error ⊥ Data Space

0][ˆ )( =− nxE θθ

Page 477: Class Notes

17

Now apply this Orthogonality Principle…TTE 0x =− )( θθ with xaT=θ

axxxxxax0xxa )( TTTTTTTT EEEEE =⇒=⇒=− θθθ

θxxx caC = The Normal Equations”

Assuming that Cxx is invertible…

θxxxcCa 1−= xCcxa xxx1ˆ −== θθ T

Same as before!!!

Page 478: Class Notes

18

12.5 Vector LMMSE EstimatorMeaning a “Physical” Vector

[ ]Tpθθθ %21=θEstimate: Realization of

Linear Estimator: aAxθ +=ˆ

Goal: Minimize Bmse for each element

View ith row in A and ith element in a as forming a scalar LMMSE estimator for θi

Already know the individual element solutions!

• Write them down

• Combine into matrix form

Page 479: Class Notes

19

Solutions to Vector LMMSE

][ˆ 1 xxCCθθ xxθx EE −+= −The Vector LMMSE estimate is:

Now… p×N Matrix…Cross-Covariance Matrix

Still… N×N Matrix…Covariance Matrix

xCCθ xxθx1ˆ −=If Eθ = 0 & Ex = 0

Can show similarly that Bmse Matrix is

)ˆ)(ˆ(ˆTE θθθθMθ −−=

xθxxθxθθθ CCCCM 1ˆ

−−=

p×pprior Cov. Matrix

p×N N×pN×N

Page 480: Class Notes

20

Two Properties of LMMSE Estimator1. Commutes over affine transformations

If bAθα += and θ is LMMSE Estimate

Then bθAα += ˆˆ is LMMSE Estimate for α

2. If α = θ1 + θ2 then 21ˆˆˆ θθα +=

Page 481: Class Notes

21

Bayesian Gauss-Markov Theorem Like G-M Theorem for the BLUE

wHθx +=Let the data be modeled as

knownp×1 random

mean µθCov Mat Cθθ

(Not Gaussian)

N×1 randomzero mean

Cov Mat Cw(Not Gaussian)

( ) ][ˆ 1θwθθθθθ HµxCHHCHCµθ −++=

−TT

Application of previous results, evaluated for this data model gives:

( ) θθwθθθθθθε HCCHHCHCCC1−

+−= TTMMSE Matrix: εθ CM =ˆ

Same forms as for Bayesian Linear Model (which include Gaussian assumption)

Except here… the result is suboptimal… unless the optimal estimate is linear

In practice… generally don’t know if linear estimate is optimal… but we useLMMSE for its simple form!

The challenge is to “guess” or estimate the needed means & cov matrices

Page 482: Class Notes

1

12.6 Sequential LMMSE EstimationSame kind if setting as for Sequential LS…

Fixed number of parameters (but here they are modeled as random)

Increasing number of data samples

][][][ nnn wθHx +=Data Model:

(n+1)×1x[n] = [x[0] … x[n]]T

p×1unknown PDF

known mean & cov

(n+1)×1w[n] = [w[0] … w[n]]T

unknown PDFknown mean & covCw must be diagonal

with elements σ2n

θ & w are uncorrelated

−=

][

]1[][

n

nn

Th

HH

(n+1)×pknown

Goal: Given an estimate ]1[ˆ −nθ based on x[n – 1], when newdata sample x[n] arrives, update the estimate to ][ˆ nθ

Page 483: Class Notes

2

Development of Sequential LMMSE EstimateOur Approach Here: Use vector space ideas to derive solution for “DC Level in White Noise” then write down general solution. ][][ nwAnx +=

For convenience… Assume both A and w[n] have zero mean

Given x[0] we can find the LMMSE estimate

]0[]0[])[(

])[(]0[]0[]0[ˆ

22

2

220 xxnwAE

nwAAExxEAxEA

A

A

+=

+

+=

=

σσσ

Now we seek to sequentially update this estimate with the info from x[1]…

Page 484: Class Notes

3

• From Vector Space View: A

x[0]

x[1]0A

1A

• First project x[1] onto x[0] to get ]0|1[x

Estimate new data given old data…

Prediction!

Notation: the estimate “at 1” based “on 0”• Use Orthogonality Principle

]0|1[ˆ]1[]1[~ xxx −=∆ is ⊥ to x[0]

⇒ This is the new, non-redundant info provided by data x[1]It is called the “innovation”

x[0]

x[1]0A

]1[~x

Page 485: Class Notes

4

• Find Estimation Update by Projecting A onto Innovation

]1[~]1[~]1[~]1[~

]1[~]1[~

,]1[~]1[~

]1[~]1[~

,ˆ221 x

xExAEx

x

xAxx

xxAA

===∆

Gain: k1

• Recall Property: Two Estimates from ⊥ data just add:

]0[]1[~ xx ⊥

]1[~ˆ

ˆˆˆ

10

101

xkA

AAA

+=

∆+=

[ ]]0|1[ˆ]1[ˆˆ101 xxkAA −+=

x[0]

x[1]0A

]1[~x1A∆

1APredictedNew Data

New Data

Old Estimate

Gain

“Innovation” is ⊥ Old Data

Page 486: Class Notes

5

The Innovations SequenceThe Innovations Sequence is…

• Key to the derivation & implementation of Seq. LMMSE• A sequence of orthogonal (i.e., uncorrelated) RVs• Broadly significant in Signal Processing and Controls

]0[x

…],2[~],1[~],0[~ xxx

]0|1[ˆ]1[ xx −

]1|2[ˆ]2[ xx −

Means: “Based on ALL data up to n = 1

(inclusive)

Page 487: Class Notes

6

General Sequential LMMSE EstimationInitialization No Data Yet! ⇒ Use Prior Information

ˆ1 θθ E=−

Estimate

θθCM =−1 MMSE Matrix

Update Loop For n = 0, 1, 2, …

nnTnn

nnn

hMhhMk

12

1

+=σ

Gain Vector Calculation

[ ]11ˆ][ˆˆ−− −+= n

Tnnnn nx θhkθθ Estimate Update

[ ] 1−−= nTnnn MhkIM MMSE Matrix Update

Page 488: Class Notes

7

Sequential LMMSE Block Diagram

][][][ nnn wθHx +=

−=

][

]1[][

n

nn

Th

HH

Data Model

kn Σ

21 ,, nnn σhM −

Compute Gain

][~ nx

z-1

1ˆ−nθ

+

+

x[n]Σ

Tnh

1ˆ]1|[ˆ −=− n

Tnnnx θh

+

InnovationObservation

( )11ˆ][ˆ−− −+ n

Tnnn nx θhkθ

nθ−

Delay Updated Estimate

Previous Estimate

nhPredicted Observation

Exact Same Structure as for Sequential Linear LS!!

Page 489: Class Notes

8

Comments on Sequential LMMSE Estimation1. Same structure as for sequential linear LS. BUT… they solve

the estimation problem under very different assumptions.

2. No matrix inversion required… So computationally Efficient

3. Gain vector kn weighs confidence in new data (σ2n) against all

previous data (Mn-1)• when previous data is better, gain is small… don’t use new data much• when new data is better, gain is large… new data is heavily used

4. If you know noise statistics σ2n and observation rows hn

T over the desired range of n:

• Can run MMSE Matrix Recursion without data measurements!!!• This provides a Predictive Performance Analysis

Page 490: Class Notes

9

12.7 Examples Wiener Filtering During WWII, Norbert Wiener developed the mathematical ideas that led to the Wiener filter when he was working on ways to improve anti-aircraft guns.

He posed the problem in C-T form and sought the best linear filter that would reduce the effect of noise in the observed A/Ctrajectory.

He modeled the aircraft motion as a wide-sense stationary random process and used the MMSE as the criterion for optimality. The solutions were not simple and there were many different ways of interpreting and casting the results.

The results were difficult for engineers of the time to understand.

Others (Kolmogorov, Hopf, Levinson, etc.) developed these ideas for the D-T case and various special cases.

Page 491: Class Notes

10

Weiner Filter: Model and Problem Statement

Signal Model: x[n] = s[n] + w[n]

Observed: Noisy SignalModel as WSS, Zero-Mean

Cxx = Rxx

covariance matrix

correlationmatrix TE xxRxx =

))(( TEEE xxxxCxx −−=

Desired SignalModel as WSS, Zero-Mean

Css = Rss

NoiseModel as WSS, Zero-Mean

Cww = Rww

Same if zero-mean

Problem Statement: Process x[n] using a linear filter to provide a “de-noised” version of the signal that has minimum MSErelative to the desired signal

LMMSE Problem!

Page 492: Class Notes

11

Filtering, Smoothing, PredictionTerminology for three different ways to cast the Wiener filter problem

Filtering Smoothing PredictionGiven: x[0], x[1], …, x[n] Given: x[0], x[1], …, x[N-1] Given: x[0], x[1], …, x[N-1]

Find: ]1[ˆ,],1[ˆ],0[ˆ −Nsss … 0,][ˆ >+ llNxFind: ][ˆ ns Find:

xCCθ xxθx1ˆ −=

x[n]

]0[s

21 n

]1[s

]2[s

]3[s

3

x[n]

21

x[n]

]5[x

21 4 5

Note!!

n n3 3

]0[s ]1[s ]2[s ]3[s

All three solved using General LMMSE Est.

Page 493: Class Notes

12

Filtering Smoothing

xCCθ xxθx1ˆ −=

Prediction(vector) sθ =(scalar) ][nsθ =

[ ]

)(vector! ~

]0[][

][

][

Tss

ssss

T

T

rnr

nsE

nsE

r

s

xC x

=

=

=

=

"

θ

wwss

TT

Txx

E

E

RR

wwss

wswsC

+=

+=

++= ))((

[ ][ ][ ]1)1()1()1()1(1

1)(~][ˆ

×++×++×

−+=

nnnn

wwssT

ssns xRRr

)(Matrix!

)(

ss

TT

T

T

E

E

E

R

swss

wss

sxCθx

=

+=

+=

=

(scalar) ]1[ lNxθ +−=

x not s!

[ ]

)(vector! ~

][]1[

]1[

Txx

xxxx

Tx

lrlNr

lNxE

r

xC

=

+−=

+−=

"

θ

wwss

TT

Txx

E

E

RR

wwss

wswsC

+=

+=

++= ))(( xx

Txx E

RxxC

==

[ ][ ][ ]1

1)(ˆ×××

−+=NNNNNwwssss xRRRs

[ ][ ][ ]11

1~]1[ˆ×××

−=+−NNNN

xxTxxlNx xRr

Page 494: Class Notes

13

Comments on Filtering: FIR WienerxaxRRr

a

Twwss

Tss

T

ns =+= −### $### %&

1)(~][ˆ

[ ][ ]Tnn

Tnnn

aaa

nhhh

01

)()()(

...

][...]1[]0[

−=

=h

∑=

−=n

k

n knxkhns0

)( ][][][ˆ

Wiener Filter as Time-Varying FIR Filter

• Causal!• Length Grows!

Wiener-Hopf Filtering Equations

[ ]Tssssssss

sswwss

nrrrxx

][...]1[]0[

)(

=

=+

r

rhRRR##$##%&

=

][

]1[]0[

][

]1[]0[

]0[]1[][

]1[]0[]1[][]1[]0[

)(

)(

)(

Toeplitz & Symmetric

nr

rr

nh

hh

rnrnr

nrrrnrrr

ss

ss

ss

n

n

n

xxxxxx

xxxxxx

xxxxxx

''

###### $###### %& "'(''

""

In Principle: Solve WHF Eqs for filter h at each nIn Practice: Use Levinson Recursion to Recursively Solve

Page 495: Class Notes

14

Comments on Filtering: IIR Wiener

Can Show: as n →∞ Wiener filter becomes Time-InvariantThus: h(n)[k] → h[k]

Then the Wiener-Hopf Equations become:

…,1,0][][][0

==−∑∞

=llrklrkh ss

kxx

and these are solved using so-called “Spectral Factorization”

And… the Wiener Filter becomes IIR Time-Invariant:

∑∞

=−=

0][][][ˆ

kknxkhns

Page 496: Class Notes

15

Revisit the FIR Wiener: Fixed Length L

∑−

=−=

1

0][][][ˆ

L

kknxkhns

]6[s

The way the Wiener filter was formulated above, the length of filter grew so that the current estimate was based on all the past data

Reformulate so that current estimate is based on only L most recent data: … x[3] x[4] x[5] x[6] x[7] x[8] x[9] …

]7[s]8[s

Wiener-Hopf Filtering Equations for WSS Process w/ Fixed FIR

[ ]Tssssssss

sswwss

nrrrxx

][...]1[]0[

)(

=

=+

r

rhRRR##$##%&

=

][

]1[

]0[

][

]1[

]0[

]0[]1[][

]1[]0[]1[

][]1[]0[

Toeplitz & Symmetric

nr

r

r

nh

h

h

rnrnr

nrrr

nrrr

ss

ss

ss

xxxxxx

xxxxxx

xxxxxx

''

####### $####### %& "

'(''

"

"

Solve W-H Filtering Eqs ONCE for filter h

Page 497: Class Notes

16

Comments on Smoothing: FIR Smoother

WxxRRRsW

=+= −### $### %&

1)(ˆ wwssssEach row of W like a FIR Filter

• Time-Varying • Non-Causal!• Block-Based

To interpret this – Consider N=1 Case:

]0[1

]0[]0[]0[

]0[]0[ˆ

SNRLow ,0SNR High,1

xSNR

SNRxrr

rswwss

ss

#$#%&

+=

+

=

Page 498: Class Notes

17

Comments on Smoothing: IIR Smoother

Estimate s[n] based on …, x[–1], x[0], x[1],…

∑∞

−∞=−=

kknxkhns ][][][ˆ Time-Invariant &

Non-Causal IIR Filter

The Wiener-Hopf Equations become:

∞<<∞−=−∑∞

−∞=llrklrkh ss

kxx ][][][ ][][][ nrnrnh ssxx =∗

)()()(

)()()(

fPfPfP

fPfPfH

wwss

ss

xx

ss

+=

=Differs From Filter CaseSum over all k

Differs From Filter CaseSolve for all l

H( f ) ≈ 1 when Pss( f ) >> Pww( f )H( f ) ≈ 0 when Pss( f ) << Pww( f )

Page 499: Class Notes

18

Relationship of Prediction to AR Est. & Yule-WalkerWiener-Hopf Prediction Equations

[ ]Txxxxxxxx

xxxx

Nlrlrlr ]1[...]1[][ −++=

=

r

rhR

−+

+=

−−

−−

]1[

]1[][

][

]1[]0[

]0[]2[]1[

]2[]0[]1[]1[]1[]0[

Toeplitz & Symmetric

Nlr

lrlr

nh

hh

rNrNr

NrrrNrrr

xx

xx

xx

xxxxxx

xxxxxx

xxxxxx

''

######## $######## %& "'(''

""

For l=1 we get EXACTLY the Yule-Walker Eqs used inEx. 7.18 to solve for the ML estimates of the AR parameters!!! FIR Prediction Coefficients are estimated AR parms

Recall: we first estimated the ACF lags rxx[k] using the dataThen used the estimates to find estimates of the AR parameters

xxxx rhR ˆˆ =

Page 500: Class Notes

19

Relationship of Prediction to Inverse/Whitening Filter

)(11

za−

u[k] x[k]

)(za

Σ

AR Model

][ˆ kx

+

Inverse Filter: 1a(z)

FIR Pred.

Signal Observed

u[k]

White NoiseWhite Noise

1-Step PredictionImagination

& Modeling

Physical Reality

Page 501: Class Notes

20

Results for 1-Step Prediction: For AR(3)

0 20 40 60 80 100-6

-4

-2

0

2

4

Sample Index, k

Sign

al V

alue

Signal PredictionError

At each k we predict x[k] using past 3 samples

Application to Data CompressionSmaller Dynamic Range of Error gives More Efficient Binary Coding

(e.g., DPCM – Differential Pulse Code Modulation)

Page 502: Class Notes

1

Ch. 13 Kalman Filters

Page 503: Class Notes

2

IntroductionIn 1960, Rudolf Kalman developed a way to solve some of the practical difficulties that arise when trying to apply Weiner filters.

There are D-T and C-T versions of the Kalman Filter… we will only consider the D-T version.

The Kalman filter is widely used in:• Control Systems• Navigation Systems • Tracking Systems

It is less widely used in signal processing applications

KF initially arose in the field of control systems –

in order to make a system do what you

want, you must know what it is doing now

Page 504: Class Notes

3

The Three Keys to Leading to the Kalman Filter

Wiener Filter: LMMSE of a Signal (i.e., a Varying Parameter)

Sequential LMMSE: Sequentially Estimate a Fixed Parameter

State-Space Models: Dynamical Models for Varying Parameters

Kalman Filter: Sequential LMMSE Estimation for a time-varying parameter vector – but the time variation is

constrained to follow a “state-space” dynamical model.

Aside: There are many ways to mathematically model dynamical systems…• Differential/Difference Equations• Convolution Integral/Summation• Transfer Function via Laplace/Z transforms• State-Space Model

Page 505: Class Notes

4

13.3 State-Variable Dynamical ModelsSystem State: the collection of variables needed to know how to determine how the system will “exist” at some future time (in the absence of an input)…

For an RLC circuit… you need to know all of its current capacitor voltages and all of its current inductor currents

Motivational Example: Constant Velocity Aircraft in 2-D

=

)(

)(

)(

)(

)(

tv

tv

tr

tr

t

y

x

y

x

s

A/C positions (m) For the constant velocity model we would constrain vx(t) & vy(t) to be constants Vx & Vy.A/C velocities (m/s)

If we know s(to) and there is no input we know how the A/C behaves for all future times: rx(to + τ) = Vxτ + rx(to)

rx(to + τ) = rx(to) + Vxτry(to + τ) = ry(to) + Vyτ

Page 506: Class Notes

5

D-T State Model for Constant Velocity A/CBecause measurements are often taken at discrete times… we oftenneed D-T models for what are otherwise C-T systems

(This is the same as using a difference equation to approximate a differential equation)

If every increment of n corresponds to a duration of ∆ sec and there is no driving force then we can write a D-T State Model as:

=

1000

0100

010

001

A

]1[][ −= nn Ass

State Transition Matrix

rx[n] = rx[n-1] + vx[n-1]∆

ry [n] = ry[n-1] + vy[n-1]∆

vx[n] = vx[n-1]∆

vy[n] = vy[n-1]∆

We can include the effect of a vector input:

][]1[][ nnn BuAss +−=

Input could be deterministic and/or random.Matrix B combines inputs & distributes them to states.

Page 507: Class Notes

6

Thm 13.1 Vector Gauss-Markov Model Don’t confuse with the G-M Thm. of Ch. 6This theorem characterizes the probability model for a

specific state-space model with Gaussian Inputs

][]1[][ nnn BuAss +−=Linear State Model: n ≥ 0

p×1 p×pknown

p×rknown

r×1

s[n]: “state vector” is a vector Gauss-Markov processA: “state transition matrix”; assumed |λi| < 1 for stabilityB: “input matrix”u[n]: “driving noise” is vector WGN w/ zero means[-1]: “initial state” ~ N(µs,Cs) and independent of u[n]

eigenvalues

u[n] ~ N(0,Q)Eu[n] uT[m] = 0, n ≠ m

Page 508: Class Notes

7

Theorem:• s[n] for n ≥ 0 is Gaussian with the following characteristics…• Mean of state vector is sµAs 1][ += nnE diverges if e-values

have |λi| ≥ 1• Covariance between state vectors at m and n is

( ) ∑

−=

+++ +=

−−=≥

m

nmk

Tkn-mTkTnm

TnEnmEmEnmnm

)(

]][][]][[][[],[:for

11 ABQBAACA

ssssC

s

s

],[],[:for mnnmnm Tss CC =< State Process is Not WSS!

• Covariance Matrix: C[n] = Cs[n,n] (this is just notation)• Propagation of Mean & Covariance:

TTnn

nEnE

BQBAACC

sAs

+−=

−=

]1[][

]1[][

Page 509: Class Notes

8

Proof: (only for the scalar case: p = 1)For the scalar case the model is: s[n] = a s[n-1] + b u[n] n ≥ 0

differs a bit from (13.1) etc.

Now we can just iterate this model and surmise its general form:]0[]1[]0[ bu as s +−=

]1[]0[]1[

]1[]0[]1[

2 bu abu s a

bu as s

++−=

+=

]2[]1[]0[]1[

]2[]1[]2[

23 bu abubu a s a

bu as s

+++−=

+=

∑=

+ −+−=n

k

kn knbua s a ns0

1 ][]1[][

!Now easy to find the mean:

sn

n

k

kn

a

knuEba sE a nsEs

µ

µ

1

0 0

1 ][]1[][

+

= ==

+

=

−+−= ∑ "#"$%"#"$%

… as claimed!z.i. response… exponential

z.s. response… convolution

Page 510: Class Notes

9

Covariance between s[m] and s[n] is:

∑∑

= = +−−=

++

=

+

=

+

++

−−+=

−+−

×−+−=

−−=

m

k

m

l

l

kmnl

ks

nm

n

l

ls

n

m

k

ks

m

Ts

ns

ms

balnukmuEbaaa

lnbuansa

kmbuamsaE

ansamsEnmC

u0 0 )]([

211

0

1

0

1

11

2

][][

][]][[

][]][[

]][][][[],[

)(

)(

""" #""" $%δσ

σ

µ

µ

µµ

Must use differentdummy variables!!

Cross-terms will be zero… why?

∑−=

+−++ +=m

nmk

kmnu

ks

nms babaaanmC 2211],[ σσFor m ≥ n:

For m < n: ],[],[ mnCnmC ss =

Page 511: Class Notes

10

For mean & cov. propagation: from s[n] = a s[n – 1] + b u[n]

"#"$%"""" #"""" $%0 theoremin as propagates

][]1[][=

+−= nuEbnsaEnsE

bnuEbansEnsEa

nsaEnbunasE

nsEnsEns

uns

"#"$%""""" #""""" $%2

][])1[]1[(

])1[][]1[(

])[][(][var

2

]1[var

2

2

2

σ=−=

+−−−=

−−+−=

−=

… which propagates as in theorem < End of Proof >

So we now have:• Random Dynamical Model (A State Model)• Statistical Characterization of it

Page 512: Class Notes

11

Random Model for “Constant” Velocity A/C

+

=

=

][

][

0

0

]1[

]1[

]1[

]1[

1000

0100

010

001

][

][

][

][

][

nu

nu

nv

nv

nr

nr

nv

nv

nr

nr

n

y

x

y

x

y

x

y

x

y

x

s

Deterministic Propagation of Constant-Velocity

Random Perturbation of Constant Velocities

=

2

2

000

000

0000

0000

][cov

u

un

σ

σu

Page 513: Class Notes

12

0 2000 4000 6000 8000 10000 12000 14000 16000 180000

2000

4000

6000

8000

10000

12000

14000

X position (m)

Y po

sitio

n (m

)

m/s 100]1[]1[

m 0]1[]1[

m/s 5

sec 1

=−=−

=−=−

=

=∆

yx

yx

u

vv

rr

σ

Ex. Set of “Constant-Velocity” A/C TrajectoriesRed Line is Non-Random

Constant Velocity Trajectory

Acceleration of (5 m/s)/1s = 5m/s2

Page 514: Class Notes

13

Observation ModelSo… we have a random state-variable model for the dynamics of the “signal” (… the “signal” is often some true A/C trajectory)

We need to have some observations (i.e., measurements) of the “signal”

• In Navigation Systems… inertial sensors make noisy measurements at intervals of time

• In Tracking Systems… sensing systems make noisy measurements (e.g., range and angles) at intervals of time

][][][][ nnnn wsHx +=Linear Observation Model:

Measured “observation” vector at each time

allows multiple measurements at each time

Observation Matrix …can change w/ time

State Vector Process being observed

Vector Noise Process

Page 515: Class Notes

14

The Estimation ProblemObserve a Sequence of Observation Vectors x[0], x[1], … x[n]

Compute an Estimate of the State Vector s[n]

]|[ˆ nnsestimate state at n

using observation up to n

Notation: ]|[ˆ mns = Estimate of s[n] using x[0], x[1], … x[m]

Want Recursive Solution:Given: ]|[ˆ nns and a new observation vector x[n + 1]Find: ]1|1[ˆ ++ nns

Three Cases of Interest:• Scalar State – Scalar Observation• Vector State – Scalar Observation• Vector State – Vector Observation

Page 516: Class Notes

1

13.4 Scalar Kalman FilterData ModelTo derive the Kalman filter we need the data model:

><+=

><+−=

EquationnObservatio][][][

EquationState][]1[][

nwnsnx

nunasns

Assumptions1. u[n] is zero mean Gaussian, White, 22 ][ unuE σ=

2. w[n] is zero mean Gaussian, White, 22 ][ nnwE σ=3. The initial state is ),(~]1[ 2

ssNs σµ−4. u[n], w[n], and s[–1] are all independent of each other

Can vary with time

To simplify the derivation: let µs = 0 (we’ll account for this later)

Page 517: Class Notes

2

Goal and Two Properties ][,],1[],0[|][]|[ˆ nxxxnsEnns …=Goal: Recursively compute

[ ]Tnxxxn ][,],1[],0[][ …=XNotation: X[n] is set of all observationsx[n] is a single vector-observation

Two Properties We Need1. For the jointly Gaussian case, the MMSE estimator of zero mean based on two uncorrelated data vectors x1 & x2 is (see p. 350 of text)

||,|ˆ2121 xxxx θθθθ EEE +==

2. If θ = θ1 + θ2 then the MSEE estimator is

||||ˆ2121 xxxx θθθθθθ EEEE +=+==

(a result of the linearity of E. operator)

Page 518: Class Notes

3

Derivation of Scalar Kalman FilterInnovation: ]1|[ˆ][][~ −−= nnxnxnxRecall from Section 12.6…

MMSE estimate of x[n] given X[n – 1]

(prediction!!)

By MMSE Orthogonality Principle

0X =− ]1[][~ nnxE

data previous the witheduncorrelat is that ][ ofpart is ][~ nxnx

Now note: X[n] is equivalent to ][~],1[ nxn −XWhy? Because we can get get X[n] from it as follows:

][][

]1[

][~]1[

nnx

n

nx

nX

XX=

−→

"#"$%]1|[ˆ

1

0][][~][

=∑+=

nnx

n

kk kxanxnx

Page 519: Class Notes

4

What have we done so far?

• Have shown that ][~],1[][ nxnn −↔ XX

⇒ Have split current data set into 2 parts:1. Old data2. Uncorrelated part of new data (“just the new facts”)

uncorrelated

][~],1[|][][|][]|[ˆ nxnnsEnnsEnns −== XX Because of this⇒

So what??!! Well… can now exploit Property #1!!

""#""$%"" #"" $% ][~|][

]1|[ˆ

]1[|][]|[ˆ nxnsE

nns

nnsEnns +

−=

−=

X

Update based on innovation part

of new data

Now need to look more closely at

each of these!prediction of s[n]based on past data

Page 520: Class Notes

5

]1|[ˆ −nnsLook at Prediction Term:

Use the Dynamical Model… it is the key to prediction because it tells us how the state should progress from instant to instant

]1[|][]1[]1[|][]1|[ˆ −+−=−=− nnunasEnnsEnns XX

Now use Property #2:

""" #""" $%""" #""" $%0][]1|1[ˆ

]1[|][]1[|]1[]1|[ˆ==−−=

−+−−=−nuEnns

nnuEnnsEanns XX

By Definition By independence of u[n] & X[n-1]… See bottom of p. 433 in textbook.

]1|1[ˆ]1|[ˆ −−=− nnsanns

The Dynamical Model provides the update from estimate to prediction!!

Page 521: Class Notes

6

][~|][ nxnsELook at Update Term:

Use the form for the Gaussian MMSE estimate:

][~][~

][~][][~|][

][

2 nxnxE

nxnsEnxnsE

nk

"" #"" $%∆=

=

]1|[ˆ][][~ −−= nnxnxnx

( )]1|[ˆ][][][~|][ −−= nnxnxnknxnsE

"#"$%"#"$%0

]1|[ˆ]1|[ˆ=

−+−= nnwnnsby Prop. #2

Prediction Shows Up Again!!!

So…

Because w[n] is indep. of x[0], … , x[n-1]Put these Results Together:

[ ]]1|[ˆ][][]1|[ˆ]|[ˆ]1|1[ˆ

−++−=−−=

nnsnxnknnsnnsnnsa"#"$% This is the

Kalman Filter

How to get the gain?

Page 522: Class Notes

7

Look at the Gain Term:Need two properties…

])1|[ˆ][])(1|[ˆ][(])1|[ˆ][]([ −−−−=−− nnsnxnnsnsEnnsnxnsEA.

The innovation][~

]1|[ˆ][

nx

nnxnx

=

−−=Aside<x,y> = <x+z,y>

for any z ⊥ y

Linear combo of past data… thus ⊥ w/ innovation

0])1|[ˆ][]([ =−− nnsnsnwEB.

“ proof ”• w[n] is the measurement noise and by assumption is indep. of the “dynamical driving noise” u[n] and s[-1]… In other words: w[n] is indep. of everything dynamical… So Ew[n]s[n] = 0

• is based on past data, which include w[0], … , w[n-1], and since the measurement noise has indep. samples we get

]1|[ˆ −nns][]1|[ˆ nwnns ⊥−

Page 523: Class Notes

8

So… we start with the gain as defined above:

[ ] [ ]

[ ][ ] [ ]

[ ][ ] [ ]

[ ] [ ] [ ] [ ] ][]1|[ˆ][2]1|[ˆ][

][]1|[ˆ][]1|[ˆ][

][]1|[ˆ][

][]1|[ˆ][]1|[ˆ][

][]1|[ˆ][

]1|[ˆ][]1|[ˆ][

]1|[ˆ][

]1|[ˆ][][][~

][~][][

22

2

2

2

22

nwnnsnsEnnsnsE

nwnnsnsEnnsnsE

nwnnsnsE

nwnnsnsnnsnsE

nwnnsnsE

nnsnxnnsnsE

nnsnxE

nnsnxnsEnxE

nxnsEnk

n −−++−−

−−+−−=

+−−

+−−−−=

+−−

−−−−=

−−

−−==

σ

Use Prop. A in num.Use x[n] = s[n]+ w[n] in denominator

(!)

(!!)

Use x[n] = s[n]+ w[n]

in numerator

Expand

= 0 by Prop. B]1|[ −=∆ nnM

Plug in for innovation

MSE when s[n] is estimated by 1-step prediction

Page 524: Class Notes

9

This gives a form for the gain:

]1|[]1|[][ 2 −+

−=

nnMnnMnk

This balances… • the quality of the measured data • against the predicted state

In the Kalman filter the prediction acts like the prior information about the state at time nbefore we observe the data at time n

Page 525: Class Notes

10

Look at the Prediction MSE Term:

But now we need to know how to find M[n|n – 1]!!!

[ ] [ ] ( )[ ] 2

2

2

][]1|1[ˆ]1[

]1|1[ˆ][]1[

1|[ˆ][]1|[

nunnsnsaE

nnsanunasE

nnsnsEnnM

+−−−−=

−−−+−=

−−=−

22 ]1|1[]1|[ unnMannM σ+−−=−

Why are the cross-terms zero? Two parts:1. s[n – 1] depends on u[0] … u[n – 1], s[-1], which are indep. of u[n]2. depends on s[0]+w[0] … s[n – 1]+w[n – 1], which are

indep. of u[n]]1|1[ˆ −− nns

Use dynamical model & exploit

form for prediction

Cross-terms = 0

Est. Error at previous time

Page 526: Class Notes

11

Look at a Recursion for MSE Term: M[n|n]

[ ] ( )[ ] 22 ]1|[ˆ][][]1|[ˆ][]|[ˆ][]|[ −−−−−=−= nnsnxnknnsnsEnnsnsEnnBy def.: M

Term A Term BNow we’ll get three terms:

EA2, EAB, EB2 ]1|[2 −= nnMAE

[ ][ ]

]1|[][2

]1|[ˆ][]1|[ˆ][][22

−−=

−−−−−=

nnMnk

nnsnxnnsnsEnkABE

[ ] [ ]

[ ] ]1|[][][ of Num.][

][ of Den.][

]1|[ˆ][][

2

222

−==

=

−−=

nnMnknknk

nknk

nnsnxEnkBE

from (!!)… is num. k[n]

from (!)… is den. k[n]

by definition

]1|[]1|[][ 2 −+

−=

nnMnnMnk

nσRecall:

Page 527: Class Notes

12

So this gives…

]1|[][]1|[][2]1|[]|[ −+−−−= nnMnknnMnknnMnnM

( ) ]1|[][1]|[ −−= nnMnknnM

Putting all of these results together gives some very simple equations to iterate…

Called the Kalman Filter

We just derived the form for Scalar State & Scalar Observation.On the next three charts we give the Kalman Filter equations for:

• Scalar State & Scalar Observation• Vector State & Scalar Observation• Vector State & Vector Observation

Page 528: Class Notes

13

Kalman Filter: Scalar State & Scalar Observationu[n] WGN; WSS; ),0(~ 2

uN σ][]1[][ nunasns +−=State Model:Varies with n][][][ nwnsn +=xObservation Model: w[n] WGN; ~ ),0( 2

nN σ

22])1|1[]1[(]1|1[

]1[]1|1[

s

s

ssEM

sEs

σ

µ

=−−−−=−−

=−=−− Must Know: µs, σ2s , a, σ2

u, σ2n

Must Know: µs, σ2s , a, σ2

u, σ2nInitialization:

Prediction: ]1|1[]1|[ −−=− nnsanns

22 ]1|1[]1|[ unnMannM σ+−−=−Pred. MSE:

]1|[]1|[][ 2 −+

−=

nnMnnMnK

nσKalman Gain:

( )]1|[][][]1|[]|[ −−+−= nnsnxnKnnsnnsUpdate:

( ) ]1|[][1]|[ −−= nnMnKnnMEst. MSE:

Page 529: Class Notes

14

Kalman Filter: Vector State & Scalar Observation1 )(~;;1; ][]1[][ ××××+−= rNr pp ppnnn Q0,uBAsBuAssState Model:

1][];[][][][ ×+= pnnwnnnx TT hsh ),0(~ 2nN σw[n] WGN; Observation Model:

sT

s

E

E

CssssM

µss

=−−−−−−−−=−−

=−=−−

])1|1[ˆ]1[])(1|1[ˆ]1[(]1|1[

]1[]1|1[ˆ Must Know: µs, Cs, A, B, h, Q, σ2n

Must Know: µs, Cs, A, B, h, Q, σ2nInitialization:

]1|1[ˆ]1|[ˆ −−=− nnnn sAsPrediction:

TTnnnn BQBAAMM +−−=− ]1|1[]1|[Pred. MSE (p×p):

""" #""" $%11

2 ][]1|[][][]1|[][

×

−+−

=nnnn

nnnn Tn hMh

hMKσKalman Gain (p×1):

"""" #"""" $%"" #"" $%

sinnovationnx

nnx

T nnnnxnnnnn

:][~]1|[ˆ

])1|[ˆ][][(][]1|[ˆ]|[ˆ−

−−+−= shKssUpdate:

( ) ]1|[][][]|[ −−= nnnnnn T MhKIMEst. MSE (p×p): :

Page 530: Class Notes

15

Kalman Filter: Vector State & Vector Observation1 )N(~;;1; ][]1[][ ××××+−= rr pp ppnnn Q0,uBAsBuAssState Model:

1 )][N(~][;][;1];[][][][ ×××+= MnnpMnMnnnn C0,wHxwsHxObservation:

sT

s

E

E

CssssM

µss

=−−−−−−−−=−−

=−=−−

])1|1[ˆ]1[])(1|1[ˆ]1[(]1|1[

]1[]1|1[ˆ Must Know: µs, Cs, A, B, H, Q, C[n]Must Know: µs, Cs, A, B, H, Q, C[n]Initialization:

]1|1[ˆ]1|[ˆ −−=− nnnn sAsPrediction:

TTnnnn BQBAAMM +−−=− ]1|1[]1|[Pred. MSE (p×p):

1

][]1|[][][][]1|[][−

×

−+−= """ #""" $%

MM

TT nnnnnnnnn HMHCHMKKalman Gain (p×M):

"""" #"""" $%"" #"" $%

sinnovationn

nn

nnnnnnnnn

:][~]1|[ˆ

])1|[ˆ][][(][]1|[ˆ]|[ˆ

x

x

sHxKss−

−−+−=Update:

Est. MSE (p×p): : ( ) ]1|[][][]|[ −−= nnnnnn MHKIM

Page 531: Class Notes

16

Kalman Filter Block Diagram

K[n] Σ

Az-1

+

+

][ˆ nuB

]1|[ˆ −nns

x[n]Σ

H[n]]1|[ˆ −nnx

+

][~ nx ]|[ˆ nns

EstimatedState

EstimatedDriving NoiseInnovations

Observations

EmbeddedObservation

Model

EmbeddedDynamical

ModelPredicted Observation

Predicted State

Looks a lot like Sequential LS/MMSE except it has the Embedded Dynamical Model!!!

Page 532: Class Notes

17

Overview of MMSE Estimation

Jointly Gaussian LMMSE

Bayesian Linear Model

LMMSE Linear Model

|ˆ xθθ E=

Optimal Seq. Filter

(No Dynamics)

Optimal Kalman Filter(w/ Dynamics)

Linear Seq. Filter

(No Dynamics)

Linear Kalman Filter(w/ Dynamics)

( )ˆ 1 xxCCθθ xxθx EE −+= −

( ) ( )θθθθ HµxCHHCHCµθ −++=−1ˆ

wTT

[ ]11ˆ][ˆˆ−− −+= n

Tnnnn nx θhkθθ

])1|1[ˆ][][]([]1|[ˆ]|[ˆ −−−+−= nnnnnnnnn sAHxKss

Force LinearAny PDF, Known 2nd Moments

Assume Gaussian

Gen. MMSE“Squared” Cost Function

Page 533: Class Notes

1

Important Properties of the KF1. Kalman filter is an extension of the sequential MMSE

estimator• Sequential MMSE is for a fixed parameter• Kalman is for time-varying parameter, but must have a known

dynamical model• Block diagrams are nearly identical except for the Az-1 feedback box in

the Kalman filter… just a z-1 box in seq. MMSE… the A is the dynamical model’s state-transition matrix

2. Inversion is only needed for the vector observation case3. Kalman filter is a time-varying filter

• Due to two time-varying blocks: gain K[n] & Observation Matrix H[n]• Note: K[n] changes constantly to adjust the balance between “info from

the data” (the innovation) vs. “info from the model” (the prediction)

4. Kalman filter computes (and uses!) its own performance measure M[n|n] (which is the MMSE matrix)

• Used to help balance between innovation and prediction

Page 534: Class Notes

2

5. There is a natural up-down progression in the error• The Prediction Stage increases the error • The Update Stage decreases the error M[n|n – 1] > M[n|n]• This is OK… prediction is just a natural, intermediate step in the Optimal

processing

5 6 7 8 n

M[5|4]M[6|5]

M[7|6]M[8|7]M[5|5]

M[6|6]M[7|7]

6. Prediction is an integral part of the KF• And it is based entirely on the Dynamical Model!!!

7. After a “long” time (as n →∞) the KF reaches “steady-state”operation… and the KF becomes a Linear Time-Invariant filter

• M[n|n] and M[n|n – 1] both become constant• … but still have M[n|n – 1] > M[n|n]• Thus, the gain k[n] becomes constant, too.

Page 535: Class Notes

3

8. The KF creates an uncorrelated sequence… the innovations.• Can view the innovations as “an equivalent input sequence”• Or… if we view the innovations as the output, then the steady-state KF is

a LTI whitening filter (need state-state to get constant-power innovations)

9. The KF is optimal for the Gaussian Case (minimizes MSE)• If not Gaussian… the KF is still the optimal Linear MMSE estimator!!!

10. M[n|n – 1], M[n|n], and K[n] can be computed ahead of time (“off-line”)

• As long as the expected measurement variance σ2n is known

• This allows off-line data-independent assessment of KF performance

Page 536: Class Notes

4

13.5 Kalman Filters vs. Wiener FiltersThey are hard to directly compare… They have different models

• Wiener assumes WSS signal + Noise

• Kalman assumes Dynamical Model w/ Observation Model

So… to compare we need to put them in the same context:

If we let:1. Consider only after much time has elapsed (as n →∞)

• Gives IIR Wiener case• Gives steady-state Kalman & Dynamic model becomes AR

2. For Kalman Filter, let σ2n be constant

• Observation noise becomes WSS

Then… Kalman = Wiener!!! See book for more details

Page 537: Class Notes

5

13.7 Extended Kalman FilterThe dynamical and observation models we assumed when developing the Kalman filter were Linear models:

][]1[][ nnn BuAss +−=Dynamics:(A matrix is a linear operator)

][][][][ nnnn wsHx +=Observations:

However, many (most?) applications have a• Nonlinear State Equation

and/or• Nonlinear Observation Equation

Solving for the Optimal Kalman filter for the nonlinear model case is generally intractable!!!

The “Extended Kalman Filter” is a sub-optimal approach that linearizes the model(s) and then applies the standard KF

Page 538: Class Notes

6

EKF Motivation: A/C Tracking with Radar

Case #1: Dynamics are Linear but Observations are Nonlinear

Recall the constant-velocity model for an aircraft:

=

][

][

][

][

][

nv

nv

nr

nr

n

y

x

y

x

s

A/C positions (m)

A/C velocities (m/s)

=

1000

0100

010

001

A

][]1[][ nnn BuAss +−=

=

10

01

00

00

B

Dynamics Model

Define the state in rectangular coordinates:

For rectangular coordinates the state equation is linear

Page 539: Class Notes

7

But… the choice of rectangular coordinates makes the radar’s observations nonlinearly related to the state:A radar can observe range and bearing (i.e., angle to target)

(and radial and angular velocities, which we will ignore here)

So the observations equations – relating the observation to the state – are given by:

+

+

=− ][

][

][

][tan

][][][

1

22

nw

nw

nr

nr

nrnrn

R

x

y

yx

βx

Observation ModelFor rectangular state

coordinates the observation equation is Non-Linear

Target

Radar rx

ry

β

R

Page 540: Class Notes

8

Case #2: Observations are Linear but Dynamics are NonlinearIf we choose the state to be in polar form then the observations will be linear functions of the state… so maybe then we won’t have a problem??? WRONG!!!

=

][

][

][

][

][

n

nS

n

nR

n

α

βs

A/C Range & Bearing

A/C Speed & Heading

Target

Radar rx

ry

β

R

αS

Observation Model

+

=

][

][

][

][

][

][

0010

0001][

nw

nw

n

nS

n

nR

nR

β

α

βxThe observation is linear:

Page 541: Class Notes

9

But… The Dynamics Model is now Non-Linear:

( ) ( )( ) ( )[ ]

( ) ( )( ) ( )[ ]

+−−+−−

+−−++−−

−−∆+−−−−∆+−−

−−∆+−−+−−∆+−−

=

=

][])1[cos(]1[/][])1[sin(]1[tan

][])1[sin(]1[][])1[cos(]1[

])1[cos(]1[])1[cos(]1[/])1[sin(]1[])1[sin(]1[tan

])1[sin(]1[])1[sin(]1[])1[cos(]1[])1[cos(]1[

][

][

][

][

][

1

22

1

22

nunnSnunnS

nunnSnunnS

nnSnnRnnSnnR

nnSnnRnnSnnR

n

nS

n

nR

n

xy

yx

αα

αα

αβαβ

αβαβ

α

βs

In each of these cases…We can’t apply the standard KF because it relies on the assumption of linear state and observation models!!!

Page 542: Class Notes

10

Nonlinear ModelsWe state here the case where both the state and observation equations are nonlinear…

][])1[(][ nnn Busas +−=

][])[(][ nnn n wshx +=

where a(.) and hn(.) are both nonlinear functions mapping a vector to a vector

Page 543: Class Notes

11

What To Do When Facing a Non-Linear Model?1. Go back and re-derive the MMSE estimator for the the nonlinear

case to develop the “your-last-name-here filter”??• Nonlinearities don’t preserve Gaussian so it will be hard to derive…• There has been some recent progress in this area: “particle filters”

2. Give up and try to convince your company’s executives and the FAA (Federal Aviation Administration) that tracking airplanes is not that important??

• Probably not a good career move!!!

3. Argue that you should use an extremely dense grid of radars networked together??

• Would be extremely expensive… although with today’s efforts in sensor networks this may not be so far-fetched!!!

4. Linearize each nonlinear model using a 1st order Taylor series?• Yes!!!• Of course, it won’t be optimal… but it might give the required

performance!

Page 544: Class Notes

12

Linearization of Models

( )]1|1[ˆ]1[]1[

])1|1[ˆ(])1[(

]1[

]1|1[ˆ]1[−−−−

−∂∂

+−−≈−

−=

−−=−

nnnn

nnn

n

nnnss

sasasa

A

ss !!!! "!!!! #$

State:

( )]1|[ˆ][][

])1|[ˆ(])1[(

][

]1|[ˆ][−−

∂∂

+−≈−

∆=

−=

nnnn

nnn

n

nnn

nnn ss

shshsh

H

ss !!! "!!! #$

Observation:

Page 545: Class Notes

13

Using the Linearized Models[ ]]1|1[ˆ]1[])1|1[ˆ(][]1[]1[][ −−−−−−++−−= nnnnnnnnn sAsaBusAs

Just like what we did in the linear case except now have a time-varying A matrix

New additive term… But it is knownat each step. So… in terms of development we can imagine that we just subtract off this known part… ⇒Result: This part has no real impact!

[ ]]1|[ˆ][])1|[ˆ(][][][][ −−−++= nnnnnnnnn n sHshwsHx

1. Resulting EKF iteration is virtually the same – except there is a “linearizations” step

2. We no longer can do data-free, off-line performance iteration• H[n] and A[n-1] are computed on each iteration using the

data-dependent estimate and prediction

Page 546: Class Notes

14

Extended Kalman Filter (Vector-Vector)

ss CMµs =−−=−− ]1|1[]1|1[ˆInitialization:

( )]1|1[ˆ]1|[ˆ −−=− nnnn sasPrediction:

Linearizations :)1|1(ˆ)1()1(

]1[−−=−

−∂

∂=−

nnnnn

sssaA

)1|(ˆ)()(][

−=

∂∂

=nnn

nn

nsss

hH

TT nnnnnn BQBAMAM +−−−−=− ]1[]1|1[]1[]1|[Pred. MSE:

( ) 1][]1|[][][][]1|[][ −−+−= nnnnnnnnn TT HMHCHMKKalman Gain:

])1|[ˆ][][]([]1|[ˆ]|[ˆ −−+−= nnnnnnnnn sHxKssUpdate:

( ) ]1|[][][]|[ −−= nnnnnn MHKIMEst. MSE:

Page 547: Class Notes

1

13.8 Signal Processing ExamplesEx. 13.3 Time-Varying Channel Estimation

Tx RxDirect Path

Multi Pathv(t) y(t)

∫ −=T

t dtvhty0

)()()( τττ T is the maximum delay

Model using a time-varying D-T FIR system

Channel changes with time if: Relative motion between Rx, Tx Reflectors move/change with time

∑=

−=p

kn knvkhny

0][][][Coefficients change at each nto model time-varying channel

Page 548: Class Notes

2

In communication systems, multipath channels degrade performance(Inter-symbol interference (ISI), flat fading, frequency-selective fading, etc.)

Need To: First estimate the channel coefficients Second Build an Inverse Filter or Equalizer

2 Broad Scenarios:1. Signal v(t) being sent is known (Training Data)2. Signal v(t) being sent is not known (Blind Channel Est.)

One method for scenario #1 is to use a Kalman Filter:

State to be estimated is h[n] = [hn[0] hn[p]]T

(Note: h here is no longer used to notate the observation model here)

Page 549: Class Notes

3

Need State Equation:Assume FIR tap coefficients change slowly

][]1[][ nnn uAhh +−=

Assumed Known That is a weakness!!

Assume FIR taps are uncorrelated with each other<uncorrelated scattering>

A, Q , Ch , are Diagonalcovh[-1] = M[-1|-1]covu[n]

Page 550: Class Notes

4

Have measurement model from convolution view:

][][][][0

nwknvkhnxp

kn +−= ∑

= zero-mean, WGN, σ2

Known training signal

Need Observation Equation:

][][ nwnx nT += hv

Observation Matrixis made up of the

samples of the known transmitted signal

State Vectoris the filter

coefficients

Page 551: Class Notes

5

Simple Specific Example: p = 2 (1 Direct Path, 1 Multipath)

=

=

0001.00

00001.0

999.00

099.0QA][]1[][ nnn uAhh +−=

Q = covu[n]Typical Realization of Channel Coefficients

Book doesnt state how the initial coefficients were chosen for this realization

Note: hn[0] decays fasterand that the random perturbation is small

Page 552: Class Notes

6

Known Transmitted Signal

Noise-Free Received Signal<It is a bit odd that the received

signal is larger than the transmitted signal>

Noisy Received SignalThe variance of the noise in the measurement model is σ2 = 0.1

Page 553: Class Notes

7

Estimation Results Using Standard Kalman Filter

Initialization: 1.0100]1|1[]00[]1|1[ 2 ==−−=−− σIMh T

Chosen to reflect that little prior knowledge is known

In theory we said that we initialize to the a priorimean but in practice it is common to just pick some arbitrary initial value and set the initial covariance quite high this forces the filter to start out trusting the data a lot!

Transient due to wrong IC

Eventually Tracks Well!!hn[0]

hn[1]

Page 554: Class Notes

8

Kalman Filter Gains

Decay down relies more on modelGain is zero when signal is noise only

Kalman Filter MMSE

Filter Performance improves with time

Page 555: Class Notes

9

Example: Radar Target Tracking

State Model: Constant-Velocity A/C Model

!"!#$!"!#$!!! "!!! #$!"!#$][]1[][

][

][

0

0

]1[

]1[

]1[

]1[

1000

0100

010

001

][

][

][

][

n

y

x

n

y

x

y

x

n

y

x

y

x

nu

nu

nv

nv

nr

nr

nv

nv

nr

nr

usAs

+

=

==

2

2

000

000

0000

0000

cov

u

u

σ

σuQ

+

+

=− ][

][

][

][tan

][][][

1

22

nw

nw

nr

nr

nrnrn

R

x

y

yx

βx

Observation Model: Noisy Range/Bearing Radar Measurements

For this simple example. assume:

==

2

2

0

0cov

βσ

σ RwC

Velocity perturbations due to wind, slight speed corrections, etc.Velocity perturbations due to wind, slight speed corrections, etc.

in radians

Page 556: Class Notes

10

Extended Kalman Filter Issues

1. Linearization of the observation model (see book for details) Calculate by hand, program into the EKF to be evaluated each iteration

2. Covariance of State Driving Noise Assume wind gusts, etc. are as likely to occur in any direction w/ same

magnitude ! model as indep. w/ common variance

Need the following:

==

2

2

000

000

0000

0000

cov

u

u

σ

σuQ

σu = what??? Note: ux[n]/ ∆ = acceleration from n-1 to n

So choose σu in m/s so that σu/ ∆ gives a reasonable range of accelerations for the

type of target expected to track

Page 557: Class Notes

11

3. Covariance of Measurement Noise The DSP engineers working on the radar usually specify this or build

routines into the radar to provide time updated assessments of range/bearing accuracy

Usually assume to be white and zero-mean Can use CRLBs for Range & Bearing

" Note: The CRLBs depend on SNR so the Range & Bearing measurement accuracy should get worse when the target is farther away

Often assume Range Error to be Uncorrelated with Bearing Error" So use C[n] = diagσR

2[n], σβ2[n] But best to derive joint CRLB to see if they are correlated

Page 558: Class Notes

12

4. Initialization Issues Typically Convert first range/bearing into initial rx & ry values If radar provides no velocity info (i.e. does not measure Doppler) can

assume zero velocities Pick a large initial MSE to force KF to be unbiased

" If we follow the above two ideas, then we might pick the MSE for rx & ry based on statistical analysis of conversion of range/bearing accuracy into rx & ryaccuracies

Sometimes one radar gets a hand-off from some other radar or sensor" The other radar/sensor would likely hand-off its last track values so use

those as ICs for the initializing the new radar" The other radar/sensor would likely hand-off a MSE measure of the quality its

last track so use that as M[-1|-1]

Page 559: Class Notes

13

State Model Example Trajectories: Constant-Velocity A/C Model

-20 -15 -10 -5 0 5 10 15-10

-5

0

5

10

15

20

25

X position (m)

Y po

sitio

n (m

)

Radar

Red Line is Non-Random Constant Velocity Trajectory

m/s 2.0]1[2.0]1[

m 5]1[10]1[

)/sm 001.0( m/s 0.0316

sec 1

222

=−−=−

−=−=−

==

=∆

yx

yx

uu

vv

rr

σσ

Page 560: Class Notes

14

0 10 20 30 40 50 60 70 80 90 1000

5

10

15

20

Sample Index n

Ran

ge R

(met

ers)

0 10 20 30 40 50 60 70 80 90 100-50

0

50

100

Bea

ring β

(deg

rees

)

Sample Index n

Observation Model Example Measurements

Red Lines are Noise-Free Measurements

)rad 01.0( deg 5.7 rad 0.1

)m 1.0( m 0.3162

22

22

===

==

R

RR

σσ

σσ

β

In reality, these would get worse when the target is far away due to a weaker

returned signal

Page 561: Class Notes

15

If we tried to directly convert the noisy range and bearing measurements into a track this is what wed get.

Not a very accurate track!!!! ! Need a Kalman Filter!!!

But Nonlinear Observation Model so use Extended KF!

Radar

Note how the track gets worse when far from the radar (angle accuracy

converts into position accuracy in a way that depends on range)

Measurements Directly Give a Poor Track

Page 562: Class Notes

16

Extended Kalman Filter Gives Better Track

Note: The EKF was run with the correct values for Q and C(i.e., the Q and C used to simulate the trajectory and measurements was used to implement the Kalman Filter)

Initialization: s[-1|-1] = [5 5 0 0]T M[-1|-1] = 100I

Picked Arbitrarily

Radar×

Initialization

Set large to assert that little is known a priori

After about 20 samples the EKF attains track even with poor ICs and the linearization. Track gets worse near end where measurements are worse MSE show obtain track and show that things get worse at the end

Page 563: Class Notes

17

MSE Plots Show Performance

First a transient where things get worse

Next the EKF seems to obtain track

Finally the accuracy degrades due to range magnification of

bearing errors